textmood 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2013 Stian Grytøyr
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
6
+ this software and associated documentation files (the "Software"), to deal in
7
+ the Software without restriction, including without limitation the rights to
8
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
9
+ the Software, and to permit persons to whom the Software is furnished to do so,
10
+ subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,162 @@
1
+ ## TextMood - Simple sentiment analyzer
2
+ *TextMood* is a simple sentiment analyzer, provided as a Ruby gem with a command-line
3
+ tool for simple interoperability with other processes. It takes text as input and
4
+ returns a sentiment score. Above 0 is typically considered positive, below is
5
+ considered negative.
6
+
7
+ The goal is to have a robust and simple tool that comes with baseline sentiment files
8
+ for many languages.
9
+
10
+ ### Installation
11
+ The easiest way to get the latest stable version is to use gem:
12
+
13
+ gem install textmood
14
+
15
+ If you’d like to get the bleeding-edge version:
16
+
17
+ git clone https://github.com/stiang/textmood
18
+
19
+ ### Usage
20
+ TextMood can be used as a ruby library or as a standalone CLI tool.
21
+
22
+ #### Ruby library
23
+ You can use textmood in a ruby program like this:
24
+ ```ruby
25
+ require "textmood"
26
+
27
+ # The :lang parameter makes TextMood use one of the bundled language sentiment files
28
+ scorer = TextMood.new(lang: "en_US")
29
+ score = scorer.score_text("some text")
30
+ #=> '1.121'
31
+
32
+ # The :files parameter makes TextMood ignore the bundled sentiment files and use the
33
+ # specified files instead. You can specify as many files as you want.
34
+ scorer = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
35
+
36
+ # TextMood will by default make one pass over the text, checking every word, but it
37
+ # supports doing several passes for any range of word N-grams. Both the start and end
38
+ # N-gram can be specified using the :start_ngram and :end_ngram options
39
+ scorer = TextMood.new(lang: "en_US", debug: true, start_ngram: 2, end_ngram: 3)
40
+ score = scorer.score_text("some long text with many words")
41
+ #=> some long: 0.1
42
+ #=> long text: 0.1
43
+ #=> text with: -0.1
44
+ #=> with many: -0.1
45
+ #=> many words: -0.1
46
+ #=> some long text: -0.1
47
+ #=> long text with: 0.1
48
+ #=> text with many: 0.1
49
+ #=> with many words: 0.1
50
+ #=> '0.1'
51
+
52
+ # Using :normalize, you can make TextMood return a normalized value: 1 for positive,
53
+ # 0 for neutral and -1 for negative
54
+ scorer = TextMood.new(lang: "en_US", normalize: true)
55
+ score = scorer.score_text("some text")
56
+ #=> '1'
57
+
58
+ # :min_threshold and :max_threshold lets you customize the way :normalize treats
59
+ # different values. The options below will make all scores below 1 negative,
60
+ # 1-2 will be neutral, and above 2 will be positive.
61
+ scorer = TextMood.new(lang: "en_US", normalize: true, min_threshold: 1, max_threshold: 2)
62
+ score = scorer.score_text("some text")
63
+ #=> '0'
64
+
65
+ # :debug prints out all tokens to stdout, alongs with their values (or 'nil' when the
66
+ # token was not found)
67
+ scorer = TextMood.new(lang: "en_US", debug: true)
68
+ score = scorer.score_text("some text")
69
+ #=> some: 0.1
70
+ #=> text: 0.1
71
+ #=> some text: -0.1
72
+ #=> '0.1'
73
+ ```
74
+
75
+ #### CLI tool
76
+ Or you can simply pass some UTF-8-encoded text to the cli tool and get a score back, like so
77
+ ```bash
78
+ textmood -l en_US "<some text>"
79
+ -0.4375
80
+ ```
81
+
82
+ The cli tool has many useful options, mostly mirroring those of the library:
83
+ ```
84
+ Usage: textmood [options] "<text>"
85
+
86
+ Returns a floating-point sentiment score of the provided text.
87
+ Above 0 is considered positive, below is considered negative.
88
+
89
+ MANDATORY options:
90
+ -l, --language LANGUAGE The IETF language tag for the provided text.
91
+ Examples: en_US, no_NB
92
+
93
+ OR
94
+
95
+ -f, --file PATH TO FILE Use the specified sentiment file. May be used
96
+ multiple times to load several files. No other
97
+ files will be loaded if this option is used.
98
+
99
+ OPTIONAL options:
100
+ --start-ngram INTEGER The lowest word N-gram number to split the text into
101
+ (default 1). Note that this only makes sense if the
102
+ sentiment file has tokens of similar N-gram length
103
+
104
+ --end-ngram INTEGER The highest word N-gram number to to split the text into
105
+ (default 1). Note that this only makes sense if the
106
+ sentiment file has tokens of similar N-gram length
107
+
108
+ -n, --normalize Return 1 (positive), -1 (negative) or 0 (neutral)
109
+ instead of the actual score. See also --min and --max.
110
+
111
+ --min-threshold FLOAT Scores lower than this are considered negative when
112
+ using --normalize (default -0.5)
113
+
114
+ --max-threshold FLOAT Scores higher than this are considered positive when
115
+ using --normalize (default 0.5)
116
+
117
+ -s, --skip-symbols Do not include symbols file (emoticons etc.).
118
+ Only applies when using -l/--language.
119
+
120
+ -d, --debug Prints out the score for each token in the provided text
121
+ or 'nil' if the token was not found in the sentiment file
122
+
123
+ -h, --help Show this message
124
+ ```
125
+
126
+ ## Sentiment files
127
+ The included sentiment files reside in the *lang* directory. I hope to add many
128
+ more baseline sentiment files in the future.
129
+
130
+ Sentiment files should be named according to the IETF language tag, like *en_US*,
131
+ and contain one colon-separated line per token, like so:
132
+ ```
133
+ 1.0: epic
134
+ 1.0: good
135
+ 1.0: upright
136
+ 0.958: fortunate
137
+ 0.875: wonderfulness
138
+ 0.875: wonderful
139
+ 0.875: wide-eyed
140
+ 0.875: wholesomeness
141
+ 0.875: well-to-do
142
+ 0.875: well-situated
143
+ 0.6: well suited
144
+ ```
145
+ The score is to the left of the first ':', and everything to the right is the
146
+ (potentially multi-word) token.
147
+
148
+ ## Contribute
149
+ Including baseline word/N-gram scores for many different languages is one
150
+ of the expressed goals of this project. If you are able to contribute scores
151
+ for a missing language or improve an existing one, it would be much appreciated!
152
+
153
+ The process is the usual:
154
+ * Fork
155
+ * Add/improve
156
+ * Pull request
157
+
158
+ ## Credits
159
+ Loosely based on https://github.com/cmaclell/Basic-Tweet-Sentiment-Analyzer
160
+
161
+ ## Author
162
+ Stian Grytøyr
data/bin/textmood ADDED
@@ -0,0 +1,108 @@
1
+ #!/usr/bin/env ruby
2
+ #encoding: utf-8
3
+
4
+ if RUBY_VERSION < '1.9'
5
+ $KCODE='u'
6
+ else
7
+ Encoding.default_external = Encoding::UTF_8
8
+ Encoding.default_internal = Encoding::UTF_8
9
+ end
10
+
11
+ $:.unshift File.join(File.dirname(__FILE__), *%w{ .. lib })
12
+
13
+ require "optparse"
14
+ require "textmood"
15
+
16
+ usage = "Usage: #{File.basename($0)} [options] \"<text>\""
17
+
18
+ def mini_usage(usage, notext = false)
19
+ puts usage
20
+ puts ""
21
+ if notext
22
+ puts "ERROR: Quoted text must be provided after the last option."
23
+ else
24
+ puts "ERROR: An IETF language tag must be provided using the -l/--language option."
25
+ end
26
+ puts ""
27
+ puts "Use \"#{File.basename($0)} -h\" for full usage info."
28
+ puts ""
29
+ exit 20
30
+ end
31
+
32
+ if ARGV[0] != "-h" and ARGV[0] != "--help" and not (ARGV[0] and ARGV[1])
33
+ mini_usage(usage)
34
+ end
35
+
36
+ options = {:files => []}
37
+ opts_parser = OptionParser.new do |opts|
38
+ opts.banner = usage
39
+ opts.separator ""
40
+ opts.separator "Returns a floating-point sentiment score of the provided text."
41
+ opts.separator "Above 0 is considered positive, below is considered negative."
42
+ opts.separator ""
43
+ opts.separator "MANDATORY options:"
44
+ opts.on("-l", "--language LANGUAGE", "The IETF language tag for the provided text.",
45
+ "Examples: en_US, no_NB") do |l|
46
+ options[:lang] = l
47
+ end
48
+ opts.separator ""
49
+ opts.separator " OR "
50
+ opts.separator ""
51
+ opts.on("-f", "--file PATH TO FILE", "Use the specified sentiment file. May be used",
52
+ "multiple times to load several files. No other",
53
+ "files will be loaded if this option is used.") do |f|
54
+ options[:files] << f
55
+ end
56
+ opts.separator ""
57
+ opts.separator "OPTIONAL options:"
58
+ opts.on("--start-ngram INTEGER", "The lowest word N-gram number to split the text into",
59
+ "(default 1). Note that this only makes sense if the",
60
+ "sentiment file has tokens of similar N-gram length") do |start_ngram|
61
+ options[:start_ngram] = start_ngram.to_i
62
+ end
63
+ opts.separator ""
64
+ opts.on("--end-ngram INTEGER", "The highest word N-gram number to to split the text into",
65
+ "(default 1). Note that this only makes sense if the",
66
+ "sentiment file has tokens of similar N-gram length") do |end_ngram|
67
+ options[:end_ngram] = end_ngram.to_i
68
+ end
69
+ opts.separator ""
70
+ opts.on("-n", "--normalize", "Return 1 (positive), -1 (negative) or 0 (neutral)",
71
+ "instead of the actual score. See also --min and --max.") do |n|
72
+ options[:normalize] = true
73
+ end
74
+ opts.separator ""
75
+ opts.on("--min-threshold FLOAT", "Scores lower than this are considered negative when",
76
+ "using --normalize (default -0.5)") do |min|
77
+ options[:min_threshold] = min.to_f
78
+ end
79
+ opts.separator ""
80
+ opts.on("--max-threshold FLOAT", "Scores higher than this are considered positive when",
81
+ "using --normalize (default 0.5)") do |max|
82
+ options[:max_threshold] = max.to_f
83
+ end
84
+ opts.separator ""
85
+ opts.on("-s", "--skip-symbols", "Do not include symbols file (emoticons etc.).",
86
+ "Only applies when using -l/--language.") do |s|
87
+ options[:include_symbols] = false
88
+ end
89
+ opts.separator ""
90
+ opts.on("-d", "--debug", "Prints out the score for each token in the provided text",
91
+ "or 'nil' if the token was not found in the sentiment file") do |d|
92
+ options[:debug] = true
93
+ end
94
+ opts.separator ""
95
+ opts.on_tail("-h", "--help", "Show this message") do
96
+ puts opts
97
+ puts ""
98
+ exit
99
+ end
100
+ end
101
+ opts_parser.parse!
102
+
103
+ if ARGV[0]
104
+ scorer = TextMood.new(options)
105
+ puts scorer.score_text(ARGV[0])
106
+ else
107
+ mini_usage(usage, true)
108
+ end