textmood 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2013 Stian Grytøyr
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
6
+ this software and associated documentation files (the "Software"), to deal in
7
+ the Software without restriction, including without limitation the rights to
8
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
9
+ the Software, and to permit persons to whom the Software is furnished to do so,
10
+ subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,162 @@
1
+ ## TextMood - Simple sentiment analyzer
2
+ *TextMood* is a simple sentiment analyzer, provided as a Ruby gem with a command-line
3
+ tool for simple interoperability with other processes. It takes text as input and
4
+ returns a sentiment score. Above 0 is typically considered positive, below is
5
+ considered negative.
6
+
7
+ The goal is to have a robust and simple tool that comes with baseline sentiment files
8
+ for many languages.
9
+
10
+ ### Installation
11
+ The easiest way to get the latest stable version is to use gem:
12
+
13
+ gem install textmood
14
+
15
+ If you’d like to get the bleeding-edge version:
16
+
17
+ git clone https://github.com/stiang/textmood
18
+
19
+ ### Usage
20
+ TextMood can be used as a ruby library or as a standalone CLI tool.
21
+
22
+ #### Ruby library
23
+ You can use textmood in a ruby program like this:
24
+ ```ruby
25
+ require "textmood"
26
+
27
+ # The :lang parameter makes TextMood use one of the bundled language sentiment files
28
+ scorer = TextMood.new(lang: "en_US")
29
+ score = scorer.score_text("some text")
30
+ #=> '1.121'
31
+
32
+ # The :files parameter makes TextMood ignore the bundled sentiment files and use the
33
+ # specified files instead. You can specify as many files as you want.
34
+ scorer = TextMood.new(files: ["en_US-mod1.txt", "emoticons.txt"])
35
+
36
+ # TextMood will by default make one pass over the text, checking every word, but it
37
+ # supports doing several passes for any range of word N-grams. Both the start and end
38
+ # N-gram can be specified using the :start_ngram and :end_ngram options
39
+ scorer = TextMood.new(lang: "en_US", debug: true, start_ngram: 2, end_ngram: 3)
40
+ score = scorer.score_text("some long text with many words")
41
+ #=> some long: 0.1
42
+ #=> long text: 0.1
43
+ #=> text with: -0.1
44
+ #=> with many: -0.1
45
+ #=> many words: -0.1
46
+ #=> some long text: -0.1
47
+ #=> long text with: 0.1
48
+ #=> text with many: 0.1
49
+ #=> with many words: 0.1
50
+ #=> '0.1'
51
+
52
+ # Using :normalize, you can make TextMood return a normalized value: 1 for positive,
53
+ # 0 for neutral and -1 for negative
54
+ scorer = TextMood.new(lang: "en_US", normalize: true)
55
+ score = scorer.score_text("some text")
56
+ #=> '1'
57
+
58
+ # :min_threshold and :max_threshold lets you customize the way :normalize treats
59
+ # different values. The options below will make all scores below 1 negative,
60
+ # 1-2 will be neutral, and above 2 will be positive.
61
+ scorer = TextMood.new(lang: "en_US", normalize: true, min_threshold: 1, max_threshold: 2)
62
+ score = scorer.score_text("some text")
63
+ #=> '0'
64
+
65
+ # :debug prints out all tokens to stdout, alongs with their values (or 'nil' when the
66
+ # token was not found)
67
+ scorer = TextMood.new(lang: "en_US", debug: true)
68
+ score = scorer.score_text("some text")
69
+ #=> some: 0.1
70
+ #=> text: 0.1
71
+ #=> some text: -0.1
72
+ #=> '0.1'
73
+ ```
74
+
75
+ #### CLI tool
76
+ Or you can simply pass some UTF-8-encoded text to the cli tool and get a score back, like so
77
+ ```bash
78
+ textmood -l en_US "<some text>"
79
+ -0.4375
80
+ ```
81
+
82
+ The cli tool has many useful options, mostly mirroring those of the library:
83
+ ```
84
+ Usage: textmood [options] "<text>"
85
+
86
+ Returns a floating-point sentiment score of the provided text.
87
+ Above 0 is considered positive, below is considered negative.
88
+
89
+ MANDATORY options:
90
+ -l, --language LANGUAGE The IETF language tag for the provided text.
91
+ Examples: en_US, no_NB
92
+
93
+ OR
94
+
95
+ -f, --file PATH TO FILE Use the specified sentiment file. May be used
96
+ multiple times to load several files. No other
97
+ files will be loaded if this option is used.
98
+
99
+ OPTIONAL options:
100
+ --start-ngram INTEGER The lowest word N-gram number to split the text into
101
+ (default 1). Note that this only makes sense if the
102
+ sentiment file has tokens of similar N-gram length
103
+
104
+ --end-ngram INTEGER The highest word N-gram number to to split the text into
105
+ (default 1). Note that this only makes sense if the
106
+ sentiment file has tokens of similar N-gram length
107
+
108
+ -n, --normalize Return 1 (positive), -1 (negative) or 0 (neutral)
109
+ instead of the actual score. See also --min and --max.
110
+
111
+ --min-threshold FLOAT Scores lower than this are considered negative when
112
+ using --normalize (default -0.5)
113
+
114
+ --max-threshold FLOAT Scores higher than this are considered positive when
115
+ using --normalize (default 0.5)
116
+
117
+ -s, --skip-symbols Do not include symbols file (emoticons etc.).
118
+ Only applies when using -l/--language.
119
+
120
+ -d, --debug Prints out the score for each token in the provided text
121
+ or 'nil' if the token was not found in the sentiment file
122
+
123
+ -h, --help Show this message
124
+ ```
125
+
126
+ ## Sentiment files
127
+ The included sentiment files reside in the *lang* directory. I hope to add many
128
+ more baseline sentiment files in the future.
129
+
130
+ Sentiment files should be named according to the IETF language tag, like *en_US*,
131
+ and contain one colon-separated line per token, like so:
132
+ ```
133
+ 1.0: epic
134
+ 1.0: good
135
+ 1.0: upright
136
+ 0.958: fortunate
137
+ 0.875: wonderfulness
138
+ 0.875: wonderful
139
+ 0.875: wide-eyed
140
+ 0.875: wholesomeness
141
+ 0.875: well-to-do
142
+ 0.875: well-situated
143
+ 0.6: well suited
144
+ ```
145
+ The score is to the left of the first ':', and everything to the right is the
146
+ (potentially multi-word) token.
147
+
148
+ ## Contribute
149
+ Including baseline word/N-gram scores for many different languages is one
150
+ of the expressed goals of this project. If you are able to contribute scores
151
+ for a missing language or improve an existing one, it would be much appreciated!
152
+
153
+ The process is the usual:
154
+ * Fork
155
+ * Add/improve
156
+ * Pull request
157
+
158
+ ## Credits
159
+ Loosely based on https://github.com/cmaclell/Basic-Tweet-Sentiment-Analyzer
160
+
161
+ ## Author
162
+ Stian Grytøyr
data/bin/textmood ADDED
@@ -0,0 +1,108 @@
1
+ #!/usr/bin/env ruby
2
+ #encoding: utf-8
3
+
4
+ if RUBY_VERSION < '1.9'
5
+ $KCODE='u'
6
+ else
7
+ Encoding.default_external = Encoding::UTF_8
8
+ Encoding.default_internal = Encoding::UTF_8
9
+ end
10
+
11
+ $:.unshift File.join(File.dirname(__FILE__), *%w{ .. lib })
12
+
13
+ require "optparse"
14
+ require "textmood"
15
+
16
+ usage = "Usage: #{File.basename($0)} [options] \"<text>\""
17
+
18
+ def mini_usage(usage, notext = false)
19
+ puts usage
20
+ puts ""
21
+ if notext
22
+ puts "ERROR: Quoted text must be provided after the last option."
23
+ else
24
+ puts "ERROR: An IETF language tag must be provided using the -l/--language option."
25
+ end
26
+ puts ""
27
+ puts "Use \"#{File.basename($0)} -h\" for full usage info."
28
+ puts ""
29
+ exit 20
30
+ end
31
+
32
+ if ARGV[0] != "-h" and ARGV[0] != "--help" and not (ARGV[0] and ARGV[1])
33
+ mini_usage(usage)
34
+ end
35
+
36
+ options = {:files => []}
37
+ opts_parser = OptionParser.new do |opts|
38
+ opts.banner = usage
39
+ opts.separator ""
40
+ opts.separator "Returns a floating-point sentiment score of the provided text."
41
+ opts.separator "Above 0 is considered positive, below is considered negative."
42
+ opts.separator ""
43
+ opts.separator "MANDATORY options:"
44
+ opts.on("-l", "--language LANGUAGE", "The IETF language tag for the provided text.",
45
+ "Examples: en_US, no_NB") do |l|
46
+ options[:lang] = l
47
+ end
48
+ opts.separator ""
49
+ opts.separator " OR "
50
+ opts.separator ""
51
+ opts.on("-f", "--file PATH TO FILE", "Use the specified sentiment file. May be used",
52
+ "multiple times to load several files. No other",
53
+ "files will be loaded if this option is used.") do |f|
54
+ options[:files] << f
55
+ end
56
+ opts.separator ""
57
+ opts.separator "OPTIONAL options:"
58
+ opts.on("--start-ngram INTEGER", "The lowest word N-gram number to split the text into",
59
+ "(default 1). Note that this only makes sense if the",
60
+ "sentiment file has tokens of similar N-gram length") do |start_ngram|
61
+ options[:start_ngram] = start_ngram.to_i
62
+ end
63
+ opts.separator ""
64
+ opts.on("--end-ngram INTEGER", "The highest word N-gram number to to split the text into",
65
+ "(default 1). Note that this only makes sense if the",
66
+ "sentiment file has tokens of similar N-gram length") do |end_ngram|
67
+ options[:end_ngram] = end_ngram.to_i
68
+ end
69
+ opts.separator ""
70
+ opts.on("-n", "--normalize", "Return 1 (positive), -1 (negative) or 0 (neutral)",
71
+ "instead of the actual score. See also --min and --max.") do |n|
72
+ options[:normalize] = true
73
+ end
74
+ opts.separator ""
75
+ opts.on("--min-threshold FLOAT", "Scores lower than this are considered negative when",
76
+ "using --normalize (default -0.5)") do |min|
77
+ options[:min_threshold] = min.to_f
78
+ end
79
+ opts.separator ""
80
+ opts.on("--max-threshold FLOAT", "Scores higher than this are considered positive when",
81
+ "using --normalize (default 0.5)") do |max|
82
+ options[:max_threshold] = max.to_f
83
+ end
84
+ opts.separator ""
85
+ opts.on("-s", "--skip-symbols", "Do not include symbols file (emoticons etc.).",
86
+ "Only applies when using -l/--language.") do |s|
87
+ options[:include_symbols] = false
88
+ end
89
+ opts.separator ""
90
+ opts.on("-d", "--debug", "Prints out the score for each token in the provided text",
91
+ "or 'nil' if the token was not found in the sentiment file") do |d|
92
+ options[:debug] = true
93
+ end
94
+ opts.separator ""
95
+ opts.on_tail("-h", "--help", "Show this message") do
96
+ puts opts
97
+ puts ""
98
+ exit
99
+ end
100
+ end
101
+ opts_parser.parse!
102
+
103
+ if ARGV[0]
104
+ scorer = TextMood.new(options)
105
+ puts scorer.score_text(ARGV[0])
106
+ else
107
+ mini_usage(usage, true)
108
+ end