words_counted 0.0.4 → 0.0.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.yardopts +2 -1
- data/README.md +78 -3
- data/lib/words_counted/counter.rb +86 -26
- data/lib/words_counted/version.rb +1 -1
- data/spec/words_counted/counter_spec.rb +38 -15
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9847725d713bc20dd1b66d86321c8089e33276bf
|
4
|
+
data.tar.gz: f847a70bf6e008527606f944833979bf952a330e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d5922f2b471ea4bc60650a77972c289e516dae19b5ba4e59ceacf669df85442b775361a43ea9c8765270b005e2a8dfd1db39dbd4b85e29db142a3c064cabaff7
|
7
|
+
data.tar.gz: 2ab1369a5dc2b063c749242339e93d7a5f5424a9bae489fd6b649406834e7e05f18af7a598c797928f256d73eba532db3e1ebde95cbf5798b58b9e32599a6f8f
|
data/.yardopts
CHANGED
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# Words Counted
|
2
2
|
|
3
|
-
|
3
|
+
Words Counted is a Ruby word counter and string analyser. It includes some handy utility methods that go beyond word counting. You can use this gem to get word desnity, words and their number of occurrences, the highest occurring words, and few more things. You can also pass in your custom criteria for splitting strings in the form of a custom regexp.
|
4
4
|
|
5
5
|
### Features
|
6
6
|
|
@@ -11,6 +11,8 @@ This Ruby gem is a word counter that includes some handy utility methods. It let
|
|
11
11
|
5. Get the longest word(s) and its length.
|
12
12
|
6. Ability to filter out words from the count. Useful if you don't want to count `a`, `the`, etc...
|
13
13
|
7. Filters special characters but respects hyphens and apostrophes.
|
14
|
+
8. Plays nicely with diacritics (utf and unicode characters): "São Paulo" is treated as `["São", "Paulo"]` and not `["S", "", "o", "Paulo"]`
|
15
|
+
9. Customisable criteria. Pass in your own regexp rules to split strings if you prefer.
|
14
16
|
|
15
17
|
See usage instructions for details on each feature.
|
16
18
|
|
@@ -132,13 +134,55 @@ counter.words
|
|
132
134
|
#=> ["We", "are", "all", "in", "the", "gutter", "but", "some", "of", "us", "are", "looking", "at", "the", "stars"]
|
133
135
|
```
|
134
136
|
|
137
|
+
#### `.word_density`
|
138
|
+
|
139
|
+
Returns a two-dimentional array of words and their density.
|
140
|
+
|
141
|
+
```ruby
|
142
|
+
counter.word_density
|
143
|
+
#
|
144
|
+
# [
|
145
|
+
# ["are", 13.33],
|
146
|
+
# ["the", 13.33],
|
147
|
+
# ["but", 6.67],
|
148
|
+
# ["us", 6.67],
|
149
|
+
# ["of", 6.67],
|
150
|
+
# ["some", 6.67],
|
151
|
+
# ["looking", 6.67],
|
152
|
+
# ["gutter", 6.67],
|
153
|
+
# ["at", 6.67],
|
154
|
+
# ["in", 6.67],
|
155
|
+
# ["all", 6.67],
|
156
|
+
# ["stars", 6.67],
|
157
|
+
# ["we", 6.67]
|
158
|
+
# ]
|
159
|
+
#
|
160
|
+
```
|
161
|
+
|
135
162
|
## Filtering
|
136
163
|
|
137
164
|
You can pass in a space-delimited word list to filter words that you don't want to count. Filter words should be *lowercase*. The filter will remove both uppercase and lowercase variants of the word.
|
138
165
|
|
139
166
|
```ruby
|
140
|
-
WordsCounted::Counter.new("Magnificent! That was magnificent, Trevor.", "was magnificent")
|
141
|
-
|
167
|
+
WordsCounted::Counter.new("Magnificent! That was magnificent, Trevor.", filter: "was magnificent")
|
168
|
+
counter.words
|
169
|
+
#=> ["That", "Trevor"]
|
170
|
+
```
|
171
|
+
|
172
|
+
## Passing in a Custom Regexp
|
173
|
+
|
174
|
+
Defining words is tricky business. Out of the box, the default regexp accounts for letters, hyphenated words, and apostrophes. This means `twenty-one` is treated as one word. So is `Mohamad's`.
|
175
|
+
|
176
|
+
```ruby
|
177
|
+
/[^\p{Alpha}\-']+/
|
178
|
+
```
|
179
|
+
|
180
|
+
If you prefer, you can pass in your own criteria in the form of a Ruby regexp to split your string as desired. For example, if you wanted to count numbers as words, you could pass the following regex instead of the default one.
|
181
|
+
|
182
|
+
```ruby
|
183
|
+
counter = WordsCounted::Counter.new("I am 007.", regex: /[^\p{Alnum}\-']+/)
|
184
|
+
counter.words
|
185
|
+
=> ["I", "am", "007"]
|
142
186
|
```
|
143
187
|
|
144
188
|
## Gotchas
|
@@ -167,6 +211,37 @@ counter.word_occurrences
|
|
167
211
|
|
168
212
|
In this example, `-you` and `you` are counted as separate words. Writers should use the correct dash element, but this is not always the case.
|
169
213
|
|
214
|
+
The default criteria does not count numbers as words.
|
215
|
+
|
216
|
+
## To do
|
217
|
+
|
218
|
+
1. Add paragraph counter.
|
219
|
+
2. Add ability to open files or URLs.
|
220
|
+
3. A character counter, with spaces, and without spaces.
|
221
|
+
4. A sentence counter.
|
222
|
+
5. Average words in a sentence.
|
223
|
+
6. Average sentence chars.
|
224
|
+
|
225
|
+
#### Ability to open files or urls
|
226
|
+
|
227
|
+
Maybe I can some class methods to open the file and init the counter class.
|
228
|
+
|
229
|
+
```ruby
|
230
|
+
def self.count_from_url
|
231
|
+
new # open url and send string here after removing html
|
232
|
+
end
|
233
|
+
|
234
|
+
def self.from_file
|
235
|
+
new # open file and send string here.
|
236
|
+
end
|
237
|
+
```
|
238
|
+
|
239
|
+
## But wait... wait a minute...
|
240
|
+
|
241
|
+
#### Isn't it better to write this in JavaScript?
|
242
|
+
|
243
|
+
![http://stream1.gifsoup.com/view3/1290449/picard-facepalm-o.gif][Picard face palm]
|
244
|
+
|
170
245
|
## About
|
171
246
|
|
172
247
|
Originally I wrote this program for a code challenge. My initial implementation was decent, but it could have been better. Thanks to [Dave Yarwood](http://codereview.stackexchange.com/a/47515/1563) for helping me improve my code. Some of this code is based on his recommendations. You can find the original implementation as well as the code review on [Code Review](http://codereview.stackexchange.com/questions/46105/a-ruby-string-analyser).
|
@@ -1,46 +1,78 @@
|
|
1
1
|
module WordsCounted
|
2
|
+
|
3
|
+
# Represents a Counter object.
|
4
|
+
#
|
2
5
|
class Counter
|
3
6
|
# @!words [Array] an array of words resulting from the string passed to the initializer.
|
4
|
-
|
7
|
+
# @!word_occurrences [Hash] an hash of words as keys and their occurrences as values.
|
8
|
+
# @!word_lengths [Hash] an hash of words as keys and their lengths as values.
|
9
|
+
attr_reader :words, :word_occurrences, :word_lengths
|
5
10
|
|
6
11
|
# This is the criteria for defining words.
|
7
12
|
#
|
8
13
|
# Words are alpha characters and can include hyphens and apostrophes.
|
14
|
+
#
|
9
15
|
WORD_REGEX = /[^\p{Alpha}\-']+/
|
10
16
|
|
11
17
|
# Initializes an instance of Counter and splits a given string into an array of words.
|
12
18
|
#
|
13
|
-
#
|
14
|
-
#
|
19
|
+
# ## @words
|
20
|
+
# This is the array of words that results from the string passed in. For example:
|
21
|
+
#
|
22
|
+
# Counter.new("Bad, bad, piggy!")
|
23
|
+
# => #<WordsCounted::Counter:0x007fd49429bfb0 @words=["Bad", "bad", "piggy"]>
|
15
24
|
#
|
16
25
|
# @param string [String] the string to act on.
|
17
|
-
# @param
|
26
|
+
# @param options [Hash] a hash of options that includes `filter` and `regex`
|
18
27
|
#
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
#
|
28
|
+
# ## `filter`
|
29
|
+
# This a list of words to filter from the string. Useful if you want to remove *a*, **you**, and other common words.
|
30
|
+
# Any words included in the filter must be **lowercase**.
|
31
|
+
# defaults to an empty string
|
32
|
+
#
|
33
|
+
# ## `regex`
|
34
|
+
# The criteria used to split a string. It defaults to `/[^\p{Alpha}\-']+/`.
|
24
35
|
#
|
25
|
-
def word_count
|
26
|
-
words.size
|
27
|
-
end
|
28
|
-
|
29
|
-
# Returns a hash of words and their occurrences.
|
30
|
-
# Occurrences count is not case sensitive:
|
31
36
|
#
|
32
|
-
#
|
37
|
+
# @word_occurrences
|
38
|
+
# This is a hash of words and their occurrences. Occurrences count is not case sensitive.
|
33
39
|
#
|
34
|
-
#
|
40
|
+
# ## Example
|
35
41
|
#
|
36
|
-
|
37
|
-
|
42
|
+
# "Hello hello" #=> { "hello" => 2 }
|
43
|
+
#
|
44
|
+
# @return [Hash] a hash map of words as keys and their occurrences as values.
|
45
|
+
#
|
46
|
+
#
|
47
|
+
# ## @word_lengths
|
48
|
+
# This is a hash of words and their lengths.
|
49
|
+
#
|
50
|
+
# ## Example
|
51
|
+
#
|
52
|
+
# "Hello sir" #=> { "hello" => 5, "sir" => 3 }
|
53
|
+
#
|
54
|
+
# @return [Hash] a hash map of words as keys and their lengths as values.
|
55
|
+
#
|
56
|
+
def initialize(string, options = {})
|
57
|
+
@options = options
|
58
|
+
|
59
|
+
@words = string.split(regex).reject { |word| filter.split.include? word.downcase }
|
60
|
+
|
61
|
+
@word_occurrences = words.each_with_object(Hash.new(0)) do |word, result|
|
62
|
+
result[word.downcase] += 1
|
63
|
+
end
|
64
|
+
|
65
|
+
@word_lengths = words.each_with_object({}) do |word, result|
|
66
|
+
result[word] ||= word.length
|
67
|
+
end
|
38
68
|
end
|
39
69
|
|
40
|
-
# Returns
|
70
|
+
# Returns the total word count.
|
41
71
|
#
|
42
|
-
|
43
|
-
|
72
|
+
# @return [Integer] total word count from `words` array size.
|
73
|
+
#
|
74
|
+
def word_count
|
75
|
+
words.size
|
44
76
|
end
|
45
77
|
|
46
78
|
# Returns a two dimensional array of the most occuring word(s)
|
@@ -48,30 +80,58 @@ module WordsCounted
|
|
48
80
|
#
|
49
81
|
# In the event of a tie, all tied words are returned.
|
50
82
|
#
|
83
|
+
# @return [Array] see {#highest_ranking}
|
84
|
+
#
|
51
85
|
def most_occurring_words
|
52
86
|
highest_ranking word_occurrences
|
53
87
|
end
|
54
88
|
|
55
89
|
# Returns a two dimensional array of the longest word(s) and
|
56
|
-
# its length.
|
90
|
+
# its length. In the event of a tie, all tied words are returned.
|
57
91
|
#
|
58
|
-
#
|
92
|
+
# @return [Array] see {#highest_ranking}
|
59
93
|
#
|
60
94
|
def longest_words
|
61
95
|
highest_ranking word_lengths
|
62
96
|
end
|
63
97
|
|
98
|
+
# Returns a hash of word and their word density in percent.
|
99
|
+
#
|
100
|
+
# @returns [Hash] a hash map of words as keys and their density as values in percent.
|
101
|
+
#
|
102
|
+
def word_density
|
103
|
+
word_occurrences.each_with_object({}) { |(word, occ), hash| hash[word] = percent_of_n(occ) }.sort_by { |_, v| v }.reverse
|
104
|
+
end
|
105
|
+
|
64
106
|
private
|
65
107
|
|
66
108
|
# Takes a hashmap of the form {"foo" => 1, "bar" => 2} and returns an array
|
67
109
|
# containing the entries (as an array) with the highest number as a value.
|
68
110
|
#
|
69
|
-
# {http://codereview.stackexchange.com/a/47515/1563 See here}.
|
70
|
-
#
|
71
111
|
# @param entries [Hash] a hash of entries to analyse
|
112
|
+
# @return [Array] a two dimentional array where each consists of a word its rank
|
113
|
+
#
|
114
|
+
# {http://codereview.stackexchange.com/a/47515/1563 See here}.
|
72
115
|
#
|
73
116
|
def highest_ranking(entries)
|
74
117
|
entries.group_by { |word, occurrence| occurrence }.sort.last.last
|
75
118
|
end
|
119
|
+
|
120
|
+
# Calculates the percentege of a word.
|
121
|
+
#
|
122
|
+
# @param n [Integer] the divisor.
|
123
|
+
# @returns [Float] a percentege of n based on {#word_count} rounded to two decimal places.
|
124
|
+
#
|
125
|
+
def percent_of_n(n)
|
126
|
+
((n.to_f / word_count.to_f) * 100.0).round(2)
|
127
|
+
end
|
128
|
+
|
129
|
+
def regex
|
130
|
+
@options[:regex] || WORD_REGEX
|
131
|
+
end
|
132
|
+
|
133
|
+
def filter
|
134
|
+
@options[:filter] || String.new
|
135
|
+
end
|
76
136
|
end
|
77
137
|
end
|
@@ -2,10 +2,23 @@ require "spec_helper"
|
|
2
2
|
|
3
3
|
module WordsCounted
|
4
4
|
describe Counter do
|
5
|
+
let(:counter) { Counter.new("We are all in the gutter, but some of us are looking at the stars.") }
|
5
6
|
|
6
|
-
describe "
|
7
|
-
|
7
|
+
describe "#initialize" do
|
8
|
+
it "sets @words" do
|
9
|
+
expect(counter.instance_variables).to include(:@words)
|
10
|
+
end
|
11
|
+
|
12
|
+
it "sets @word_occurrences" do
|
13
|
+
expect(counter.instance_variables).to include(:@word_occurrences)
|
14
|
+
end
|
8
15
|
|
16
|
+
it "sets @word_lengths" do
|
17
|
+
expect(counter.instance_variables).to include(:@word_lengths)
|
18
|
+
end
|
19
|
+
end
|
20
|
+
|
21
|
+
describe ".words" do
|
9
22
|
it "returns an array" do
|
10
23
|
expect(counter.words).to be_a(Array)
|
11
24
|
end
|
@@ -30,65 +43,75 @@ module WordsCounted
|
|
30
43
|
end
|
31
44
|
|
32
45
|
it "filters words" do
|
33
|
-
counter = Counter.new("That was magnificent, Trevor.", "magnificent")
|
46
|
+
counter = Counter.new("That was magnificent, Trevor.", filter: "magnificent")
|
34
47
|
expect(counter.words).to eq(%w[That was Trevor])
|
35
48
|
end
|
49
|
+
|
50
|
+
it "splits words based on regex" do
|
51
|
+
counter = Counter.new("I am 007.", regex: /[^\p{Alnum}\-']+/)
|
52
|
+
expect(counter.words).to eq(["I", "am", "007"])
|
53
|
+
end
|
36
54
|
end
|
37
55
|
|
38
56
|
describe ".word_count" do
|
39
|
-
let(:counter) { Counter.new("In that case I'll take measures to secure you, woman!") }
|
40
|
-
|
41
57
|
it "returns the correct word count" do
|
42
|
-
expect(counter.word_count).to eq(
|
58
|
+
expect(counter.word_count).to eq(15)
|
43
59
|
end
|
44
60
|
end
|
45
61
|
|
46
62
|
describe ".word_occurrences" do
|
47
|
-
let(:counter) { Counter.new("Bad, bad, piggy!") }
|
48
|
-
|
49
63
|
it "returns a hash" do
|
50
64
|
expect(counter.word_occurrences).to be_a(Hash)
|
51
65
|
end
|
52
66
|
|
53
67
|
it "treats capitalized words as the same word" do
|
68
|
+
counter = Counter.new("Bad, bad, piggy!")
|
54
69
|
expect(counter.word_occurrences).to eq({ "bad" => 2, "piggy" => 1 })
|
55
70
|
end
|
56
71
|
end
|
57
72
|
|
58
73
|
describe ".most_occurring_words" do
|
59
|
-
let(:counter) { Counter.new("One should always be in love. That is the reason one should never marry.") }
|
60
|
-
|
61
74
|
it "returns an array" do
|
62
75
|
expect(counter.most_occurring_words).to be_a(Array)
|
63
76
|
end
|
64
77
|
|
65
78
|
it "returns highest occuring words" do
|
66
|
-
|
79
|
+
counter = Counter.new("Orange orange Apple apple banana")
|
80
|
+
expect(counter.most_occurring_words).to eq([["orange", 2],["apple", 2]])
|
67
81
|
end
|
68
82
|
end
|
69
83
|
|
70
84
|
describe '.word_lengths' do
|
71
|
-
let(:counter) { Counter.new("One two three.") }
|
72
|
-
|
73
85
|
it "returns a hash" do
|
74
86
|
expect(counter.word_lengths).to be_a(Hash)
|
75
87
|
end
|
76
88
|
|
77
89
|
it "returns a hash of word lengths" do
|
90
|
+
counter = Counter.new("One two three.")
|
78
91
|
expect(counter.word_lengths).to eq({ "One" => 3, "two" => 3, "three" => 5 })
|
79
92
|
end
|
80
93
|
end
|
81
94
|
|
82
95
|
describe ".longest_words" do
|
83
|
-
let(:counter) { Counter.new("Those whom the gods love grow young.") }
|
84
|
-
|
85
96
|
it "returns an array" do
|
86
97
|
expect(counter.longest_words).to be_a(Array)
|
87
98
|
end
|
88
99
|
|
89
100
|
it "returns the longest words" do
|
101
|
+
counter = Counter.new("Those whom the gods love grow young.")
|
90
102
|
expect(counter.longest_words).to eq([["Those", 5],["young", 5]])
|
91
103
|
end
|
92
104
|
end
|
105
|
+
|
106
|
+
describe ".word_density" do
|
107
|
+
it "returns a hash" do
|
108
|
+
expect(counter.word_density).to be_a(Array)
|
109
|
+
end
|
110
|
+
|
111
|
+
it "returns words and their density in percent" do
|
112
|
+
counter = Counter.new("His name was major, I mean, Major Major Major Major.")
|
113
|
+
expect(counter.word_density).to eq([["major", 50.0], ["mean", 10.0], ["i", 10.0], ["was", 10.0], ["name", 10.0], ["his", 10.0]])
|
114
|
+
end
|
115
|
+
end
|
93
116
|
end
|
94
117
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: words_counted
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mohamad El-Husseini
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-
|
11
|
+
date: 2014-05-01 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|