words_counted 0.0.7 → 0.0.8
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +45 -16
- data/lib/words_counted/counter.rb +24 -98
- data/lib/words_counted/version.rb +1 -1
- data/spec/words_counted/counter_spec.rb +32 -0
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f755855668270d89fc16194ce006feb3f534bec3
|
4
|
+
data.tar.gz: 22bccb437e3105c3ba5d4f4bf563b5fa8757ac49
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8e9222364a6ea859ed17a553b07558ff1fd9e8fb84b6d7c4775b023daf64210de1fc0b93511a6a21fb5afaada0897f8951809137b41641b44e52e29f1c0adeaa
|
7
|
+
data.tar.gz: 057299e3bd09f97b2dd75f815dec2b4a542a74e5c4512798301ce4cad8f9ba330b0fc1db5636143241f1e19142fce0b20e39f6aecd82c46ebaa43d61894b3e7c
|
data/README.md
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
# Words Counted
|
2
2
|
|
3
|
-
Words Counted is a Ruby word counter and string analyser. It includes some handy utility methods that go beyond word counting. You can use this gem to get word desnity, words and their number of occurrences, the highest occurring words, and few more things.
|
3
|
+
Words Counted is a Ruby word (or anything--see custom regexp) counter and string analyser. It includes some handy utility methods that go beyond word counting. You can use this gem to get word desnity, words and their number of occurrences, the highest occurring words, and few more things.
|
4
|
+
|
5
|
+
You can also pass in your custom criteria for splitting strings in the form of a custom regexp, which affords you a great deal of flexibility, whether you want to count words, numbers, or special characters.
|
4
6
|
|
5
7
|
### Features
|
6
8
|
|
@@ -13,6 +15,8 @@ Words Counted is a Ruby word counter and string analyser. It includes some handy
|
|
13
15
|
7. Filters special characters but respects hyphens and apostrophes.
|
14
16
|
8. Plays nicely with diacritics (utf and unicode characters): "São Paulo" is treated as `["São", "Paulo"]` and not `["S", "", "o", "Paulo"]`
|
15
17
|
9. Customisable criteria. Pass in your own regexp rules to split strings if you prefer.
|
18
|
+
10. Get `char_count` and `average_chars_per_word`.
|
19
|
+
11. Get unique word count.
|
16
20
|
|
17
21
|
See usage instructions for details on each feature.
|
18
22
|
|
@@ -32,7 +36,7 @@ Or install it yourself as:
|
|
32
36
|
|
33
37
|
## Usage
|
34
38
|
|
35
|
-
Create an instance of `Counter` and pass in a string and an optional filter
|
39
|
+
Create an instance of `Counter` and pass in a string and an optional filter and/or regexp.
|
36
40
|
|
37
41
|
```ruby
|
38
42
|
counter = WordsCounted::Counter.new(
|
@@ -40,9 +44,11 @@ counter = WordsCounted::Counter.new(
|
|
40
44
|
)
|
41
45
|
```
|
42
46
|
|
47
|
+
### API
|
48
|
+
|
43
49
|
#### `.word_count`
|
44
50
|
|
45
|
-
Returns the word count of a given string. The word count includes only alpha characters. Hyphenated and words with apostrophes are considered a single word.
|
51
|
+
Returns the word count of a given string. The word count includes only alpha characters. Hyphenated and words with apostrophes are considered a single word. You can pass in your own regexp if this is not desired behaviour.
|
46
52
|
|
47
53
|
```ruby
|
48
54
|
counter.word_count #=> 15
|
@@ -159,9 +165,36 @@ counter.word_density
|
|
159
165
|
#
|
160
166
|
```
|
161
167
|
|
168
|
+
#### `.char_count`
|
169
|
+
|
170
|
+
Returns the string's character count.
|
171
|
+
|
172
|
+
```ruby
|
173
|
+
counter.char_count
|
174
|
+
#=> 76
|
175
|
+
```
|
176
|
+
|
177
|
+
#### `.average_chars_per_word`
|
178
|
+
|
179
|
+
Returns the average character count per word.
|
180
|
+
|
181
|
+
```ruby
|
182
|
+
counter.average_chars_per_word
|
183
|
+
#=> 4
|
184
|
+
```
|
185
|
+
|
186
|
+
#### `.unique_word_count`
|
187
|
+
|
188
|
+
Returns the count of unique words in the string.
|
189
|
+
|
190
|
+
```ruby
|
191
|
+
counter.unique_word_count
|
192
|
+
#=> 13
|
193
|
+
```
|
194
|
+
|
162
195
|
## Filtering
|
163
196
|
|
164
|
-
You can pass in a space-delimited word list to filter words that you don't want to count.
|
197
|
+
You can pass in a *space-delimited* word list to filter words that you don't want to count. The filter will remove both uppercase and lowercase variants of the word.
|
165
198
|
|
166
199
|
```ruby
|
167
200
|
WordsCounted::Counter.new(
|
@@ -179,7 +212,9 @@ Defining words is tricky business. Out of the box, the default regexp accounts f
|
|
179
212
|
/[\p{Alpha}\-']+/
|
180
213
|
```
|
181
214
|
|
182
|
-
|
215
|
+
But maybe you don't want to count words? Well, count anything you want. What you count is only limited by your knowledge of regular expressions. Pass in your own criteria in the form of a Ruby regexp to split your string as desired.
|
216
|
+
|
217
|
+
For example, if you wanted to count numbers as words, you could pass the following regex instead of the default one.
|
183
218
|
|
184
219
|
```ruby
|
185
220
|
counter = WordsCounted::Counter.new("I am 007.", regex: /[\p{Alnum}\-']+/)
|
@@ -189,7 +224,7 @@ counter.words
|
|
189
224
|
|
190
225
|
## Gotchas
|
191
226
|
|
192
|
-
A hyphen
|
227
|
+
A hyphen used in leu of an *em* or *en* dash will form part of the word and throw off the `word_occurences` algorithm.
|
193
228
|
|
194
229
|
```ruby
|
195
230
|
counter = WordsCounted::Counter.new("How do you do?-you are well, I see.")
|
@@ -213,18 +248,12 @@ counter.word_occurrences
|
|
213
248
|
|
214
249
|
In this example, `-you` and `you` are counted as separate words. Writers should use the correct dash element, but this is not always the case.
|
215
250
|
|
216
|
-
Another gotcha is that the default criteria does not count numbers as words.
|
217
|
-
|
218
|
-
Remember that you can pass in your own regexp if the default solution does not fit your needs.
|
251
|
+
Another gotcha is that the default criteria does not count numbers as words. Remember that you can pass in your own regexp if the default solution does not fit your needs.
|
219
252
|
|
220
|
-
##
|
253
|
+
## Road Map
|
221
254
|
|
222
|
-
1. Add
|
223
|
-
2. Add
|
224
|
-
3. A character counter, with spaces, and without spaces.
|
225
|
-
4. A sentence counter.
|
226
|
-
5. Average words in a sentence.
|
227
|
-
6. Average sentence chars.
|
255
|
+
1. Add ability to open files or URLs.
|
256
|
+
2. Add paragraph, sentence, average words per sentence, and average sentence chars counters.
|
228
257
|
|
229
258
|
#### Ability to open files or urls
|
230
259
|
|
@@ -1,131 +1,53 @@
|
|
1
1
|
module WordsCounted
|
2
|
-
|
3
|
-
# Represents a Counter object.
|
4
|
-
#
|
5
2
|
class Counter
|
6
|
-
|
7
|
-
# @!word_occurrences [Hash] an hash of words as keys and their occurrences as values.
|
8
|
-
# @!word_lengths [Hash] an hash of words as keys and their lengths as values.
|
9
|
-
attr_reader :words, :word_occurrences, :word_lengths
|
3
|
+
attr_reader :words, :word_occurrences, :word_lengths, :char_count
|
10
4
|
|
11
|
-
# This is the criteria for defining words.
|
12
|
-
#
|
13
|
-
# Words are alpha characters and can include hyphens and apostrophes.
|
14
|
-
#
|
15
5
|
WORD_REGEX = /[\p{Alpha}\-']+/
|
16
6
|
|
17
|
-
# Initializes an instance of Counter and splits a given string into an array of words.
|
18
|
-
#
|
19
|
-
# ## @words
|
20
|
-
# This is the array of words that results from the string passed in. For example:
|
21
|
-
#
|
22
|
-
# Counter.new("Bad, bad, piggy!")
|
23
|
-
# => #<WordsCounted::Counter:0x007fd49429bfb0 @words=["Bad", "bad", "piggy"]>
|
24
|
-
#
|
25
|
-
# @param string [String] the string to act on.
|
26
|
-
# @param options [Hash] a hash of options that includes `filter` and `regex`
|
27
|
-
#
|
28
|
-
# ## `filter`
|
29
|
-
# This a list of words to filter from the string. Useful if you want to remove *a*, **you**, and other common words.
|
30
|
-
# Any words included in the filter must be **lowercase**.
|
31
|
-
# defaults to an empty string
|
32
|
-
#
|
33
|
-
# ## `regex`
|
34
|
-
# The criteria used to split a string. It defaults to `/[^\p{Alpha}\-']+/`.
|
35
|
-
#
|
36
|
-
#
|
37
|
-
# @word_occurrences
|
38
|
-
# This is a hash of words and their occurrences. Occurrences count is not case sensitive.
|
39
|
-
#
|
40
|
-
# ## Example
|
41
|
-
#
|
42
|
-
# "Hello hello" #=> { "hello" => 2 }
|
43
|
-
#
|
44
|
-
# @return [Hash] a hash map of words as keys and their occurrences as values.
|
45
|
-
#
|
46
|
-
#
|
47
|
-
# ## @word_lengths
|
48
|
-
# This is a hash of words and their lengths.
|
49
|
-
#
|
50
|
-
# ## Example
|
51
|
-
#
|
52
|
-
# "Hello sir" #=> { "hello" => 5, "sir" => 3 }
|
53
|
-
#
|
54
|
-
# @return [Hash] a hash map of words as keys and their lengths as values.
|
55
|
-
#
|
56
7
|
def initialize(string, options = {})
|
57
8
|
@options = options
|
58
|
-
|
59
|
-
@words = string.scan(regex).reject { |word| filter.
|
60
|
-
|
61
|
-
|
62
|
-
result[word.downcase] += 1
|
63
|
-
end
|
64
|
-
|
65
|
-
@word_lengths = words.each_with_object({}) do |word, result|
|
66
|
-
result[word] ||= word.length
|
9
|
+
@char_count = string.length
|
10
|
+
@words = string.scan(regex).reject { |word| filter.include? word.downcase }
|
11
|
+
@word_occurrences = words.each_with_object(Hash.new(0)) do |word, hash|
|
12
|
+
hash[word.downcase] += 1
|
67
13
|
end
|
14
|
+
@word_lengths = words.each_with_object({}) { |word, hash| hash[word] ||= word.length }
|
68
15
|
end
|
69
16
|
|
70
|
-
# Returns the total word count.
|
71
|
-
#
|
72
|
-
# @return [Integer] total word count from `words` array size.
|
73
|
-
#
|
74
17
|
def word_count
|
75
18
|
words.size
|
76
19
|
end
|
77
20
|
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
21
|
+
def unique_word_count
|
22
|
+
words.uniq.size
|
23
|
+
end
|
24
|
+
|
25
|
+
def average_chars_per_word
|
26
|
+
(char_count / word_count).round(2)
|
27
|
+
end
|
28
|
+
|
85
29
|
def most_occurring_words
|
86
30
|
highest_ranking word_occurrences
|
87
31
|
end
|
88
32
|
|
89
|
-
# Returns a two dimensional array of the longest word(s) and
|
90
|
-
# its length. In the event of a tie, all tied words are returned.
|
91
|
-
#
|
92
|
-
# @return [Array] see {#highest_ranking}
|
93
|
-
#
|
94
33
|
def longest_words
|
95
34
|
highest_ranking word_lengths
|
96
35
|
end
|
97
36
|
|
98
|
-
# Returns a hash of word and their word density in percent.
|
99
|
-
#
|
100
|
-
# @returns [Hash] a hash map of words as keys and their density as values in percent.
|
101
|
-
#
|
102
37
|
def word_density
|
103
38
|
word_occurrences.each_with_object({}) do |(word, occ), hash|
|
104
|
-
hash[word] =
|
105
|
-
end.sort_by { |_,
|
39
|
+
hash[word] = percent_of(occ)
|
40
|
+
end.sort_by { |_, value| value }.reverse
|
106
41
|
end
|
107
42
|
|
108
43
|
private
|
109
44
|
|
110
|
-
# Takes a hashmap of the form {"foo" => 1, "bar" => 2} and returns an array
|
111
|
-
# containing the entries (as an array) with the highest number as a value.
|
112
|
-
#
|
113
|
-
# @param entries [Hash] a hash of entries to analyse
|
114
|
-
# @return [Array] a two dimentional array where each consists of a word its rank
|
115
|
-
#
|
116
|
-
# {http://codereview.stackexchange.com/a/47515/1563 See here}.
|
117
|
-
#
|
118
45
|
def highest_ranking(entries)
|
119
|
-
entries.group_by { |word,
|
46
|
+
entries.group_by { |word, value| value }.sort.last.last
|
120
47
|
end
|
121
48
|
|
122
|
-
|
123
|
-
|
124
|
-
# @param n [Integer] the divisor.
|
125
|
-
# @returns [Float] a percentege of n based on {#word_count} rounded to two decimal places.
|
126
|
-
#
|
127
|
-
def percent_of_n(n)
|
128
|
-
((n.to_f / word_count.to_f) * 100.0).round(2)
|
49
|
+
def percent_of(n)
|
50
|
+
(n.to_f / word_count.to_f * 100.0).round(2)
|
129
51
|
end
|
130
52
|
|
131
53
|
def regex
|
@@ -133,7 +55,11 @@ module WordsCounted
|
|
133
55
|
end
|
134
56
|
|
135
57
|
def filter
|
136
|
-
@options[:filter]
|
58
|
+
if filters = @options[:filter]
|
59
|
+
filters.split.collect { |word| word.downcase }
|
60
|
+
else
|
61
|
+
[]
|
62
|
+
end
|
137
63
|
end
|
138
64
|
end
|
139
65
|
end
|
@@ -9,6 +9,10 @@ module WordsCounted
|
|
9
9
|
expect(counter.instance_variables).to include(:@options)
|
10
10
|
end
|
11
11
|
|
12
|
+
it "sets @char_count" do
|
13
|
+
expect(counter.instance_variables).to include(:@char_count)
|
14
|
+
end
|
15
|
+
|
12
16
|
it "sets @words" do
|
13
17
|
expect(counter.instance_variables).to include(:@words)
|
14
18
|
end
|
@@ -46,11 +50,21 @@ module WordsCounted
|
|
46
50
|
expect(counter.words).to eq(%w[Bust 'em Them be Jim's bastards'])
|
47
51
|
end
|
48
52
|
|
53
|
+
it "does not split on unicode chars" do
|
54
|
+
counter = Counter.new("São Paulo")
|
55
|
+
expect(counter.words).to eq(%w[São Paulo])
|
56
|
+
end
|
57
|
+
|
49
58
|
it "filters words" do
|
50
59
|
counter = Counter.new("That was magnificent, Trevor.", filter: "magnificent")
|
51
60
|
expect(counter.words).to eq(%w[That was Trevor])
|
52
61
|
end
|
53
62
|
|
63
|
+
it "filters words when passed in in uppercase" do
|
64
|
+
counter = Counter.new("That was magnificent, Trevor.", filter: "Magnificent")
|
65
|
+
expect(counter.words).to eq(%w[That was Trevor])
|
66
|
+
end
|
67
|
+
|
54
68
|
it "splits words based on regex" do
|
55
69
|
counter = Counter.new("I am 007.", regex: /[\p{Alnum}\-']+/)
|
56
70
|
expect(counter.words).to eq(["I", "am", "007"])
|
@@ -117,5 +131,23 @@ module WordsCounted
|
|
117
131
|
expect(counter.word_density).to eq([["major", 50.0], ["mean", 10.0], ["i", 10.0], ["was", 10.0], ["name", 10.0], ["his", 10.0]])
|
118
132
|
end
|
119
133
|
end
|
134
|
+
|
135
|
+
describe ".char_count" do
|
136
|
+
it "returns the number of chars in the passed in string" do
|
137
|
+
expect(counter.char_count).to eq(66)
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
describe ".average_chars_per_word" do
|
142
|
+
it "returns the average number of chars per word" do
|
143
|
+
expect(counter.average_chars_per_word).to eq(4)
|
144
|
+
end
|
145
|
+
end
|
146
|
+
|
147
|
+
describe ".unique_word_count" do
|
148
|
+
it "returns the number of unique words" do
|
149
|
+
expect(counter.unique_word_count).to eq(13)
|
150
|
+
end
|
151
|
+
end
|
120
152
|
end
|
121
153
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: words_counted
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mohamad El-Husseini
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-05-
|
11
|
+
date: 2014-05-03 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|