words_counted 1.0.0 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d6302c1802d7da076d1ddafdcbe70e46a89c8f33
4
- data.tar.gz: 873efaa5e58f883e0dde99094ca53952d46217c7
3
+ metadata.gz: 9c56052462cafa83864d9f6ae8fcc9edcdd67356
4
+ data.tar.gz: 562629bb6b6f61d45a2e40ae0fab5b07e749576e
5
5
  SHA512:
6
- metadata.gz: 0e6ddb8db9c060432066d86aed2efe20aa95dee2019d54c950007170c0ffbbcff16fa27a0377419b0d1b718be1625a4376ee9c687a4ae67073aaffe9ef363157
7
- data.tar.gz: 9df2a0cefe14b9ac77d1741f8980d1b1fb4d8b770738fbd69c8870f73da4b653a1d9462ac8813f88dc48af36e03718773523985f5be0f4999177a6b0a2a89662
6
+ metadata.gz: 96a6ee9b686893bef4552aedd4dd05f696bf5735cb57c9de65b08f7a6ed6c0a4cbceddd27f8309d0f0522753ff0c556fa5b8ec7bf949a1c3c7c07d9ad1b11337
7
+ data.tar.gz: 05251ab8e6efa3b29ebd88d7c5ddf82c20bfa3c07321a7a66a9f261c4b1162cdae600cd991b4b243c94d660fc2bfd45d779098f8dcda9f0165474965d8d5b9b8
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # WordsCounted
2
2
 
3
- WordsCounted is a highly customisable Ruby text analyser. Consult the features for more information.
3
+ WordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class. [Consult the documentation][2] for more information.
4
4
 
5
5
  <a href="http://badge.fury.io/rb/words_counted">
6
6
  <img src="https://badge.fury.io/rb/words_counted@2x.png" alt="Gem Version" height="18">
@@ -8,25 +8,18 @@ WordsCounted is a highly customisable Ruby text analyser. Consult the features f
8
8
 
9
9
  ### Demo
10
10
 
11
- Visit [the gem's website][4] for a demo.
11
+ Visit [this website][4] for an example of what the gem can do.
12
12
 
13
13
  ### Features
14
14
 
15
- * Get the following data from any string or readable file:
16
- * Word count
17
- * Unique word count
18
- * Word density
19
- * Character count
20
- * Average characters per word
21
- * A hash map of words and the number of times they occur
22
- * A hash map of words and their lengths
23
- * The longest word(s) and its length
24
- * The most occurring word(s) and its number of occurrences.
25
- * Count invividual strings for occurrences.
26
- * A flexible way to exclude words (or anything) from the count. You can pass a **string**, a **regexp**, an **array**, or a **lambda**.
27
- * Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
28
- * Filters special characters but respects hyphens and apostrophes.
29
- * Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as `["São", "Paulo"]` and not `["S", "", "o", "Paulo"]`.
15
+ * Out of the box, get the following data from any string or readable file, or URL:
16
+ * Token count and unique token count
17
+ * Token densities, frequencies, and lengths
18
+ * Char count and average chars per token
19
+ * The longest tokens and their lengths
20
+ * The most frequent tokens and their frequencies.
21
+ * A flexible way to exclude tokens from the tokeniser. You can pass a **string**, **regexp**, **symbol**, **lambda**, or an **array** of any combination of those types for powerful tokenisation strategies.
22
+ * Pass your own regexp rules to the tokeniser if you prefer. The default regexp filters special characters but keeps hyphens and apostrophes. It also plays nicely with diacritics (UTF and unicode characters): *Bayrūt* is treated as `["Bayrūt"]` and not `["Bayr", "ū", "t"]`, for example.
30
23
  * Opens and reads files. Pass in a file path or a url instead of a string.
31
24
 
32
25
  See usage instructions for more details.
@@ -58,62 +51,68 @@ counter = WordsCounted.count(
58
51
  counter = WordsCounted.from_file("path/or/url/to/my/file.txt")
59
52
  ```
60
53
 
61
- ## API
54
+ `.count` and `.from_file` are convenience methods that take an input, tokenise it, and return an instance of `Counter` initialized with the tokens. The `Tokeniser` and `Counter` classes can be used alone, however.
62
55
 
63
- ### Class methods
56
+ ## API
64
57
 
65
- #### `count(string, options = {})`
58
+ **`WordsCounted.count(input, options = {})`**
66
59
 
67
- Initializes an analyser object.
60
+ Tokenises input and initializes a `Counter` object with the resulting tokens.
68
61
 
69
62
  ```ruby
70
63
  counter = WordsCounted.count("Hello Beirut!")
71
64
  ````
72
65
 
73
- Accepts two options: `exclude` and `regexp`. See [Excluding words from the analyser][5] and [Passing in a custom regexp][6] respectively.
66
+ Accepts two options: `exclude` and `regexp`. See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] respectively.
74
67
 
75
- #### `from_file(path, options = {})`
68
+ **`WordsCounted.from_file(path, options = {})`**
76
69
 
77
- Initializes an analyser object from a file path.
70
+ Reads and tokenises a file, and initializes a `Counter` object with the resulting tokens.
78
71
 
79
72
  ```ruby
80
73
  counter = WordsCounted.count("hello_beirut.txt")
81
74
  ````
82
75
 
83
- Accepts the same options as `count()`.
76
+ Accepts the same options as `.count`.
84
77
 
85
- ### Instance methods
78
+ ### Tokeniser
86
79
 
87
- #### `.word_count`
80
+ The tokeniser allows you to tokenise text in a variety of ways. You can pass in your own rules for tokenisation, and apply a powerful filter with any combination of rules as long as they can boil down into a lambda.
88
81
 
89
- Returns the word count of a given string. The word count includes only alpha characters. Hyphenated and words with apostrophes are considered a single word. You can pass in your own regular expression if this is not desired behaviour.
82
+ Out of the box the tokeniser includes only alpha chars. Hyphenated tokens and tokens with apostrophes are considered a single token.
83
+
84
+ **`#tokenise([pattern: TOKEN_REGEXP, exclude: nil])`**
90
85
 
91
86
  ```ruby
92
- counter.word_count #=> 15
87
+ tokeniser = Tokeniser.new("Hello Beirut!").tokenise
88
+
89
+ # With `exclude`
90
+ tokeniser = Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")
91
+
92
+ # With `pattern`
93
+ tokeniser = Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)
93
94
  ```
94
95
 
95
- #### `.word_occurrences`
96
+ See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] for more information.
96
97
 
97
- Returns an unsorted hash map of words and their number of occurrences. Uppercase and lowercase words are counted as the same word.
98
+ ### Counter
98
99
 
99
- ```ruby
100
- counter.word_occurrences
100
+ The `Counter` class allows you to collect various statistics from an array of tokens.
101
101
 
102
- {
103
- "we" => 1,
104
- "are" => 2,
105
- "all" => 1,
106
- # ...
107
- "stars" => 1
108
- }
102
+ **`#token_count`**
103
+
104
+ Returns the token count of a given string.
105
+
106
+ ```ruby
107
+ counter.token_count #=> 15
109
108
  ```
110
109
 
111
- #### `.sorted_word_occurrences`
110
+ **`#token_frequency`**
112
111
 
113
- Returns a two dimensional array of words and their number of occurrences sorted in descending order. Uppercase and lowercase words are counted as the same word.
112
+ Returns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.
114
113
 
115
- ```ruby
116
- counter.sorted_word_occurrences
114
+ ```
115
+ counter.token_frequency
117
116
 
118
117
  [
119
118
  ["the", 2],
@@ -124,38 +123,22 @@ counter.sorted_word_occurrences
124
123
  ]
125
124
  ```
126
125
 
127
- #### `.most_occurring_words`
126
+ **`#most_frequent_tokens`**
128
127
 
129
- Returns a two dimensional array of the most occurring word and its number of occurrences. In case there is a tie all tied words are returned.
128
+ Returns a hash where each key-value pair is a token and its frequency.
130
129
 
131
130
  ```ruby
132
- counter.most_occurring_words
131
+ counter.most_frequent_tokens
133
132
 
134
- [ ["are", 2], ["the", 2] ]
133
+ { "are" => 2, "the" => 2 }
135
134
  ```
136
135
 
137
- #### `.word_lengths`
136
+ **`#token_lengths`**
138
137
 
139
- Returns an unsorted hash of words and their lengths.
138
+ Returns a sorted (unstable) two-dimentional array where each element contains a token and its length. The array is sorted by length in descending order.
140
139
 
141
140
  ```ruby
142
- counter.word_lengths
143
-
144
- {
145
- "We" => 2,
146
- "are" => 3,
147
- "all" => 3,
148
- # ...
149
- "stars" => 5
150
- }
151
- ```
152
-
153
- #### `.sorted_word_lengths`
154
-
155
- Returns a two dimensional array of words and their lengths sorted in descending order.
156
-
157
- ```ruby
158
- counter.sorted_word_lengths
141
+ counter.token_lengths
159
142
 
160
143
  [
161
144
  ["looking", 7],
@@ -166,133 +149,124 @@ counter.sorted_word_lengths
166
149
  ]
167
150
  ```
168
151
 
169
- #### `.longest_word`
152
+ **`#longest_tokens`**
170
153
 
171
- Returns a two dimensional array of the longest word and its length. In case there is a tie all tied words are returned.
154
+ Returns a hash where each key-value pair is a token and its length.
172
155
 
173
- ```ruby
174
- counter.longest_words
175
-
176
- [ ["looking", 7] ]
177
- ```
178
-
179
- #### `.words`
180
-
181
- Returns an array of words resulting from the string passed into the initialize method.
182
156
 
183
157
  ```ruby
184
- counter.words
185
- #=> ["We", "are", "all", "in", "the", "gutter", "but", "some", "of", "us", "are", "looking", "at", "the", "stars"]
158
+ counter.longest_tokens
159
+
160
+ { "looking" => 7 }
186
161
  ```
187
162
 
188
- #### `.word_density([ precision = 2 ])`
163
+ **`#token_density([ precision: 2 ])`**
189
164
 
190
- Returns a two-dimensional array of words and their density to a precision of two. It accepts a precision argument which defaults to two.
165
+ Returns a sorted (unstable) two-dimentional array where each element contains a token and its density as a float, rounded to a precision of two. The array is sorted by density in descending order. It accepts a `precision` argument, which must be a float.
191
166
 
192
167
  ```ruby
193
- counter.word_density
168
+ counter.token_density
194
169
 
195
170
  [
196
- ["are", 13.33],
197
- ["the", 13.33],
198
- ["but", 6.67 ],
171
+ ["are", 0.13],
172
+ ["the", 0.13],
173
+ ["but", 0.07 ],
199
174
  # ...
200
- ["we", 6.67 ]
175
+ ["we", 0.07 ]
201
176
  ]
202
177
  ```
203
178
 
204
- #### `.char_count`
179
+ **`#char_count`**
205
180
 
206
- Returns the string's character count.
181
+ Returns the char count of tokens.
207
182
 
208
183
  ```ruby
209
- counter.char_count #=> 76
184
+ counter.char_count #=> 76
210
185
  ```
211
186
 
212
- #### `.average_chars_per_word([ precision = 2 ])`
187
+ **`#average_chars_per_token([ precision: 2 ])`**
213
188
 
214
- Returns the average character count per word. Accepts a precision argument which defaults to two.
189
+ Returns the average char count per token rounded to two decimal places. Accepts a precision argument which defaults to two. Precision must be a float.
215
190
 
216
191
  ```ruby
217
- counter.average_chars_per_word #=> 4
192
+ counter.average_chars_per_token #=> 4
218
193
  ```
219
194
 
220
- #### `.unique_word_count`
195
+ **`#unique_token_count`**
221
196
 
222
- Returns the count of unique words in the string. This is case insensitive.
197
+ Returns the number unique tokens.
223
198
 
224
199
  ```ruby
225
- counter.unique_word_count #=> 13
200
+ counter.unique_token_count #=> 13
226
201
  ```
227
202
 
228
- #### `.count(word)`
203
+ ## Excluding tokens from the tokeniser
229
204
 
230
- Counts the occurrence of a word in the string.
205
+ You can exclude anything you want from the input by passing the `exclude` option. The exclude option accepts a variety of filters and is extremely flexible.
231
206
 
232
- ```ruby
233
- counter.count("are") #=> 2
234
- ```
235
-
236
- ## Excluding words from the analyser
207
+ 1. A *space-delimited* string. The filter will normalise the string.
208
+ 2. A regular expression.
209
+ 3. A lambda.
210
+ 4. A symbol that is convertible to a proc. For example `:odd?`.
211
+ 5. An array of any combination of the above.
237
212
 
238
- You can exclude anything you want from the string you want to analyse by passing in the `exclude` option. The exclude option accepts a variety of filters.
239
-
240
- 1. A *space-delimited* list of candidates. The filter will remove both uppercase and lowercase variants of the candidate when applicable. Useful for excluding *the*, *a*, and so on.
241
- 2. An array of string candidates. For example: `['a', 'the']`.
242
- 3. A regular expression.
243
- 4. A lambda.
244
-
245
- #### Using a string
246
213
  ```ruby
247
- WordsCounted.count(
248
- "Magnificent! That was magnificent, Trevor.", exclude: "was magnificent"
214
+ tokeniser =
215
+ WordsCounted::Tokeniser.new(
216
+ "Magnificent! That was magnificent, Trevor.", exclude: "was magnificent"
217
+ )
218
+
219
+ # Using a string
220
+ tokeniser.tokenise(exclude: "was magnificent")
221
+ tokeniser.tokens
222
+ # => ["that", "trevor"]
223
+
224
+ # Using a regular expression
225
+ tokeniser.tokenise(exclude: /Trevor/)
226
+ counter.tokens
227
+ # => ["that", "was", "magnificent"]
228
+
229
+ # Using a lambda
230
+ tokeniser.tokenise(exclude: ->(t) { t.length < 4 })
231
+ counter.tokens
232
+ # => ["magnificent", "trevor"]
233
+
234
+ # Using symbol
235
+ tokeniser = WordsCounted::Tokeniser.new("Hello! محمد")
236
+ t.tokenise(exclude: :ascii_only?)
237
+ # => ["محمد"]
238
+
239
+ # Using an array
240
+ tokeniser = WordsCounted::Tokeniser.new(
241
+ "Hello! اسماءنا هي محمد، كارولينا، سامي، وداني"
249
242
  )
250
- counter.words
251
- #=> ["That", "Trevor"]
252
- ```
253
-
254
- #### Using an array
255
- ```ruby
256
- WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ['1', '2', '3'])
257
- counter.words
258
- #=> ["4", "5", "6"]
259
- ```
260
-
261
- #### Using a regular expression
262
- ```ruby
263
- WordsCounted.count("Hello Beirut", exclude: /Beirut/)
264
- counter.words
265
- #=> ["Hello"]
266
- ```
267
-
268
- #### Using a lambda
269
- ```ruby
270
- WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ->(w) { w.to_i.even? })
271
- counter.words
272
- #=> ["1", "3", "5"]
243
+ tokeniser.tokenise(
244
+ exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"]
245
+ )
246
+ # => ["هي", "سامي", "ودان"]
273
247
  ```
274
248
 
275
249
  ## Passing in a Custom Regexp
276
250
 
277
- Defining words is tricky. The default regexp accounts for letters, hyphenated words, and apostrophes. This means *twenty-one* is treated as one word. So is *Mohamad's*.
251
+ The default regexp accounts for letters, hyphenated tokens, and apostrophes. This means *twenty-one* is treated as one token. So is *Mohamad's*.
278
252
 
279
253
  ```ruby
280
254
  /[\p{Alpha}\-']+/
281
255
  ```
282
256
 
283
- But maybe you don't want to count words?&ndash;Well, analyse anything you want. What you analyse is only limited by your knowledge of regular expressions. Pass your own criteria as a Ruby regular expression to split your string as desired.
257
+ You can pass your own criteria as a Ruby regular expression to split your string as desired.
284
258
 
285
- For example, if you wanted to include numbers in your analysis, you can override the regular expression:
259
+ For example, if you wanted to include numbers, you can override the regular expression:
286
260
 
287
261
  ```ruby
288
262
  counter = WordsCounted.count("Numbers 1, 2, and 3", regexp: /[\p{Alnum}\-']+/)
289
- counter.words
263
+ counter.tokens
290
264
  #=> ["Numbers", "1", "2", "and", "3"]
291
265
  ```
292
266
 
293
267
  ## Opening and Reading Files
294
268
 
295
- Use the `from_file` method to open files. `from_file` accepts the same options as `count`. The file path can be a URL.
269
+ Use the `from_file` method to open files. `from_file` accepts the same options as `.count`. The file path can be a URL.
296
270
 
297
271
  ```ruby
298
272
  counter = WordsCounted.from_file("url/or/path/to/file.text")
@@ -300,37 +274,32 @@ counter = WordsCounted.from_file("url/or/path/to/file.text")
300
274
 
301
275
  ## Gotchas
302
276
 
303
- A hyphen used in leu of an *em* or *en* dash will form part of the word. This affects the `word_occurences` algorithm.
277
+ A hyphen used in leu of an *em* or *en* dash will form part of the token. This affects the tokeniser algorithm.
304
278
 
305
279
  ```ruby
306
280
  counter = WordsCounted.count("How do you do?-you are well, I see.")
307
- counter.word_occurrences
308
-
309
- {
310
- "how" => 1,
311
- "do" => 2,
312
- "you" => 1,
313
- "-you" => 1, # WTF, mate!
314
- "are" => 1,
315
- "very" => 1,
316
- "well" => 1,
317
- "i" => 1,
318
- "see" => 1
319
- }
320
- ```
281
+ counter.token_frequency
321
282
 
322
- In this example `-you` and `you` are counted as separate words. Writers should use the correct dash element, but this is not always true.
283
+ [
284
+ ["do", 2],
285
+ ["how", 1],
286
+ ["you", 1],
287
+ ["-you", 1], # WTF, mate!
288
+ ["are", 1],
289
+ # ...
290
+ ]
291
+ ```
323
292
 
324
- Another gotcha is that the default criteria does not include numbers in its analysis. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
293
+ In this example `-you` and `you` are separate tokens. Also, the tokeniser does not include numbers by default. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
325
294
 
326
295
  ### A note on case sensitivity
327
296
 
328
- The program will downcase all incoming strings for consistency.
297
+ The program will normalise (downcase) all incoming strings for consistency and filters.
329
298
 
330
299
  ## Road Map
331
300
 
332
301
  1. Add ability to open URLs.
333
- 2. Add paragraph, sentence, average words per sentence, and average sentence chars counters.
302
+ 2. Add Ngram support.
334
303
 
335
304
  #### Ability to read URLs
336
305
 
@@ -342,21 +311,13 @@ def self.from_url
342
311
  end
343
312
  ```
344
313
 
345
- ## But wait... wait a minute...
346
-
347
- #### Isn't it better to write this in JavaScript?
348
-
349
- ![Picard face-palm](http://stream1.gifsoup.com/view3/1290449/picard-facepalm-o.gif "Picard face-palm")
350
-
351
314
  ## About
352
315
 
353
316
  Originally I wrote this program for a code challenge on Treehouse. You can find the original implementation on [Code Review][1].
354
317
 
355
318
  ## Contributors
356
319
 
357
- Thanks to Dave Yarwood for helping me improve my code. Some of my code is based on his recommendations. You can find the original program implementation, as well as Dave's code review, on [Code Review][1].
358
-
359
- Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and improving the filter feature to well beyond what I can come up with.
320
+ See [contributors][3]. Not listed there is [Dave Yarwood][1].
360
321
 
361
322
  ## Contributing
362
323
 
@@ -368,8 +329,8 @@ Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and imp
368
329
 
369
330
 
370
331
  [1]: http://codereview.stackexchange.com/questions/46105/a-ruby-string-analyser
371
- [2]: https://github.com/wconrad
372
- [3]: http://codereview.stackexchange.com/a/49476/1563
332
+ [2]: http://www.rubydoc.info/gems/words_counted
333
+ [3]: https://github.com/abitdodgy/words_counted/graphs/contributors
373
334
  [4]: http://rubywordcount.com
374
- [5]: https://github.com/abitdodgy/words_counted#excluding-words-from-the-analyser
335
+ [5]: https://github.com/abitdodgy/words_counted#excluding-tokens-from-the-analyser
375
336
  [6]: https://github.com/abitdodgy/words_counted#passing-in-a-custom-regexp
@@ -2,6 +2,8 @@
2
2
  module Refinements
3
3
  module HashRefinements
4
4
  refine Hash do
5
+ # This is convenience method to sort hashes into an
6
+ # array of tuples by descending value.
5
7
  def sort_by_value_desc
6
8
  sort_by(&:last).reverse
7
9
  end
@@ -67,10 +67,10 @@ module WordsCounted
67
67
  # @example With `exclude` as a mixed array
68
68
  # t = WordsCounted::Tokeniser.new("Hello! اسماءنا هي محمد، كارولينا، سامي، وداني")
69
69
  # t.tokenise(exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"])
70
- # # => => ["هي", "سامي", "ودان
70
+ # # => ["هي", "سامي", "ودان"]
71
71
  #
72
72
  # @param [Regexp] pattern The string to tokenise.
73
- # @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol nil] exclude The filter to apply.
73
+ # @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol, nil] exclude The filter to apply.
74
74
  # @return [Array] the array of filtered tokens.
75
75
  def tokenise(pattern: TOKEN_REGEXP, exclude: nil)
76
76
  filter_proc = filter_to_proc(exclude)
@@ -1,4 +1,4 @@
1
1
  # -*- encoding : utf-8 -*-
2
2
  module WordsCounted
3
- VERSION = "1.0.0"
3
+ VERSION = "1.0.1"
4
4
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: words_counted
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.0.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mohamad El-Husseini