words_counted 1.0.2 → 1.0.3
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +5 -5
- data/.ruby-style.yml +2 -2
- data/.ruby-version +1 -0
- data/.travis.yml +2 -2
- data/CHANGELOG.md +5 -0
- data/README.md +39 -43
- data/lib/refinements/hash_refinements.rb +2 -0
- data/lib/words_counted/counter.rb +23 -11
- data/lib/words_counted/deprecated.rb +2 -0
- data/lib/words_counted/tokeniser.rb +53 -29
- data/lib/words_counted/version.rb +1 -1
- data/lib/words_counted.rb +8 -6
- data/words_counted.gemspec +1 -1
- metadata +11 -11
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
|
-
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: a248654f9f76e28bde0f54993a5c5c87504acffed42b1531acc9de7f385f0696
|
4
|
+
data.tar.gz: c057a7ecb20d7989651b6667f39d16820734e63dd751a0182406f268ecf0f347
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 2c4a5028624393434586c7570e8a6c98785c6cedfc3a6f5c07b7fa9b8aba2880ddf847be8779f623df8e36becb8e148aeaabfae822dcc4f0c9b1db414f8c7916
|
7
|
+
data.tar.gz: e115d757c34480e9e7425db94f6c78a035b4464c69946aa31cbb45ea28f963dc1088a1617269b506001669753b2725abf4f0b708303ced59aa5c59cb1658096c
|
data/.ruby-style.yml
CHANGED
@@ -1,2 +1,2 @@
|
|
1
|
-
|
2
|
-
|
1
|
+
Metrics/LineLength:
|
2
|
+
Max: 120
|
data/.ruby-version
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
3.0.1
|
data/.travis.yml
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,3 +1,8 @@
|
|
1
|
+
## Version 1.0.3
|
2
|
+
|
3
|
+
1. Adds support for Ruby 3.0.0.
|
4
|
+
2. Improve documentation and adds newer configs to Travis CI and Hound.
|
5
|
+
|
1
6
|
## Version 1.0
|
2
7
|
|
3
8
|
This version brings lots of improvements to code organisation. The tokeniser has been extracted into its own class. All methods in `Counter` have either renamed or deprecated. Deprecated methods and their tests have moved into their own modules. Using them will trigger warnings with upgrade instructions outlined below.
|
data/README.md
CHANGED
@@ -1,14 +1,22 @@
|
|
1
1
|
# WordsCounted
|
2
2
|
|
3
|
-
|
3
|
+
> We are all in the gutter, but some of us are looking at the stars.
|
4
|
+
>
|
5
|
+
> -- Oscar Wilde
|
6
|
+
|
7
|
+
WordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class.
|
8
|
+
|
9
|
+
**Are you using WordsCounted to do something interesting?** Please [tell me about it][8].
|
4
10
|
|
5
11
|
<a href="http://badge.fury.io/rb/words_counted">
|
6
12
|
<img src="https://badge.fury.io/rb/words_counted@2x.png" alt="Gem Version" height="18">
|
7
13
|
</a>
|
8
14
|
|
15
|
+
[RubyDoc documentation][7].
|
16
|
+
|
9
17
|
### Demo
|
10
18
|
|
11
|
-
Visit [this website][4] for
|
19
|
+
Visit [this website][4] for one example of what you can do with WordsCounted.
|
12
20
|
|
13
21
|
### Features
|
14
22
|
|
@@ -22,8 +30,6 @@ Visit [this website][4] for an example of what the gem can do.
|
|
22
30
|
* Pass your own regexp rules to the tokeniser if you prefer. The default regexp filters special characters but keeps hyphens and apostrophes. It also plays nicely with diacritics (UTF and unicode characters): *Bayrūt* is treated as `["Bayrūt"]` and not `["Bayr", "ū", "t"]`, for example.
|
23
31
|
* Opens and reads files. Pass in a file path or a url instead of a string.
|
24
32
|
|
25
|
-
See usage instructions for more details.
|
26
|
-
|
27
33
|
## Installation
|
28
34
|
|
29
35
|
Add this line to your application's Gemfile:
|
@@ -51,13 +57,15 @@ counter = WordsCounted.count(
|
|
51
57
|
counter = WordsCounted.from_file("path/or/url/to/my/file.txt")
|
52
58
|
```
|
53
59
|
|
54
|
-
`.count` and `.from_file` are convenience methods that take an input, tokenise it, and return an instance of `Counter` initialized with the tokens. The `Tokeniser` and `Counter` classes can be used alone, however.
|
60
|
+
`.count` and `.from_file` are convenience methods that take an input, tokenise it, and return an instance of `WordsCounted::Counter` initialized with the tokens. The `WordsCounted::Tokeniser` and `WordsCounted::Counter` classes can be used alone, however.
|
55
61
|
|
56
62
|
## API
|
57
63
|
|
64
|
+
### WordsCounted
|
65
|
+
|
58
66
|
**`WordsCounted.count(input, options = {})`**
|
59
67
|
|
60
|
-
Tokenises input and initializes a `Counter` object with the resulting tokens.
|
68
|
+
Tokenises input and initializes a `WordsCounted::Counter` object with the resulting tokens.
|
61
69
|
|
62
70
|
```ruby
|
63
71
|
counter = WordsCounted.count("Hello Beirut!")
|
@@ -67,10 +75,10 @@ Accepts two options: `exclude` and `regexp`. See [Excluding tokens from the anal
|
|
67
75
|
|
68
76
|
**`WordsCounted.from_file(path, options = {})`**
|
69
77
|
|
70
|
-
Reads and tokenises a file, and initializes a `Counter` object with the resulting tokens.
|
78
|
+
Reads and tokenises a file, and initializes a `WordsCounted::Counter` object with the resulting tokens.
|
71
79
|
|
72
80
|
```ruby
|
73
|
-
counter = WordsCounted.
|
81
|
+
counter = WordsCounted.from_file("hello_beirut.txt")
|
74
82
|
````
|
75
83
|
|
76
84
|
Accepts the same options as `.count`.
|
@@ -84,20 +92,20 @@ Out of the box the tokeniser includes only alpha chars. Hyphenated tokens and to
|
|
84
92
|
**`#tokenise([pattern: TOKEN_REGEXP, exclude: nil])`**
|
85
93
|
|
86
94
|
```ruby
|
87
|
-
tokeniser = Tokeniser.new("Hello Beirut!").tokenise
|
95
|
+
tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise
|
88
96
|
|
89
97
|
# With `exclude`
|
90
|
-
tokeniser = Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")
|
98
|
+
tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")
|
91
99
|
|
92
100
|
# With `pattern`
|
93
|
-
tokeniser = Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)
|
101
|
+
tokeniser = WordsCounted::Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)
|
94
102
|
```
|
95
103
|
|
96
104
|
See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] for more information.
|
97
105
|
|
98
106
|
### Counter
|
99
107
|
|
100
|
-
The `Counter` class allows you to collect various statistics from an array of tokens.
|
108
|
+
The `WordsCounted::Counter` class allows you to collect various statistics from an array of tokens.
|
101
109
|
|
102
110
|
**`#token_count`**
|
103
111
|
|
@@ -111,7 +119,7 @@ counter.token_count #=> 15
|
|
111
119
|
|
112
120
|
Returns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.
|
113
121
|
|
114
|
-
```
|
122
|
+
```ruby
|
115
123
|
counter.token_frequency
|
116
124
|
|
117
125
|
[
|
@@ -192,12 +200,12 @@ Returns the average char count per token rounded to two decimal places. Accepts
|
|
192
200
|
counter.average_chars_per_token #=> 4
|
193
201
|
```
|
194
202
|
|
195
|
-
**`#
|
203
|
+
**`#uniq_token_count`**
|
196
204
|
|
197
|
-
Returns the number unique tokens.
|
205
|
+
Returns the number of unique tokens.
|
198
206
|
|
199
207
|
```ruby
|
200
|
-
counter.
|
208
|
+
counter.uniq_token_count #=> 13
|
201
209
|
```
|
202
210
|
|
203
211
|
## Excluding tokens from the tokeniser
|
@@ -207,33 +215,30 @@ You can exclude anything you want from the input by passing the `exclude` option
|
|
207
215
|
1. A *space-delimited* string. The filter will normalise the string.
|
208
216
|
2. A regular expression.
|
209
217
|
3. A lambda.
|
210
|
-
4. A symbol that
|
218
|
+
4. A symbol that names a predicate method. For example `:odd?`.
|
211
219
|
5. An array of any combination of the above.
|
212
220
|
|
213
221
|
```ruby
|
214
222
|
tokeniser =
|
215
223
|
WordsCounted::Tokeniser.new(
|
216
|
-
"Magnificent! That was magnificent, Trevor."
|
224
|
+
"Magnificent! That was magnificent, Trevor."
|
217
225
|
)
|
218
226
|
|
219
227
|
# Using a string
|
220
228
|
tokeniser.tokenise(exclude: "was magnificent")
|
221
|
-
tokeniser.tokens
|
222
229
|
# => ["that", "trevor"]
|
223
230
|
|
224
231
|
# Using a regular expression
|
225
|
-
tokeniser.tokenise(exclude: /
|
226
|
-
|
227
|
-
# => ["that", "was", "magnificent"]
|
232
|
+
tokeniser.tokenise(exclude: /trevor/)
|
233
|
+
# => ["magnificent", "that", "was", "magnificent"]
|
228
234
|
|
229
235
|
# Using a lambda
|
230
236
|
tokeniser.tokenise(exclude: ->(t) { t.length < 4 })
|
231
|
-
|
232
|
-
# => ["magnificent", "trevor"]
|
237
|
+
# => ["magnificent", "that", "magnificent", "trevor"]
|
233
238
|
|
234
239
|
# Using symbol
|
235
240
|
tokeniser = WordsCounted::Tokeniser.new("Hello! محمد")
|
236
|
-
|
241
|
+
tokeniser.tokenise(exclude: :ascii_only?)
|
237
242
|
# => ["محمد"]
|
238
243
|
|
239
244
|
# Using an array
|
@@ -243,10 +248,10 @@ tokeniser = WordsCounted::Tokeniser.new(
|
|
243
248
|
tokeniser.tokenise(
|
244
249
|
exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"]
|
245
250
|
)
|
246
|
-
# => ["هي", "سامي", "
|
251
|
+
# => ["هي", "سامي", "وداني"]
|
247
252
|
```
|
248
253
|
|
249
|
-
## Passing in a
|
254
|
+
## Passing in a custom regexp
|
250
255
|
|
251
256
|
The default regexp accounts for letters, hyphenated tokens, and apostrophes. This means *twenty-one* is treated as one token. So is *Mohamad's*.
|
252
257
|
|
@@ -259,12 +264,12 @@ You can pass your own criteria as a Ruby regular expression to split your string
|
|
259
264
|
For example, if you wanted to include numbers, you can override the regular expression:
|
260
265
|
|
261
266
|
```ruby
|
262
|
-
counter = WordsCounted.count("Numbers 1, 2, and 3",
|
267
|
+
counter = WordsCounted.count("Numbers 1, 2, and 3", pattern: /[\p{Alnum}\-']+/)
|
263
268
|
counter.tokens
|
264
|
-
#=> ["
|
269
|
+
#=> ["numbers", "1", "2", "and", "3"]
|
265
270
|
```
|
266
271
|
|
267
|
-
## Opening and
|
272
|
+
## Opening and reading files
|
268
273
|
|
269
274
|
Use the `from_file` method to open files. `from_file` accepts the same options as `.count`. The file path can be a URL.
|
270
275
|
|
@@ -296,14 +301,9 @@ In this example `-you` and `you` are separate tokens. Also, the tokeniser does n
|
|
296
301
|
|
297
302
|
The program will normalise (downcase) all incoming strings for consistency and filters.
|
298
303
|
|
299
|
-
##
|
300
|
-
|
301
|
-
1. Add ability to open URLs.
|
302
|
-
2. Add Ngram support.
|
303
|
-
|
304
|
-
#### Ability to read URLs
|
304
|
+
## Roadmap
|
305
305
|
|
306
|
-
|
306
|
+
### Ability to open URLs
|
307
307
|
|
308
308
|
```ruby
|
309
309
|
def self.from_url
|
@@ -311,10 +311,6 @@ def self.from_url
|
|
311
311
|
end
|
312
312
|
```
|
313
313
|
|
314
|
-
## About
|
315
|
-
|
316
|
-
Originally I wrote this program for a code challenge on Treehouse. You can find the original implementation on [Code Review][1].
|
317
|
-
|
318
314
|
## Contributors
|
319
315
|
|
320
316
|
See [contributors][3]. Not listed there is [Dave Yarwood][1].
|
@@ -327,10 +323,10 @@ See [contributors][3]. Not listed there is [Dave Yarwood][1].
|
|
327
323
|
4. Push to the branch (`git push origin my-new-feature`)
|
328
324
|
5. Create new Pull Request
|
329
325
|
|
330
|
-
|
331
|
-
[1]: http://codereview.stackexchange.com/questions/46105/a-ruby-string-analyser
|
332
326
|
[2]: http://www.rubydoc.info/gems/words_counted
|
333
327
|
[3]: https://github.com/abitdodgy/words_counted/graphs/contributors
|
334
328
|
[4]: http://rubywordcount.com
|
335
329
|
[5]: https://github.com/abitdodgy/words_counted#excluding-tokens-from-the-analyser
|
336
330
|
[6]: https://github.com/abitdodgy/words_counted#passing-in-a-custom-regexp
|
331
|
+
[7]: http://www.rubydoc.info/gems/words_counted/
|
332
|
+
[8]: https://github.com/abitdodgy/words_counted/issues/new
|
@@ -3,10 +3,21 @@ module WordsCounted
|
|
3
3
|
using Refinements::HashRefinements
|
4
4
|
|
5
5
|
class Counter
|
6
|
+
# This module contains several methods to extract useful statistics
|
7
|
+
# from any array of tokens, such as density, frequency, and more.
|
8
|
+
#
|
9
|
+
# @example
|
10
|
+
# WordsCounted::Counter.new(["hello", "world"]).token_count
|
11
|
+
# # => 2
|
12
|
+
|
6
13
|
include Deprecated
|
7
14
|
|
15
|
+
# @return [Array<String>] an array of tokens.
|
8
16
|
attr_reader :tokens
|
9
17
|
|
18
|
+
# Initializes state with an array of tokens.
|
19
|
+
#
|
20
|
+
# @param [Array] An array of tokens to perform operations on
|
10
21
|
def initialize(tokens)
|
11
22
|
@tokens = tokens
|
12
23
|
end
|
@@ -17,7 +28,7 @@ module WordsCounted
|
|
17
28
|
# Counter.new(%w[one two two three three three]).token_count
|
18
29
|
# # => 6
|
19
30
|
#
|
20
|
-
# @return [Integer]
|
31
|
+
# @return [Integer] The number of tokens
|
21
32
|
def token_count
|
22
33
|
tokens.size
|
23
34
|
end
|
@@ -28,7 +39,7 @@ module WordsCounted
|
|
28
39
|
# Counter.new(%w[one two two three three three]).uniq_token_count
|
29
40
|
# # => 3
|
30
41
|
#
|
31
|
-
# @return [Integer]
|
42
|
+
# @return [Integer] The number of unique tokens
|
32
43
|
def uniq_token_count
|
33
44
|
tokens.uniq.size
|
34
45
|
end
|
@@ -39,7 +50,7 @@ module WordsCounted
|
|
39
50
|
# Counter.new(%w[one two]).char_count
|
40
51
|
# # => 6
|
41
52
|
#
|
42
|
-
# @return [Integer]
|
53
|
+
# @return [Integer] The total char count of tokens
|
43
54
|
def char_count
|
44
55
|
tokens.join.size
|
45
56
|
end
|
@@ -51,7 +62,7 @@ module WordsCounted
|
|
51
62
|
# Counter.new(%w[one two two three three three]).token_frequency
|
52
63
|
# # => [ ['three', 3], ['two', 2], ['one', 1] ]
|
53
64
|
#
|
54
|
-
# @return [Array<Array<String, Integer>>]
|
65
|
+
# @return [Array<Array<String, Integer>>] An array of tokens and their frequencies
|
55
66
|
def token_frequency
|
56
67
|
tokens.each_with_object(Hash.new(0)) { |token, hash| hash[token] += 1 }.sort_by_value_desc
|
57
68
|
end
|
@@ -63,7 +74,7 @@ module WordsCounted
|
|
63
74
|
# Counter.new(%w[one two three four five]).token_lenghts
|
64
75
|
# # => [ ['three', 5], ['four', 4], ['five', 4], ['one', 3], ['two', 3] ]
|
65
76
|
#
|
66
|
-
# @return [Array<Array<String, Integer>>]
|
77
|
+
# @return [Array<Array<String, Integer>>] An array of tokens and their lengths
|
67
78
|
def token_lengths
|
68
79
|
tokens.uniq.each_with_object({}) { |token, hash| hash[token] = token.length }.sort_by_value_desc
|
69
80
|
end
|
@@ -80,8 +91,8 @@ module WordsCounted
|
|
80
91
|
# Counter.new(%w[Maj. Major Major Major]).token_density(precision: 4)
|
81
92
|
# # => [ ['major', .7500], ['maj', .2500] ]
|
82
93
|
#
|
83
|
-
# @param [Integer] precision
|
84
|
-
# @return [Array<Array<String, Float>>]
|
94
|
+
# @param [Integer] precision The number of decimal places to round density to
|
95
|
+
# @return [Array<Array<String, Float>>] An array of tokens and their densities
|
85
96
|
def token_density(precision: 2)
|
86
97
|
token_frequency.each_with_object({}) { |(token, freq), hash|
|
87
98
|
hash[token] = (freq / token_count.to_f).round(precision)
|
@@ -94,18 +105,18 @@ module WordsCounted
|
|
94
105
|
# Counter.new(%w[one once two two twice twice]).most_frequent_tokens
|
95
106
|
# # => { 'two' => 2, 'twice' => 2 }
|
96
107
|
#
|
97
|
-
# @return [Hash
|
108
|
+
# @return [Hash{String => Integer}] A hash of tokens and their frequencies
|
98
109
|
def most_frequent_tokens
|
99
110
|
token_frequency.group_by(&:last).max.last.to_h
|
100
111
|
end
|
101
112
|
|
102
|
-
# Returns a hash of tokens and their lengths for tokens with the highest length
|
113
|
+
# Returns a hash of tokens and their lengths for tokens with the highest length
|
103
114
|
#
|
104
115
|
# @example
|
105
116
|
# Counter.new(%w[one three five seven]).longest_tokens
|
106
117
|
# # => { 'three' => 5, 'seven' => 5 }
|
107
118
|
#
|
108
|
-
# @return [Hash
|
119
|
+
# @return [Hash{String => Integer}] A hash of tokens and their lengths
|
109
120
|
def longest_tokens
|
110
121
|
token_lengths.group_by(&:last).max.last.to_h
|
111
122
|
end
|
@@ -117,7 +128,8 @@ module WordsCounted
|
|
117
128
|
# Counter.new(%w[one three five seven]).average_chars_per_token
|
118
129
|
# # => 4.25
|
119
130
|
#
|
120
|
-
# @
|
131
|
+
# @param [Integer] precision The number of decimal places to round average char count to
|
132
|
+
# @return [Float] The average char count per token
|
121
133
|
def average_chars_per_token(precision: 2)
|
122
134
|
(char_count / token_count.to_f).round(precision)
|
123
135
|
end
|
@@ -1,6 +1,8 @@
|
|
1
1
|
# -*- encoding : utf-8 -*-
|
2
2
|
module WordsCounted
|
3
3
|
module Deprecated
|
4
|
+
# The following methods are deprecated and will be removed in version 1.1.0.
|
5
|
+
|
4
6
|
# @deprecated use `Counter#token_count`
|
5
7
|
def word_count
|
6
8
|
warn "`Counter#word_count` is deprecated, please use `Counter#token_count`"
|
@@ -5,28 +5,32 @@ module WordsCounted
|
|
5
5
|
# Using `pattern` and `exclude` allows for powerful tokenisation strategies.
|
6
6
|
#
|
7
7
|
# @example
|
8
|
-
# tokeniser
|
8
|
+
# tokeniser
|
9
|
+
# = WordsCounted::Tokeniser.new(
|
10
|
+
# "We are all in the gutter, but some of us are looking at the stars."
|
11
|
+
# )
|
9
12
|
# tokeniser.tokenise(exclude: "We are all in the gutter")
|
10
13
|
# # => ['but', 'some', 'of', 'us', 'are', 'looking', 'at', 'the', 'stars']
|
11
14
|
|
12
15
|
# Default tokenisation strategy
|
13
16
|
TOKEN_REGEXP = /[\p{Alpha}\-']+/
|
14
17
|
|
15
|
-
# Initialises state with
|
18
|
+
# Initialises state with the string to be tokenised.
|
16
19
|
#
|
17
|
-
# @param [String] input The string to tokenise
|
18
|
-
# @return [Tokeniser]
|
20
|
+
# @param [String] input The string to tokenise
|
19
21
|
def initialize(input)
|
20
22
|
@input = input
|
21
23
|
end
|
22
24
|
|
23
25
|
# Converts a string into an array of tokens using a regular expression.
|
24
|
-
# If a regexp is not provided a default one is used. See
|
26
|
+
# If a regexp is not provided a default one is used. See `Tokenizer.TOKEN_REGEXP`.
|
25
27
|
#
|
26
28
|
# Use `exclude` to remove tokens from the final list. `exclude` can be a string,
|
27
29
|
# a regular expression, a lambda, a symbol, or an array of one or more of those types.
|
28
30
|
# This allows for powerful and flexible tokenisation strategies.
|
29
31
|
#
|
32
|
+
# If a symbol is passed, it must name a predicate method.
|
33
|
+
#
|
30
34
|
# @example
|
31
35
|
# WordsCounted::Tokeniser.new("Hello World").tokenise
|
32
36
|
# # => ['hello', 'world']
|
@@ -44,7 +48,9 @@ module WordsCounted
|
|
44
48
|
# # => ['dani']
|
45
49
|
#
|
46
50
|
# @example With `exclude` as a lambda
|
47
|
-
# WordsCounted::Tokeniser.new("Goodbye Sami").tokenise(
|
51
|
+
# WordsCounted::Tokeniser.new("Goodbye Sami").tokenise(
|
52
|
+
# exclude: ->(token) { token.length > 6 }
|
53
|
+
# )
|
48
54
|
# # => ['sami']
|
49
55
|
#
|
50
56
|
# @example With `exclude` as a symbol
|
@@ -52,26 +58,42 @@ module WordsCounted
|
|
52
58
|
# # => ['محمد']
|
53
59
|
#
|
54
60
|
# @example With `exclude` as an array of strings
|
55
|
-
# WordsCounted::Tokeniser.new("Goodbye Sami and hello Dani").tokenise(
|
61
|
+
# WordsCounted::Tokeniser.new("Goodbye Sami and hello Dani").tokenise(
|
62
|
+
# exclude: ["goodbye hello"]
|
63
|
+
# )
|
56
64
|
# # => ['sami', 'and', dani']
|
57
65
|
#
|
58
66
|
# @example With `exclude` as an array of regular expressions
|
59
|
-
# WordsCounted::Tokeniser.new("Goodbye and hello Dani").tokenise(
|
67
|
+
# WordsCounted::Tokeniser.new("Goodbye and hello Dani").tokenise(
|
68
|
+
# exclude: [/goodbye/i, /and/i]
|
69
|
+
# )
|
60
70
|
# # => ['hello', 'dani']
|
61
71
|
#
|
62
72
|
# @example With `exclude` as an array of lambdas
|
63
73
|
# t = WordsCounted::Tokeniser.new("Special Agent 007")
|
64
|
-
# t.tokenise(
|
74
|
+
# t.tokenise(
|
75
|
+
# exclude: [
|
76
|
+
# ->(t) { t.to_i.odd? },
|
77
|
+
# ->(t) { t.length > 5}
|
78
|
+
# ]
|
79
|
+
# )
|
65
80
|
# # => ['agent']
|
66
81
|
#
|
67
82
|
# @example With `exclude` as a mixed array
|
68
83
|
# t = WordsCounted::Tokeniser.new("Hello! اسماءنا هي محمد، كارولينا، سامي، وداني")
|
69
|
-
# t.tokenise(
|
70
|
-
#
|
71
|
-
#
|
72
|
-
#
|
73
|
-
#
|
74
|
-
#
|
84
|
+
# t.tokenise(
|
85
|
+
# exclude: [
|
86
|
+
# :ascii_only?,
|
87
|
+
# /محمد/,
|
88
|
+
# ->(t) { t.length > 6},
|
89
|
+
# "و"
|
90
|
+
# ]
|
91
|
+
# )
|
92
|
+
# # => ["هي", "سامي", "وداني"]
|
93
|
+
#
|
94
|
+
# @param [Regexp] pattern The string to tokenise
|
95
|
+
# @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol, nil] exclude The filter to apply
|
96
|
+
# @return [Array] The array of filtered tokens
|
75
97
|
def tokenise(pattern: TOKEN_REGEXP, exclude: nil)
|
76
98
|
filter_proc = filter_to_proc(exclude)
|
77
99
|
@input.scan(pattern).map(&:downcase).reject { |token| filter_proc.call(token) }
|
@@ -79,22 +101,31 @@ module WordsCounted
|
|
79
101
|
|
80
102
|
private
|
81
103
|
|
82
|
-
#
|
83
|
-
# is then used to determine whether a token should be excluded from the final list
|
104
|
+
# The following methods convert any arguments into a callable object. The return value of this
|
105
|
+
# lambda is then used to determine whether a token should be excluded from the final list.
|
84
106
|
#
|
85
107
|
# `filter` can be a string, a regular expression, a lambda, a symbol, or an array
|
86
108
|
# of any combination of those types.
|
87
109
|
#
|
88
|
-
# If `filter` is a string,
|
89
|
-
#
|
110
|
+
# If `filter` is a string, it converts the string into an array, and returns a lambda
|
111
|
+
# that returns true if the token is included in the resulting array.
|
112
|
+
#
|
113
|
+
# @see {Tokeniser#filter_proc_from_string}.
|
114
|
+
#
|
115
|
+
# If `filter` is a an array, it creates a new array where each element of the origingal is
|
116
|
+
# converted to a lambda, and returns a lambda that calls each lambda in the resulting array.
|
117
|
+
# If any lambda returns true the token is excluded from the final list.
|
118
|
+
#
|
119
|
+
# @see {Tokeniser#filter_procs_from_array}.
|
90
120
|
#
|
91
121
|
# If `filter` is a proc, then the proc is simply called. If `filter` is a regexp, a `lambda`
|
92
|
-
# is returned that checks the token for a match.
|
93
|
-
#
|
122
|
+
# is returned that checks the token for a match.
|
123
|
+
#
|
124
|
+
# If a symbol is passed, it is converted to a proc. The symbol must name a predicate method.
|
94
125
|
#
|
95
126
|
# This method depends on `nil` responding `to_a` with an empty array, which
|
96
127
|
# avoids having to check if `exclude` was passed.
|
97
|
-
|
128
|
+
|
98
129
|
# @api private
|
99
130
|
def filter_to_proc(filter)
|
100
131
|
if filter.respond_to?(:to_a)
|
@@ -113,10 +144,6 @@ module WordsCounted
|
|
113
144
|
end
|
114
145
|
end
|
115
146
|
|
116
|
-
# Converts an array of `filters` to an array of lambdas, and returns a lambda that calls
|
117
|
-
# each lambda in the resulting array. If any lambda returns true the token is excluded
|
118
|
-
# from the final list.
|
119
|
-
#
|
120
147
|
# @api private
|
121
148
|
def filter_procs_from_array(filter)
|
122
149
|
filter_procs = Array(filter).map &method(:filter_to_proc)
|
@@ -125,9 +152,6 @@ module WordsCounted
|
|
125
152
|
}
|
126
153
|
end
|
127
154
|
|
128
|
-
# Converts a string `filter` to an array, and returns a lambda
|
129
|
-
# that returns true if the token is included in the array.
|
130
|
-
#
|
131
155
|
# @api private
|
132
156
|
def filter_proc_from_string(filter)
|
133
157
|
normalized_exclusion_list = filter.split.map(&:downcase)
|
data/lib/words_counted.rb
CHANGED
@@ -19,10 +19,11 @@ module WordsCounted
|
|
19
19
|
# @see Tokeniser.tokenise
|
20
20
|
# @see Counter.initialize
|
21
21
|
#
|
22
|
-
# @param [String] input The input to be tokenised
|
23
|
-
# @param [Hash] options The options to pass onto `Counter
|
22
|
+
# @param [String] input The input to be tokenised
|
23
|
+
# @param [Hash] options The options to pass onto `Counter`
|
24
|
+
# @return [WordsCounted::Counter] An instance of Counter
|
24
25
|
def self.count(input, options = {})
|
25
|
-
tokens = Tokeniser.new(input).tokenise(options)
|
26
|
+
tokens = Tokeniser.new(input).tokenise(**options)
|
26
27
|
Counter.new(tokens)
|
27
28
|
end
|
28
29
|
|
@@ -32,11 +33,12 @@ module WordsCounted
|
|
32
33
|
# @see Tokeniser.tokenise
|
33
34
|
# @see Counter.initialize
|
34
35
|
#
|
35
|
-
# @param [String] path The file to be read and tokenised
|
36
|
-
# @param [Hash] options The options to pass onto `Counter
|
36
|
+
# @param [String] path The file to be read and tokenised
|
37
|
+
# @param [Hash] options The options to pass onto `Counter`
|
38
|
+
# @return [WordsCounted::Counter] An instance of Counter
|
37
39
|
def self.from_file(path, options = {})
|
38
40
|
tokens = File.open(path) do |file|
|
39
|
-
Tokeniser.new(file.read).tokenise(options)
|
41
|
+
Tokeniser.new(file.read).tokenise(**options)
|
40
42
|
end
|
41
43
|
Counter.new(tokens)
|
42
44
|
end
|
data/words_counted.gemspec
CHANGED
@@ -19,7 +19,7 @@ Gem::Specification.new do |spec|
|
|
19
19
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
20
20
|
spec.require_paths = ["lib"]
|
21
21
|
|
22
|
-
spec.add_development_dependency "bundler"
|
22
|
+
spec.add_development_dependency "bundler"
|
23
23
|
spec.add_development_dependency "rake"
|
24
24
|
spec.add_development_dependency "rspec"
|
25
25
|
spec.add_development_dependency "pry"
|
metadata
CHANGED
@@ -1,29 +1,29 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: words_counted
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.3
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mohamad El-Husseini
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2021-10-14 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: bundler
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '
|
19
|
+
version: '0'
|
20
20
|
type: :development
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- - "
|
24
|
+
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '
|
26
|
+
version: '0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: rake
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
@@ -78,6 +78,7 @@ files:
|
|
78
78
|
- ".hound.yml"
|
79
79
|
- ".rspec"
|
80
80
|
- ".ruby-style.yml"
|
81
|
+
- ".ruby-version"
|
81
82
|
- ".travis.yml"
|
82
83
|
- ".yardopts"
|
83
84
|
- CHANGELOG.md
|
@@ -102,7 +103,7 @@ homepage: https://github.com/abitdodgy/words_counted
|
|
102
103
|
licenses:
|
103
104
|
- MIT
|
104
105
|
metadata: {}
|
105
|
-
post_install_message:
|
106
|
+
post_install_message:
|
106
107
|
rdoc_options: []
|
107
108
|
require_paths:
|
108
109
|
- lib
|
@@ -117,9 +118,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
117
118
|
- !ruby/object:Gem::Version
|
118
119
|
version: '0'
|
119
120
|
requirements: []
|
120
|
-
|
121
|
-
|
122
|
-
signing_key:
|
121
|
+
rubygems_version: 3.2.15
|
122
|
+
signing_key:
|
123
123
|
specification_version: 4
|
124
124
|
summary: See README.
|
125
125
|
test_files:
|