words_counted 1.0.0 → 1.0.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +133 -172
- data/lib/refinements/hash_refinements.rb +2 -0
- data/lib/words_counted/tokeniser.rb +2 -2
- data/lib/words_counted/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9c56052462cafa83864d9f6ae8fcc9edcdd67356
|
4
|
+
data.tar.gz: 562629bb6b6f61d45a2e40ae0fab5b07e749576e
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 96a6ee9b686893bef4552aedd4dd05f696bf5735cb57c9de65b08f7a6ed6c0a4cbceddd27f8309d0f0522753ff0c556fa5b8ec7bf949a1c3c7c07d9ad1b11337
|
7
|
+
data.tar.gz: 05251ab8e6efa3b29ebd88d7c5ddf82c20bfa3c07321a7a66a9f261c4b1162cdae600cd991b4b243c94d660fc2bfd45d779098f8dcda9f0165474965d8d5b9b8
|
data/README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
# WordsCounted
|
2
2
|
|
3
|
-
WordsCounted is a
|
3
|
+
WordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class. [Consult the documentation][2] for more information.
|
4
4
|
|
5
5
|
<a href="http://badge.fury.io/rb/words_counted">
|
6
6
|
<img src="https://badge.fury.io/rb/words_counted@2x.png" alt="Gem Version" height="18">
|
@@ -8,25 +8,18 @@ WordsCounted is a highly customisable Ruby text analyser. Consult the features f
|
|
8
8
|
|
9
9
|
### Demo
|
10
10
|
|
11
|
-
Visit [
|
11
|
+
Visit [this website][4] for an example of what the gem can do.
|
12
12
|
|
13
13
|
### Features
|
14
14
|
|
15
|
-
*
|
16
|
-
*
|
17
|
-
*
|
18
|
-
*
|
19
|
-
*
|
20
|
-
*
|
21
|
-
|
22
|
-
|
23
|
-
* The longest word(s) and its length
|
24
|
-
* The most occurring word(s) and its number of occurrences.
|
25
|
-
* Count invividual strings for occurrences.
|
26
|
-
* A flexible way to exclude words (or anything) from the count. You can pass a **string**, a **regexp**, an **array**, or a **lambda**.
|
27
|
-
* Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
|
28
|
-
* Filters special characters but respects hyphens and apostrophes.
|
29
|
-
* Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as `["São", "Paulo"]` and not `["S", "", "o", "Paulo"]`.
|
15
|
+
* Out of the box, get the following data from any string or readable file, or URL:
|
16
|
+
* Token count and unique token count
|
17
|
+
* Token densities, frequencies, and lengths
|
18
|
+
* Char count and average chars per token
|
19
|
+
* The longest tokens and their lengths
|
20
|
+
* The most frequent tokens and their frequencies.
|
21
|
+
* A flexible way to exclude tokens from the tokeniser. You can pass a **string**, **regexp**, **symbol**, **lambda**, or an **array** of any combination of those types for powerful tokenisation strategies.
|
22
|
+
* Pass your own regexp rules to the tokeniser if you prefer. The default regexp filters special characters but keeps hyphens and apostrophes. It also plays nicely with diacritics (UTF and unicode characters): *Bayrūt* is treated as `["Bayrūt"]` and not `["Bayr", "ū", "t"]`, for example.
|
30
23
|
* Opens and reads files. Pass in a file path or a url instead of a string.
|
31
24
|
|
32
25
|
See usage instructions for more details.
|
@@ -58,62 +51,68 @@ counter = WordsCounted.count(
|
|
58
51
|
counter = WordsCounted.from_file("path/or/url/to/my/file.txt")
|
59
52
|
```
|
60
53
|
|
61
|
-
|
54
|
+
`.count` and `.from_file` are convenience methods that take an input, tokenise it, and return an instance of `Counter` initialized with the tokens. The `Tokeniser` and `Counter` classes can be used alone, however.
|
62
55
|
|
63
|
-
|
56
|
+
## API
|
64
57
|
|
65
|
-
|
58
|
+
**`WordsCounted.count(input, options = {})`**
|
66
59
|
|
67
|
-
|
60
|
+
Tokenises input and initializes a `Counter` object with the resulting tokens.
|
68
61
|
|
69
62
|
```ruby
|
70
63
|
counter = WordsCounted.count("Hello Beirut!")
|
71
64
|
````
|
72
65
|
|
73
|
-
Accepts two options: `exclude` and `regexp`. See [Excluding
|
66
|
+
Accepts two options: `exclude` and `regexp`. See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] respectively.
|
74
67
|
|
75
|
-
|
68
|
+
**`WordsCounted.from_file(path, options = {})`**
|
76
69
|
|
77
|
-
|
70
|
+
Reads and tokenises a file, and initializes a `Counter` object with the resulting tokens.
|
78
71
|
|
79
72
|
```ruby
|
80
73
|
counter = WordsCounted.count("hello_beirut.txt")
|
81
74
|
````
|
82
75
|
|
83
|
-
Accepts the same options as
|
76
|
+
Accepts the same options as `.count`.
|
84
77
|
|
85
|
-
###
|
78
|
+
### Tokeniser
|
86
79
|
|
87
|
-
|
80
|
+
The tokeniser allows you to tokenise text in a variety of ways. You can pass in your own rules for tokenisation, and apply a powerful filter with any combination of rules as long as they can boil down into a lambda.
|
88
81
|
|
89
|
-
|
82
|
+
Out of the box the tokeniser includes only alpha chars. Hyphenated tokens and tokens with apostrophes are considered a single token.
|
83
|
+
|
84
|
+
**`#tokenise([pattern: TOKEN_REGEXP, exclude: nil])`**
|
90
85
|
|
91
86
|
```ruby
|
92
|
-
|
87
|
+
tokeniser = Tokeniser.new("Hello Beirut!").tokenise
|
88
|
+
|
89
|
+
# With `exclude`
|
90
|
+
tokeniser = Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")
|
91
|
+
|
92
|
+
# With `pattern`
|
93
|
+
tokeniser = Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)
|
93
94
|
```
|
94
95
|
|
95
|
-
|
96
|
+
See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] for more information.
|
96
97
|
|
97
|
-
|
98
|
+
### Counter
|
98
99
|
|
99
|
-
|
100
|
-
counter.word_occurrences
|
100
|
+
The `Counter` class allows you to collect various statistics from an array of tokens.
|
101
101
|
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
}
|
102
|
+
**`#token_count`**
|
103
|
+
|
104
|
+
Returns the token count of a given string.
|
105
|
+
|
106
|
+
```ruby
|
107
|
+
counter.token_count #=> 15
|
109
108
|
```
|
110
109
|
|
111
|
-
|
110
|
+
**`#token_frequency`**
|
112
111
|
|
113
|
-
Returns a two
|
112
|
+
Returns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.
|
114
113
|
|
115
|
-
```
|
116
|
-
counter.
|
114
|
+
```
|
115
|
+
counter.token_frequency
|
117
116
|
|
118
117
|
[
|
119
118
|
["the", 2],
|
@@ -124,38 +123,22 @@ counter.sorted_word_occurrences
|
|
124
123
|
]
|
125
124
|
```
|
126
125
|
|
127
|
-
|
126
|
+
**`#most_frequent_tokens`**
|
128
127
|
|
129
|
-
Returns a
|
128
|
+
Returns a hash where each key-value pair is a token and its frequency.
|
130
129
|
|
131
130
|
```ruby
|
132
|
-
counter.
|
131
|
+
counter.most_frequent_tokens
|
133
132
|
|
134
|
-
|
133
|
+
{ "are" => 2, "the" => 2 }
|
135
134
|
```
|
136
135
|
|
137
|
-
|
136
|
+
**`#token_lengths`**
|
138
137
|
|
139
|
-
Returns
|
138
|
+
Returns a sorted (unstable) two-dimentional array where each element contains a token and its length. The array is sorted by length in descending order.
|
140
139
|
|
141
140
|
```ruby
|
142
|
-
counter.
|
143
|
-
|
144
|
-
{
|
145
|
-
"We" => 2,
|
146
|
-
"are" => 3,
|
147
|
-
"all" => 3,
|
148
|
-
# ...
|
149
|
-
"stars" => 5
|
150
|
-
}
|
151
|
-
```
|
152
|
-
|
153
|
-
#### `.sorted_word_lengths`
|
154
|
-
|
155
|
-
Returns a two dimensional array of words and their lengths sorted in descending order.
|
156
|
-
|
157
|
-
```ruby
|
158
|
-
counter.sorted_word_lengths
|
141
|
+
counter.token_lengths
|
159
142
|
|
160
143
|
[
|
161
144
|
["looking", 7],
|
@@ -166,133 +149,124 @@ counter.sorted_word_lengths
|
|
166
149
|
]
|
167
150
|
```
|
168
151
|
|
169
|
-
|
152
|
+
**`#longest_tokens`**
|
170
153
|
|
171
|
-
Returns a
|
154
|
+
Returns a hash where each key-value pair is a token and its length.
|
172
155
|
|
173
|
-
```ruby
|
174
|
-
counter.longest_words
|
175
|
-
|
176
|
-
[ ["looking", 7] ]
|
177
|
-
```
|
178
|
-
|
179
|
-
#### `.words`
|
180
|
-
|
181
|
-
Returns an array of words resulting from the string passed into the initialize method.
|
182
156
|
|
183
157
|
```ruby
|
184
|
-
counter.
|
185
|
-
|
158
|
+
counter.longest_tokens
|
159
|
+
|
160
|
+
{ "looking" => 7 }
|
186
161
|
```
|
187
162
|
|
188
|
-
|
163
|
+
**`#token_density([ precision: 2 ])`**
|
189
164
|
|
190
|
-
Returns a two-
|
165
|
+
Returns a sorted (unstable) two-dimentional array where each element contains a token and its density as a float, rounded to a precision of two. The array is sorted by density in descending order. It accepts a `precision` argument, which must be a float.
|
191
166
|
|
192
167
|
```ruby
|
193
|
-
counter.
|
168
|
+
counter.token_density
|
194
169
|
|
195
170
|
[
|
196
|
-
["are", 13
|
197
|
-
["the", 13
|
198
|
-
["but",
|
171
|
+
["are", 0.13],
|
172
|
+
["the", 0.13],
|
173
|
+
["but", 0.07 ],
|
199
174
|
# ...
|
200
|
-
["we",
|
175
|
+
["we", 0.07 ]
|
201
176
|
]
|
202
177
|
```
|
203
178
|
|
204
|
-
|
179
|
+
**`#char_count`**
|
205
180
|
|
206
|
-
Returns the
|
181
|
+
Returns the char count of tokens.
|
207
182
|
|
208
183
|
```ruby
|
209
|
-
counter.char_count
|
184
|
+
counter.char_count #=> 76
|
210
185
|
```
|
211
186
|
|
212
|
-
|
187
|
+
**`#average_chars_per_token([ precision: 2 ])`**
|
213
188
|
|
214
|
-
Returns the average
|
189
|
+
Returns the average char count per token rounded to two decimal places. Accepts a precision argument which defaults to two. Precision must be a float.
|
215
190
|
|
216
191
|
```ruby
|
217
|
-
counter.
|
192
|
+
counter.average_chars_per_token #=> 4
|
218
193
|
```
|
219
194
|
|
220
|
-
|
195
|
+
**`#unique_token_count`**
|
221
196
|
|
222
|
-
Returns the
|
197
|
+
Returns the number unique tokens.
|
223
198
|
|
224
199
|
```ruby
|
225
|
-
counter.
|
200
|
+
counter.unique_token_count #=> 13
|
226
201
|
```
|
227
202
|
|
228
|
-
|
203
|
+
## Excluding tokens from the tokeniser
|
229
204
|
|
230
|
-
|
205
|
+
You can exclude anything you want from the input by passing the `exclude` option. The exclude option accepts a variety of filters and is extremely flexible.
|
231
206
|
|
232
|
-
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
207
|
+
1. A *space-delimited* string. The filter will normalise the string.
|
208
|
+
2. A regular expression.
|
209
|
+
3. A lambda.
|
210
|
+
4. A symbol that is convertible to a proc. For example `:odd?`.
|
211
|
+
5. An array of any combination of the above.
|
237
212
|
|
238
|
-
You can exclude anything you want from the string you want to analyse by passing in the `exclude` option. The exclude option accepts a variety of filters.
|
239
|
-
|
240
|
-
1. A *space-delimited* list of candidates. The filter will remove both uppercase and lowercase variants of the candidate when applicable. Useful for excluding *the*, *a*, and so on.
|
241
|
-
2. An array of string candidates. For example: `['a', 'the']`.
|
242
|
-
3. A regular expression.
|
243
|
-
4. A lambda.
|
244
|
-
|
245
|
-
#### Using a string
|
246
213
|
```ruby
|
247
|
-
|
248
|
-
|
214
|
+
tokeniser =
|
215
|
+
WordsCounted::Tokeniser.new(
|
216
|
+
"Magnificent! That was magnificent, Trevor.", exclude: "was magnificent"
|
217
|
+
)
|
218
|
+
|
219
|
+
# Using a string
|
220
|
+
tokeniser.tokenise(exclude: "was magnificent")
|
221
|
+
tokeniser.tokens
|
222
|
+
# => ["that", "trevor"]
|
223
|
+
|
224
|
+
# Using a regular expression
|
225
|
+
tokeniser.tokenise(exclude: /Trevor/)
|
226
|
+
counter.tokens
|
227
|
+
# => ["that", "was", "magnificent"]
|
228
|
+
|
229
|
+
# Using a lambda
|
230
|
+
tokeniser.tokenise(exclude: ->(t) { t.length < 4 })
|
231
|
+
counter.tokens
|
232
|
+
# => ["magnificent", "trevor"]
|
233
|
+
|
234
|
+
# Using symbol
|
235
|
+
tokeniser = WordsCounted::Tokeniser.new("Hello! محمد")
|
236
|
+
t.tokenise(exclude: :ascii_only?)
|
237
|
+
# => ["محمد"]
|
238
|
+
|
239
|
+
# Using an array
|
240
|
+
tokeniser = WordsCounted::Tokeniser.new(
|
241
|
+
"Hello! اسماءنا هي محمد، كارولينا، سامي، وداني"
|
249
242
|
)
|
250
|
-
|
251
|
-
|
252
|
-
|
253
|
-
|
254
|
-
#### Using an array
|
255
|
-
```ruby
|
256
|
-
WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ['1', '2', '3'])
|
257
|
-
counter.words
|
258
|
-
#=> ["4", "5", "6"]
|
259
|
-
```
|
260
|
-
|
261
|
-
#### Using a regular expression
|
262
|
-
```ruby
|
263
|
-
WordsCounted.count("Hello Beirut", exclude: /Beirut/)
|
264
|
-
counter.words
|
265
|
-
#=> ["Hello"]
|
266
|
-
```
|
267
|
-
|
268
|
-
#### Using a lambda
|
269
|
-
```ruby
|
270
|
-
WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ->(w) { w.to_i.even? })
|
271
|
-
counter.words
|
272
|
-
#=> ["1", "3", "5"]
|
243
|
+
tokeniser.tokenise(
|
244
|
+
exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"]
|
245
|
+
)
|
246
|
+
# => ["هي", "سامي", "ودان"]
|
273
247
|
```
|
274
248
|
|
275
249
|
## Passing in a Custom Regexp
|
276
250
|
|
277
|
-
|
251
|
+
The default regexp accounts for letters, hyphenated tokens, and apostrophes. This means *twenty-one* is treated as one token. So is *Mohamad's*.
|
278
252
|
|
279
253
|
```ruby
|
280
254
|
/[\p{Alpha}\-']+/
|
281
255
|
```
|
282
256
|
|
283
|
-
|
257
|
+
You can pass your own criteria as a Ruby regular expression to split your string as desired.
|
284
258
|
|
285
|
-
For example, if you wanted to include numbers
|
259
|
+
For example, if you wanted to include numbers, you can override the regular expression:
|
286
260
|
|
287
261
|
```ruby
|
288
262
|
counter = WordsCounted.count("Numbers 1, 2, and 3", regexp: /[\p{Alnum}\-']+/)
|
289
|
-
counter.
|
263
|
+
counter.tokens
|
290
264
|
#=> ["Numbers", "1", "2", "and", "3"]
|
291
265
|
```
|
292
266
|
|
293
267
|
## Opening and Reading Files
|
294
268
|
|
295
|
-
Use the `from_file` method to open files. `from_file` accepts the same options as
|
269
|
+
Use the `from_file` method to open files. `from_file` accepts the same options as `.count`. The file path can be a URL.
|
296
270
|
|
297
271
|
```ruby
|
298
272
|
counter = WordsCounted.from_file("url/or/path/to/file.text")
|
@@ -300,37 +274,32 @@ counter = WordsCounted.from_file("url/or/path/to/file.text")
|
|
300
274
|
|
301
275
|
## Gotchas
|
302
276
|
|
303
|
-
A hyphen used in leu of an *em* or *en* dash will form part of the
|
277
|
+
A hyphen used in leu of an *em* or *en* dash will form part of the token. This affects the tokeniser algorithm.
|
304
278
|
|
305
279
|
```ruby
|
306
280
|
counter = WordsCounted.count("How do you do?-you are well, I see.")
|
307
|
-
counter.
|
308
|
-
|
309
|
-
{
|
310
|
-
"how" => 1,
|
311
|
-
"do" => 2,
|
312
|
-
"you" => 1,
|
313
|
-
"-you" => 1, # WTF, mate!
|
314
|
-
"are" => 1,
|
315
|
-
"very" => 1,
|
316
|
-
"well" => 1,
|
317
|
-
"i" => 1,
|
318
|
-
"see" => 1
|
319
|
-
}
|
320
|
-
```
|
281
|
+
counter.token_frequency
|
321
282
|
|
322
|
-
|
283
|
+
[
|
284
|
+
["do", 2],
|
285
|
+
["how", 1],
|
286
|
+
["you", 1],
|
287
|
+
["-you", 1], # WTF, mate!
|
288
|
+
["are", 1],
|
289
|
+
# ...
|
290
|
+
]
|
291
|
+
```
|
323
292
|
|
324
|
-
|
293
|
+
In this example `-you` and `you` are separate tokens. Also, the tokeniser does not include numbers by default. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
|
325
294
|
|
326
295
|
### A note on case sensitivity
|
327
296
|
|
328
|
-
The program will downcase all incoming strings for consistency.
|
297
|
+
The program will normalise (downcase) all incoming strings for consistency and filters.
|
329
298
|
|
330
299
|
## Road Map
|
331
300
|
|
332
301
|
1. Add ability to open URLs.
|
333
|
-
2. Add
|
302
|
+
2. Add Ngram support.
|
334
303
|
|
335
304
|
#### Ability to read URLs
|
336
305
|
|
@@ -342,21 +311,13 @@ def self.from_url
|
|
342
311
|
end
|
343
312
|
```
|
344
313
|
|
345
|
-
## But wait... wait a minute...
|
346
|
-
|
347
|
-
#### Isn't it better to write this in JavaScript?
|
348
|
-
|
349
|
-
![Picard face-palm](http://stream1.gifsoup.com/view3/1290449/picard-facepalm-o.gif "Picard face-palm")
|
350
|
-
|
351
314
|
## About
|
352
315
|
|
353
316
|
Originally I wrote this program for a code challenge on Treehouse. You can find the original implementation on [Code Review][1].
|
354
317
|
|
355
318
|
## Contributors
|
356
319
|
|
357
|
-
|
358
|
-
|
359
|
-
Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and improving the filter feature to well beyond what I can come up with.
|
320
|
+
See [contributors][3]. Not listed there is [Dave Yarwood][1].
|
360
321
|
|
361
322
|
## Contributing
|
362
323
|
|
@@ -368,8 +329,8 @@ Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and imp
|
|
368
329
|
|
369
330
|
|
370
331
|
[1]: http://codereview.stackexchange.com/questions/46105/a-ruby-string-analyser
|
371
|
-
[2]:
|
372
|
-
[3]:
|
332
|
+
[2]: http://www.rubydoc.info/gems/words_counted
|
333
|
+
[3]: https://github.com/abitdodgy/words_counted/graphs/contributors
|
373
334
|
[4]: http://rubywordcount.com
|
374
|
-
[5]: https://github.com/abitdodgy/words_counted#excluding-
|
335
|
+
[5]: https://github.com/abitdodgy/words_counted#excluding-tokens-from-the-analyser
|
375
336
|
[6]: https://github.com/abitdodgy/words_counted#passing-in-a-custom-regexp
|
@@ -67,10 +67,10 @@ module WordsCounted
|
|
67
67
|
# @example With `exclude` as a mixed array
|
68
68
|
# t = WordsCounted::Tokeniser.new("Hello! اسماءنا هي محمد، كارولينا، سامي، وداني")
|
69
69
|
# t.tokenise(exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"])
|
70
|
-
# # =>
|
70
|
+
# # => ["هي", "سامي", "ودان"]
|
71
71
|
#
|
72
72
|
# @param [Regexp] pattern The string to tokenise.
|
73
|
-
# @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol nil] exclude The filter to apply.
|
73
|
+
# @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol, nil] exclude The filter to apply.
|
74
74
|
# @return [Array] the array of filtered tokens.
|
75
75
|
def tokenise(pattern: TOKEN_REGEXP, exclude: nil)
|
76
76
|
filter_proc = filter_to_proc(exclude)
|