RubyGems - words_counted - Versions diffs - 0.1.5 → 1.0.3 - Mend

words_counted 0.1.5 → 1.0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +5 -5
data/.gitignore +1 -0
data/.hound.yml +2 -0
data/.ruby-style.yml +2 -0
data/.ruby-version +1 -0
data/.travis.yml +9 -0
data/.yardopts +3 -2
data/CHANGELOG.md +29 -0
data/README.md +146 -189
data/lib/refinements/hash_refinements.rb +14 -0
data/lib/words_counted/counter.rb +113 -72
data/lib/words_counted/deprecated.rb +78 -0
data/lib/words_counted/tokeniser.rb +163 -0
data/lib/words_counted/version.rb +1 -1
data/lib/words_counted.rb +31 -4
data/spec/words_counted/counter_spec.rb +49 -204
data/spec/words_counted/deprecated_spec.rb +99 -0
data/spec/words_counted/tokeniser_spec.rb +133 -0
data/spec/words_counted_spec.rb +34 -0
data/words_counted.gemspec +2 -2
metadata +25 -12

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: cba04e2004b13b0ee7b99e46cdf6549f6aebe2f6
-  data.tar.gz: 885d494f7f2b2af40f59ed08aaca1db7ec89a54b
+SHA256:
+  metadata.gz: a248654f9f76e28bde0f54993a5c5c87504acffed42b1531acc9de7f385f0696
+  data.tar.gz: c057a7ecb20d7989651b6667f39d16820734e63dd751a0182406f268ecf0f347
 SHA512:
-  metadata.gz: e2009cd4b401da2b43047699a073a3f541654384d831d73c0d436016eb88325e29c179a59961c6d1d8d48a865f34a2da78e014a28a5e0cf4ccf714cafa7a6bb5
-  data.tar.gz: f46e0031db714c0985ef4b2dee5d1f294c9ab0bdb629157110af0b26b76280bfe440207b4f6920156681cc91ded0246e3e66b6dcf26717208cc73ebbe4e86821
+  metadata.gz: 2c4a5028624393434586c7570e8a6c98785c6cedfc3a6f5c07b7fa9b8aba2880ddf847be8779f623df8e36becb8e148aeaabfae822dcc4f0c9b1db414f8c7916
+  data.tar.gz: e115d757c34480e9e7425db94f6c78a035b4464c69946aa31cbb45ea28f963dc1088a1617269b506001669753b2725abf4f0b708303ced59aa5c59cb1658096c

data/.gitignore CHANGED Viewed

@@ -15,3 +15,4 @@ spec/reports
 test/tmp
 test/version_tmp
 tmp
+.idea/

data/.hound.yml ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ ruby:
2	+ config_file: .ruby-style.yml

data/.ruby-style.yml ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ Metrics/LineLength:
2	+ Max: 120

data/.ruby-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.0.1

data/.travis.yml ADDED Viewed

@@ -0,0 +1,9 @@
+language: ruby
+rvm:
+  - 3.0.0
+  - 3.0.1
+  - ruby-head
+gemfile:
+  - Gemfile

data/.yardopts CHANGED Viewed

@@ -1,3 +1,4 @@
---title 'Word Counter for Ruby'
+--title 'Ruby natural language processor'
 --private
---markup markdown
+--markup markdown
+--hide-api private

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,32 @@
+## Version 1.0.3
+1. Adds support for Ruby 3.0.0.
+2. Improve documentation and adds newer configs to Travis CI and Hound.
+## Version 1.0
+This version brings lots of improvements to code organisation. The tokeniser has been extracted into its own class. All methods in `Counter` have either renamed or deprecated. Deprecated methods and their tests have moved into their own modules. Using them will trigger warnings with upgrade instructions outlined below.
+1. Extracted tokenisation behaviour from `Counter` into a `Tokeniser` class.
+2. Deprecated all methods that have `word` in their name. Most are renamed such that `word` became `token`. They will be removed in version 1.1.
+  - Deprecated `word_count` in favor of `token_count`
+  - Deprecated `unique_word_count` in favor of `unique_token_count`
+  - Deprecated `word_occurrences` and `sorted_word_occurrences` in favor of `token_frequency`
+  - Deprecated `word_lengths` and `sorted_word_lengths` in favor of `token_lenghts`
+  - Deprecated `word_density` in favor of `token_density`
+  - Deprecated `most_occurring_words` in favor of `most_frequent_tokens`
+  - Deprecated `longest_words` in favor of `longest_tokens`
+  - Deprecated `average_chars_per_word` in favor of `average_chars_per_token`
+  - Deprecated `count`. Use `Array#count` instead.
+3. `token_lengths`, which replaces `word_lengths` returns a sorted two-dimensional array instead of a hash. It behaves exactly like `sorted_word_lengths` which has been deprecated. Use `token_lengths.to_h` for old behaviour.
+4. `token_frequency`, which replaces `word_occurences` returns a sorted two-dimensional array instead of a hash. It behaves like `sorted_word_occurrences` which has been deprecated. Use `token_frequency.to_h` for old behaviour.
+5. `token_density`, which replaces `word_density`, returns a decimal with a precision of 2, not a percent. Use `token_density * 100` for old behaviour.
+6. Add a refinement to Hash under `lib/refinements/hash_refinements.rb` to quickly sort by descending value.
+7. Extracted all deprecated methods to their own module, and their tests to their own spec file.
+8. Added a base `words_counted_spec.rb` and moved `.from_file` test to the new file.
+9. Added Travis continuous integration.
+10. Add documentation to the code.
 ## Version 0.1.5
 1. Removed `to_f` from the dividend in `average_chars_per_word` and `word_densities`. The divisor is a float, and dividing by a float returns a float.

data/README.md CHANGED Viewed

@@ -1,36 +1,35 @@
 # WordsCounted
-WordsCounted is a highly customisable Ruby text analyser. Consult the features for more information.
+> We are all in the gutter, but some of us are looking at the stars.
+>
+> -- Oscar Wilde
+WordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class.
+**Are you using WordsCounted to do something interesting?** Please [tell me about it][8].
 <a href="http://badge.fury.io/rb/words_counted">
   <img src="https://badge.fury.io/rb/words_counted@2x.png" alt="Gem Version" height="18">
 </a>
+[RubyDoc documentation][7].
 ### Demo
-Visit [the gem's website][4] for a demo.
+Visit [this website][4] for one example of what you can do with WordsCounted.
 ### Features
-* Get the following data from any string or readable file:
-    * Word count
-    * Unique word count
-    * Word density
-    * Character count
-    * Average characters per word
-    * A hash map of words and the number of times they occur
-    * A hash map of words and their lengths
-    * The longest word(s) and its length
-    * The most occurring word(s) and its number of occurrences.
-    * Count invividual strings for occurrences.
-* A flexible way to exclude words (or anything) from the count. You can pass a **string**, a **regexp**, an **array**, or a **lambda**.
-* Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
-  * Filters special characters but respects hyphens and apostrophes.
-  * Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as `["São", "Paulo"]` and not `["S", "", "o", "Paulo"]`.
+* Out of the box, get the following data from any string or readable file, or URL:
+    * Token count and unique token count
+    * Token densities, frequencies, and lengths
+    * Char count and average chars per token
+    * The longest tokens and their lengths
+    * The most frequent tokens and their frequencies.
+* A flexible way to exclude tokens from the tokeniser. You can pass a **string**, **regexp**, **symbol**, **lambda**, or an **array** of any combination of those types for powerful tokenisation strategies.
+* Pass your own regexp rules to the tokeniser if you prefer. The default regexp filters special characters but keeps hyphens and apostrophes. It also plays nicely with diacritics (UTF and unicode characters): *Bayrūt* is treated as `["Bayrūt"]` and not `["Bayr", "ū", "t"]`, for example.
 * Opens and reads files. Pass in a file path or a url instead of a string.
-See usage instructions for more details.
 ## Installation
 Add this line to your application's Gemfile:
@@ -58,62 +57,70 @@ counter = WordsCounted.count(
 counter = WordsCounted.from_file("path/or/url/to/my/file.txt")
 ```
+`.count` and `.from_file` are convenience methods that take an input, tokenise it, and return an instance of `WordsCounted::Counter` initialized with the tokens. The `WordsCounted::Tokeniser` and `WordsCounted::Counter` classes can be used alone, however.
 ## API
-### Class methods
+### WordsCounted
-#### `count(string, options = {})`
+**`WordsCounted.count(input, options = {})`**
-Initializes an analyser object.
+Tokenises input and initializes a `WordsCounted::Counter` object with the resulting tokens.
 ```ruby
 counter = WordsCounted.count("Hello Beirut!")
 ````
-Accepts two options: `exclude` and `regexp`. See [Excluding words from the analyser][5] and [Passing in a custom regexp][6] respectively.
+Accepts two options: `exclude` and `regexp`. See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] respectively.
-#### `from_file(path, options = {})`
+**`WordsCounted.from_file(path, options = {})`**
-Initializes an analyser object from a file path.
+Reads and tokenises a file, and initializes a `WordsCounted::Counter` object with the resulting tokens.
 ```ruby
-counter = WordsCounted.count("hello_beirut.txt")
+counter = WordsCounted.from_file("hello_beirut.txt")
 ````
-Accepts the same options as `count()`.
+Accepts the same options as `.count`.
+### Tokeniser
-### Instance methods
+The tokeniser allows you to tokenise text in a variety of ways. You can pass in your own rules for tokenisation, and apply a powerful filter with any combination of rules as long as they can boil down into a lambda.
-#### `.word_count`
+Out of the box the tokeniser includes only alpha chars. Hyphenated tokens and tokens with apostrophes are considered a single token.
-Returns the word count of a given string. The word count includes only alpha characters. Hyphenated and words with apostrophes are considered a single word. You can pass in your own regular expression if this is not desired behaviour.
+**`#tokenise([pattern: TOKEN_REGEXP, exclude: nil])`**
 ```ruby
-counter.word_count #=> 15
+tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise
+# With `exclude`
+tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")
+# With `pattern`
+tokeniser = WordsCounted::Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)
 ```
-#### `.word_occurrences`
+See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] for more information.
-Returns an unsorted hash map of words and their number of occurrences. Uppercase and lowercase words are counted as the same word.
+### Counter
-```ruby
-counter.word_occurrences
+The `WordsCounted::Counter` class allows you to collect various statistics from an array of tokens.
-{
-  "we"      => 1,
-  "are"     => 2,
-  "all"     => 1,
-  # ...
-  "stars"   => 1
-}
+**`#token_count`**
+Returns the token count of a given string.
+```ruby
+counter.token_count #=> 15
 ```
-#### `.sorted_word_occurrences`
+**`#token_frequency`**
-Returns a two dimensional array of words and their number of occurrences sorted in descending order. Uppercase and lowercase words are counted as the same word.
+Returns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.
 ```ruby
-counter.sorted_word_occurrences
+counter.token_frequency
 [
   ["the", 2],
@@ -124,38 +131,22 @@ counter.sorted_word_occurrences
 ]
 ```
-#### `.most_occurring_words`
-Returns a two dimensional array of the most occurring word and its number of occurrences. In case there is a tie all tied words are returned.
-```ruby
-counter.most_occurring_words
-[ ["are", 2], ["the", 2] ]
-```
-#### `.word_lengths`
+**`#most_frequent_tokens`**
-Returns an unsorted hash of words and their lengths.
+Returns a hash where each key-value pair is a token and its frequency.
 ```ruby
-counter.word_lengths
+counter.most_frequent_tokens
-{
-  "We"      => 2,
-  "are"     => 3,
-  "all"     => 3,
-  # ...
-  "stars"   => 5
-}
+{ "are" => 2, "the" => 2 }
 ```
-#### `.sorted_word_lengths`
+**`#token_lengths`**
-Returns a two dimensional array of words and their lengths sorted in descending order.
+Returns a sorted (unstable) two-dimentional array where each element contains a token and its length. The array is sorted by length in descending order.
 ```ruby
-counter.sorted_word_lengths
+counter.token_lengths
 [
   ["looking", 7],
@@ -166,133 +157,121 @@ counter.sorted_word_lengths
 ]
 ```
-#### `.longest_word`
-Returns a two dimensional array of the longest word and its length. In case there is a tie all tied words are returned.
+**`#longest_tokens`**
-```ruby
-counter.longest_words
-[ ["looking", 7] ]
-```
-#### `.words`
+Returns a hash where each key-value pair is a token and its length.
-Returns an array of words resulting from the string passed into the initialize method.
 ```ruby
-counter.words
-#=> ["We", "are", "all", "in", "the", "gutter", "but", "some", "of", "us", "are", "looking", "at", "the", "stars"]
+counter.longest_tokens
+{ "looking" => 7 }
 ```
-#### `.word_density([ precision = 2 ])`
+**`#token_density([ precision: 2 ])`**
-Returns a two-dimensional array of words and their density to a precision of two. It accepts a precision argument which defaults to two.
+Returns a sorted (unstable) two-dimentional array where each element contains a token and its density as a float, rounded to a precision of two. The array is sorted by density in descending order. It accepts a `precision` argument, which must be a float.
 ```ruby
-counter.word_density
+counter.token_density
 [
-  ["are",     13.33],
-  ["the",     13.33],
-  ["but",     6.67 ],
+  ["are",     0.13],
+  ["the",     0.13],
+  ["but",     0.07 ],
   # ...
-  ["we",      6.67 ]
+  ["we",      0.07 ]
 ]
 ```
-#### `.char_count`
+**`#char_count`**
-Returns the string's character count.
+Returns the char count of tokens.
 ```ruby
-counter.char_count              #=> 76
+counter.char_count #=> 76
 ```
-#### `.average_chars_per_word([ precision = 2 ])`
+**`#average_chars_per_token([ precision: 2 ])`**
-Returns the average character count per word. Accepts a precision argument which defaults to two.
+Returns the average char count per token rounded to two decimal places. Accepts a precision argument which defaults to two. Precision must be a float.
 ```ruby
-counter.average_chars_per_word  #=> 4
+counter.average_chars_per_token #=> 4
 ```
-#### `.unique_word_count`
+**`#uniq_token_count`**
-Returns the count of unique words in the string. This is case insensitive.
+Returns the number of unique tokens.
 ```ruby
-counter.unique_word_count       #=> 13
+counter.uniq_token_count #=> 13
 ```
-#### `.count(word)`
+## Excluding tokens from the tokeniser
-Counts the occurrence of a word in the string.
+You can exclude anything you want from the input by passing the `exclude` option. The exclude option accepts a variety of filters and is extremely flexible.
-```ruby
-counter.count("are")            #=> 2
-```
+1. A *space-delimited* string. The filter will normalise the string.
+2. A regular expression.
+3. A lambda.
+4. A symbol that names a predicate method.  For example `:odd?`.
+5. An array of any combination of the above.
-## Excluding words from the analyser
-You can exclude anything you want from the string you want to analyse by passing in the `exclude` option. The exclude option accepts a variety of filters.
-1. A *space-delimited* list of candidates. The filter will remove both uppercase and lowercase variants of the candidate when applicable. Useful for excluding *the*, *a*, and so on.
-2. An array of string candidates. For example: `['a', 'the']`.
-3. A regular expression.
-4. A lambda.
-#### Using a string
 ```ruby
-WordsCounted.count(
-  "Magnificent! That was magnificent, Trevor.", exclude: "was magnificent"
+tokeniser =
+  WordsCounted::Tokeniser.new(
+    "Magnificent! That was magnificent, Trevor."
+  )
+# Using a string
+tokeniser.tokenise(exclude: "was magnificent")
+# => ["that", "trevor"]
+# Using a regular expression
+tokeniser.tokenise(exclude: /trevor/)
+# => ["magnificent", "that", "was", "magnificent"]
+# Using a lambda
+tokeniser.tokenise(exclude: ->(t) { t.length < 4 })
+# => ["magnificent", "that", "magnificent", "trevor"]
+# Using symbol
+tokeniser = WordsCounted::Tokeniser.new("Hello! محمد")
+tokeniser.tokenise(exclude: :ascii_only?)
+# => ["محمد"]
+# Using an array
+tokeniser = WordsCounted::Tokeniser.new(
+  "Hello! اسماءنا هي محمد، كارولينا، سامي، وداني"
 )
-counter.words
-#=> ["That", "Trevor"]
-```
-#### Using an array
-```ruby
-WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ['1', '2', '3'])
-counter.words
-#=> ["4", "5", "6"]
-```
-#### Using a regular expression
-```ruby
-WordsCounted.count("Hello Beirut", exclude: /Beirut/)
-counter.words
-#=> ["Hello"]
-```
-#### Using a lambda
-```ruby
-WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ->(w) { w.to_i.even? })
-counter.words
-#=> ["1", "3", "5"]
+tokeniser.tokenise(
+  exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"]
+)
+# => ["هي", "سامي", "وداني"]
 ```
-## Passing in a Custom Regexp
+## Passing in a custom regexp
-Defining words is tricky. The default regexp accounts for letters, hyphenated words, and apostrophes. This means *twenty-one* is treated as one word. So is *Mohamad's*.
+The default regexp accounts for letters, hyphenated tokens, and apostrophes. This means *twenty-one* is treated as one token. So is *Mohamad's*.
 ```ruby
 /[\p{Alpha}\-']+/
 ```
-But maybe you don't want to count words?&ndash;Well, analyse anything you want. What you analyse is only limited by your knowledge of regular expressions. Pass your own criteria as a Ruby regular expression to split your string as desired.
+You can pass your own criteria as a Ruby regular expression to split your string as desired.
-For example, if you wanted to include numbers in your analysis, you can override the regular expression:
+For example, if you wanted to include numbers, you can override the regular expression:
 ```ruby
-counter = WordsCounted.count("Numbers 1, 2, and 3", regexp: /[\p{Alnum}\-']+/)
-counter.words
-#=> ["Numbers", "1", "2", "and", "3"]
+counter = WordsCounted.count("Numbers 1, 2, and 3", pattern: /[\p{Alnum}\-']+/)
+counter.tokens
+#=> ["numbers", "1", "2", "and", "3"]
 ```
-## Opening and Reading Files
+## Opening and reading files
-Use the `from_file` method to open files. `from_file` accepts the same options as `count`. The file path can be a URL.
+Use the `from_file` method to open files. `from_file` accepts the same options as `.count`. The file path can be a URL.
 ```ruby
 counter = WordsCounted.from_file("url/or/path/to/file.text")
@@ -300,41 +279,31 @@ counter = WordsCounted.from_file("url/or/path/to/file.text")
 ## Gotchas
-A hyphen used in leu of an *em* or *en* dash will form part of the word. This affects the `word_occurences` algorithm.
+A hyphen used in leu of an *em* or *en* dash will form part of the token. This affects the tokeniser algorithm.
 ```ruby
 counter = WordsCounted.count("How do you do?-you are well, I see.")
-counter.word_occurrences
-{
-  "how"   => 1,
-  "do"    => 2,
-  "you"   => 1,
-  "-you"  => 1, # WTF, mate!
-  "are"   => 1,
-  "very"  => 1,
-  "well"  => 1,
-  "i"     => 1,
-  "see"   => 1
-}
-```
+counter.token_frequency
-In this example `-you` and `you` are counted as separate words. Writers should use the correct dash element, but this is not always true.
+[
+  ["do",   2],
+  ["how",  1],
+  ["you",  1],
+  ["-you", 1], # WTF, mate!
+  ["are",  1],
+  # ...
+]
+```
-Another gotcha is that the default criteria does not include numbers in its analysis. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
+In this example `-you` and `you` are separate tokens. Also, the tokeniser does not include numbers by default. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
 ### A note on case sensitivity
-The program will downcase all incoming strings for consistency.
+The program will normalise (downcase) all incoming strings for consistency and filters.
-## Road Map
+## Roadmap
-1. Add ability to open URLs.
-2. Add paragraph, sentence, average words per sentence, and average sentence chars counters.
-#### Ability to read URLs
-Something like...
+### Ability to open URLs
 ```ruby
 def self.from_url
@@ -342,21 +311,9 @@ def self.from_url
 end
 ```
-## But wait... wait a minute...
-#### Isn't it better to write this in JavaScript?
-![Picard face-palm](http://stream1.gifsoup.com/view3/1290449/picard-facepalm-o.gif "Picard face-palm")
-## About
-Originally I wrote this program for a code challenge on Treehouse. You can find the original implementation on [Code Review][1].
 ## Contributors
-Thanks to Dave Yarwood for helping me improve my code. Some of my code is based on his recommendations. You can find the original program implementation, as well as Dave's code review, on [Code Review][1].
-Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and improving the filter feature to well beyond what I can come up with.
+See [contributors][3]. Not listed there is [Dave Yarwood][1].
 ## Contributing
@@ -366,10 +323,10 @@ Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and imp
 4. Push to the branch (`git push origin my-new-feature`)
 5. Create new Pull Request
-  [1]: http://codereview.stackexchange.com/questions/46105/a-ruby-string-analyser
-  [2]: https://github.com/wconrad
-  [3]: http://codereview.stackexchange.com/a/49476/1563
+  [2]: http://www.rubydoc.info/gems/words_counted
+  [3]: https://github.com/abitdodgy/words_counted/graphs/contributors
   [4]: http://rubywordcount.com
-  [5]: https://github.com/abitdodgy/words_counted#excluding-words-from-the-analyser
+  [5]: https://github.com/abitdodgy/words_counted#excluding-tokens-from-the-analyser
   [6]: https://github.com/abitdodgy/words_counted#passing-in-a-custom-regexp
+  [7]: http://www.rubydoc.info/gems/words_counted/
+  [8]: https://github.com/abitdodgy/words_counted/issues/new

data/lib/refinements/hash_refinements.rb ADDED Viewed

@@ -0,0 +1,14 @@
+# -*- encoding : utf-8 -*-
+module Refinements
+  module HashRefinements
+    refine Hash do
+      # This is convenience method to sort hashes into an
+      # array of tuples by descending value.
+      #
+      # @return [Array<Array>] A sorted (unstable) array of candidates
+      def sort_by_value_desc
+        sort_by(&:last).reverse
+      end
+    end
+  end
+end