RubyGems - words_counted - Versions diffs - 1.0.0 → 1.0.1 - Mend

words_counted 1.0.0 → 1.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml +4 -4
data/README.md +133 -172
data/lib/refinements/hash_refinements.rb +2 -0
data/lib/words_counted/tokeniser.rb +2 -2
data/lib/words_counted/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: d6302c1802d7da076d1ddafdcbe70e46a89c8f33
-  data.tar.gz: 873efaa5e58f883e0dde99094ca53952d46217c7
+  metadata.gz: 9c56052462cafa83864d9f6ae8fcc9edcdd67356
+  data.tar.gz: 562629bb6b6f61d45a2e40ae0fab5b07e749576e
 SHA512:
-  metadata.gz: 0e6ddb8db9c060432066d86aed2efe20aa95dee2019d54c950007170c0ffbbcff16fa27a0377419b0d1b718be1625a4376ee9c687a4ae67073aaffe9ef363157
-  data.tar.gz: 9df2a0cefe14b9ac77d1741f8980d1b1fb4d8b770738fbd69c8870f73da4b653a1d9462ac8813f88dc48af36e03718773523985f5be0f4999177a6b0a2a89662
+  metadata.gz: 96a6ee9b686893bef4552aedd4dd05f696bf5735cb57c9de65b08f7a6ed6c0a4cbceddd27f8309d0f0522753ff0c556fa5b8ec7bf949a1c3c7c07d9ad1b11337
+  data.tar.gz: 05251ab8e6efa3b29ebd88d7c5ddf82c20bfa3c07321a7a66a9f261c4b1162cdae600cd991b4b243c94d660fc2bfd45d779098f8dcda9f0165474965d8d5b9b8

data/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # WordsCounted
-WordsCounted is a highly customisable Ruby text analyser. Consult the features for more information.
+WordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class. [Consult the documentation][2] for more information.
 <a href="http://badge.fury.io/rb/words_counted">
   <img src="https://badge.fury.io/rb/words_counted@2x.png" alt="Gem Version" height="18">
@@ -8,25 +8,18 @@ WordsCounted is a highly customisable Ruby text analyser. Consult the features f
 ### Demo
-Visit [the gem's website][4] for a demo.
+Visit [this website][4] for an example of what the gem can do.
 ### Features
-* Get the following data from any string or readable file:
-    * Word count
-    * Unique word count
-    * Word density
-    * Character count
-    * Average characters per word
-    * A hash map of words and the number of times they occur
-    * A hash map of words and their lengths
-    * The longest word(s) and its length
-    * The most occurring word(s) and its number of occurrences.
-    * Count invividual strings for occurrences.
-* A flexible way to exclude words (or anything) from the count. You can pass a **string**, a **regexp**, an **array**, or a **lambda**.
-* Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
-  * Filters special characters but respects hyphens and apostrophes.
-  * Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as `["São", "Paulo"]` and not `["S", "", "o", "Paulo"]`.
+* Out of the box, get the following data from any string or readable file, or URL:
+    * Token count and unique token count
+    * Token densities, frequencies, and lengths
+    * Char count and average chars per token
+    * The longest tokens and their lengths
+    * The most frequent tokens and their frequencies.
+* A flexible way to exclude tokens from the tokeniser. You can pass a **string**, **regexp**, **symbol**, **lambda**, or an **array** of any combination of those types for powerful tokenisation strategies.
+* Pass your own regexp rules to the tokeniser if you prefer. The default regexp filters special characters but keeps hyphens and apostrophes. It also plays nicely with diacritics (UTF and unicode characters): *Bayrūt* is treated as `["Bayrūt"]` and not `["Bayr", "ū", "t"]`, for example.
 * Opens and reads files. Pass in a file path or a url instead of a string.
 See usage instructions for more details.
@@ -58,62 +51,68 @@ counter = WordsCounted.count(
 counter = WordsCounted.from_file("path/or/url/to/my/file.txt")
 ```
-## API
+`.count` and `.from_file` are convenience methods that take an input, tokenise it, and return an instance of `Counter` initialized with the tokens. The `Tokeniser` and `Counter` classes can be used alone, however.
-### Class methods
+## API
-#### `count(string, options = {})`
+**`WordsCounted.count(input, options = {})`**
-Initializes an analyser object.
+Tokenises input and initializes a `Counter` object with the resulting tokens.
 ```ruby
 counter = WordsCounted.count("Hello Beirut!")
 ````
-Accepts two options: `exclude` and `regexp`. See [Excluding words from the analyser][5] and [Passing in a custom regexp][6] respectively.
+Accepts two options: `exclude` and `regexp`. See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] respectively.
-#### `from_file(path, options = {})`
+**`WordsCounted.from_file(path, options = {})`**
-Initializes an analyser object from a file path.
+Reads and tokenises a file, and initializes a `Counter` object with the resulting tokens.
 ```ruby
 counter = WordsCounted.count("hello_beirut.txt")
 ````
-Accepts the same options as `count()`.
+Accepts the same options as `.count`.
-### Instance methods
+### Tokeniser
-#### `.word_count`
+The tokeniser allows you to tokenise text in a variety of ways. You can pass in your own rules for tokenisation, and apply a powerful filter with any combination of rules as long as they can boil down into a lambda.
-Returns the word count of a given string. The word count includes only alpha characters. Hyphenated and words with apostrophes are considered a single word. You can pass in your own regular expression if this is not desired behaviour.
+Out of the box the tokeniser includes only alpha chars. Hyphenated tokens and tokens with apostrophes are considered a single token.
+**`#tokenise([pattern: TOKEN_REGEXP, exclude: nil])`**
 ```ruby
-counter.word_count #=> 15
+tokeniser = Tokeniser.new("Hello Beirut!").tokenise
+# With `exclude`
+tokeniser = Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")
+# With `pattern`
+tokeniser = Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)
 ```
-#### `.word_occurrences`
+See [Excluding tokens from the analyser][5] and [Passing in a custom regexp][6] for more information.
-Returns an unsorted hash map of words and their number of occurrences. Uppercase and lowercase words are counted as the same word.
+### Counter
-```ruby
-counter.word_occurrences
+The `Counter` class allows you to collect various statistics from an array of tokens.
-{
-  "we"      => 1,
-  "are"     => 2,
-  "all"     => 1,
-  # ...
-  "stars"   => 1
-}
+**`#token_count`**
+Returns the token count of a given string.
+```ruby
+counter.token_count #=> 15
 ```
-#### `.sorted_word_occurrences`
+**`#token_frequency`**
-Returns a two dimensional array of words and their number of occurrences sorted in descending order. Uppercase and lowercase words are counted as the same word.
+Returns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.
-```ruby
-counter.sorted_word_occurrences
+```
+counter.token_frequency
 [
   ["the", 2],
@@ -124,38 +123,22 @@ counter.sorted_word_occurrences
 ]
 ```
-#### `.most_occurring_words`
+**`#most_frequent_tokens`**
-Returns a two dimensional array of the most occurring word and its number of occurrences. In case there is a tie all tied words are returned.
+Returns a hash where each key-value pair is a token and its frequency.
 ```ruby
-counter.most_occurring_words
+counter.most_frequent_tokens
-[ ["are", 2], ["the", 2] ]
+{ "are" => 2, "the" => 2 }
 ```
-#### `.word_lengths`
+**`#token_lengths`**
-Returns an unsorted hash of words and their lengths.
+Returns a sorted (unstable) two-dimentional array where each element contains a token and its length. The array is sorted by length in descending order.
 ```ruby
-counter.word_lengths
-{
-  "We"      => 2,
-  "are"     => 3,
-  "all"     => 3,
-  # ...
-  "stars"   => 5
-}
-```
-#### `.sorted_word_lengths`
-Returns a two dimensional array of words and their lengths sorted in descending order.
-```ruby
-counter.sorted_word_lengths
+counter.token_lengths
 [
   ["looking", 7],
@@ -166,133 +149,124 @@ counter.sorted_word_lengths
 ]
 ```
-#### `.longest_word`
+**`#longest_tokens`**
-Returns a two dimensional array of the longest word and its length. In case there is a tie all tied words are returned.
+Returns a hash where each key-value pair is a token and its length.
-```ruby
-counter.longest_words
-[ ["looking", 7] ]
-```
-#### `.words`
-Returns an array of words resulting from the string passed into the initialize method.
 ```ruby
-counter.words
-#=> ["We", "are", "all", "in", "the", "gutter", "but", "some", "of", "us", "are", "looking", "at", "the", "stars"]
+counter.longest_tokens
+{ "looking" => 7 }
 ```
-#### `.word_density([ precision = 2 ])`
+**`#token_density([ precision: 2 ])`**
-Returns a two-dimensional array of words and their density to a precision of two. It accepts a precision argument which defaults to two.
+Returns a sorted (unstable) two-dimentional array where each element contains a token and its density as a float, rounded to a precision of two. The array is sorted by density in descending order. It accepts a `precision` argument, which must be a float.
 ```ruby
-counter.word_density
+counter.token_density
 [
-  ["are",     13.33],
-  ["the",     13.33],
-  ["but",     6.67 ],
+  ["are",     0.13],
+  ["the",     0.13],
+  ["but",     0.07 ],
   # ...
-  ["we",      6.67 ]
+  ["we",      0.07 ]
 ]
 ```
-#### `.char_count`
+**`#char_count`**
-Returns the string's character count.
+Returns the char count of tokens.
 ```ruby
-counter.char_count              #=> 76
+counter.char_count #=> 76
 ```
-#### `.average_chars_per_word([ precision = 2 ])`
+**`#average_chars_per_token([ precision: 2 ])`**
-Returns the average character count per word. Accepts a precision argument which defaults to two.
+Returns the average char count per token rounded to two decimal places. Accepts a precision argument which defaults to two. Precision must be a float.
 ```ruby
-counter.average_chars_per_word  #=> 4
+counter.average_chars_per_token #=> 4
 ```
-#### `.unique_word_count`
+**`#unique_token_count`**
-Returns the count of unique words in the string. This is case insensitive.
+Returns the number unique tokens.
 ```ruby
-counter.unique_word_count       #=> 13
+counter.unique_token_count #=> 13
 ```
-#### `.count(word)`
+## Excluding tokens from the tokeniser
-Counts the occurrence of a word in the string.
+You can exclude anything you want from the input by passing the `exclude` option. The exclude option accepts a variety of filters and is extremely flexible.
-```ruby
-counter.count("are")            #=> 2
-```
-## Excluding words from the analyser
+1. A *space-delimited* string. The filter will normalise the string.
+2. A regular expression.
+3. A lambda.
+4. A symbol that is convertible to a proc.  For example `:odd?`.
+5. An array of any combination of the above.
-You can exclude anything you want from the string you want to analyse by passing in the `exclude` option. The exclude option accepts a variety of filters.
-1. A *space-delimited* list of candidates. The filter will remove both uppercase and lowercase variants of the candidate when applicable. Useful for excluding *the*, *a*, and so on.
-2. An array of string candidates. For example: `['a', 'the']`.
-3. A regular expression.
-4. A lambda.
-#### Using a string
 ```ruby
-WordsCounted.count(
-  "Magnificent! That was magnificent, Trevor.", exclude: "was magnificent"
+tokeniser =
+  WordsCounted::Tokeniser.new(
+    "Magnificent! That was magnificent, Trevor.", exclude: "was magnificent"
+  )
+# Using a string
+tokeniser.tokenise(exclude: "was magnificent")
+tokeniser.tokens
+# => ["that", "trevor"]
+# Using a regular expression
+tokeniser.tokenise(exclude: /Trevor/)
+counter.tokens
+# => ["that", "was", "magnificent"]
+# Using a lambda
+tokeniser.tokenise(exclude: ->(t) { t.length < 4 })
+counter.tokens
+# => ["magnificent", "trevor"]
+# Using symbol
+tokeniser = WordsCounted::Tokeniser.new("Hello! محمد")
+t.tokenise(exclude: :ascii_only?)
+# => ["محمد"]
+# Using an array
+tokeniser = WordsCounted::Tokeniser.new(
+  "Hello! اسماءنا هي محمد، كارولينا، سامي، وداني"
 )
-counter.words
-#=> ["That", "Trevor"]
-```
-#### Using an array
-```ruby
-WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ['1', '2', '3'])
-counter.words
-#=> ["4", "5", "6"]
-```
-#### Using a regular expression
-```ruby
-WordsCounted.count("Hello Beirut", exclude: /Beirut/)
-counter.words
-#=> ["Hello"]
-```
-#### Using a lambda
-```ruby
-WordsCounted.count("1 2 3 4 5 6", regexp: /[0-9]/, exclude: ->(w) { w.to_i.even? })
-counter.words
-#=> ["1", "3", "5"]
+tokeniser.tokenise(
+  exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"]
+)
+# => ["هي", "سامي", "ودان"]
 ```
 ## Passing in a Custom Regexp
-Defining words is tricky. The default regexp accounts for letters, hyphenated words, and apostrophes. This means *twenty-one* is treated as one word. So is *Mohamad's*.
+The default regexp accounts for letters, hyphenated tokens, and apostrophes. This means *twenty-one* is treated as one token. So is *Mohamad's*.
 ```ruby
 /[\p{Alpha}\-']+/
 ```
-But maybe you don't want to count words?&ndash;Well, analyse anything you want. What you analyse is only limited by your knowledge of regular expressions. Pass your own criteria as a Ruby regular expression to split your string as desired.
+You can pass your own criteria as a Ruby regular expression to split your string as desired.
-For example, if you wanted to include numbers in your analysis, you can override the regular expression:
+For example, if you wanted to include numbers, you can override the regular expression:
 ```ruby
 counter = WordsCounted.count("Numbers 1, 2, and 3", regexp: /[\p{Alnum}\-']+/)
-counter.words
+counter.tokens
 #=> ["Numbers", "1", "2", "and", "3"]
 ```
 ## Opening and Reading Files
-Use the `from_file` method to open files. `from_file` accepts the same options as `count`. The file path can be a URL.
+Use the `from_file` method to open files. `from_file` accepts the same options as `.count`. The file path can be a URL.
 ```ruby
 counter = WordsCounted.from_file("url/or/path/to/file.text")
@@ -300,37 +274,32 @@ counter = WordsCounted.from_file("url/or/path/to/file.text")
 ## Gotchas
-A hyphen used in leu of an *em* or *en* dash will form part of the word. This affects the `word_occurences` algorithm.
+A hyphen used in leu of an *em* or *en* dash will form part of the token. This affects the tokeniser algorithm.
 ```ruby
 counter = WordsCounted.count("How do you do?-you are well, I see.")
-counter.word_occurrences
-{
-  "how"   => 1,
-  "do"    => 2,
-  "you"   => 1,
-  "-you"  => 1, # WTF, mate!
-  "are"   => 1,
-  "very"  => 1,
-  "well"  => 1,
-  "i"     => 1,
-  "see"   => 1
-}
-```
+counter.token_frequency
-In this example `-you` and `you` are counted as separate words. Writers should use the correct dash element, but this is not always true.
+[
+  ["do",   2],
+  ["how",  1],
+  ["you",  1],
+  ["-you", 1], # WTF, mate!
+  ["are",  1],
+  # ...
+]
+```
-Another gotcha is that the default criteria does not include numbers in its analysis. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
+In this example `-you` and `you` are separate tokens. Also, the tokeniser does not include numbers by default. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.
 ### A note on case sensitivity
-The program will downcase all incoming strings for consistency.
+The program will normalise (downcase) all incoming strings for consistency and filters.
 ## Road Map
 1. Add ability to open URLs.
-2. Add paragraph, sentence, average words per sentence, and average sentence chars counters.
+2. Add Ngram support.
 #### Ability to read URLs
@@ -342,21 +311,13 @@ def self.from_url
 end
 ```
-## But wait... wait a minute...
-#### Isn't it better to write this in JavaScript?
-![Picard face-palm](http://stream1.gifsoup.com/view3/1290449/picard-facepalm-o.gif "Picard face-palm")
 ## About
 Originally I wrote this program for a code challenge on Treehouse. You can find the original implementation on [Code Review][1].
 ## Contributors
-Thanks to Dave Yarwood for helping me improve my code. Some of my code is based on his recommendations. You can find the original program implementation, as well as Dave's code review, on [Code Review][1].
-Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and improving the filter feature to well beyond what I can come up with.
+See [contributors][3]. Not listed there is [Dave Yarwood][1].
 ## Contributing
@@ -368,8 +329,8 @@ Thanks to [Wayne Conrad][2] for providing [an excellent code review][3], and imp
   [1]: http://codereview.stackexchange.com/questions/46105/a-ruby-string-analyser
-  [2]: https://github.com/wconrad
-  [3]: http://codereview.stackexchange.com/a/49476/1563
+  [2]: http://www.rubydoc.info/gems/words_counted
+  [3]: https://github.com/abitdodgy/words_counted/graphs/contributors
   [4]: http://rubywordcount.com
-  [5]: https://github.com/abitdodgy/words_counted#excluding-words-from-the-analyser
+  [5]: https://github.com/abitdodgy/words_counted#excluding-tokens-from-the-analyser
   [6]: https://github.com/abitdodgy/words_counted#passing-in-a-custom-regexp

data/lib/refinements/hash_refinements.rb CHANGED Viewed

@@ -2,6 +2,8 @@
 module Refinements
   module HashRefinements
     refine Hash do
+      # This is convenience method to sort hashes into an
+      # array of tuples by descending value.
       def sort_by_value_desc
         sort_by(&:last).reverse
       end

data/lib/words_counted/tokeniser.rb CHANGED Viewed

@@ -67,10 +67,10 @@ module WordsCounted
     # @example With `exclude` as a mixed array
     #  t = WordsCounted::Tokeniser.new("Hello! اسماءنا هي محمد، كارولينا، سامي، وداني")
     #  t.tokenise(exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"])
-    #  # => => ["هي", "سامي", "ودان
+    #  # => ["هي", "سامي", "ودان"]
     #
     # @param [Regexp] pattern   The string to tokenise.
-    # @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol nil] exclude     The filter to apply.
+    # @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol, nil] exclude     The filter to apply.
     # @return [Array] the array of filtered tokens.
     def tokenise(pattern: TOKEN_REGEXP, exclude: nil)
       filter_proc = filter_to_proc(exclude)

data/lib/words_counted/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 # -*- encoding : utf-8 -*-
 module WordsCounted
-  VERSION = "1.0.0"
+  VERSION = "1.0.1"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: words_counted
 version: !ruby/object:Gem::Version
-  version: 1.0.0
+  version: 1.0.1
 platform: ruby
 authors:
 - Mohamad El-Husseini