RubyGems - words_counted - Versions diffs - 0.1.5 → 1.0.0 - Mend

words_counted 0.1.5 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

checksums.yaml +4 -4
data/.gitignore +1 -0
data/.hound.yml +2 -0
data/.ruby-style.yml +2 -0
data/.travis.yml +9 -0
data/.yardopts +3 -2
data/CHANGELOG.md +24 -0
data/lib/refinements/hash_refinements.rb +10 -0
data/lib/words_counted/counter.rb +101 -69
data/lib/words_counted/deprecated.rb +76 -0
data/lib/words_counted/tokeniser.rb +139 -0
data/lib/words_counted/version.rb +1 -1
data/lib/words_counted.rb +10 -3
data/spec/words_counted/counter_spec.rb +49 -204
data/spec/words_counted/deprecated_spec.rb +99 -0
data/spec/words_counted/tokeniser_spec.rb +133 -0
data/spec/words_counted_spec.rb +34 -0
data/words_counted.gemspec +1 -1
metadata +17 -4

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: cba04e2004b13b0ee7b99e46cdf6549f6aebe2f6
-  data.tar.gz: 885d494f7f2b2af40f59ed08aaca1db7ec89a54b
+  metadata.gz: d6302c1802d7da076d1ddafdcbe70e46a89c8f33
+  data.tar.gz: 873efaa5e58f883e0dde99094ca53952d46217c7
 SHA512:
-  metadata.gz: e2009cd4b401da2b43047699a073a3f541654384d831d73c0d436016eb88325e29c179a59961c6d1d8d48a865f34a2da78e014a28a5e0cf4ccf714cafa7a6bb5
-  data.tar.gz: f46e0031db714c0985ef4b2dee5d1f294c9ab0bdb629157110af0b26b76280bfe440207b4f6920156681cc91ded0246e3e66b6dcf26717208cc73ebbe4e86821
+  metadata.gz: 0e6ddb8db9c060432066d86aed2efe20aa95dee2019d54c950007170c0ffbbcff16fa27a0377419b0d1b718be1625a4376ee9c687a4ae67073aaffe9ef363157
+  data.tar.gz: 9df2a0cefe14b9ac77d1741f8980d1b1fb4d8b770738fbd69c8870f73da4b653a1d9462ac8813f88dc48af36e03718773523985f5be0f4999177a6b0a2a89662

data/.gitignore CHANGED Viewed

@@ -15,3 +15,4 @@ spec/reports
 test/tmp
 test/version_tmp
 tmp
+.idea/

data/.hound.yml ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ ruby:
2	+ config_file: .ruby-style.yml

data/.ruby-style.yml ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ Style/IfUnlessModifier:
2	+ MaxLineLength: 120

data/.travis.yml ADDED Viewed

@@ -0,0 +1,9 @@
+language: ruby
+rvm:
+  - 2.1
+  - 2.2
+  - ruby-head
+gemfile:
+  - Gemfile

data/.yardopts CHANGED Viewed

@@ -1,3 +1,4 @@
---title 'Word Counter for Ruby'
+--title 'Ruby natural language processor'
 --private
---markup markdown
+--markup markdown
+--hide-api private

data/CHANGELOG.md CHANGED Viewed

@@ -1,3 +1,27 @@
+## Version 1.0
+This version brings lots of improvements to code organisation. The tokeniser has been extracted into its own class. All methods in `Counter` have either renamed or deprecated. Deprecated methods and their tests have moved into their own modules. Using them will trigger warnings with upgrade instructions outlined below.
+1. Extracted tokenisation behaviour from `Counter` into a `Tokeniser` class.
+2. Deprecated all methods that have `word` in their name. Most are renamed such that `word` became `token`. They will be removed in version 1.1.
+  - Deprecated `word_count` in favor of `token_count`
+  - Deprecated `unique_word_count` in favor of `unique_token_count`
+  - Deprecated `word_occurrences` and `sorted_word_occurrences` in favor of `token_frequency`
+  - Deprecated `word_lengths` and `sorted_word_lengths` in favor of `token_lenghts`
+  - Deprecated `word_density` in favor of `token_density`
+  - Deprecated `most_occurring_words` in favor of `most_frequent_tokens`
+  - Deprecated `longest_words` in favor of `longest_tokens`
+  - Deprecated `average_chars_per_word` in favor of `average_chars_per_token`
+  - Deprecated `count`. Use `Array#count` instead.
+3. `token_lengths`, which replaces `word_lengths` returns a sorted two-dimensional array instead of a hash. It behaves exactly like `sorted_word_lengths` which has been deprecated. Use `token_lengths.to_h` for old behaviour.
+4. `token_frequency`, which replaces `word_occurences` returns a sorted two-dimensional array instead of a hash. It behaves like `sorted_word_occurrences` which has been deprecated. Use `token_frequency.to_h` for old behaviour.
+5. `token_density`, which replaces `word_density`, returns a decimal with a precision of 2, not a percent. Use `token_density * 100` for old behaviour.
+6. Add a refinement to Hash under `lib/refinements/hash_refinements.rb` to quickly sort by descending value.
+7. Extracted all deprecated methods to their own module, and their tests to their own spec file.
+8. Added a base `words_counted_spec.rb` and moved `.from_file` test to the new file.
+9. Added Travis continuous integration.
+10. Add documentation to the code.
 ## Version 0.1.5
 1. Removed `to_f` from the dividend in `average_chars_per_word` and `word_densities`. The divisor is a float, and dividing by a float returns a float.

data/lib/refinements/hash_refinements.rb ADDED Viewed

@@ -0,0 +1,10 @@
+# -*- encoding : utf-8 -*-
+module Refinements
+  module HashRefinements
+    refine Hash do
+      def sort_by_value_desc
+        sort_by(&:last).reverse
+      end
+    end
+  end
+end

data/lib/words_counted/counter.rb CHANGED Viewed

@@ -1,96 +1,128 @@
 # -*- encoding : utf-8 -*-
-module WordsCounted
-  class Counter
-    attr_reader :words, :word_occurrences, :word_lengths, :char_count
-    WORD_REGEXP = /[\p{Alpha}\-']+/
-    def self.from_file(path, options = {})
-      File.open(path) do |file|
-        new file.read, options
-      end
-    end
-    def initialize(string, options = {})
-      @options = options
-      exclude = filter_proc(options[:exclude])
-      @words = string.scan(regexp).map(&:downcase).reject { |word| exclude.call(word) }
-      @char_count = words.join.size
-      @word_occurrences = words.each_with_object(Hash.new(0)) { |word, hash| hash[word] += 1 }
-      @word_lengths = words.each_with_object({}) { |word, hash| hash[word] ||= word.length }
-    end
+require "words_counted/deprecated"
-    def word_count
-      words.size
-    end
+module WordsCounted
+  using Refinements::HashRefinements
-    def unique_word_count
-      words.uniq.size
-    end
+  class Counter
+    include Deprecated
-    def average_chars_per_word(precision = 2)
-      (char_count / word_count.to_f).round(precision)
-    end
+    attr_reader :tokens
-    def most_occurring_words
-      highest_ranking word_occurrences
+    def initialize(tokens)
+      @tokens = tokens
     end
-    def longest_words
-      highest_ranking word_lengths
+    # Returns the number of tokens.
+    #
+    # @example
+    #  Counter.new(%w[one two two three three three]).token_count
+    #  # => 6
+    #
+    # @return [Integer]   The number of tokens.
+    def token_count
+      tokens.size
     end
-    def word_density(precision = 2)
-      word_densities = word_occurrences.each_with_object({}) do |(word, occ), hash|
-        hash[word] = (occ / word_count.to_f * 100).round(precision)
-      end
-      sort_by_descending_value word_densities
+    # Returns the number of unique tokens.
+    #
+    # @example
+    #  Counter.new(%w[one two two three three three]).uniq_token_count
+    #  # => 3
+    #
+    # @return [Integer]   The number of unique tokens.
+    def uniq_token_count
+      tokens.uniq.size
     end
-    def sorted_word_occurrences
-      sort_by_descending_value word_occurrences
+    # Returns the character count of all tokens.
+    #
+    # @example
+    #  Counter.new(%w[one two]).char_count
+    #  # => 6
+    #
+    # @return [Integer]   The total char count of tokens.
+    def char_count
+      tokens.join.size
     end
-    def sorted_word_lengths
-      sort_by_descending_value word_lengths
+    # Returns a sorted two-dimensional array where each member array is a token and its frequency.
+    # The array is sorted by frequency in descending order.
+    #
+    # @example
+    #  Counter.new(%w[one two two three three three]).token_frequency
+    #  # => [ ['three', 3], ['two', 2], ['one', 1] ]
+    #
+    # @return [Array<Array<String, Integer>>]
+    def token_frequency
+      tokens.each_with_object(Hash.new(0)) { |token, hash| hash[token] += 1 }.sort_by_value_desc
     end
-    def count(match)
-      words.select { |word| word == match.downcase }.size
+    # Returns a sorted two-dimensional array where each member array is a token and its length.
+    # The array is sorted by length in descending order.
+    #
+    # @example
+    #  Counter.new(%w[one two three four five]).token_lenghts
+    #  # => [ ['three', 5], ['four', 4], ['five', 4], ['one', 3], ['two', 3] ]
+    #
+    # @return [Array<Array<String, Integer>>]
+    def token_lengths
+      tokens.uniq.each_with_object({}) { |token, hash| hash[token] = token.length }.sort_by_value_desc
     end
-  private
-    def highest_ranking(entries)
-      entries.group_by { |_, value| value }.sort.last.last
+    # Returns a sorted two-dimensional array where each member array is a token and its density
+    # as a float, rounded to a precision of two decimal places. It accepts a precision argument
+    # which defaults to `2`.
+    #
+    # @example
+    #  Counter.new(%w[Maj. Major Major Major]).token_density
+    #  # => [ ['major', .75], ['maj', .25] ]
+    #
+    # @example with `precision`
+    #  Counter.new(%w[Maj. Major Major Major]).token_density(precision: 4)
+    #  # => [ ['major', .7500], ['maj', .2500] ]
+    #
+    # @param [Integer] precision              The number of decimal places to round density to.
+    # @return [Array<Array<String, Float>>]
+    def token_density(precision: 2)
+      token_frequency.each_with_object({}) { |(token, freq), hash|
+        hash[token] = (freq / token_count.to_f).round(precision)
+      }.sort_by_value_desc
     end
-    def sort_by_descending_value(entries)
-      entries.sort_by { |_, value| value }.reverse
+    # Returns a hash of tokens and their frequencies for tokens with the highest frequency.
+    #
+    # @example
+    #  Counter.new(%w[one once two two twice twice]).most_frequent_tokens
+    #  # => { 'two' => 2, 'twice' => 2 }
+    #
+    # @return [Hash<String, Integer>]
+    def most_frequent_tokens
+      token_frequency.group_by(&:last).max.last.to_h
     end
-    def regexp
-      @options[:regexp] || WORD_REGEXP
+    # Returns a hash of tokens and their lengths for tokens with the highest length.
+    #
+    # @example
+    #  Counter.new(%w[one three five seven]).longest_tokens
+    #  # => { 'three' => 5, 'seven' => 5 }
+    #
+    # @return [Hash<String, Integer>]
+    def longest_tokens
+      token_lengths.group_by(&:last).max.last.to_h
     end
-    def filter_proc(filter)
-      if filter.respond_to?(:to_a)
-        filter_procs = Array(filter).map(&method(:filter_proc))
-        ->(word) {
-          filter_procs.any? { |p| p.call(word) }
-        }
-      elsif filter.respond_to?(:to_str)
-        exclusion_list = filter.split.collect(&:downcase)
-        ->(word) {
-          exclusion_list.include?(word)
-        }
-      elsif regexp_filter = Regexp.try_convert(filter)
-        Proc.new { |word| word =~ regexp_filter }
-      elsif filter.respond_to?(:to_proc)
-        filter.to_proc
-      else
-        raise ArgumentError, "Filter must String, Array, Lambda, or a Regexp"
-      end
+    # Returns the average char count per token rounded to a precision of two decimal places.
+    # Accepts a `precision` argument.
+    #
+    # @example
+    #  Counter.new(%w[one three five seven]).average_chars_per_token
+    #  # => 4.25
+    #
+    # @return [Float]   The average char count per token.
+    def average_chars_per_token(precision: 2)
+      (char_count / token_count.to_f).round(precision)
     end
   end
 end

data/lib/words_counted/deprecated.rb ADDED Viewed

@@ -0,0 +1,76 @@
+# -*- encoding : utf-8 -*-
+module WordsCounted
+  module Deprecated
+    # @deprecated use `Counter#token_count`
+    def word_count
+      warn "`Counter#word_count` is deprecated, please use `Counter#token_count`"
+      token_count
+    end
+    # @deprecated use `Counter#uniq_token_count`
+    def unique_word_count
+      warn "`Counter#unique_word_count` is deprecated, please use `Counter#uniq_token_count`"
+      uniq_token_count
+    end
+    # @deprecated use `Counter#token_frequency`
+    def word_occurrences
+      warn "`Counter#word_occurrences` is deprecated, please use `Counter#token_frequency`"
+      warn "`Counter#token_frequency` returns a sorted array of arrays, not a hash. Call `token_frequency.to_h` for old behaviour"
+      token_frequency.to_h
+    end
+    # @deprecated use `Counter#token_lengths`
+    def word_lengths
+      warn "`Counter#word_lengths` is deprecated, please use `Counter#token_lengths`"
+      warn "`Counter#token_lengths` returns a sorted array of arrays, not a hash. Call `token_lengths.to_h` for old behaviour"
+      token_lengths.to_h
+    end
+    # @deprecated use `Counter#token_density`
+    def word_density(precision = 2)
+      warn "`Counter#word_density` is deprecated, please use `Counter#token_density`"
+      warn "`Counter#token_density` returns density as decimal and not percent"
+      token_density(precision: precision * 2).map { |tuple| [tuple.first, (tuple.last * 100).round(precision)] }
+    end
+    # @deprecated use `Counter#token_frequency`
+    def sorted_word_occurrences
+      warn "`Counter#sorted_word_occurrences` is deprecated, please use `Counter#token_frequency`"
+      token_frequency
+    end
+    # @deprecated use `Counter#token_lengths`
+    def sorted_word_lengths
+      warn "`Counter#sorted_word_lengths` is deprecated, please use `Counter#token_lengths`"
+      token_lengths
+    end
+    # @deprecated use `Counter#most_frequent_tokens`
+    def most_occurring_words
+      warn "`Counter#most_occurring_words` is deprecated, please use `Counter#most_frequent_tokens`"
+      warn "`Counter#most_frequent_tokens` returns a hash, not an array. Call `most_frequent_tokens.to_h` for old behaviour."
+      most_frequent_tokens.to_a
+    end
+    # @deprecated use `Counter#longest_tokens`
+    def longest_words
+      warn "`Counter#longest_words` is deprecated, please use `Counter#longest_tokens`"
+      warn "`Counter#longest_tokens` returns a hash, not an array. Call `longest_tokens.to_h` for old behaviour."
+      longest_tokens.to_a
+    end
+    # @deprecated use `Counter#average_chars_per_token`
+    def average_chars_per_word(precision = 2)
+      warn "`Counter#average_chars_per_word` is deprecated, please use `Counter#average_chars_per_token`"
+      average_chars_per_token(precision: precision)
+    end
+    # @deprecated use `Counter#average_chars_per_token`
+    def count(token)
+      warn "`Counter#count` is deprecated, please use `Array#count`"
+      tokens.count(token.downcase)
+    end
+  end
+end

data/lib/words_counted/tokeniser.rb ADDED Viewed

@@ -0,0 +1,139 @@
+# -*- encoding : utf-8 -*-
+module WordsCounted
+  class Tokeniser
+    # Takes a string and breaks it into an array of tokens.
+    # Using `pattern` and `exclude` allows for powerful tokenisation strategies.
+    #
+    # @example
+    #  tokeniser = WordsCounted::Tokeniser.new("We are all in the gutter, but some of us are looking at the stars.")
+    #  tokeniser.tokenise(exclude: "We are all in the gutter")
+    #  # => ['but', 'some', 'of', 'us', 'are', 'looking', 'at', 'the', 'stars']
+    # Default tokenisation strategy
+    TOKEN_REGEXP = /[\p{Alpha}\-']+/
+    # Initialises state with a string that will be tokenised.
+    #
+    # @param [String] input   The string to tokenise.
+    # @return [Tokeniser]
+    def initialize(input)
+      @input = input
+    end
+    # Converts a string into an array of tokens using a regular expression.
+    # If a regexp is not provided a default one is used. See {Tokenizer.TOKEN_REGEXP}.
+    #
+    # Use `exclude` to remove tokens from the final list. `exclude` can be a string,
+    # a regular expression, a lambda, a symbol, or an array of one or more of those types.
+    # This allows for powerful and flexible tokenisation strategies.
+    #
+    # @example
+    #  WordsCounted::Tokeniser.new("Hello World").tokenise
+    #  # => ['hello', 'world']
+    #
+    # @example With `pattern`
+    #  WordsCounted::Tokeniser.new("Hello-Mohamad").tokenise(pattern: /[^-]+/)
+    #  # => ['hello', 'mohamad']
+    #
+    # @example With `exclude` as a string
+    #  WordsCounted::Tokeniser.new("Hello Sami").tokenise(exclude: "hello")
+    #  # => ['sami']
+    #
+    # @example With `exclude` as a regexp
+    #  WordsCounted::Tokeniser.new("Hello Dani").tokenise(exclude: /hello/i)
+    #  # => ['dani']
+    #
+    # @example With `exclude` as a lambda
+    #  WordsCounted::Tokeniser.new("Goodbye Sami").tokenise(exclude: ->(token) { token.length > 6 })
+    #  # => ['sami']
+    #
+    # @example With `exclude` as a symbol
+    #  WordsCounted::Tokeniser.new("Hello محمد").tokenise(exclude: :ascii_only?)
+    #  # => ['محمد']
+    #
+    # @example With `exclude` as an array of strings
+    #  WordsCounted::Tokeniser.new("Goodbye Sami and hello Dani").tokenise(exclude: ["goodbye hello"])
+    #  # => ['sami', 'and', dani']
+    #
+    # @example With `exclude` as an array of regular expressions
+    #  WordsCounted::Tokeniser.new("Goodbye and hello Dani").tokenise(exclude: [/goodbye/i, /and/i])
+    #  # => ['hello', 'dani']
+    #
+    # @example With `exclude` as an array of lambdas
+    #  t = WordsCounted::Tokeniser.new("Special Agent 007")
+    #  t.tokenise(exclude: [->(t) { t.to_i.odd? }, ->(t) { t.length > 5}])
+    #  # => ['agent']
+    #
+    # @example With `exclude` as a mixed array
+    #  t = WordsCounted::Tokeniser.new("Hello! اسماءنا هي محمد، كارولينا، سامي، وداني")
+    #  t.tokenise(exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"])
+    #  # => => ["هي", "سامي", "ودان
+    #
+    # @param [Regexp] pattern   The string to tokenise.
+    # @param [Array<String, Regexp, Lambda, Symbol>, String, Regexp, Lambda, Symbol nil] exclude     The filter to apply.
+    # @return [Array] the array of filtered tokens.
+    def tokenise(pattern: TOKEN_REGEXP, exclude: nil)
+      filter_proc = filter_to_proc(exclude)
+      @input.scan(pattern).map(&:downcase).reject { |token| filter_proc.call(token) }
+    end
+  private
+    # This method converts any arguments into a callable object. The return value of this
+    # is then used to determine whether a token should be excluded from the final list or not.
+    #
+    # `filter` can be a string, a regular expression, a lambda, a symbol, or an array
+    # of any combination of those types.
+    #
+    # If `filter` is a string, see {Tokeniser#filter_proc_from_string}.
+    # If `filter` is a an array, see {Tokeniser#filter_procs_from_array}.
+    #
+    # If `filter` is a proc, then the proc is simply called. If `filter` is a regexp, a `lambda`
+    # is returned that checks the token for a match. If a symbol is passed, it is converted to
+    # a proc.
+    #
+    # This method depends on `nil` responding `to_a` with an empty array, which
+    # avoids having to check if `exclude` was passed.
+    #
+    # @api private
+    def filter_to_proc(filter)
+      if filter.respond_to?(:to_a)
+        filter_procs_from_array(filter)
+      elsif filter.respond_to?(:to_str)
+        filter_proc_from_string(filter)
+      elsif regexp_filter = Regexp.try_convert(filter)
+        ->(token) {
+          token =~ regexp_filter
+        }
+      elsif filter.respond_to?(:to_proc)
+        filter.to_proc
+      else
+        raise ArgumentError,
+          "`filter` must be a `String`, `Regexp`, `lambda`, `Symbol`, or an `Array` of any combination of those types"
+      end
+    end
+    # Converts an array of `filters` to an array of lambdas, and returns a lambda that calls
+    # each lambda in the resulting array. If any lambda returns true the token is excluded
+    # from the final list.
+    #
+    # @api private
+    def filter_procs_from_array(filter)
+      filter_procs = Array(filter).map &method(:filter_to_proc)
+      ->(token) {
+        filter_procs.any? { |pro| pro.call(token) }
+      }
+    end
+    # Converts a string `filter` to an array, and returns a lambda
+    # that returns true if the token is included in the array.
+    #
+    # @api private
+    def filter_proc_from_string(filter)
+      normalized_exclusion_list = filter.split.map(&:downcase)
+      ->(token) {
+        normalized_exclusion_list.include?(token)
+      }
+    end
+  end
+end

data/lib/words_counted/version.rb CHANGED Viewed

@@ -1,4 +1,4 @@
 # -*- encoding : utf-8 -*-
 module WordsCounted
-  VERSION = "0.1.5"
+  VERSION = "1.0.0"
 end

data/lib/words_counted.rb CHANGED Viewed

@@ -1,6 +1,9 @@
 # -*- encoding : utf-8 -*-
-require "words_counted/version"
+require "refinements/hash_refinements"
+require "words_counted/tokeniser"
 require "words_counted/counter"
+require "words_counted/version"
 begin
   require "pry"
@@ -9,10 +12,14 @@ end
 module WordsCounted
   def self.count(string, options = {})
-    Counter.new(string, options)
+    tokens = Tokeniser.new(string).tokenise(options)
+    Counter.new(tokens)
   end
   def self.from_file(path, options = {})
-    Counter.from_file(path, options)
+    tokens = File.open(path) do |file|
+      Tokeniser.new(file.read).tokenise(options)
+    end
+    Counter.new(tokens)
   end
 end

data/spec/words_counted/counter_spec.rb CHANGED Viewed

@@ -3,240 +3,85 @@ require_relative "../spec_helper"
 module WordsCounted
   describe Counter do
-    let(:counter) { Counter.new("We are all in the gutter, but some of us are looking at the stars.") }
-    describe "initialize" do
-      it "sets @options" do
-        expect(counter.instance_variables).to include(:@options)
-      end
-      it "sets @char_count" do
-        expect(counter.instance_variables).to include(:@char_count)
-      end
-      it "sets @words" do
-        expect(counter.instance_variables).to include(:@words)
-      end
-      it "sets @word_occurrences" do
-        expect(counter.instance_variables).to include(:@word_occurrences)
-      end
-      it "sets @word_lengths" do
-        expect(counter.instance_variables).to include(:@word_lengths)
-      end
+    let(:counter) do
+      tokens = WordsCounted::Tokeniser.new("one three three three woot woot").tokenise
+      Counter.new(tokens)
     end
-    describe "words" do
-      it "returns an array" do
-        expect(counter.words).to be_a(Array)
-      end
-      it "splits words" do
-        expect(counter.words).to eq(%w[we are all in the gutter but some of us are looking at the stars])
-      end
-      it "removes special characters" do
-        counter = Counter.new("Hello! # $ % 12345 * & % How do you do?")
-        expect(counter.words).to eq(%w[hello how do you do])
-      end
-      it "counts hyphenated words as one" do
-        counter = Counter.new("I am twenty-two.")
-        expect(counter.words).to eq(%w[i am twenty-two])
-      end
-      it "does not split words on apostrophe" do
-        counter = Counter.new("Bust 'em! Them be Jim's bastards'.")
-        expect(counter.words).to eq(%w[bust 'em them be jim's bastards'])
-      end
-      it "does not split on unicode chars" do
-        counter = Counter.new("São Paulo")
-        expect(counter.words).to eq(%w[são paulo])
-      end
-      it "it accepts a string filter" do
-        counter = Counter.new("That was magnificent, Trevor.", exclude: "magnificent")
-        expect(counter.words).to eq(%w[that was trevor])
-      end
-      it "it accepts a string filter with multiple words" do
-        counter = Counter.new("That was magnificent, Trevor.", exclude: "was magnificent")
-        expect(counter.words).to eq(%w[that trevor])
-      end
-      it "filters words in uppercase when using a string filter" do
-        counter = Counter.new("That was magnificent, Trevor.", exclude: "Magnificent")
-        expect(counter.words).to eq(%w[that was trevor])
-      end
-      it "accepts a regexp filter" do
-        counter = Counter.new("That was magnificent, Trevor.", exclude: /magnificent/i)
-        expect(counter.words).to eq(%w[that was trevor])
-      end
-      it "accepts an array filter" do
-        counter = Counter.new("That was magnificent, Trevor.", exclude: ['That', 'was'])
-        expect(counter.words).to eq(%w[magnificent trevor])
-      end
-      it "accepts a lambda filter" do
-        counter = Counter.new("That was magnificent, Trevor.", exclude: ->(w) { w == 'that' })
-        expect(counter.words).to eq(%w[was magnificent trevor])
-      end
-      it "accepts a custom regexp" do
-        counter = Counter.new("I am 007.", regexp: /[\p{Alnum}\-']+/)
-        expect(counter.words).to eq(["i", "am", "007"])
-      end
-      it "char_count should be calculated after the filter is applied" do
-        counter = Counter.new("I am Legend.", exclude: "I am")
-        expect(counter.char_count).to eq(6)
-      end
-    end
-    describe "word_count" do
-      it "returns the correct word count" do
-        expect(counter.word_count).to eq(15)
+    describe "initialize" do
+      it "sets @tokens" do
+        expect(counter.instance_variables).to include(:@tokens)
       end
     end
-    describe "word_occurrences" do
-      it "returns a hash" do
-        expect(counter.word_occurrences).to be_a(Hash)
-      end
-      it "treats capitalized words as the same word" do
-        counter = Counter.new("Bad, bad, piggy!")
-        expect(counter.word_occurrences).to eq({ "bad" => 2, "piggy" => 1 })
+    describe "#token_count" do
+      it "returns the correct number of tokens" do
+        expect(counter.token_count).to eq(6)
       end
     end
-    describe "sorted_word_occurrences" do
-      it "returns an array" do
-        expect(counter.sorted_word_occurrences).to be_a(Array)
-      end
-      it "returns a two dimensional array sorted by descending word occurrence" do
-        counter = Counter.new("Blue, green, green, green, orange, green, orange, red, orange, red")
-        expect(counter.sorted_word_occurrences).to eq([ ["green", 4], ["orange", 3], ["red", 2], ["blue", 1] ])
+    describe "#uniq_token_count" do
+      it "returns the number of unique token" do
+        expect(counter.uniq_token_count).to eq(3)
       end
     end
-    describe "most_occurring_words" do
-      it "returns an array" do
-        expect(counter.most_occurring_words).to be_a(Array)
-      end
-      it "returns highest occuring words" do
-        counter = Counter.new("Orange orange Apple apple banana")
-        expect(counter.most_occurring_words).to eq([["orange", 2],["apple", 2]])
+    describe "#char_count" do
+      it "returns the correct number of chars" do
+        expect(counter.char_count).to eq(26)
       end
     end
-    describe 'word_lengths' do
-      it "returns a hash" do
-        expect(counter.word_lengths).to be_a(Hash)
-      end
-      it "returns a hash of word lengths" do
-        counter = Counter.new("One two three.")
-        expect(counter.word_lengths).to eq({ "one" => 3, "two" => 3, "three" => 5 })
+    describe "#token_frequency" do
+      it "returns a two-dimensional array where each member array is a token and its frequency in descending order" do
+        expected = [
+          ['three', 3], ['woot', 2], ['one', 1]
+        ]
+        expect(counter.token_frequency).to eq(expected)
       end
     end
-    describe "sorted_word_lengths" do
-      it "returns an array" do
-        expect(counter.sorted_word_lengths).to be_a(Array)
-      end
-      it "returns a two dimensional array sorted by descending word length" do
-        counter = Counter.new("I am not certain of that")
-        expect(counter.sorted_word_lengths).to eq([ ["certain", 7], ["that", 4], ["not", 3], ["of", 2], ["am", 2], ["i", 1] ])
+    describe "#token_lengths" do
+      it "returns a two-dimensional array where each member array is a token and its length in descending order" do
+        expected = [
+          ['three', 5], ['woot', 4], ['one', 3]
+        ]
+        expect(counter.token_lengths).to eq(expected)
       end
     end
-    describe "longest_words" do
-      it "returns an array" do
-        expect(counter.longest_words).to be_a(Array)
-      end
-      it "returns the longest words" do
-        counter = Counter.new("Those whom the gods love grow young.")
-        expect(counter.longest_words).to eq([["those", 5],["young", 5]])
-      end
-    end
-    describe "word_density" do
-      it "returns an array" do
-        expect(counter.word_density).to be_a(Array)
-      end
-      it "returns words and their density in percent" do
-        counter = Counter.new("His name was Major, major Major Major.")
-        expect(counter.word_density).to eq([["major", 57.14], ["was", 14.29], ["name", 14.29], ["his", 14.29]])
+    describe "#token_density" do
+      it "returns a two-dimensional array where each member array is a token and its density in descending order" do
+        expected = [
+          ['three', 0.5], ['woot', 0.33], ['one', 0.17]
+        ]
+        expect(counter.token_density).to eq(expected)
       end
       it "accepts a precision" do
-        counter = Counter.new("His name was Major, major Major Major.")
-        expect(counter.word_density(4)).to eq([["major", 57.1429], ["was", 14.2857], ["name", 14.2857], ["his", 14.2857]])
+        expected = [
+          ['three', 0.5], ['woot', 0.3333], ['one', 0.1667]
+        ]
+        expect(counter.token_density(precision: 4)).to eq(expected)
       end
     end
-    describe "char_count" do
-      it "returns the number of chars in the passed in string" do
-        counter = Counter.new("His name was Major, major Major Major.")
-        expect(counter.char_count).to eq(30)
-      end
-      it "returns the number of chars in the passed in string after the filter is applied" do
-        counter = Counter.new("His name was Major, major Major Major.", exclude: "Major")
-        expect(counter.char_count).to eq(10)
-      end
-    end
-    describe "average_chars_per_word" do
-      it "returns the average number of chars per word" do
-        counter = Counter.new("His name was major, Major Major Major.")
-        expect(counter.average_chars_per_word).to eq(4.29)
-      end
-      it "returns the average number of chars per word after the filter is applied" do
-        counter = Counter.new("His name was Major, Major Major Major.", exclude: "Major")
-        expect(counter.average_chars_per_word).to eq(3.33)
-      end
-      it "accepts precision" do
-        counter = Counter.new("This line should have 39 characters minus spaces.")
-        expect(counter.average_chars_per_word(4)).to eq(5.5714)
+    describe "#most_frequent_tokens" do
+      it "returns a hash of the tokens with the highest frequency, where each key a token, and each value is its frequency" do
+        expected = {
+          'three' => 3
+        }
+        expect(counter.most_frequent_tokens).to eq(expected)
       end
     end
-    describe "unique_word_count" do
-      it "returns the number of unique words" do
-        expect(counter.unique_word_count).to eq(13)
-      end
-      it "is case insensitive" do
-        counter = Counter.new("Up down. Down up.")
-        expect(counter.unique_word_count).to eq(2)
+    describe "#longest_tokens" do
+      it "returns a hash of the tokens with the highest length, where each key a token, and each value is its length" do
+        expected = {
+          'three' => 5
+        }
+        expect(counter.longest_tokens).to eq(expected)
       end
     end
   end
-  describe "count" do
-    it "returns count for a single word" do
-      counter = Counter.new("I am so clever that sometimes I don't understand a single word of what I am saying.")
-      expect(counter.count("i")).to eq(3)
-    end
-  end
-  describe "from_file" do
-    it "opens and reads a text file" do
-      counter = WordsCounted.from_file('spec/support/the_hart_and_the_hunter.txt')
-      expect(counter.word_count).to eq(139)
-    end
-  end
 end

data/spec/words_counted/deprecated_spec.rb ADDED Viewed

@@ -0,0 +1,99 @@
+# -*- coding: utf-8 -*-
+require_relative "../spec_helper"
+module WordsCounted
+  warn "Methods being tested are deprecated"
+  describe Counter do
+    let(:counter) do
+      tokens = WordsCounted::Tokeniser.new("one three three three woot woot").tokenise
+      Counter.new(tokens)
+    end
+    describe "#word_density" do
+      it "returns words and their density in percent" do
+        expected = [
+          ['three', 50.0], ['woot', 33.33], ['one', 16.67]
+        ]
+        expect(counter.word_density).to eq(expected)
+      end
+      it "accepts a precision" do
+        expected = [
+          ['three', 50.0], ['woot', 33.3333], ['one', 16.6667]
+        ]
+        expect(counter.word_density(4)).to eq(expected)
+      end
+    end
+    describe "#word_occurrences" do
+      it "returns a two dimensional array sorted by descending word occurrence" do
+        expected = {
+          'three' => 3, 'woot' => 2, 'one' => 1
+        }
+        expect(counter.word_occurrences).to eq(expected)
+      end
+    end
+    describe "#sorted_word_occurrences" do
+      it "returns a two dimensional array sorted by descending word occurrence" do
+        expected = [
+          ['three', 3], ['woot', 2], ['one', 1]
+        ]
+        expect(counter.sorted_word_occurrences).to eq(expected)
+      end
+    end
+    describe "#word_lengths" do
+      it "returns a hash of of words and their length sorted descending by length" do
+        expected = {
+          'three' => 5, 'woot' => 4, 'one' => 3
+        }
+        expect(counter.word_lengths).to eq(expected)
+      end
+    end
+    describe "#sorted_word_lengths" do
+      it "returns a two dimensional array sorted by descending word length" do
+        expected = [
+          ['three', 5], ['woot', 4], ['one', 3]
+        ]
+        expect(counter.sorted_word_lengths).to eq(expected)
+      end
+    end
+    describe "#longest_words" do
+      it "returns a two-dimentional array of the longest words and their lengths" do
+        expected = [
+          ['three', 5]
+        ]
+        expect(counter.longest_words).to eq(expected)
+      end
+    end
+    describe "#most_occurring_words" do
+      it "returns a two-dimentional array of words with the highest frequency and their frequencies" do
+        expected = [
+          ['three', 3]
+        ]
+        expect(counter.most_occurring_words).to eq(expected)
+      end
+    end
+    describe "#average_chars_per_word" do
+      it "returns the average number of chars per word" do
+        expect(counter.average_chars_per_word).to eq(4.33)
+      end
+      it "accepts precision" do
+        expect(counter.average_chars_per_word(4)).to eq(4.3333)
+      end
+    end
+    describe "#count" do
+      it "returns count for a single word" do
+        expect(counter.count('one')).to eq(1)
+      end
+    end
+  end
+end

data/spec/words_counted/tokeniser_spec.rb ADDED Viewed

@@ -0,0 +1,133 @@
+# -*- coding: utf-8 -*-
+require_relative "../spec_helper"
+module WordsCounted
+  describe Tokeniser do
+    describe "initialize" do
+      it "sets @input" do
+        tokeniser = Tokeniser.new("Hello World!")
+        expect(tokeniser.instance_variables).to include(:@input)
+      end
+    end
+    describe "#tokenise" do
+      it "normalises tokens and returns an array" do
+        tokens = Tokeniser.new("Hello HELLO").tokenise
+        expect(tokens).to eq(%w[hello hello])
+      end
+      context "without arguments" do
+        it "removes none alpha-numeric chars" do
+          tokens = Tokeniser.new("Hello world! # $ % 12345 * & % ?").tokenise
+          expect(tokens).to eq(%w[hello world])
+        end
+        it "does not split on hyphens" do
+          tokens = Tokeniser.new("I am twenty-two.").tokenise
+          expect(tokens).to eq(%w[i am twenty-two])
+        end
+        it "does not split on apostrophe" do
+          tokens = Tokeniser.new("Bust 'em! It's Jim's gang.").tokenise
+          expect(tokens).to eq(%w[bust 'em it's jim's gang])
+        end
+        it "does not split on unicode chars" do
+          tokens = Tokeniser.new("Bayrūt").tokenise
+          expect(tokens).to eq(%w[bayrūt])
+        end
+      end
+      context "with `pattern` options" do
+        it "splits on accepts a custom pattern" do
+          tokens = Tokeniser.new("We-Are-ALL").tokenise(pattern: /[^-]+/)
+          expect(tokens).to eq(%w[we are all])
+        end
+      end
+      context "with `exclude` option" do
+        context "as a string" do
+          let(:tokeniser) { Tokeniser.new("That was magnificent, Trevor.") }
+          it "it accepts a string filter" do
+            tokens = tokeniser.tokenise(exclude: "magnificent")
+            expect(tokens).to eq(%w[that was trevor])
+          end
+          it "accepts a string filter with multiple space-delimited tokens" do
+            tokens = tokeniser.tokenise(exclude: "was magnificent")
+            expect(tokens).to eq(%w[that trevor])
+          end
+          it "normalises string filter" do
+            tokens = tokeniser.tokenise(exclude: "MAGNIFICENT")
+            expect(tokens).to eq(%w[that was trevor])
+          end
+        end
+        context "as a regular expression" do
+          it "filters on match" do
+            tokeniser = Tokeniser.new("That was magnificent, Trevor.")
+            tokens = tokeniser.tokenise(exclude: /magnificent/i)
+            expect(tokens).to eq(%w[that was trevor])
+          end
+        end
+        context "as a lambda" do
+          it "calls lambda" do
+            tokeniser = Tokeniser.new("That was magnificent, Trevor.")
+            tokens = tokeniser.tokenise(exclude: ->(token) { token.length < 5 })
+            expect(tokens).to eq(%w[magnificent trevor])
+          end
+          it "accepts a symbol for shorthand notation" do
+            tokeniser = Tokeniser.new("That was magnificent, محمد.}")
+            tokens = tokeniser.tokenise(exclude: :ascii_only?)
+            expect(tokens).to eq(%w[محمد])
+          end
+        end
+        context "as an array" do
+          let(:tokeniser) { Tokeniser.new("That was magnificent, Trevor.") }
+          it "accepts an array of strings" do
+            tokens = tokeniser.tokenise(exclude: ["magnificent"])
+            expect(tokens).to eq(%w[that was trevor])
+          end
+          it "accepts an array regular expressions" do
+            tokens = tokeniser.tokenise(exclude: [/that/, /was/])
+            expect(tokens).to eq(%w[magnificent trevor])
+          end
+          it "accepts an array of lambdas" do
+            filters = [
+              ->(token) { token.length < 4 },
+              ->(token) { token.length > 6 }
+            ]
+            tokens = tokeniser.tokenise(exclude: filters)
+            expect(tokens).to eq(%w[that trevor])
+          end
+          it "accepts a mixed array" do
+            filters = [
+              "that",
+              ->(token) { token.length < 4 },
+              /magnificent/
+            ]
+            tokens = tokeniser.tokenise(exclude: filters)
+            expect(tokens).to eq(["trevor"])
+          end
+        end
+        context "with an invalid filter" do
+          it "raises an `ArgumentError`" do
+            expect {
+              Tokeniser.new("Hello world!").tokenise(exclude: 1)
+            }.to raise_error(ArgumentError)
+          end
+        end
+      end
+    end
+  end
+end

data/spec/words_counted_spec.rb ADDED Viewed

@@ -0,0 +1,34 @@
+# -*- coding: utf-8 -*-
+require_relative "spec_helper"
+describe WordsCounted do
+  describe ".from_file" do
+    let(:file_path) { "spec/support/the_hart_and_the_hunter.txt" }
+    it "opens and reads a text file" do
+      counter = WordsCounted.from_file(file_path)
+      expect(counter.token_count).to eq(139)
+    end
+    it "opens and reads a text file with options" do
+      counter = WordsCounted.from_file(file_path, exclude: "hunter")
+      expect(counter.token_count).to eq(135)
+    end
+  end
+  describe ".count" do
+    let(:string) do
+      "We are all in the gutter, but some of us are looking at the stars."
+    end
+    it "returns a counter instance with given input as tokens" do
+      counter = WordsCounted.count(string)
+      expect(counter.token_count).to eq(15)
+    end
+    it "returns a counter instance with given input and options" do
+      counter = WordsCounted.count(string, exclude: "the gutter")
+      expect(counter.token_count).to eq(12)
+    end
+  end
+end

data/words_counted.gemspec CHANGED Viewed

@@ -9,7 +9,7 @@ Gem::Specification.new do |spec|
   spec.version       = WordsCounted::VERSION
   spec.authors       = ["Mohamad El-Husseini"]
   spec.email         = ["husseini.mel@gmail.com"]
-  spec.description   = %q{A Ruby word counter and string analyser with helpful utility methods.}
+  spec.description   = %q{A Ruby natural language processor to extract stats from text, such was word count and more.}
   spec.summary       = %q{See README.}
   spec.homepage      = "https://github.com/abitdodgy/words_counted"
   spec.license       = "MIT"

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: words_counted
 version: !ruby/object:Gem::Version
-  version: 0.1.5
+  version: 1.0.0
 platform: ruby
 authors:
 - Mohamad El-Husseini
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-12-02 00:00:00.000000000 Z
+date: 2015-10-24 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -66,7 +66,8 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description: A Ruby word counter and string analyser with helpful utility methods.
+description: A Ruby natural language processor to extract stats from text, such was
+  word count and more.
 email:
 - husseini.mel@gmail.com
 executables: []
@@ -74,19 +75,28 @@ extensions: []
 extra_rdoc_files: []
 files:
 - ".gitignore"
+- ".hound.yml"
 - ".rspec"
+- ".ruby-style.yml"
+- ".travis.yml"
 - ".yardopts"
 - CHANGELOG.md
 - Gemfile
 - LICENSE.txt
 - README.md
 - Rakefile
+- lib/refinements/hash_refinements.rb
 - lib/words_counted.rb
 - lib/words_counted/counter.rb
+- lib/words_counted/deprecated.rb
+- lib/words_counted/tokeniser.rb
 - lib/words_counted/version.rb
 - spec/spec_helper.rb
 - spec/support/the_hart_and_the_hunter.txt
 - spec/words_counted/counter_spec.rb
+- spec/words_counted/deprecated_spec.rb
+- spec/words_counted/tokeniser_spec.rb
+- spec/words_counted_spec.rb
 - words_counted.gemspec
 homepage: https://github.com/abitdodgy/words_counted
 licenses:
@@ -108,7 +118,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.2.2
+rubygems_version: 2.4.5
 signing_key:
 specification_version: 4
 summary: See README.
@@ -116,3 +126,6 @@ test_files:
 - spec/spec_helper.rb
 - spec/support/the_hart_and_the_hunter.txt
 - spec/words_counted/counter_spec.rb
+- spec/words_counted/deprecated_spec.rb
+- spec/words_counted/tokeniser_spec.rb
+- spec/words_counted_spec.rb