RubyGems - frequent-algorithm - Versions diffs - 0.0.1 → 0.0.2 - Mend

frequent-algorithm 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 89e3696841e7c22693c58221b03a793c1f5d94d5
-  data.tar.gz: bed42bcc5345a79a588a3c1ab594520ca67c3d40
+  metadata.gz: f799561d8b1543e23483918e587dffcbd0522e78
+  data.tar.gz: 39a5a9ceee0dee33bc889111b2745907ffb49018
 SHA512:
-  metadata.gz: 45d1ee2b0a585529e06935735b6a53be8a88a45548c659589bc148f1b9024bac19fba9f8f3e9dbebf8d733dd9fa113e6b3451cad6b3ee1a61d2e1988e2b657c4
-  data.tar.gz: 6100d5530291ec757aed6329f3f680fced9a0a7ce8b8ca974bcac72102aa3882f7ef72b47a86c318dcdf7d5ae12671a1e9c87af124f6914b75a602860056cd5c
+  metadata.gz: c68087e23dc0ff299f81797a1152c6e1fc6f1f10b7ec46ca39fff35cb8d91836be27eb34fe434c484fc1b28fde939c2f436d42b36f958ddca0afbe9f5107194b
+  data.tar.gz: 6f0da59941492900cc4da2f1266ed0e8398a852cbb55e064fd0ed290f8209a396da2e051c5293a45a21c7564828a099ea03b607dfba53608d824f834a1083bc1

data/CHANGELOG CHANGED Viewed

@@ -0,0 +1,9 @@
+## CHANGELOG
+- __2015/03/19 0.0.2 release.
+    - First-stage implementation.
+    - API documentation added.
+    - Fleshing out unit tests.
+- __2015/03/11__: 0.0.1 release.
+    - Initial release.

data/README.md CHANGED Viewed

@@ -1,4 +1,10 @@
-# frequent-algorithm
+# frequent-algorithm [![Gem Version](https://badge.fury.io/rb/frequent-algorithm.svg)](http://badge.fury.io/rb/frequent-algorithm) [![Build Status](https://travis-ci.org/buruzaemon/frequent-algorithm.svg)](https://travis-ci.org/buruzaemon/frequent-algorithm)
+Web site usage, social network behavior and Internet traffic are examples
+of systems that appear to follow the [power law](http://en.wikipedia.org/wiki/Power_law),
+where most of the events are due to the actions of a very small few.
+Knowing at any given point in time which items are trending is valuable
+in understanding the system.
 `frequent-algorithm` is a Ruby implementation of the FREQUENT algorithm
 for identifying frequent items in a data stream in sliding windows.
@@ -6,40 +12,93 @@ Please refer to [Identifying Frequent Items in Sliding Windows over On-Line
 Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by
 Golab, DeHaan, Demaine, L&#243;pez-Ortiz and Munro (2003).
-## Getting Started
-Bacon ipsum dolor amet short loin flank swine ham hock tail. T-bone biltong
-beef shoulder salami, leberkas pork chop ribeye pork belly ground round. Filet
-mignon pork chop spare ribs brisket pastrami picanha bacon, biltong beef ribs
-corned beef ham hock tail. Meatloaf kielbasa turducken, salami chuck beef ribs
-venison hamburger t-bone landjaeger pork chop drumstick sausage bacon.
+## Introduction
+### Challenges
+Challenges for Real-time processing of data streams for _frequent item queries_
+include:
+* data may be of unknown and possibly unbound length
+* data may be arriving a very fast rate
+* it might not be possible to go back and re-read the data
+* too large a window of observation may include stale data
+Therefore, a solution should have the following characteristics:
+* uses limited memory
+* can process events in the stream in &#927;(1) constant time
+* requires only a single-pass over the data
+### The algorithm
+> LOOP<br/>
+> 1. For each element e in the next b elements:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;If a local counter exists for the type of element e:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Increment the local counter.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new local counter for this element type<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to 1.<br/>
+> 2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.<br/>
+> 3. Delete all local counters<br/>
+> 4. For each type named in S:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;If a global counter exists for this type:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Add to it the count recorded in S.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new global counter for this element type<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to the count recorded in S.<br/>
+> 5. Add the count of the kth largest type in S to δ.<br/>
+> 6. If sizeOf(Q) > N/b:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;(a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;(b) For all element types named in S':<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Subtract from their global counters the counts<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;recorded in S'<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If a counter is decremented to zero:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Delete it.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;(c) Output the identity and value of each global counter > δ.
+>
+> &mdash; <cite>Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003</cite>
 ## Usage
     require 'frequent-algorithm'
+    # data is pi to 1000 digits
+    pi = File.read('test/frequent/test_data_pi').strip
+    data = pi.scan(/./).each_slice(b)
+    N = 100  # size of main window
+    b =  20  # size of basic window
+    k =   3  # we are interested in top-3 numerals in pi
+    alg = Frequent::Algorithm.new(N, b, k)
+    # read in and process the 1st basic window
+    alg.process(data.next)
+    # and the top-3 numerals are?
+    top3 = alg.statistics.report
+    puts top3
+    # lather, rinse and repeat
+    alg.process(data.next)
 ## Development
 The development of this gem requires the following:
-* [Ruby 2.0 or greater](https://www.ruby-lang.org/en/)
+* [Ruby 1.9.3 or greater](https://www.ruby-lang.org/en/)
 * [rubygems](https://rubygems.org/pages/download)
 * [`bundler`](https://github.com/bundler/bundler)
 * [`rake`](https://github.com/ruby/rake)
+* [`minitest`](https://rubygems.org/gems/minitest) (unit testing)
 * [`yard`](https://rubygems.org/gems/yard) (documentation)
 * [`rdiscount`](https://rubygems.org/gems/rdiscount) (Markdown)
-### Documentation
-`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
-[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
-Check out [Getting Started with
-Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
-### Build
-Development, testing and release of this rubygem uses the following
+Building, testing and release of this rubygem uses the following
 `rake` commands:
@@ -47,22 +106,41 @@ Development, testing and release of this rubygem uses the following
     rake clean    # Remove any temporary products
     rake clobber  # Remove any generated file
     rake install  # Build and install frequent-algorithm-n.n.n.gem into system gems
-    rake release  # Create tag vn.n.n and build and push frequent-algorithm-n.n.n.gem to Rubygems
+    rake release  # Create tag vn.n.n and build and push
+                  # frequent-algorithm-n.n.n.gem to Rubygems
     rake test     # Execute unit tests
+### Documentation
+`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
+[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
+Check out [Getting Started with
+Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
 ### Unit Testing
 `frequent-algorithm` uses
 [`MiniTest::Unit`](https://github.com/seattlerb/minitest) for
 unit testing.
-### Release
+### Releasing
 Please refer to Publishing To Rubygems.org in the
 [Rubygems Guide](http://guides.rubygems.org/make-your-own-gem/).
+### Contributing
+1. Fork it
+2. Begin work on `dev-branch` (`git fetch && git checkout dev-branch`)
+3. Create your feature branch (`git branch my-new-feature && git checkout
+   my-new-feature`)
+4. Commit your changes (`git commit -am 'Add some feature'`)
+5. Push to the branch (`git push origin my-new-feature:dev-branch`)
+6. Create new Pull Request
+You may wish to read the [Git book online](http://git-scm.com/book/en/v2).
 ## License

data/lib/frequent-algorithm.rb CHANGED Viewed

@@ -1,3 +1,4 @@
+# coding: utf-8
 require 'frequent/algorithm'
 =begin

data/lib/frequent/algorithm.rb CHANGED Viewed

@@ -1,23 +1,141 @@
+# coding: utf-8
 require 'frequent/version'
 module Frequent
+  # `Frequent::Algorithm` is the Ruby implementation of the
+  # Demaine et al. FREQUENT algorithm for calculating
+  # top-k items in a stream.
+  #
+  # The aims of this algorithm are:
+  # * use limited memory
+  # * require constant processing time per item
+  # * require a single-pass only
+  #
   class Algorithm
-    attr_reader :n, :b
+    # @return [Integer] the number of items in the main window
+    attr_reader :n
+    # @return [Integer] the number of items in a basic window
+    attr_reader :b
+    # @return [Integer] the number of top item categories to track
+    attr_reader :k
+    # @return [Array<Hash<Object,Integer>>] global queue for basic window summaries
+    attr_reader :queue
+    # @return [Hash<Object,Integer>] global mapping of items and counts
+    attr_reader :statistics
+    # @return [Integer] minimum threshold for membership in top-k items
+    attr_reader :delta
-    def initialize(n, b)
+    # Initializes this top-k frequency-calculating instance.
+    #
+    # @param [Integer] n number of items in the main window
+    # @param [Integer] b number of items in a basic window
+    # @param [Integer] k number of top item categories to track
+    # @raise [ArgumentError] if n is not greater than 0
+    # @raise [ArgumentError] if b is not greater than 0
+    # @raise [ArgumentError] if k is not greater than 0
+    # @raise [ArgumentError] if n/b is not greater than 1
+    def initialize(n, b, k=1)
+      if n <= 0
+        raise ArgumentError.new('n must be greater than 0')
+      end
+      if b <= 0
+        raise ArgumentError.new('b must be greater than 0')
+      end
+      if k <= 0
+        raise ArgumentError.new('k must be greater than 0')
+      end
+      if n/b < 1
+        raise ArgumentError.new('n/b must be greater than 1')
+      end
       @n = n
       @b = b
+      @k = k
+      @queue = []
+      @statistics = {}
+      @delta = 0
     end
-    def process(item)
-      raise NotImplementedError.new
+    # Processes a single basic window of b items, by first adding
+    # a summary of this basic window in the internal global queue;
+    # and then updating the global statistics accordingly.
+    #
+    # @param [Array] an array of objects representing a basic window
+    def process(elements)
+      # Do we need this?
+      return if elements.length != @b
+      # Step 1
+      summary = {}
+      elements.each do |e|
+        if summary.key? e
+          summary[e] += 1
+        else
+          summary[e] = 1
+        end
+      end
+      # index of the k-th item
+      kth_index = find_kth_largest(summary)
+      # Step 2 & 3
+      # summary is [[item,count],[item,count],[item,count]....]
+      # sorted by descending order of the item count
+      summary = summary.sort { |a,b| b[1]<=>a[1] }[0..kth_index]
+      @queue << summary
+      # Step 4
+      summary.each do |t|
+        if @statistics.key? t[0]
+          @statistics[t[0]] += t[1]
+        else
+          @statistics[t[0]] = t[1]
+        end
+      end
+      # Step 5
+      @delta += summary[kth_index][1]
+      # Step 6
+      if should_pop_oldest_summary
+        # a
+        summary_p = @queue.shift
+        @delta -= summary_p[find_kth_largest(summary_p)][1]
+        # b
+        summary_p.each { |t| @statistics[t[0]] -= t[1] }
+        @statistics.delete_if { |k,v| v <= 0 }
+        #c
+        @statistics.select { |k,v| v > @delta }
+      else
+        {}
+      end
     end
+    # Returns the version for this gem.
+    #
+    # @return [String] the version for this gem.
     def version
       Frequent::VERSION
     end
+    private
+      # Return true when it is ready to pop oldest summary from queue
+      #
+      # @return [Boolean] whether it is ready to pop oldest summary from queue
+      def should_pop_oldest_summary
+        @queue.length > @n/@b
+      end
+      # Return the k-th index of a summary object
+      #
+      # @param [Object] a summary object
+      # @return [Integer] the k-th index
+      def find_kth_largest(summary)
+        [summary.length, @k].min - 1
+      end
   end
 end

data/lib/frequent/version.rb CHANGED Viewed

@@ -1,5 +1,14 @@
+# coding: utf-8
+# `Frequent` is the namespace for objects implementing
+# the Demaine et al. FREQUENT algorithm for finding
+# the most frequently-appearing items (top-k) in a
+# data stream in sliding windows.
+#
+# `Frequent::Algorithm` is the implementation class.
 module Frequent
-  VERSION = '0.0.1'
+  # Version string for this Rubygem.
+  VERSION = '0.0.2'
 end
 =begin

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: frequent-algorithm
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
 platform: ruby
 authors:
 - Willie Tong
@@ -9,22 +9,52 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-03-11 00:00:00.000000000 Z
-dependencies: []
+date: 2015-03-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
 description: |
-  frequent-algorithm is a Ruby implementation of the FREQUENT algorithm for identifying frequent items in a data stream in sliding windows. Please refer to [Identifying Frequent Items in Sliding Windows over On-Line Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by Golab, DeHaan, Demaine, L&#243;pez-Ortiz and Munro (2003).
-email: buruzaemon@gmail.com
+  frequent-algorithm is a Ruby implementation of the Demaine et al FREQUENT algorithm for identifying frequent items in a data stream in sliding windows (c.f Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003).
+email:
+- tongsinyin@gmail.com
+- buruzaemon@gmail.com
 executables: []
 extensions: []
 extra_rdoc_files: []
 files:
-- .yardopts
-- CHANGELOG
-- LICENSE
-- README.md
 - lib/frequent-algorithm.rb
 - lib/frequent/algorithm.rb
 - lib/frequent/version.rb
+- README.md
+- LICENSE
+- CHANGELOG
+- .yardopts
 homepage: https://github.com/buruzaemon/frequent-algorithm
 licenses:
 - MIT
@@ -45,10 +75,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.1
+rubygems_version: 2.0.14
 signing_key:
 specification_version: 4
-summary: A Ruby implementation of the FREQUENT algorithm for identifying frequent
-  items in a data stream in sliding windows.
+summary: Identifies frequent items in a data stream in sliding windows using the Demaine
+  et al FREQUENT algorithm.
 test_files: []
-has_rdoc: