RubyGems - frequent-algorithm - Versions diffs - 0.0.1 → 0.0.2 - Mend

frequent-algorithm 0.0.1 → 0.0.2

Files changed (7) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 89e3696841e7c22693c58221b03a793c1f5d94d5
-  data.tar.gz: bed42bcc5345a79a588a3c1ab594520ca67c3d40
+  metadata.gz: f799561d8b1543e23483918e587dffcbd0522e78
+  data.tar.gz: 39a5a9ceee0dee33bc889111b2745907ffb49018
 SHA512:
-  metadata.gz: 45d1ee2b0a585529e06935735b6a53be8a88a45548c659589bc148f1b9024bac19fba9f8f3e9dbebf8d733dd9fa113e6b3451cad6b3ee1a61d2e1988e2b657c4
-  data.tar.gz: 6100d5530291ec757aed6329f3f680fced9a0a7ce8b8ca974bcac72102aa3882f7ef72b47a86c318dcdf7d5ae12671a1e9c87af124f6914b75a602860056cd5c
+  metadata.gz: c68087e23dc0ff299f81797a1152c6e1fc6f1f10b7ec46ca39fff35cb8d91836be27eb34fe434c484fc1b28fde939c2f436d42b36f958ddca0afbe9f5107194b
+  data.tar.gz: 6f0da59941492900cc4da2f1266ed0e8398a852cbb55e064fd0ed290f8209a396da2e051c5293a45a21c7564828a099ea03b607dfba53608d824f834a1083bc1

data/CHANGELOG CHANGED Viewed

@@ -0,0 +1,9 @@
+## CHANGELOG
+- __2015/03/19 0.0.2 release.
+    - First-stage implementation.
+    - API documentation added.
+    - Fleshing out unit tests.
+- __2015/03/11__: 0.0.1 release.
+    - Initial release.

data/README.md CHANGED Viewed

@@ -1,4 +1,10 @@
-# frequent-algorithm
+# frequent-algorithm [![Gem Version](https://badge.fury.io/rb/frequent-algorithm.svg)](http://badge.fury.io/rb/frequent-algorithm) [![Build Status](https://travis-ci.org/buruzaemon/frequent-algorithm.svg)](https://travis-ci.org/buruzaemon/frequent-algorithm)
+Web site usage, social network behavior and Internet traffic are examples
+of systems that appear to follow the [power law](http://en.wikipedia.org/wiki/Power_law),
+where most of the events are due to the actions of a very small few.
+Knowing at any given point in time which items are trending is valuable
+in understanding the system.
 `frequent-algorithm` is a Ruby implementation of the FREQUENT algorithm
 for identifying frequent items in a data stream in sliding windows.
@@ -6,40 +12,93 @@ Please refer to [Identifying Frequent Items in Sliding Windows over On-Line
 Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by
 Golab, DeHaan, Demaine, L&#243;pez-Ortiz and Munro (2003).
-## Getting Started
-Bacon ipsum dolor amet short loin flank swine ham hock tail. T-bone biltong
-beef shoulder salami, leberkas pork chop ribeye pork belly ground round. Filet
-mignon pork chop spare ribs brisket pastrami picanha bacon, biltong beef ribs
-corned beef ham hock tail. Meatloaf kielbasa turducken, salami chuck beef ribs
-venison hamburger t-bone landjaeger pork chop drumstick sausage bacon.
+## Introduction
+### Challenges
+Challenges for Real-time processing of data streams for _frequent item queries_
+include:
+* data may be of unknown and possibly unbound length
+* data may be arriving a very fast rate
+* it might not be possible to go back and re-read the data
+* too large a window of observation may include stale data
+Therefore, a solution should have the following characteristics:
+* uses limited memory
+* can process events in the stream in &#927;(1) constant time
+* requires only a single-pass over the data
+### The algorithm
+> LOOP<br/>
+> 1. For each element e in the next b elements:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;If a local counter exists for the type of element e:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Increment the local counter.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new local counter for this element type<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to 1.<br/>
+> 2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.<br/>
+> 3. Delete all local counters<br/>
+> 4. For each type named in S:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;If a global counter exists for this type:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Add to it the count recorded in S.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new global counter for this element type<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to the count recorded in S.<br/>
+> 5. Add the count of the kth largest type in S to δ.<br/>
+> 6. If sizeOf(Q) > N/b:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;(a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;(b) For all element types named in S':<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Subtract from their global counters the counts<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;recorded in S'<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If a counter is decremented to zero:<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Delete it.<br/>
+> &nbsp;&nbsp;&nbsp;&nbsp;(c) Output the identity and value of each global counter > δ.
+>
+> &mdash; <cite>Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003</cite>
 ## Usage
     require 'frequent-algorithm'
+    # data is pi to 1000 digits
+    pi = File.read('test/frequent/test_data_pi').strip
+    data = pi.scan(/./).each_slice(b)
+    N = 100  # size of main window
+    b =  20  # size of basic window
+    k =   3  # we are interested in top-3 numerals in pi
+    alg = Frequent::Algorithm.new(N, b, k)
+    # read in and process the 1st basic window
+    alg.process(data.next)
+    # and the top-3 numerals are?
+    top3 = alg.statistics.report
+    puts top3
+    # lather, rinse and repeat
+    alg.process(data.next)
 ## Development
 The development of this gem requires the following:
-* [Ruby 2.0 or greater](https://www.ruby-lang.org/en/)
+* [Ruby 1.9.3 or greater](https://www.ruby-lang.org/en/)
 * [rubygems](https://rubygems.org/pages/download)
 * [`bundler`](https://github.com/bundler/bundler)
 * [`rake`](https://github.com/ruby/rake)
+* [`minitest`](https://rubygems.org/gems/minitest) (unit testing)
 * [`yard`](https://rubygems.org/gems/yard) (documentation)
 * [`rdiscount`](https://rubygems.org/gems/rdiscount) (Markdown)
-### Documentation
-`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
-[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
-Check out [Getting Started with
-Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
-### Build
-Development, testing and release of this rubygem uses the following
+Building, testing and release of this rubygem uses the following
 `rake` commands:
@@ -47,22 +106,41 @@ Development, testing and release of this rubygem uses the following
     rake clean    # Remove any temporary products
     rake clobber  # Remove any generated file
     rake install  # Build and install frequent-algorithm-n.n.n.gem into system gems
-    rake release  # Create tag vn.n.n and build and push frequent-algorithm-n.n.n.gem to Rubygems
+    rake release  # Create tag vn.n.n and build and push
+                  # frequent-algorithm-n.n.n.gem to Rubygems
     rake test     # Execute unit tests
+### Documentation
+`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
+[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
+Check out [Getting Started with
+Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
 ### Unit Testing
 `frequent-algorithm` uses
 [`MiniTest::Unit`](https://github.com/seattlerb/minitest) for
 unit testing.
-### Release
+### Releasing
 Please refer to Publishing To Rubygems.org in the
 [Rubygems Guide](http://guides.rubygems.org/make-your-own-gem/).
+### Contributing
+1. Fork it
+2. Begin work on `dev-branch` (`git fetch && git checkout dev-branch`)
+3. Create your feature branch (`git branch my-new-feature && git checkout
+   my-new-feature`)
+4. Commit your changes (`git commit -am 'Add some feature'`)
+5. Push to the branch (`git push origin my-new-feature:dev-branch`)
+6. Create new Pull Request
+You may wish to read the [Git book online](http://git-scm.com/book/en/v2).
 ## License

data/lib/frequent-algorithm.rb CHANGED Viewed

@@ -1,3 +1,4 @@
+# coding: utf-8
 require 'frequent/algorithm'
 =begin

data/lib/frequent/algorithm.rb CHANGED Viewed

@@ -1,23 +1,141 @@
+# coding: utf-8
 require 'frequent/version'
 module Frequent
+  # `Frequent::Algorithm` is the Ruby implementation of the
+  # Demaine et al. FREQUENT algorithm for calculating
+  # top-k items in a stream.
+  #
+  # The aims of this algorithm are:
+  # * use limited memory
+  # * require constant processing time per item
+  # * require a single-pass only
+  #
   class Algorithm
-    attr_reader :n, :b
+    # @return [Integer] the number of items in the main window
+    attr_reader :n
+    # @return [Integer] the number of items in a basic window
+    attr_reader :b
+    # @return [Integer] the number of top item categories to track
+    attr_reader :k
+    # @return [Array<Hash<Object,Integer>>] global queue for basic window summaries
+    attr_reader :queue
+    # @return [Hash<Object,Integer>] global mapping of items and counts
+    attr_reader :statistics
+    # @return [Integer] minimum threshold for membership in top-k items
+    attr_reader :delta
-    def initialize(n, b)
+    # Initializes this top-k frequency-calculating instance.
+    #
+    # @param [Integer] n number of items in the main window
+    # @param [Integer] b number of items in a basic window
+    # @param [Integer] k number of top item categories to track
+    # @raise [ArgumentError] if n is not greater than 0
+    # @raise [ArgumentError] if b is not greater than 0
+    # @raise [ArgumentError] if k is not greater than 0
+    # @raise [ArgumentError] if n/b is not greater than 1
+    def initialize(n, b, k=1)
+      if n <= 0
+        raise ArgumentError.new('n must be greater than 0')
+      end
+      if b <= 0
+        raise ArgumentError.new('b must be greater than 0')
+      end
+      if k <= 0
+        raise ArgumentError.new('k must be greater than 0')
+      end
+      if n/b < 1
+        raise ArgumentError.new('n/b must be greater than 1')
+      end
       @n = n
       @b = b
+      @k = k
+      @queue = []
+      @statistics = {}
+      @delta = 0
     end
-    def process(item)
-      raise NotImplementedError.new
+    # Processes a single basic window of b items, by first adding
+    # a summary of this basic window in the internal global queue;
+    # and then updating the global statistics accordingly.
+    #
+    # @param [Array] an array of objects representing a basic window
+    def process(elements)
+      # Do we need this?
+      return if elements.length != @b
+      # Step 1
+      summary = {}
+      elements.each do |e|
+        if summary.key? e
+          summary[e] += 1
+        else
+          summary[e] = 1
+        end
+      end
+      # index of the k-th item
+      kth_index = find_kth_largest(summary)
+      # Step 2 & 3
+      # summary is [[item,count],[item,count],[item,count]....]
+      # sorted by descending order of the item count
+      summary = summary.sort { |a,b| b[1]<=>a[1] }[0..kth_index]
+      @queue << summary
+      # Step 4
+      summary.each do |t|
+        if @statistics.key? t[0]
+          @statistics[t[0]] += t[1]
+        else
+          @statistics[t[0]] = t[1]
+        end
+      end
+      # Step 5
+      @delta += summary[kth_index][1]
+      # Step 6
+      if should_pop_oldest_summary
+        # a
+        summary_p = @queue.shift
+        @delta -= summary_p[find_kth_largest(summary_p)][1]
+        # b
+        summary_p.each { |t| @statistics[t[0]] -= t[1] }
+        @statistics.delete_if { |k,v| v <= 0 }
+        #c
+        @statistics.select { |k,v| v > @delta }
+      else
+        {}
+      end
     end
+    # Returns the version for this gem.
+    #
+    # @return [String] the version for this gem.
     def version
       Frequent::VERSION
     end
+    private
+      # Return true when it is ready to pop oldest summary from queue
+      #
+      # @return [Boolean] whether it is ready to pop oldest summary from queue
+      def should_pop_oldest_summary
+        @queue.length > @n/@b
+      end
+      # Return the k-th index of a summary object
+      #
+      # @param [Object] a summary object
+      # @return [Integer] the k-th index
+      def find_kth_largest(summary)
+        [summary.length, @k].min - 1
+      end
   end
 end

data/lib/frequent/version.rb CHANGED Viewed

@@ -1,5 +1,14 @@
+# coding: utf-8
+# `Frequent` is the namespace for objects implementing
+# the Demaine et al. FREQUENT algorithm for finding
+# the most frequently-appearing items (top-k) in a
+# data stream in sliding windows.
+#
+# `Frequent::Algorithm` is the implementation class.
 module Frequent
-  VERSION = '0.0.1'
+  # Version string for this Rubygem.
+  VERSION = '0.0.2'
 end
 =begin

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: frequent-algorithm
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.0.2
 platform: ruby
 authors:
 - Willie Tong
@@ -9,22 +9,52 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2015-03-11 00:00:00.000000000 Z
-dependencies: []
+date: 2015-03-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: minitest
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - '>='
+      - !ruby/object:Gem::Version
+        version: '0'
 description: |
-  frequent-algorithm is a Ruby implementation of the FREQUENT algorithm for identifying frequent items in a data stream in sliding windows. Please refer to [Identifying Frequent Items in Sliding Windows over On-Line Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by Golab, DeHaan, Demaine, L&#243;pez-Ortiz and Munro (2003).
-email: buruzaemon@gmail.com
+  frequent-algorithm is a Ruby implementation of the Demaine et al FREQUENT algorithm for identifying frequent items in a data stream in sliding windows (c.f Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003).
+email:
+- tongsinyin@gmail.com
+- buruzaemon@gmail.com
 executables: []
 extensions: []
 extra_rdoc_files: []
 files:
-- .yardopts
-- CHANGELOG
-- LICENSE
-- README.md
 - lib/frequent-algorithm.rb
 - lib/frequent/algorithm.rb
 - lib/frequent/version.rb
+- README.md
+- LICENSE
+- CHANGELOG
+- .yardopts
 homepage: https://github.com/buruzaemon/frequent-algorithm
 licenses:
 - MIT
@@ -45,10 +75,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.4.1
+rubygems_version: 2.0.14
 signing_key:
 specification_version: 4
-summary: A Ruby implementation of the FREQUENT algorithm for identifying frequent
-  items in a data stream in sliding windows.
+summary: Identifies frequent items in a data stream in sliding windows using the Demaine
+  et al FREQUENT algorithm.
 test_files: []
-has_rdoc: