frequent-algorithm 0.0.1 → 0.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 89e3696841e7c22693c58221b03a793c1f5d94d5
4
- data.tar.gz: bed42bcc5345a79a588a3c1ab594520ca67c3d40
3
+ metadata.gz: f799561d8b1543e23483918e587dffcbd0522e78
4
+ data.tar.gz: 39a5a9ceee0dee33bc889111b2745907ffb49018
5
5
  SHA512:
6
- metadata.gz: 45d1ee2b0a585529e06935735b6a53be8a88a45548c659589bc148f1b9024bac19fba9f8f3e9dbebf8d733dd9fa113e6b3451cad6b3ee1a61d2e1988e2b657c4
7
- data.tar.gz: 6100d5530291ec757aed6329f3f680fced9a0a7ce8b8ca974bcac72102aa3882f7ef72b47a86c318dcdf7d5ae12671a1e9c87af124f6914b75a602860056cd5c
6
+ metadata.gz: c68087e23dc0ff299f81797a1152c6e1fc6f1f10b7ec46ca39fff35cb8d91836be27eb34fe434c484fc1b28fde939c2f436d42b36f958ddca0afbe9f5107194b
7
+ data.tar.gz: 6f0da59941492900cc4da2f1266ed0e8398a852cbb55e064fd0ed290f8209a396da2e051c5293a45a21c7564828a099ea03b607dfba53608d824f834a1083bc1
data/CHANGELOG CHANGED
@@ -0,0 +1,9 @@
1
+ ## CHANGELOG
2
+
3
+ - __2015/03/19 0.0.2 release.
4
+ - First-stage implementation.
5
+ - API documentation added.
6
+ - Fleshing out unit tests.
7
+
8
+ - __2015/03/11__: 0.0.1 release.
9
+ - Initial release.
data/README.md CHANGED
@@ -1,4 +1,10 @@
1
- # frequent-algorithm
1
+ # frequent-algorithm [![Gem Version](https://badge.fury.io/rb/frequent-algorithm.svg)](http://badge.fury.io/rb/frequent-algorithm) [![Build Status](https://travis-ci.org/buruzaemon/frequent-algorithm.svg)](https://travis-ci.org/buruzaemon/frequent-algorithm)
2
+
3
+ Web site usage, social network behavior and Internet traffic are examples
4
+ of systems that appear to follow the [power law](http://en.wikipedia.org/wiki/Power_law),
5
+ where most of the events are due to the actions of a very small few.
6
+ Knowing at any given point in time which items are trending is valuable
7
+ in understanding the system.
2
8
 
3
9
  `frequent-algorithm` is a Ruby implementation of the FREQUENT algorithm
4
10
  for identifying frequent items in a data stream in sliding windows.
@@ -6,40 +12,93 @@ Please refer to [Identifying Frequent Items in Sliding Windows over On-Line
6
12
  Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by
7
13
  Golab, DeHaan, Demaine, López-Ortiz and Munro (2003).
8
14
 
9
- ## Getting Started
10
-
11
- Bacon ipsum dolor amet short loin flank swine ham hock tail. T-bone biltong
12
- beef shoulder salami, leberkas pork chop ribeye pork belly ground round. Filet
13
- mignon pork chop spare ribs brisket pastrami picanha bacon, biltong beef ribs
14
- corned beef ham hock tail. Meatloaf kielbasa turducken, salami chuck beef ribs
15
- venison hamburger t-bone landjaeger pork chop drumstick sausage bacon.
15
+ ## Introduction
16
+
17
+ ### Challenges
18
+
19
+ Challenges for Real-time processing of data streams for _frequent item queries_
20
+ include:
21
+
22
+ * data may be of unknown and possibly unbound length
23
+ * data may be arriving a very fast rate
24
+ * it might not be possible to go back and re-read the data
25
+ * too large a window of observation may include stale data
26
+
27
+ Therefore, a solution should have the following characteristics:
28
+
29
+ * uses limited memory
30
+ * can process events in the stream in Ο(1) constant time
31
+ * requires only a single-pass over the data
32
+
33
+
34
+ ### The algorithm
35
+
36
+ > LOOP<br/>
37
+ > 1. For each element e in the next b elements:<br/>
38
+ > &nbsp;&nbsp;&nbsp;&nbsp;If a local counter exists for the type of element e:<br/>
39
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Increment the local counter.<br/>
40
+ > &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
41
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new local counter for this element type<br/>
42
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to 1.<br/>
43
+ > 2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.<br/>
44
+ > 3. Delete all local counters<br/>
45
+ > 4. For each type named in S:<br/>
46
+ > &nbsp;&nbsp;&nbsp;&nbsp;If a global counter exists for this type:<br/>
47
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Add to it the count recorded in S.<br/>
48
+ > &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
49
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new global counter for this element type<br/>
50
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to the count recorded in S.<br/>
51
+ > 5. Add the count of the kth largest type in S to δ.<br/>
52
+ > 6. If sizeOf(Q) > N/b:<br/>
53
+ > &nbsp;&nbsp;&nbsp;&nbsp;(a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.<br/>
54
+ > &nbsp;&nbsp;&nbsp;&nbsp;(b) For all element types named in S':<br/>
55
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Subtract from their global counters the counts<br/>
56
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;recorded in S'<br/>
57
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If a counter is decremented to zero:<br/>
58
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Delete it.<br/>
59
+ > &nbsp;&nbsp;&nbsp;&nbsp;(c) Output the identity and value of each global counter > δ.
60
+ >
61
+ > &mdash; <cite>Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003</cite>
16
62
 
17
63
 
18
64
  ## Usage
19
65
 
20
66
  require 'frequent-algorithm'
21
67
 
68
+ # data is pi to 1000 digits
69
+ pi = File.read('test/frequent/test_data_pi').strip
70
+ data = pi.scan(/./).each_slice(b)
71
+
72
+ N = 100 # size of main window
73
+ b = 20 # size of basic window
74
+ k = 3 # we are interested in top-3 numerals in pi
75
+
76
+ alg = Frequent::Algorithm.new(N, b, k)
77
+
78
+ # read in and process the 1st basic window
79
+ alg.process(data.next)
80
+
81
+ # and the top-3 numerals are?
82
+ top3 = alg.statistics.report
83
+ puts top3
84
+
85
+ # lather, rinse and repeat
86
+ alg.process(data.next)
87
+
88
+
22
89
  ## Development
23
90
 
24
91
  The development of this gem requires the following:
25
92
 
26
- * [Ruby 2.0 or greater](https://www.ruby-lang.org/en/)
93
+ * [Ruby 1.9.3 or greater](https://www.ruby-lang.org/en/)
27
94
  * [rubygems](https://rubygems.org/pages/download)
28
95
  * [`bundler`](https://github.com/bundler/bundler)
29
96
  * [`rake`](https://github.com/ruby/rake)
97
+ * [`minitest`](https://rubygems.org/gems/minitest) (unit testing)
30
98
  * [`yard`](https://rubygems.org/gems/yard) (documentation)
31
99
  * [`rdiscount`](https://rubygems.org/gems/rdiscount) (Markdown)
32
100
 
33
- ### Documentation
34
-
35
- `frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
36
- [`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
37
- Check out [Getting Started with
38
- Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
39
-
40
- ### Build
41
-
42
- Development, testing and release of this rubygem uses the following
101
+ Building, testing and release of this rubygem uses the following
43
102
  `rake` commands:
44
103
 
45
104
 
@@ -47,22 +106,41 @@ Development, testing and release of this rubygem uses the following
47
106
  rake clean # Remove any temporary products
48
107
  rake clobber # Remove any generated file
49
108
  rake install # Build and install frequent-algorithm-n.n.n.gem into system gems
50
- rake release # Create tag vn.n.n and build and push frequent-algorithm-n.n.n.gem to Rubygems
109
+ rake release # Create tag vn.n.n and build and push
110
+ # frequent-algorithm-n.n.n.gem to Rubygems
51
111
  rake test # Execute unit tests
52
112
 
53
113
 
114
+ ### Documentation
115
+
116
+ `frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
117
+ [`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
118
+ Check out [Getting Started with
119
+ Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
120
+
54
121
  ### Unit Testing
55
122
 
56
123
  `frequent-algorithm` uses
57
124
  [`MiniTest::Unit`](https://github.com/seattlerb/minitest) for
58
125
  unit testing.
59
126
 
60
-
61
- ### Release
127
+ ### Releasing
62
128
 
63
129
  Please refer to Publishing To Rubygems.org in the
64
130
  [Rubygems Guide](http://guides.rubygems.org/make-your-own-gem/).
65
131
 
132
+ ### Contributing
133
+
134
+ 1. Fork it
135
+ 2. Begin work on `dev-branch` (`git fetch && git checkout dev-branch`)
136
+ 3. Create your feature branch (`git branch my-new-feature && git checkout
137
+ my-new-feature`)
138
+ 4. Commit your changes (`git commit -am 'Add some feature'`)
139
+ 5. Push to the branch (`git push origin my-new-feature:dev-branch`)
140
+ 6. Create new Pull Request
141
+
142
+ You may wish to read the [Git book online](http://git-scm.com/book/en/v2).
143
+
66
144
 
67
145
  ## License
68
146
 
@@ -1,3 +1,4 @@
1
+ # coding: utf-8
1
2
  require 'frequent/algorithm'
2
3
 
3
4
  =begin
@@ -1,23 +1,141 @@
1
+ # coding: utf-8
1
2
  require 'frequent/version'
2
3
 
3
4
  module Frequent
4
5
 
6
+ # `Frequent::Algorithm` is the Ruby implementation of the
7
+ # Demaine et al. FREQUENT algorithm for calculating
8
+ # top-k items in a stream.
9
+ #
10
+ # The aims of this algorithm are:
11
+ # * use limited memory
12
+ # * require constant processing time per item
13
+ # * require a single-pass only
14
+ #
5
15
  class Algorithm
6
- attr_reader :n, :b
16
+ # @return [Integer] the number of items in the main window
17
+ attr_reader :n
18
+ # @return [Integer] the number of items in a basic window
19
+ attr_reader :b
20
+ # @return [Integer] the number of top item categories to track
21
+ attr_reader :k
22
+ # @return [Array<Hash<Object,Integer>>] global queue for basic window summaries
23
+ attr_reader :queue
24
+ # @return [Hash<Object,Integer>] global mapping of items and counts
25
+ attr_reader :statistics
26
+ # @return [Integer] minimum threshold for membership in top-k items
27
+ attr_reader :delta
7
28
 
8
- def initialize(n, b)
29
+ # Initializes this top-k frequency-calculating instance.
30
+ #
31
+ # @param [Integer] n number of items in the main window
32
+ # @param [Integer] b number of items in a basic window
33
+ # @param [Integer] k number of top item categories to track
34
+ # @raise [ArgumentError] if n is not greater than 0
35
+ # @raise [ArgumentError] if b is not greater than 0
36
+ # @raise [ArgumentError] if k is not greater than 0
37
+ # @raise [ArgumentError] if n/b is not greater than 1
38
+ def initialize(n, b, k=1)
39
+ if n <= 0
40
+ raise ArgumentError.new('n must be greater than 0')
41
+ end
42
+ if b <= 0
43
+ raise ArgumentError.new('b must be greater than 0')
44
+ end
45
+ if k <= 0
46
+ raise ArgumentError.new('k must be greater than 0')
47
+ end
48
+ if n/b < 1
49
+ raise ArgumentError.new('n/b must be greater than 1')
50
+ end
9
51
  @n = n
10
52
  @b = b
53
+ @k = k
54
+
55
+ @queue = []
56
+ @statistics = {}
57
+ @delta = 0
11
58
  end
12
59
 
13
- def process(item)
14
- raise NotImplementedError.new
60
+ # Processes a single basic window of b items, by first adding
61
+ # a summary of this basic window in the internal global queue;
62
+ # and then updating the global statistics accordingly.
63
+ #
64
+ # @param [Array] an array of objects representing a basic window
65
+ def process(elements)
66
+ # Do we need this?
67
+ return if elements.length != @b
68
+
69
+ # Step 1
70
+ summary = {}
71
+ elements.each do |e|
72
+ if summary.key? e
73
+ summary[e] += 1
74
+ else
75
+ summary[e] = 1
76
+ end
77
+ end
78
+
79
+ # index of the k-th item
80
+ kth_index = find_kth_largest(summary)
81
+
82
+ # Step 2 & 3
83
+ # summary is [[item,count],[item,count],[item,count]....]
84
+ # sorted by descending order of the item count
85
+ summary = summary.sort { |a,b| b[1]<=>a[1] }[0..kth_index]
86
+ @queue << summary
87
+
88
+ # Step 4
89
+ summary.each do |t|
90
+ if @statistics.key? t[0]
91
+ @statistics[t[0]] += t[1]
92
+ else
93
+ @statistics[t[0]] = t[1]
94
+ end
95
+ end
96
+
97
+ # Step 5
98
+ @delta += summary[kth_index][1]
99
+
100
+ # Step 6
101
+ if should_pop_oldest_summary
102
+ # a
103
+ summary_p = @queue.shift
104
+ @delta -= summary_p[find_kth_largest(summary_p)][1]
105
+
106
+ # b
107
+ summary_p.each { |t| @statistics[t[0]] -= t[1] }
108
+ @statistics.delete_if { |k,v| v <= 0 }
109
+
110
+ #c
111
+ @statistics.select { |k,v| v > @delta }
112
+ else
113
+ {}
114
+ end
15
115
  end
16
116
 
117
+ # Returns the version for this gem.
118
+ #
119
+ # @return [String] the version for this gem.
17
120
  def version
18
121
  Frequent::VERSION
19
122
  end
20
123
 
124
+ private
125
+ # Return true when it is ready to pop oldest summary from queue
126
+ #
127
+ # @return [Boolean] whether it is ready to pop oldest summary from queue
128
+ def should_pop_oldest_summary
129
+ @queue.length > @n/@b
130
+ end
131
+
132
+ # Return the k-th index of a summary object
133
+ #
134
+ # @param [Object] a summary object
135
+ # @return [Integer] the k-th index
136
+ def find_kth_largest(summary)
137
+ [summary.length, @k].min - 1
138
+ end
21
139
  end
22
140
  end
23
141
 
@@ -1,5 +1,14 @@
1
+ # coding: utf-8
2
+
3
+ # `Frequent` is the namespace for objects implementing
4
+ # the Demaine et al. FREQUENT algorithm for finding
5
+ # the most frequently-appearing items (top-k) in a
6
+ # data stream in sliding windows.
7
+ #
8
+ # `Frequent::Algorithm` is the implementation class.
1
9
  module Frequent
2
- VERSION = '0.0.1'
10
+ # Version string for this Rubygem.
11
+ VERSION = '0.0.2'
3
12
  end
4
13
 
5
14
  =begin
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: frequent-algorithm
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Willie Tong
@@ -9,22 +9,52 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2015-03-11 00:00:00.000000000 Z
13
- dependencies: []
12
+ date: 2015-03-19 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rake
16
+ requirement: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - '>='
19
+ - !ruby/object:Gem::Version
20
+ version: '0'
21
+ type: :development
22
+ prerelease: false
23
+ version_requirements: !ruby/object:Gem::Requirement
24
+ requirements:
25
+ - - '>='
26
+ - !ruby/object:Gem::Version
27
+ version: '0'
28
+ - !ruby/object:Gem::Dependency
29
+ name: minitest
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - '>='
33
+ - !ruby/object:Gem::Version
34
+ version: '0'
35
+ type: :development
36
+ prerelease: false
37
+ version_requirements: !ruby/object:Gem::Requirement
38
+ requirements:
39
+ - - '>='
40
+ - !ruby/object:Gem::Version
41
+ version: '0'
14
42
  description: |
15
- frequent-algorithm is a Ruby implementation of the FREQUENT algorithm for identifying frequent items in a data stream in sliding windows. Please refer to [Identifying Frequent Items in Sliding Windows over On-Line Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by Golab, DeHaan, Demaine, L&#243;pez-Ortiz and Munro (2003).
16
- email: buruzaemon@gmail.com
43
+ frequent-algorithm is a Ruby implementation of the Demaine et al FREQUENT algorithm for identifying frequent items in a data stream in sliding windows (c.f Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003).
44
+ email:
45
+ - tongsinyin@gmail.com
46
+ - buruzaemon@gmail.com
17
47
  executables: []
18
48
  extensions: []
19
49
  extra_rdoc_files: []
20
50
  files:
21
- - .yardopts
22
- - CHANGELOG
23
- - LICENSE
24
- - README.md
25
51
  - lib/frequent-algorithm.rb
26
52
  - lib/frequent/algorithm.rb
27
53
  - lib/frequent/version.rb
54
+ - README.md
55
+ - LICENSE
56
+ - CHANGELOG
57
+ - .yardopts
28
58
  homepage: https://github.com/buruzaemon/frequent-algorithm
29
59
  licenses:
30
60
  - MIT
@@ -45,10 +75,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
45
75
  version: '0'
46
76
  requirements: []
47
77
  rubyforge_project:
48
- rubygems_version: 2.4.1
78
+ rubygems_version: 2.0.14
49
79
  signing_key:
50
80
  specification_version: 4
51
- summary: A Ruby implementation of the FREQUENT algorithm for identifying frequent
52
- items in a data stream in sliding windows.
81
+ summary: Identifies frequent items in a data stream in sliding windows using the Demaine
82
+ et al FREQUENT algorithm.
53
83
  test_files: []
54
- has_rdoc: