frequent-algorithm 0.0.1 → 0.0.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 89e3696841e7c22693c58221b03a793c1f5d94d5
4
- data.tar.gz: bed42bcc5345a79a588a3c1ab594520ca67c3d40
3
+ metadata.gz: f799561d8b1543e23483918e587dffcbd0522e78
4
+ data.tar.gz: 39a5a9ceee0dee33bc889111b2745907ffb49018
5
5
  SHA512:
6
- metadata.gz: 45d1ee2b0a585529e06935735b6a53be8a88a45548c659589bc148f1b9024bac19fba9f8f3e9dbebf8d733dd9fa113e6b3451cad6b3ee1a61d2e1988e2b657c4
7
- data.tar.gz: 6100d5530291ec757aed6329f3f680fced9a0a7ce8b8ca974bcac72102aa3882f7ef72b47a86c318dcdf7d5ae12671a1e9c87af124f6914b75a602860056cd5c
6
+ metadata.gz: c68087e23dc0ff299f81797a1152c6e1fc6f1f10b7ec46ca39fff35cb8d91836be27eb34fe434c484fc1b28fde939c2f436d42b36f958ddca0afbe9f5107194b
7
+ data.tar.gz: 6f0da59941492900cc4da2f1266ed0e8398a852cbb55e064fd0ed290f8209a396da2e051c5293a45a21c7564828a099ea03b607dfba53608d824f834a1083bc1
data/CHANGELOG CHANGED
@@ -0,0 +1,9 @@
1
+ ## CHANGELOG
2
+
3
+ - __2015/03/19 0.0.2 release.
4
+ - First-stage implementation.
5
+ - API documentation added.
6
+ - Fleshing out unit tests.
7
+
8
+ - __2015/03/11__: 0.0.1 release.
9
+ - Initial release.
data/README.md CHANGED
@@ -1,4 +1,10 @@
1
- # frequent-algorithm
1
+ # frequent-algorithm [![Gem Version](https://badge.fury.io/rb/frequent-algorithm.svg)](http://badge.fury.io/rb/frequent-algorithm) [![Build Status](https://travis-ci.org/buruzaemon/frequent-algorithm.svg)](https://travis-ci.org/buruzaemon/frequent-algorithm)
2
+
3
+ Web site usage, social network behavior and Internet traffic are examples
4
+ of systems that appear to follow the [power law](http://en.wikipedia.org/wiki/Power_law),
5
+ where most of the events are due to the actions of a very small few.
6
+ Knowing at any given point in time which items are trending is valuable
7
+ in understanding the system.
2
8
 
3
9
  `frequent-algorithm` is a Ruby implementation of the FREQUENT algorithm
4
10
  for identifying frequent items in a data stream in sliding windows.
@@ -6,40 +12,93 @@ Please refer to [Identifying Frequent Items in Sliding Windows over On-Line
6
12
  Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by
7
13
  Golab, DeHaan, Demaine, López-Ortiz and Munro (2003).
8
14
 
9
- ## Getting Started
10
-
11
- Bacon ipsum dolor amet short loin flank swine ham hock tail. T-bone biltong
12
- beef shoulder salami, leberkas pork chop ribeye pork belly ground round. Filet
13
- mignon pork chop spare ribs brisket pastrami picanha bacon, biltong beef ribs
14
- corned beef ham hock tail. Meatloaf kielbasa turducken, salami chuck beef ribs
15
- venison hamburger t-bone landjaeger pork chop drumstick sausage bacon.
15
+ ## Introduction
16
+
17
+ ### Challenges
18
+
19
+ Challenges for Real-time processing of data streams for _frequent item queries_
20
+ include:
21
+
22
+ * data may be of unknown and possibly unbound length
23
+ * data may be arriving a very fast rate
24
+ * it might not be possible to go back and re-read the data
25
+ * too large a window of observation may include stale data
26
+
27
+ Therefore, a solution should have the following characteristics:
28
+
29
+ * uses limited memory
30
+ * can process events in the stream in Ο(1) constant time
31
+ * requires only a single-pass over the data
32
+
33
+
34
+ ### The algorithm
35
+
36
+ > LOOP<br/>
37
+ > 1. For each element e in the next b elements:<br/>
38
+ > &nbsp;&nbsp;&nbsp;&nbsp;If a local counter exists for the type of element e:<br/>
39
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Increment the local counter.<br/>
40
+ > &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
41
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new local counter for this element type<br/>
42
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to 1.<br/>
43
+ > 2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.<br/>
44
+ > 3. Delete all local counters<br/>
45
+ > 4. For each type named in S:<br/>
46
+ > &nbsp;&nbsp;&nbsp;&nbsp;If a global counter exists for this type:<br/>
47
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Add to it the count recorded in S.<br/>
48
+ > &nbsp;&nbsp;&nbsp;&nbsp;Otherwise:<br/>
49
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Create a new global counter for this element type<br/>
50
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and set it equal to the count recorded in S.<br/>
51
+ > 5. Add the count of the kth largest type in S to δ.<br/>
52
+ > 6. If sizeOf(Q) > N/b:<br/>
53
+ > &nbsp;&nbsp;&nbsp;&nbsp;(a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.<br/>
54
+ > &nbsp;&nbsp;&nbsp;&nbsp;(b) For all element types named in S':<br/>
55
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Subtract from their global counters the counts<br/>
56
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;recorded in S'<br/>
57
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If a counter is decremented to zero:<br/>
58
+ > &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Delete it.<br/>
59
+ > &nbsp;&nbsp;&nbsp;&nbsp;(c) Output the identity and value of each global counter > δ.
60
+ >
61
+ > &mdash; <cite>Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003</cite>
16
62
 
17
63
 
18
64
  ## Usage
19
65
 
20
66
  require 'frequent-algorithm'
21
67
 
68
+ # data is pi to 1000 digits
69
+ pi = File.read('test/frequent/test_data_pi').strip
70
+ data = pi.scan(/./).each_slice(b)
71
+
72
+ N = 100 # size of main window
73
+ b = 20 # size of basic window
74
+ k = 3 # we are interested in top-3 numerals in pi
75
+
76
+ alg = Frequent::Algorithm.new(N, b, k)
77
+
78
+ # read in and process the 1st basic window
79
+ alg.process(data.next)
80
+
81
+ # and the top-3 numerals are?
82
+ top3 = alg.statistics.report
83
+ puts top3
84
+
85
+ # lather, rinse and repeat
86
+ alg.process(data.next)
87
+
88
+
22
89
  ## Development
23
90
 
24
91
  The development of this gem requires the following:
25
92
 
26
- * [Ruby 2.0 or greater](https://www.ruby-lang.org/en/)
93
+ * [Ruby 1.9.3 or greater](https://www.ruby-lang.org/en/)
27
94
  * [rubygems](https://rubygems.org/pages/download)
28
95
  * [`bundler`](https://github.com/bundler/bundler)
29
96
  * [`rake`](https://github.com/ruby/rake)
97
+ * [`minitest`](https://rubygems.org/gems/minitest) (unit testing)
30
98
  * [`yard`](https://rubygems.org/gems/yard) (documentation)
31
99
  * [`rdiscount`](https://rubygems.org/gems/rdiscount) (Markdown)
32
100
 
33
- ### Documentation
34
-
35
- `frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
36
- [`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
37
- Check out [Getting Started with
38
- Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
39
-
40
- ### Build
41
-
42
- Development, testing and release of this rubygem uses the following
101
+ Building, testing and release of this rubygem uses the following
43
102
  `rake` commands:
44
103
 
45
104
 
@@ -47,22 +106,41 @@ Development, testing and release of this rubygem uses the following
47
106
  rake clean # Remove any temporary products
48
107
  rake clobber # Remove any generated file
49
108
  rake install # Build and install frequent-algorithm-n.n.n.gem into system gems
50
- rake release # Create tag vn.n.n and build and push frequent-algorithm-n.n.n.gem to Rubygems
109
+ rake release # Create tag vn.n.n and build and push
110
+ # frequent-algorithm-n.n.n.gem to Rubygems
51
111
  rake test # Execute unit tests
52
112
 
53
113
 
114
+ ### Documentation
115
+
116
+ `frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
117
+ [`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
118
+ Check out [Getting Started with
119
+ Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
120
+
54
121
  ### Unit Testing
55
122
 
56
123
  `frequent-algorithm` uses
57
124
  [`MiniTest::Unit`](https://github.com/seattlerb/minitest) for
58
125
  unit testing.
59
126
 
60
-
61
- ### Release
127
+ ### Releasing
62
128
 
63
129
  Please refer to Publishing To Rubygems.org in the
64
130
  [Rubygems Guide](http://guides.rubygems.org/make-your-own-gem/).
65
131
 
132
+ ### Contributing
133
+
134
+ 1. Fork it
135
+ 2. Begin work on `dev-branch` (`git fetch && git checkout dev-branch`)
136
+ 3. Create your feature branch (`git branch my-new-feature && git checkout
137
+ my-new-feature`)
138
+ 4. Commit your changes (`git commit -am 'Add some feature'`)
139
+ 5. Push to the branch (`git push origin my-new-feature:dev-branch`)
140
+ 6. Create new Pull Request
141
+
142
+ You may wish to read the [Git book online](http://git-scm.com/book/en/v2).
143
+
66
144
 
67
145
  ## License
68
146
 
@@ -1,3 +1,4 @@
1
+ # coding: utf-8
1
2
  require 'frequent/algorithm'
2
3
 
3
4
  =begin
@@ -1,23 +1,141 @@
1
+ # coding: utf-8
1
2
  require 'frequent/version'
2
3
 
3
4
  module Frequent
4
5
 
6
+ # `Frequent::Algorithm` is the Ruby implementation of the
7
+ # Demaine et al. FREQUENT algorithm for calculating
8
+ # top-k items in a stream.
9
+ #
10
+ # The aims of this algorithm are:
11
+ # * use limited memory
12
+ # * require constant processing time per item
13
+ # * require a single-pass only
14
+ #
5
15
  class Algorithm
6
- attr_reader :n, :b
16
+ # @return [Integer] the number of items in the main window
17
+ attr_reader :n
18
+ # @return [Integer] the number of items in a basic window
19
+ attr_reader :b
20
+ # @return [Integer] the number of top item categories to track
21
+ attr_reader :k
22
+ # @return [Array<Hash<Object,Integer>>] global queue for basic window summaries
23
+ attr_reader :queue
24
+ # @return [Hash<Object,Integer>] global mapping of items and counts
25
+ attr_reader :statistics
26
+ # @return [Integer] minimum threshold for membership in top-k items
27
+ attr_reader :delta
7
28
 
8
- def initialize(n, b)
29
+ # Initializes this top-k frequency-calculating instance.
30
+ #
31
+ # @param [Integer] n number of items in the main window
32
+ # @param [Integer] b number of items in a basic window
33
+ # @param [Integer] k number of top item categories to track
34
+ # @raise [ArgumentError] if n is not greater than 0
35
+ # @raise [ArgumentError] if b is not greater than 0
36
+ # @raise [ArgumentError] if k is not greater than 0
37
+ # @raise [ArgumentError] if n/b is not greater than 1
38
+ def initialize(n, b, k=1)
39
+ if n <= 0
40
+ raise ArgumentError.new('n must be greater than 0')
41
+ end
42
+ if b <= 0
43
+ raise ArgumentError.new('b must be greater than 0')
44
+ end
45
+ if k <= 0
46
+ raise ArgumentError.new('k must be greater than 0')
47
+ end
48
+ if n/b < 1
49
+ raise ArgumentError.new('n/b must be greater than 1')
50
+ end
9
51
  @n = n
10
52
  @b = b
53
+ @k = k
54
+
55
+ @queue = []
56
+ @statistics = {}
57
+ @delta = 0
11
58
  end
12
59
 
13
- def process(item)
14
- raise NotImplementedError.new
60
+ # Processes a single basic window of b items, by first adding
61
+ # a summary of this basic window in the internal global queue;
62
+ # and then updating the global statistics accordingly.
63
+ #
64
+ # @param [Array] an array of objects representing a basic window
65
+ def process(elements)
66
+ # Do we need this?
67
+ return if elements.length != @b
68
+
69
+ # Step 1
70
+ summary = {}
71
+ elements.each do |e|
72
+ if summary.key? e
73
+ summary[e] += 1
74
+ else
75
+ summary[e] = 1
76
+ end
77
+ end
78
+
79
+ # index of the k-th item
80
+ kth_index = find_kth_largest(summary)
81
+
82
+ # Step 2 & 3
83
+ # summary is [[item,count],[item,count],[item,count]....]
84
+ # sorted by descending order of the item count
85
+ summary = summary.sort { |a,b| b[1]<=>a[1] }[0..kth_index]
86
+ @queue << summary
87
+
88
+ # Step 4
89
+ summary.each do |t|
90
+ if @statistics.key? t[0]
91
+ @statistics[t[0]] += t[1]
92
+ else
93
+ @statistics[t[0]] = t[1]
94
+ end
95
+ end
96
+
97
+ # Step 5
98
+ @delta += summary[kth_index][1]
99
+
100
+ # Step 6
101
+ if should_pop_oldest_summary
102
+ # a
103
+ summary_p = @queue.shift
104
+ @delta -= summary_p[find_kth_largest(summary_p)][1]
105
+
106
+ # b
107
+ summary_p.each { |t| @statistics[t[0]] -= t[1] }
108
+ @statistics.delete_if { |k,v| v <= 0 }
109
+
110
+ #c
111
+ @statistics.select { |k,v| v > @delta }
112
+ else
113
+ {}
114
+ end
15
115
  end
16
116
 
117
+ # Returns the version for this gem.
118
+ #
119
+ # @return [String] the version for this gem.
17
120
  def version
18
121
  Frequent::VERSION
19
122
  end
20
123
 
124
+ private
125
+ # Return true when it is ready to pop oldest summary from queue
126
+ #
127
+ # @return [Boolean] whether it is ready to pop oldest summary from queue
128
+ def should_pop_oldest_summary
129
+ @queue.length > @n/@b
130
+ end
131
+
132
+ # Return the k-th index of a summary object
133
+ #
134
+ # @param [Object] a summary object
135
+ # @return [Integer] the k-th index
136
+ def find_kth_largest(summary)
137
+ [summary.length, @k].min - 1
138
+ end
21
139
  end
22
140
  end
23
141
 
@@ -1,5 +1,14 @@
1
+ # coding: utf-8
2
+
3
+ # `Frequent` is the namespace for objects implementing
4
+ # the Demaine et al. FREQUENT algorithm for finding
5
+ # the most frequently-appearing items (top-k) in a
6
+ # data stream in sliding windows.
7
+ #
8
+ # `Frequent::Algorithm` is the implementation class.
1
9
  module Frequent
2
- VERSION = '0.0.1'
10
+ # Version string for this Rubygem.
11
+ VERSION = '0.0.2'
3
12
  end
4
13
 
5
14
  =begin
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: frequent-algorithm
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Willie Tong
@@ -9,22 +9,52 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2015-03-11 00:00:00.000000000 Z
13
- dependencies: []
12
+ date: 2015-03-19 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: rake
16
+ requirement: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - '>='
19
+ - !ruby/object:Gem::Version
20
+ version: '0'
21
+ type: :development
22
+ prerelease: false
23
+ version_requirements: !ruby/object:Gem::Requirement
24
+ requirements:
25
+ - - '>='
26
+ - !ruby/object:Gem::Version
27
+ version: '0'
28
+ - !ruby/object:Gem::Dependency
29
+ name: minitest
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - '>='
33
+ - !ruby/object:Gem::Version
34
+ version: '0'
35
+ type: :development
36
+ prerelease: false
37
+ version_requirements: !ruby/object:Gem::Requirement
38
+ requirements:
39
+ - - '>='
40
+ - !ruby/object:Gem::Version
41
+ version: '0'
14
42
  description: |
15
- frequent-algorithm is a Ruby implementation of the FREQUENT algorithm for identifying frequent items in a data stream in sliding windows. Please refer to [Identifying Frequent Items in Sliding Windows over On-Line Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by Golab, DeHaan, Demaine, L&#243;pez-Ortiz and Munro (2003).
16
- email: buruzaemon@gmail.com
43
+ frequent-algorithm is a Ruby implementation of the Demaine et al FREQUENT algorithm for identifying frequent items in a data stream in sliding windows (c.f Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003).
44
+ email:
45
+ - tongsinyin@gmail.com
46
+ - buruzaemon@gmail.com
17
47
  executables: []
18
48
  extensions: []
19
49
  extra_rdoc_files: []
20
50
  files:
21
- - .yardopts
22
- - CHANGELOG
23
- - LICENSE
24
- - README.md
25
51
  - lib/frequent-algorithm.rb
26
52
  - lib/frequent/algorithm.rb
27
53
  - lib/frequent/version.rb
54
+ - README.md
55
+ - LICENSE
56
+ - CHANGELOG
57
+ - .yardopts
28
58
  homepage: https://github.com/buruzaemon/frequent-algorithm
29
59
  licenses:
30
60
  - MIT
@@ -45,10 +75,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
45
75
  version: '0'
46
76
  requirements: []
47
77
  rubyforge_project:
48
- rubygems_version: 2.4.1
78
+ rubygems_version: 2.0.14
49
79
  signing_key:
50
80
  specification_version: 4
51
- summary: A Ruby implementation of the FREQUENT algorithm for identifying frequent
52
- items in a data stream in sliding windows.
81
+ summary: Identifies frequent items in a data stream in sliding windows using the Demaine
82
+ et al FREQUENT algorithm.
53
83
  test_files: []
54
- has_rdoc: