frequent-algorithm 0.0.1 → 0.0.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG +9 -0
- data/README.md +100 -22
- data/lib/frequent-algorithm.rb +1 -0
- data/lib/frequent/algorithm.rb +122 -4
- data/lib/frequent/version.rb +10 -1
- metadata +42 -13
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f799561d8b1543e23483918e587dffcbd0522e78
|
4
|
+
data.tar.gz: 39a5a9ceee0dee33bc889111b2745907ffb49018
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c68087e23dc0ff299f81797a1152c6e1fc6f1f10b7ec46ca39fff35cb8d91836be27eb34fe434c484fc1b28fde939c2f436d42b36f958ddca0afbe9f5107194b
|
7
|
+
data.tar.gz: 6f0da59941492900cc4da2f1266ed0e8398a852cbb55e064fd0ed290f8209a396da2e051c5293a45a21c7564828a099ea03b607dfba53608d824f834a1083bc1
|
data/CHANGELOG
CHANGED
data/README.md
CHANGED
@@ -1,4 +1,10 @@
|
|
1
|
-
# frequent-algorithm
|
1
|
+
# frequent-algorithm [![Gem Version](https://badge.fury.io/rb/frequent-algorithm.svg)](http://badge.fury.io/rb/frequent-algorithm) [![Build Status](https://travis-ci.org/buruzaemon/frequent-algorithm.svg)](https://travis-ci.org/buruzaemon/frequent-algorithm)
|
2
|
+
|
3
|
+
Web site usage, social network behavior and Internet traffic are examples
|
4
|
+
of systems that appear to follow the [power law](http://en.wikipedia.org/wiki/Power_law),
|
5
|
+
where most of the events are due to the actions of a very small few.
|
6
|
+
Knowing at any given point in time which items are trending is valuable
|
7
|
+
in understanding the system.
|
2
8
|
|
3
9
|
`frequent-algorithm` is a Ruby implementation of the FREQUENT algorithm
|
4
10
|
for identifying frequent items in a data stream in sliding windows.
|
@@ -6,40 +12,93 @@ Please refer to [Identifying Frequent Items in Sliding Windows over On-Line
|
|
6
12
|
Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by
|
7
13
|
Golab, DeHaan, Demaine, López-Ortiz and Munro (2003).
|
8
14
|
|
9
|
-
##
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
15
|
+
## Introduction
|
16
|
+
|
17
|
+
### Challenges
|
18
|
+
|
19
|
+
Challenges for Real-time processing of data streams for _frequent item queries_
|
20
|
+
include:
|
21
|
+
|
22
|
+
* data may be of unknown and possibly unbound length
|
23
|
+
* data may be arriving a very fast rate
|
24
|
+
* it might not be possible to go back and re-read the data
|
25
|
+
* too large a window of observation may include stale data
|
26
|
+
|
27
|
+
Therefore, a solution should have the following characteristics:
|
28
|
+
|
29
|
+
* uses limited memory
|
30
|
+
* can process events in the stream in Ο(1) constant time
|
31
|
+
* requires only a single-pass over the data
|
32
|
+
|
33
|
+
|
34
|
+
### The algorithm
|
35
|
+
|
36
|
+
> LOOP<br/>
|
37
|
+
> 1. For each element e in the next b elements:<br/>
|
38
|
+
> If a local counter exists for the type of element e:<br/>
|
39
|
+
> Increment the local counter.<br/>
|
40
|
+
> Otherwise:<br/>
|
41
|
+
> Create a new local counter for this element type<br/>
|
42
|
+
> and set it equal to 1.<br/>
|
43
|
+
> 2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.<br/>
|
44
|
+
> 3. Delete all local counters<br/>
|
45
|
+
> 4. For each type named in S:<br/>
|
46
|
+
> If a global counter exists for this type:<br/>
|
47
|
+
> Add to it the count recorded in S.<br/>
|
48
|
+
> Otherwise:<br/>
|
49
|
+
> Create a new global counter for this element type<br/>
|
50
|
+
> and set it equal to the count recorded in S.<br/>
|
51
|
+
> 5. Add the count of the kth largest type in S to δ.<br/>
|
52
|
+
> 6. If sizeOf(Q) > N/b:<br/>
|
53
|
+
> (a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.<br/>
|
54
|
+
> (b) For all element types named in S':<br/>
|
55
|
+
> Subtract from their global counters the counts<br/>
|
56
|
+
> recorded in S'<br/>
|
57
|
+
> If a counter is decremented to zero:<br/>
|
58
|
+
> Delete it.<br/>
|
59
|
+
> (c) Output the identity and value of each global counter > δ.
|
60
|
+
>
|
61
|
+
> — <cite>Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003</cite>
|
16
62
|
|
17
63
|
|
18
64
|
## Usage
|
19
65
|
|
20
66
|
require 'frequent-algorithm'
|
21
67
|
|
68
|
+
# data is pi to 1000 digits
|
69
|
+
pi = File.read('test/frequent/test_data_pi').strip
|
70
|
+
data = pi.scan(/./).each_slice(b)
|
71
|
+
|
72
|
+
N = 100 # size of main window
|
73
|
+
b = 20 # size of basic window
|
74
|
+
k = 3 # we are interested in top-3 numerals in pi
|
75
|
+
|
76
|
+
alg = Frequent::Algorithm.new(N, b, k)
|
77
|
+
|
78
|
+
# read in and process the 1st basic window
|
79
|
+
alg.process(data.next)
|
80
|
+
|
81
|
+
# and the top-3 numerals are?
|
82
|
+
top3 = alg.statistics.report
|
83
|
+
puts top3
|
84
|
+
|
85
|
+
# lather, rinse and repeat
|
86
|
+
alg.process(data.next)
|
87
|
+
|
88
|
+
|
22
89
|
## Development
|
23
90
|
|
24
91
|
The development of this gem requires the following:
|
25
92
|
|
26
|
-
* [Ruby
|
93
|
+
* [Ruby 1.9.3 or greater](https://www.ruby-lang.org/en/)
|
27
94
|
* [rubygems](https://rubygems.org/pages/download)
|
28
95
|
* [`bundler`](https://github.com/bundler/bundler)
|
29
96
|
* [`rake`](https://github.com/ruby/rake)
|
97
|
+
* [`minitest`](https://rubygems.org/gems/minitest) (unit testing)
|
30
98
|
* [`yard`](https://rubygems.org/gems/yard) (documentation)
|
31
99
|
* [`rdiscount`](https://rubygems.org/gems/rdiscount) (Markdown)
|
32
100
|
|
33
|
-
|
34
|
-
|
35
|
-
`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
|
36
|
-
[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
|
37
|
-
Check out [Getting Started with
|
38
|
-
Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
|
39
|
-
|
40
|
-
### Build
|
41
|
-
|
42
|
-
Development, testing and release of this rubygem uses the following
|
101
|
+
Building, testing and release of this rubygem uses the following
|
43
102
|
`rake` commands:
|
44
103
|
|
45
104
|
|
@@ -47,22 +106,41 @@ Development, testing and release of this rubygem uses the following
|
|
47
106
|
rake clean # Remove any temporary products
|
48
107
|
rake clobber # Remove any generated file
|
49
108
|
rake install # Build and install frequent-algorithm-n.n.n.gem into system gems
|
50
|
-
rake release # Create tag vn.n.n and build and push
|
109
|
+
rake release # Create tag vn.n.n and build and push
|
110
|
+
# frequent-algorithm-n.n.n.gem to Rubygems
|
51
111
|
rake test # Execute unit tests
|
52
112
|
|
53
113
|
|
114
|
+
### Documentation
|
115
|
+
|
116
|
+
`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
|
117
|
+
[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
|
118
|
+
Check out [Getting Started with
|
119
|
+
Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
|
120
|
+
|
54
121
|
### Unit Testing
|
55
122
|
|
56
123
|
`frequent-algorithm` uses
|
57
124
|
[`MiniTest::Unit`](https://github.com/seattlerb/minitest) for
|
58
125
|
unit testing.
|
59
126
|
|
60
|
-
|
61
|
-
### Release
|
127
|
+
### Releasing
|
62
128
|
|
63
129
|
Please refer to Publishing To Rubygems.org in the
|
64
130
|
[Rubygems Guide](http://guides.rubygems.org/make-your-own-gem/).
|
65
131
|
|
132
|
+
### Contributing
|
133
|
+
|
134
|
+
1. Fork it
|
135
|
+
2. Begin work on `dev-branch` (`git fetch && git checkout dev-branch`)
|
136
|
+
3. Create your feature branch (`git branch my-new-feature && git checkout
|
137
|
+
my-new-feature`)
|
138
|
+
4. Commit your changes (`git commit -am 'Add some feature'`)
|
139
|
+
5. Push to the branch (`git push origin my-new-feature:dev-branch`)
|
140
|
+
6. Create new Pull Request
|
141
|
+
|
142
|
+
You may wish to read the [Git book online](http://git-scm.com/book/en/v2).
|
143
|
+
|
66
144
|
|
67
145
|
## License
|
68
146
|
|
data/lib/frequent-algorithm.rb
CHANGED
data/lib/frequent/algorithm.rb
CHANGED
@@ -1,23 +1,141 @@
|
|
1
|
+
# coding: utf-8
|
1
2
|
require 'frequent/version'
|
2
3
|
|
3
4
|
module Frequent
|
4
5
|
|
6
|
+
# `Frequent::Algorithm` is the Ruby implementation of the
|
7
|
+
# Demaine et al. FREQUENT algorithm for calculating
|
8
|
+
# top-k items in a stream.
|
9
|
+
#
|
10
|
+
# The aims of this algorithm are:
|
11
|
+
# * use limited memory
|
12
|
+
# * require constant processing time per item
|
13
|
+
# * require a single-pass only
|
14
|
+
#
|
5
15
|
class Algorithm
|
6
|
-
|
16
|
+
# @return [Integer] the number of items in the main window
|
17
|
+
attr_reader :n
|
18
|
+
# @return [Integer] the number of items in a basic window
|
19
|
+
attr_reader :b
|
20
|
+
# @return [Integer] the number of top item categories to track
|
21
|
+
attr_reader :k
|
22
|
+
# @return [Array<Hash<Object,Integer>>] global queue for basic window summaries
|
23
|
+
attr_reader :queue
|
24
|
+
# @return [Hash<Object,Integer>] global mapping of items and counts
|
25
|
+
attr_reader :statistics
|
26
|
+
# @return [Integer] minimum threshold for membership in top-k items
|
27
|
+
attr_reader :delta
|
7
28
|
|
8
|
-
|
29
|
+
# Initializes this top-k frequency-calculating instance.
|
30
|
+
#
|
31
|
+
# @param [Integer] n number of items in the main window
|
32
|
+
# @param [Integer] b number of items in a basic window
|
33
|
+
# @param [Integer] k number of top item categories to track
|
34
|
+
# @raise [ArgumentError] if n is not greater than 0
|
35
|
+
# @raise [ArgumentError] if b is not greater than 0
|
36
|
+
# @raise [ArgumentError] if k is not greater than 0
|
37
|
+
# @raise [ArgumentError] if n/b is not greater than 1
|
38
|
+
def initialize(n, b, k=1)
|
39
|
+
if n <= 0
|
40
|
+
raise ArgumentError.new('n must be greater than 0')
|
41
|
+
end
|
42
|
+
if b <= 0
|
43
|
+
raise ArgumentError.new('b must be greater than 0')
|
44
|
+
end
|
45
|
+
if k <= 0
|
46
|
+
raise ArgumentError.new('k must be greater than 0')
|
47
|
+
end
|
48
|
+
if n/b < 1
|
49
|
+
raise ArgumentError.new('n/b must be greater than 1')
|
50
|
+
end
|
9
51
|
@n = n
|
10
52
|
@b = b
|
53
|
+
@k = k
|
54
|
+
|
55
|
+
@queue = []
|
56
|
+
@statistics = {}
|
57
|
+
@delta = 0
|
11
58
|
end
|
12
59
|
|
13
|
-
|
14
|
-
|
60
|
+
# Processes a single basic window of b items, by first adding
|
61
|
+
# a summary of this basic window in the internal global queue;
|
62
|
+
# and then updating the global statistics accordingly.
|
63
|
+
#
|
64
|
+
# @param [Array] an array of objects representing a basic window
|
65
|
+
def process(elements)
|
66
|
+
# Do we need this?
|
67
|
+
return if elements.length != @b
|
68
|
+
|
69
|
+
# Step 1
|
70
|
+
summary = {}
|
71
|
+
elements.each do |e|
|
72
|
+
if summary.key? e
|
73
|
+
summary[e] += 1
|
74
|
+
else
|
75
|
+
summary[e] = 1
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
# index of the k-th item
|
80
|
+
kth_index = find_kth_largest(summary)
|
81
|
+
|
82
|
+
# Step 2 & 3
|
83
|
+
# summary is [[item,count],[item,count],[item,count]....]
|
84
|
+
# sorted by descending order of the item count
|
85
|
+
summary = summary.sort { |a,b| b[1]<=>a[1] }[0..kth_index]
|
86
|
+
@queue << summary
|
87
|
+
|
88
|
+
# Step 4
|
89
|
+
summary.each do |t|
|
90
|
+
if @statistics.key? t[0]
|
91
|
+
@statistics[t[0]] += t[1]
|
92
|
+
else
|
93
|
+
@statistics[t[0]] = t[1]
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
# Step 5
|
98
|
+
@delta += summary[kth_index][1]
|
99
|
+
|
100
|
+
# Step 6
|
101
|
+
if should_pop_oldest_summary
|
102
|
+
# a
|
103
|
+
summary_p = @queue.shift
|
104
|
+
@delta -= summary_p[find_kth_largest(summary_p)][1]
|
105
|
+
|
106
|
+
# b
|
107
|
+
summary_p.each { |t| @statistics[t[0]] -= t[1] }
|
108
|
+
@statistics.delete_if { |k,v| v <= 0 }
|
109
|
+
|
110
|
+
#c
|
111
|
+
@statistics.select { |k,v| v > @delta }
|
112
|
+
else
|
113
|
+
{}
|
114
|
+
end
|
15
115
|
end
|
16
116
|
|
117
|
+
# Returns the version for this gem.
|
118
|
+
#
|
119
|
+
# @return [String] the version for this gem.
|
17
120
|
def version
|
18
121
|
Frequent::VERSION
|
19
122
|
end
|
20
123
|
|
124
|
+
private
|
125
|
+
# Return true when it is ready to pop oldest summary from queue
|
126
|
+
#
|
127
|
+
# @return [Boolean] whether it is ready to pop oldest summary from queue
|
128
|
+
def should_pop_oldest_summary
|
129
|
+
@queue.length > @n/@b
|
130
|
+
end
|
131
|
+
|
132
|
+
# Return the k-th index of a summary object
|
133
|
+
#
|
134
|
+
# @param [Object] a summary object
|
135
|
+
# @return [Integer] the k-th index
|
136
|
+
def find_kth_largest(summary)
|
137
|
+
[summary.length, @k].min - 1
|
138
|
+
end
|
21
139
|
end
|
22
140
|
end
|
23
141
|
|
data/lib/frequent/version.rb
CHANGED
@@ -1,5 +1,14 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
|
3
|
+
# `Frequent` is the namespace for objects implementing
|
4
|
+
# the Demaine et al. FREQUENT algorithm for finding
|
5
|
+
# the most frequently-appearing items (top-k) in a
|
6
|
+
# data stream in sliding windows.
|
7
|
+
#
|
8
|
+
# `Frequent::Algorithm` is the implementation class.
|
1
9
|
module Frequent
|
2
|
-
|
10
|
+
# Version string for this Rubygem.
|
11
|
+
VERSION = '0.0.2'
|
3
12
|
end
|
4
13
|
|
5
14
|
=begin
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: frequent-algorithm
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Willie Tong
|
@@ -9,22 +9,52 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2015-03-
|
13
|
-
dependencies:
|
12
|
+
date: 2015-03-19 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: rake
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
requirements:
|
18
|
+
- - '>='
|
19
|
+
- !ruby/object:Gem::Version
|
20
|
+
version: '0'
|
21
|
+
type: :development
|
22
|
+
prerelease: false
|
23
|
+
version_requirements: !ruby/object:Gem::Requirement
|
24
|
+
requirements:
|
25
|
+
- - '>='
|
26
|
+
- !ruby/object:Gem::Version
|
27
|
+
version: '0'
|
28
|
+
- !ruby/object:Gem::Dependency
|
29
|
+
name: minitest
|
30
|
+
requirement: !ruby/object:Gem::Requirement
|
31
|
+
requirements:
|
32
|
+
- - '>='
|
33
|
+
- !ruby/object:Gem::Version
|
34
|
+
version: '0'
|
35
|
+
type: :development
|
36
|
+
prerelease: false
|
37
|
+
version_requirements: !ruby/object:Gem::Requirement
|
38
|
+
requirements:
|
39
|
+
- - '>='
|
40
|
+
- !ruby/object:Gem::Version
|
41
|
+
version: '0'
|
14
42
|
description: |
|
15
|
-
frequent-algorithm is a Ruby implementation of the FREQUENT algorithm for identifying frequent items in a data stream in sliding windows.
|
16
|
-
email:
|
43
|
+
frequent-algorithm is a Ruby implementation of the Demaine et al FREQUENT algorithm for identifying frequent items in a data stream in sliding windows (c.f Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003).
|
44
|
+
email:
|
45
|
+
- tongsinyin@gmail.com
|
46
|
+
- buruzaemon@gmail.com
|
17
47
|
executables: []
|
18
48
|
extensions: []
|
19
49
|
extra_rdoc_files: []
|
20
50
|
files:
|
21
|
-
- .yardopts
|
22
|
-
- CHANGELOG
|
23
|
-
- LICENSE
|
24
|
-
- README.md
|
25
51
|
- lib/frequent-algorithm.rb
|
26
52
|
- lib/frequent/algorithm.rb
|
27
53
|
- lib/frequent/version.rb
|
54
|
+
- README.md
|
55
|
+
- LICENSE
|
56
|
+
- CHANGELOG
|
57
|
+
- .yardopts
|
28
58
|
homepage: https://github.com/buruzaemon/frequent-algorithm
|
29
59
|
licenses:
|
30
60
|
- MIT
|
@@ -45,10 +75,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
45
75
|
version: '0'
|
46
76
|
requirements: []
|
47
77
|
rubyforge_project:
|
48
|
-
rubygems_version: 2.
|
78
|
+
rubygems_version: 2.0.14
|
49
79
|
signing_key:
|
50
80
|
specification_version: 4
|
51
|
-
summary:
|
52
|
-
|
81
|
+
summary: Identifies frequent items in a data stream in sliding windows using the Demaine
|
82
|
+
et al FREQUENT algorithm.
|
53
83
|
test_files: []
|
54
|
-
has_rdoc:
|