frequent-algorithm 0.0.1 → 0.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG +9 -0
- data/README.md +100 -22
- data/lib/frequent-algorithm.rb +1 -0
- data/lib/frequent/algorithm.rb +122 -4
- data/lib/frequent/version.rb +10 -1
- metadata +42 -13
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f799561d8b1543e23483918e587dffcbd0522e78
|
4
|
+
data.tar.gz: 39a5a9ceee0dee33bc889111b2745907ffb49018
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c68087e23dc0ff299f81797a1152c6e1fc6f1f10b7ec46ca39fff35cb8d91836be27eb34fe434c484fc1b28fde939c2f436d42b36f958ddca0afbe9f5107194b
|
7
|
+
data.tar.gz: 6f0da59941492900cc4da2f1266ed0e8398a852cbb55e064fd0ed290f8209a396da2e051c5293a45a21c7564828a099ea03b607dfba53608d824f834a1083bc1
|
data/CHANGELOG
CHANGED
data/README.md
CHANGED
@@ -1,4 +1,10 @@
|
|
1
|
-
# frequent-algorithm
|
1
|
+
# frequent-algorithm [](http://badge.fury.io/rb/frequent-algorithm) [](https://travis-ci.org/buruzaemon/frequent-algorithm)
|
2
|
+
|
3
|
+
Web site usage, social network behavior and Internet traffic are examples
|
4
|
+
of systems that appear to follow the [power law](http://en.wikipedia.org/wiki/Power_law),
|
5
|
+
where most of the events are due to the actions of a very small few.
|
6
|
+
Knowing at any given point in time which items are trending is valuable
|
7
|
+
in understanding the system.
|
2
8
|
|
3
9
|
`frequent-algorithm` is a Ruby implementation of the FREQUENT algorithm
|
4
10
|
for identifying frequent items in a data stream in sliding windows.
|
@@ -6,40 +12,93 @@ Please refer to [Identifying Frequent Items in Sliding Windows over On-Line
|
|
6
12
|
Packet Streams](http://erikdemaine.org/papers/SlidingWindow_IMC2003/), by
|
7
13
|
Golab, DeHaan, Demaine, López-Ortiz and Munro (2003).
|
8
14
|
|
9
|
-
##
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
15
|
+
## Introduction
|
16
|
+
|
17
|
+
### Challenges
|
18
|
+
|
19
|
+
Challenges for Real-time processing of data streams for _frequent item queries_
|
20
|
+
include:
|
21
|
+
|
22
|
+
* data may be of unknown and possibly unbound length
|
23
|
+
* data may be arriving a very fast rate
|
24
|
+
* it might not be possible to go back and re-read the data
|
25
|
+
* too large a window of observation may include stale data
|
26
|
+
|
27
|
+
Therefore, a solution should have the following characteristics:
|
28
|
+
|
29
|
+
* uses limited memory
|
30
|
+
* can process events in the stream in Ο(1) constant time
|
31
|
+
* requires only a single-pass over the data
|
32
|
+
|
33
|
+
|
34
|
+
### The algorithm
|
35
|
+
|
36
|
+
> LOOP<br/>
|
37
|
+
> 1. For each element e in the next b elements:<br/>
|
38
|
+
> If a local counter exists for the type of element e:<br/>
|
39
|
+
> Increment the local counter.<br/>
|
40
|
+
> Otherwise:<br/>
|
41
|
+
> Create a new local counter for this element type<br/>
|
42
|
+
> and set it equal to 1.<br/>
|
43
|
+
> 2. Add a summary S containing identities and counts of the k most frequent items to the back of queue Q.<br/>
|
44
|
+
> 3. Delete all local counters<br/>
|
45
|
+
> 4. For each type named in S:<br/>
|
46
|
+
> If a global counter exists for this type:<br/>
|
47
|
+
> Add to it the count recorded in S.<br/>
|
48
|
+
> Otherwise:<br/>
|
49
|
+
> Create a new global counter for this element type<br/>
|
50
|
+
> and set it equal to the count recorded in S.<br/>
|
51
|
+
> 5. Add the count of the kth largest type in S to δ.<br/>
|
52
|
+
> 6. If sizeOf(Q) > N/b:<br/>
|
53
|
+
> (a) Remove the summary S' from the front of Q and subtract the count of the kth largest type in S' from δ.<br/>
|
54
|
+
> (b) For all element types named in S':<br/>
|
55
|
+
> Subtract from their global counters the counts<br/>
|
56
|
+
> recorded in S'<br/>
|
57
|
+
> If a counter is decremented to zero:<br/>
|
58
|
+
> Delete it.<br/>
|
59
|
+
> (c) Output the identity and value of each global counter > δ.
|
60
|
+
>
|
61
|
+
> — <cite>Golab, DeHaan, Demaine, López-Ortiz and Munro. Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003</cite>
|
16
62
|
|
17
63
|
|
18
64
|
## Usage
|
19
65
|
|
20
66
|
require 'frequent-algorithm'
|
21
67
|
|
68
|
+
# data is pi to 1000 digits
|
69
|
+
pi = File.read('test/frequent/test_data_pi').strip
|
70
|
+
data = pi.scan(/./).each_slice(b)
|
71
|
+
|
72
|
+
N = 100 # size of main window
|
73
|
+
b = 20 # size of basic window
|
74
|
+
k = 3 # we are interested in top-3 numerals in pi
|
75
|
+
|
76
|
+
alg = Frequent::Algorithm.new(N, b, k)
|
77
|
+
|
78
|
+
# read in and process the 1st basic window
|
79
|
+
alg.process(data.next)
|
80
|
+
|
81
|
+
# and the top-3 numerals are?
|
82
|
+
top3 = alg.statistics.report
|
83
|
+
puts top3
|
84
|
+
|
85
|
+
# lather, rinse and repeat
|
86
|
+
alg.process(data.next)
|
87
|
+
|
88
|
+
|
22
89
|
## Development
|
23
90
|
|
24
91
|
The development of this gem requires the following:
|
25
92
|
|
26
|
-
* [Ruby
|
93
|
+
* [Ruby 1.9.3 or greater](https://www.ruby-lang.org/en/)
|
27
94
|
* [rubygems](https://rubygems.org/pages/download)
|
28
95
|
* [`bundler`](https://github.com/bundler/bundler)
|
29
96
|
* [`rake`](https://github.com/ruby/rake)
|
97
|
+
* [`minitest`](https://rubygems.org/gems/minitest) (unit testing)
|
30
98
|
* [`yard`](https://rubygems.org/gems/yard) (documentation)
|
31
99
|
* [`rdiscount`](https://rubygems.org/gems/rdiscount) (Markdown)
|
32
100
|
|
33
|
-
|
34
|
-
|
35
|
-
`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
|
36
|
-
[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
|
37
|
-
Check out [Getting Started with
|
38
|
-
Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
|
39
|
-
|
40
|
-
### Build
|
41
|
-
|
42
|
-
Development, testing and release of this rubygem uses the following
|
101
|
+
Building, testing and release of this rubygem uses the following
|
43
102
|
`rake` commands:
|
44
103
|
|
45
104
|
|
@@ -47,22 +106,41 @@ Development, testing and release of this rubygem uses the following
|
|
47
106
|
rake clean # Remove any temporary products
|
48
107
|
rake clobber # Remove any generated file
|
49
108
|
rake install # Build and install frequent-algorithm-n.n.n.gem into system gems
|
50
|
-
rake release # Create tag vn.n.n and build and push
|
109
|
+
rake release # Create tag vn.n.n and build and push
|
110
|
+
# frequent-algorithm-n.n.n.gem to Rubygems
|
51
111
|
rake test # Execute unit tests
|
52
112
|
|
53
113
|
|
114
|
+
### Documentation
|
115
|
+
|
116
|
+
`frequent-algorithm` uses [`yard`](https://rubygems.org/gems/yard) and
|
117
|
+
[`rdiscount`](https://rubygems.org/gems/rdiscount) for Markdown documentation.
|
118
|
+
Check out [Getting Started with
|
119
|
+
Yard](http://www.rubydoc.info/gems/yard/file/docs/GettingStarted.md).
|
120
|
+
|
54
121
|
### Unit Testing
|
55
122
|
|
56
123
|
`frequent-algorithm` uses
|
57
124
|
[`MiniTest::Unit`](https://github.com/seattlerb/minitest) for
|
58
125
|
unit testing.
|
59
126
|
|
60
|
-
|
61
|
-
### Release
|
127
|
+
### Releasing
|
62
128
|
|
63
129
|
Please refer to Publishing To Rubygems.org in the
|
64
130
|
[Rubygems Guide](http://guides.rubygems.org/make-your-own-gem/).
|
65
131
|
|
132
|
+
### Contributing
|
133
|
+
|
134
|
+
1. Fork it
|
135
|
+
2. Begin work on `dev-branch` (`git fetch && git checkout dev-branch`)
|
136
|
+
3. Create your feature branch (`git branch my-new-feature && git checkout
|
137
|
+
my-new-feature`)
|
138
|
+
4. Commit your changes (`git commit -am 'Add some feature'`)
|
139
|
+
5. Push to the branch (`git push origin my-new-feature:dev-branch`)
|
140
|
+
6. Create new Pull Request
|
141
|
+
|
142
|
+
You may wish to read the [Git book online](http://git-scm.com/book/en/v2).
|
143
|
+
|
66
144
|
|
67
145
|
## License
|
68
146
|
|
data/lib/frequent-algorithm.rb
CHANGED
data/lib/frequent/algorithm.rb
CHANGED
@@ -1,23 +1,141 @@
|
|
1
|
+
# coding: utf-8
|
1
2
|
require 'frequent/version'
|
2
3
|
|
3
4
|
module Frequent
|
4
5
|
|
6
|
+
# `Frequent::Algorithm` is the Ruby implementation of the
|
7
|
+
# Demaine et al. FREQUENT algorithm for calculating
|
8
|
+
# top-k items in a stream.
|
9
|
+
#
|
10
|
+
# The aims of this algorithm are:
|
11
|
+
# * use limited memory
|
12
|
+
# * require constant processing time per item
|
13
|
+
# * require a single-pass only
|
14
|
+
#
|
5
15
|
class Algorithm
|
6
|
-
|
16
|
+
# @return [Integer] the number of items in the main window
|
17
|
+
attr_reader :n
|
18
|
+
# @return [Integer] the number of items in a basic window
|
19
|
+
attr_reader :b
|
20
|
+
# @return [Integer] the number of top item categories to track
|
21
|
+
attr_reader :k
|
22
|
+
# @return [Array<Hash<Object,Integer>>] global queue for basic window summaries
|
23
|
+
attr_reader :queue
|
24
|
+
# @return [Hash<Object,Integer>] global mapping of items and counts
|
25
|
+
attr_reader :statistics
|
26
|
+
# @return [Integer] minimum threshold for membership in top-k items
|
27
|
+
attr_reader :delta
|
7
28
|
|
8
|
-
|
29
|
+
# Initializes this top-k frequency-calculating instance.
|
30
|
+
#
|
31
|
+
# @param [Integer] n number of items in the main window
|
32
|
+
# @param [Integer] b number of items in a basic window
|
33
|
+
# @param [Integer] k number of top item categories to track
|
34
|
+
# @raise [ArgumentError] if n is not greater than 0
|
35
|
+
# @raise [ArgumentError] if b is not greater than 0
|
36
|
+
# @raise [ArgumentError] if k is not greater than 0
|
37
|
+
# @raise [ArgumentError] if n/b is not greater than 1
|
38
|
+
def initialize(n, b, k=1)
|
39
|
+
if n <= 0
|
40
|
+
raise ArgumentError.new('n must be greater than 0')
|
41
|
+
end
|
42
|
+
if b <= 0
|
43
|
+
raise ArgumentError.new('b must be greater than 0')
|
44
|
+
end
|
45
|
+
if k <= 0
|
46
|
+
raise ArgumentError.new('k must be greater than 0')
|
47
|
+
end
|
48
|
+
if n/b < 1
|
49
|
+
raise ArgumentError.new('n/b must be greater than 1')
|
50
|
+
end
|
9
51
|
@n = n
|
10
52
|
@b = b
|
53
|
+
@k = k
|
54
|
+
|
55
|
+
@queue = []
|
56
|
+
@statistics = {}
|
57
|
+
@delta = 0
|
11
58
|
end
|
12
59
|
|
13
|
-
|
14
|
-
|
60
|
+
# Processes a single basic window of b items, by first adding
|
61
|
+
# a summary of this basic window in the internal global queue;
|
62
|
+
# and then updating the global statistics accordingly.
|
63
|
+
#
|
64
|
+
# @param [Array] an array of objects representing a basic window
|
65
|
+
def process(elements)
|
66
|
+
# Do we need this?
|
67
|
+
return if elements.length != @b
|
68
|
+
|
69
|
+
# Step 1
|
70
|
+
summary = {}
|
71
|
+
elements.each do |e|
|
72
|
+
if summary.key? e
|
73
|
+
summary[e] += 1
|
74
|
+
else
|
75
|
+
summary[e] = 1
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
# index of the k-th item
|
80
|
+
kth_index = find_kth_largest(summary)
|
81
|
+
|
82
|
+
# Step 2 & 3
|
83
|
+
# summary is [[item,count],[item,count],[item,count]....]
|
84
|
+
# sorted by descending order of the item count
|
85
|
+
summary = summary.sort { |a,b| b[1]<=>a[1] }[0..kth_index]
|
86
|
+
@queue << summary
|
87
|
+
|
88
|
+
# Step 4
|
89
|
+
summary.each do |t|
|
90
|
+
if @statistics.key? t[0]
|
91
|
+
@statistics[t[0]] += t[1]
|
92
|
+
else
|
93
|
+
@statistics[t[0]] = t[1]
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
# Step 5
|
98
|
+
@delta += summary[kth_index][1]
|
99
|
+
|
100
|
+
# Step 6
|
101
|
+
if should_pop_oldest_summary
|
102
|
+
# a
|
103
|
+
summary_p = @queue.shift
|
104
|
+
@delta -= summary_p[find_kth_largest(summary_p)][1]
|
105
|
+
|
106
|
+
# b
|
107
|
+
summary_p.each { |t| @statistics[t[0]] -= t[1] }
|
108
|
+
@statistics.delete_if { |k,v| v <= 0 }
|
109
|
+
|
110
|
+
#c
|
111
|
+
@statistics.select { |k,v| v > @delta }
|
112
|
+
else
|
113
|
+
{}
|
114
|
+
end
|
15
115
|
end
|
16
116
|
|
117
|
+
# Returns the version for this gem.
|
118
|
+
#
|
119
|
+
# @return [String] the version for this gem.
|
17
120
|
def version
|
18
121
|
Frequent::VERSION
|
19
122
|
end
|
20
123
|
|
124
|
+
private
|
125
|
+
# Return true when it is ready to pop oldest summary from queue
|
126
|
+
#
|
127
|
+
# @return [Boolean] whether it is ready to pop oldest summary from queue
|
128
|
+
def should_pop_oldest_summary
|
129
|
+
@queue.length > @n/@b
|
130
|
+
end
|
131
|
+
|
132
|
+
# Return the k-th index of a summary object
|
133
|
+
#
|
134
|
+
# @param [Object] a summary object
|
135
|
+
# @return [Integer] the k-th index
|
136
|
+
def find_kth_largest(summary)
|
137
|
+
[summary.length, @k].min - 1
|
138
|
+
end
|
21
139
|
end
|
22
140
|
end
|
23
141
|
|
data/lib/frequent/version.rb
CHANGED
@@ -1,5 +1,14 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
|
3
|
+
# `Frequent` is the namespace for objects implementing
|
4
|
+
# the Demaine et al. FREQUENT algorithm for finding
|
5
|
+
# the most frequently-appearing items (top-k) in a
|
6
|
+
# data stream in sliding windows.
|
7
|
+
#
|
8
|
+
# `Frequent::Algorithm` is the implementation class.
|
1
9
|
module Frequent
|
2
|
-
|
10
|
+
# Version string for this Rubygem.
|
11
|
+
VERSION = '0.0.2'
|
3
12
|
end
|
4
13
|
|
5
14
|
=begin
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: frequent-algorithm
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Willie Tong
|
@@ -9,22 +9,52 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2015-03-
|
13
|
-
dependencies:
|
12
|
+
date: 2015-03-19 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: rake
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
requirements:
|
18
|
+
- - '>='
|
19
|
+
- !ruby/object:Gem::Version
|
20
|
+
version: '0'
|
21
|
+
type: :development
|
22
|
+
prerelease: false
|
23
|
+
version_requirements: !ruby/object:Gem::Requirement
|
24
|
+
requirements:
|
25
|
+
- - '>='
|
26
|
+
- !ruby/object:Gem::Version
|
27
|
+
version: '0'
|
28
|
+
- !ruby/object:Gem::Dependency
|
29
|
+
name: minitest
|
30
|
+
requirement: !ruby/object:Gem::Requirement
|
31
|
+
requirements:
|
32
|
+
- - '>='
|
33
|
+
- !ruby/object:Gem::Version
|
34
|
+
version: '0'
|
35
|
+
type: :development
|
36
|
+
prerelease: false
|
37
|
+
version_requirements: !ruby/object:Gem::Requirement
|
38
|
+
requirements:
|
39
|
+
- - '>='
|
40
|
+
- !ruby/object:Gem::Version
|
41
|
+
version: '0'
|
14
42
|
description: |
|
15
|
-
frequent-algorithm is a Ruby implementation of the FREQUENT algorithm for identifying frequent items in a data stream in sliding windows.
|
16
|
-
email:
|
43
|
+
frequent-algorithm is a Ruby implementation of the Demaine et al FREQUENT algorithm for identifying frequent items in a data stream in sliding windows (c.f Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, 2003).
|
44
|
+
email:
|
45
|
+
- tongsinyin@gmail.com
|
46
|
+
- buruzaemon@gmail.com
|
17
47
|
executables: []
|
18
48
|
extensions: []
|
19
49
|
extra_rdoc_files: []
|
20
50
|
files:
|
21
|
-
- .yardopts
|
22
|
-
- CHANGELOG
|
23
|
-
- LICENSE
|
24
|
-
- README.md
|
25
51
|
- lib/frequent-algorithm.rb
|
26
52
|
- lib/frequent/algorithm.rb
|
27
53
|
- lib/frequent/version.rb
|
54
|
+
- README.md
|
55
|
+
- LICENSE
|
56
|
+
- CHANGELOG
|
57
|
+
- .yardopts
|
28
58
|
homepage: https://github.com/buruzaemon/frequent-algorithm
|
29
59
|
licenses:
|
30
60
|
- MIT
|
@@ -45,10 +75,9 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
45
75
|
version: '0'
|
46
76
|
requirements: []
|
47
77
|
rubyforge_project:
|
48
|
-
rubygems_version: 2.
|
78
|
+
rubygems_version: 2.0.14
|
49
79
|
signing_key:
|
50
80
|
specification_version: 4
|
51
|
-
summary:
|
52
|
-
|
81
|
+
summary: Identifies frequent items in a data stream in sliding windows using the Demaine
|
82
|
+
et al FREQUENT algorithm.
|
53
83
|
test_files: []
|
54
|
-
has_rdoc:
|