pacer-bloomfilter 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,5 @@
1
+ lib/**/*.rb
2
+ bin/*
3
+ -
4
+ features/**/*.feature
5
+ LICENSE.txt
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --color
data/Gemfile ADDED
@@ -0,0 +1,9 @@
1
+ source "http://rubygems.org"
2
+ gem 'pacer', '>= 0.6.1'
3
+
4
+ group :development do
5
+ gem "bundler", "~> 1.0.0"
6
+ gem "jeweler", "~> 1.5.2"
7
+ gem "rspec", "~> 2.3.0"
8
+ gem "rcov", ">= 0"
9
+ end
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2011 Darrick Wiebe
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,121 @@
1
+ # Pacer BloomFilter plugin (pacer-bloomfilter)
2
+
3
+ This plugin adds set filtering using [bloom filters](http://en.wikipedia.org/wiki/Bloom_filter) to the [Pacer](https://github.com/pangloss/pacer) graph and
4
+ streaming data processing library.
5
+
6
+ This plugin is also meant to serve as an example of how easy it is to
7
+ build a plugin for Pacer.
8
+
9
+ ## Usage
10
+
11
+ The bloomfilter method is added to the core route object in Pacer,
12
+ which means that it will be available to all routes. The method takes 2
13
+ arguments and a block:
14
+
15
+ - false_positive_probability: between 0 and 1 with a lower number
16
+ indicating a lower chance of different keys being considered equal
17
+ - expected_count: the maximum number of elements you think will be added
18
+ to the bloom filter. The more accurate this number is the more
19
+ accurate your false_positive_probability will be.
20
+ - block: The block should map the elements that will be iterated over to
21
+ a value that will be used by the filter. You should return a string.
22
+ This block does not affect the actual output of the route.
23
+ If no block is given, to_s on the element itself will used for the
24
+ filter.
25
+
26
+ ### Example
27
+
28
+ Map the vertices to names and then filter by name:
29
+
30
+ graph.v[:name].bloomfilter(0.001, 10).except(['sam', 'bob'])
31
+
32
+ "steve" "gary"
33
+ Total: 2
34
+ => #<GraphV -> Obj(name) -> Obj-Bloom>
35
+
36
+ Wrong! There is no way to map the vertices to the name, all vertices
37
+ pass through:
38
+
39
+ graph.v.bloomfilter(0.001, 10).except(['sam', 'bob'])
40
+
41
+ #<V[0]> #<V[1]> #<V[2]> #<V[3]>
42
+ Total: 4
43
+ => #<GraphV -> V-Bloom>
44
+
45
+ That's better. Here we tell the bloomfilter how to map the vertices to
46
+ the name field (we switched it to #only though just to get more mileage
47
+ out of the example):
48
+
49
+ graph.v.bloomfilter(0.001, 10) { |v| v[:name] }.only(['sam', 'bob'])
50
+
51
+ #<V[0]> #<V[3]>
52
+ Total: 2
53
+ => #<GraphV -> V-Bloom>
54
+
55
+ And for completeness, the uniq method is pretty self explanitory I hope:
56
+
57
+ graph.v[:type].bloomfilter(0.001, 10).uniq
58
+
59
+ "band member"
60
+ Total: 1
61
+ => #<GraphV -> V-Bloom>
62
+
63
+ ## Is it Fast?
64
+
65
+ I don't know. This plugin is currently only a proof of concept and has
66
+ not been optimized, profiled or benchmarked! If you want to spend a
67
+ couple of hours to make it blazing fast, I'll be quite impressed!
68
+
69
+ ## How Pacer Plugins Work
70
+
71
+ The plugin architecture of Pacer is very simple. You can define a module
72
+ in one of 3 namespaces which correspond to the categories of functions
73
+ that I've identified thus far:
74
+
75
+ - Pacer::Filter
76
+ - Pacer::Transform
77
+ - Pacer::SideEffect
78
+
79
+ Rather than try to explain those clearly, it is probably best to
80
+ dig into the code and documenation of Pacer, Pipes and Gremlin.
81
+
82
+ The module that you define will be mixed in to a Pacer::Route instance
83
+ at runtime when chain_route is called pointing to your module. Nearly
84
+ every method that is called when a user is building a route is actually
85
+ just a friendlier version of chain_route. The chain_route method takes a
86
+ hash of arguments, a few of which are reserved, and the rest of which
87
+ will be applied to your module via property setters. Note that in this
88
+ plugin, :false_pos_prob, :expected_count, :bloomfilter, and :block all
89
+ have corresponding attr_accessors in the BloomFilter module. In that
90
+ way, settings are carried over from the route definition to the route
91
+ instance.
92
+
93
+ Once the route is defined, it may be executed multiple times. Whenever
94
+ it is executed, a new pipeline is built, of which your module is just
95
+ one part. In almost all cases, all you will need to do is define the
96
+ protected attach_pipe method. Inside that method you just need to create
97
+ your pipe, call setStarts on it with the pipe that was given as a
98
+ parameter, and return the pipe you created. Pacer will look after the
99
+ rest of the pipe building process.
100
+
101
+ ## Contributing to pacer-bloomfilter
102
+
103
+ * Check out the latest master to make sure the feature hasn't been
104
+ implemented or the bug hasn't been fixed yet
105
+ * Check out the issue tracker to make sure someone already hasn't
106
+ requested it and/or contributed it
107
+ * Fork the project
108
+ * Start a feature/bugfix branch
109
+ * Commit and push until you are happy with your contribution
110
+ * Make sure to add tests for it. This is important so I don't break it
111
+ in a future version unintentionally.
112
+ * Please try not to mess with the Rakefile, version, or history. If you
113
+ want to have your own version, or is otherwise necessary, that is
114
+ fine, but please isolate to its own commit so I can cherry-pick around
115
+ it.
116
+
117
+ ## Copyright
118
+
119
+ Copyright (c) 2011 Darrick Wiebe. See LICENSE.txt for
120
+ further details.
121
+
@@ -0,0 +1,50 @@
1
+ require 'rubygems'
2
+ require 'bundler'
3
+ begin
4
+ Bundler.setup(:default, :development)
5
+ rescue Bundler::BundlerError => e
6
+ $stderr.puts e.message
7
+ $stderr.puts "Run `bundle install` to install missing gems"
8
+ exit e.status_code
9
+ end
10
+ require 'rake'
11
+
12
+ require 'jeweler'
13
+ Jeweler::Tasks.new do |gem|
14
+ # gem is a Gem::Specification... see http://docs.rubygems.org/read/chapter/20 for more options
15
+ gem.name = "pacer-bloomfilter"
16
+ gem.homepage = "http://github.com/pangloss/pacer-bloomfilter"
17
+ gem.license = "MIT"
18
+ gem.summary = %Q{Filter object streams in Pacer by using a bloom filter}
19
+ gem.description = %Q{Bloom filters are fast, compact, probabalistic data structures that allow set filtering with a configurable rate of false positives. This plugin adds .bloom_filter.uniq, .bloom_filter.only([collection]), and .bloom_filter.except([collection]) to the available routes methods in Pacer.}
20
+ gem.email = "darrick@innatesoftware.com"
21
+ gem.authors = ["Darrick Wiebe"]
22
+ # Include your dependencies below. Runtime dependencies are required when using your gem,
23
+ # and development dependencies are only needed for development (ie running rake tasks, tests, etc)
24
+ gem.add_runtime_dependency 'pacer', '>= 0.6.1'
25
+ # gem.add_development_dependency 'rspec', '> 1.2.3'
26
+ end
27
+ Jeweler::RubygemsDotOrgTasks.new
28
+
29
+ require 'rspec/core'
30
+ require 'rspec/core/rake_task'
31
+ RSpec::Core::RakeTask.new(:spec) do |spec|
32
+ spec.pattern = FileList['spec/**/*_spec.rb']
33
+ end
34
+
35
+ RSpec::Core::RakeTask.new(:rcov) do |spec|
36
+ spec.pattern = 'spec/**/*_spec.rb'
37
+ spec.rcov = true
38
+ end
39
+
40
+ task :default => :spec
41
+
42
+ require 'rake/rdoctask'
43
+ Rake::RDocTask.new do |rdoc|
44
+ version = File.exist?('VERSION') ? File.read('VERSION') : ""
45
+
46
+ rdoc.rdoc_dir = 'rdoc'
47
+ rdoc.title = "pacer-bloomfilter #{version}"
48
+ rdoc.rdoc_files.include('README*')
49
+ rdoc.rdoc_files.include('lib/**/*.rb')
50
+ end
data/VERSION ADDED
@@ -0,0 +1 @@
1
+ 1.0.0
@@ -0,0 +1,21 @@
1
+ module PacerBloomFilter
2
+ unless const_defined? :VERSION
3
+ PATH = File.expand_path(File.join(File.dirname(__FILE__), '..'))
4
+ VERSION = File.read(PATH + '/VERSION').chomp
5
+
6
+ $: << File.dirname(__FILE__)
7
+ require File.dirname(__FILE__) + '/../vendor/java-bloomfilter.jar'
8
+ end
9
+
10
+ def self.reload!
11
+ Dir[File.dirname(__FILE__) + '/**/*.rb'].each do |file|
12
+ puts file
13
+ load file
14
+ end
15
+ nil
16
+ end
17
+ end
18
+
19
+ require 'pacer/pipe/bloomfilter_reject'
20
+ require 'pacer/filter/bloomfilter'
21
+
@@ -0,0 +1,77 @@
1
+ module Pacer
2
+ module Core
3
+ module Route
4
+ def bloomfilter(false_pos_prob, expected_count, opts = {}, &block)
5
+ chain_route :filter => :bloom,
6
+ :false_pos_prob => false_pos_prob,
7
+ :expected_count => expected_count,
8
+ :bloomfilter => opts[:bloomfilter],
9
+ :block => block
10
+ end
11
+ end
12
+ end
13
+
14
+ module Filter
15
+ module BloomFilter
16
+ attr_accessor :false_pos_prob, :expected_count, :block, :bloomfilter
17
+
18
+ def uniq
19
+ @except ||= []
20
+ @uniq = true
21
+ self
22
+ end
23
+
24
+ def except(others)
25
+ @except ||= []
26
+ @except << others
27
+ self
28
+ end
29
+
30
+ def only(others)
31
+ @only ||= []
32
+ @only << others
33
+ self
34
+ end
35
+
36
+ protected
37
+
38
+ def attach_pipe(pipe)
39
+ pipe = except_pipe(pipe) if @except
40
+ pipe = only_pipe(pipe) if @only
41
+ pipe
42
+ end
43
+
44
+ private
45
+
46
+ def except_pipe(pipe)
47
+ bfp = Pacer::Pipes::BloomFilter::RejectPipe.new false_pos_prob, expected_count, sideline_pipe
48
+ bfp.accumulate if @uniq
49
+ prepare_pipe(bfp, @except, pipe)
50
+ end
51
+
52
+ def only_pipe(pipe)
53
+ bfp = Pacer::Pipes::BloomFilter::SelectPipe.new false_pos_prob, expected_count, sideline_pipe
54
+ prepare_pipe(bfp, @except, pipe)
55
+ end
56
+
57
+ def prepare_pipe(bfp, all_items, pipe)
58
+ bfp.bloomfilter = bloomfilter if bloomfilter
59
+ all_items.each do |items|
60
+ if items.is_a? Enumerable
61
+ bfp.addAll items
62
+ else
63
+ bfp.addAll [items]
64
+ end
65
+ end
66
+ bfp.setStarts pipe if pipe
67
+ bfp
68
+ end
69
+
70
+ def sideline_pipe
71
+ if block
72
+ Pacer::Route.pipeline Pacer::Route.empty(self).map(&block)
73
+ end
74
+ end
75
+ end
76
+ end
77
+ end
@@ -0,0 +1,103 @@
1
+ module Pacer
2
+ module Pipes
3
+ module BloomFilter
4
+ class SideliningPipe < AbstractPipe
5
+ def initialize(pipe)
6
+ super()
7
+ if pipe
8
+ @sideline = pipe
9
+ @sidelineExpando = ExpandableIterator.new(java.util.ArrayList.new.iterator);
10
+ @sideline.setStarts(@sidelineExpando);
11
+ end
12
+ end
13
+
14
+ protected
15
+
16
+ def sidelineValue(value)
17
+ if @sideline
18
+ @sideline.reset
19
+ @sidelineExpando.add value
20
+ @sideline.next
21
+ else
22
+ value
23
+ end
24
+ rescue NativeException => e
25
+ if e.cause.getClass == Pacer::NoSuchElementException.getClass
26
+ nil
27
+ else
28
+ raise e
29
+ end
30
+ end
31
+ end
32
+
33
+ class RejectPipe < SideliningPipe
34
+ import com.skjegstad.utils.BloomFilter
35
+ field_accessor :starts
36
+ attr_accessor :filter
37
+
38
+ def initialize(false_pos_prob, expected_count, sideline_pipe = nil)
39
+ super(sideline_pipe)
40
+ @filter = BloomFilter.new(false_pos_prob, expected_count)
41
+ end
42
+
43
+ def addAll(elements)
44
+ @filter.addAll(elements)
45
+ end
46
+
47
+ def accumulate
48
+ @accumulate = true
49
+ end
50
+
51
+ protected
52
+
53
+ def processNextStart()
54
+ while raw_element = starts.next
55
+ value = sidelineValue(raw_element)
56
+ unless @filter.contains? value.to_s
57
+ @filter.add(value.to_s) if @accumulate and value
58
+ return raw_element
59
+ end
60
+ end
61
+ rescue NativeException => e
62
+ if e.cause.getClass == Pacer::NoSuchElementException.getClass
63
+ raise e.cause
64
+ else
65
+ raise e
66
+ end
67
+ end
68
+ end
69
+
70
+ class SelectPipe < SideliningPipe
71
+ import com.skjegstad.utils.BloomFilter
72
+ field_accessor :starts
73
+ attr_accessor :filter
74
+
75
+ def initialize(false_pos_prob, expected_count, sideline_pipe = nil)
76
+ super(sideline_pipe)
77
+ @filter = BloomFilter.new(false_pos_prob, expected_count)
78
+ end
79
+
80
+ def addAll(elements)
81
+ @filter.addAll(elements)
82
+ end
83
+
84
+ protected
85
+
86
+ def processNextStart()
87
+ while raw_element = starts.next
88
+ value = sidelineValue(raw_element)
89
+ if @filter.contains? value.to_s
90
+ return raw_element
91
+ end
92
+ end
93
+ rescue NativeException => e
94
+ if e.cause.getClass == Pacer::NoSuchElementException.getClass
95
+ raise e.cause
96
+ else
97
+ raise e
98
+ end
99
+ end
100
+ end
101
+ end
102
+ end
103
+ end
@@ -0,0 +1,7 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/spec_helper')
2
+
3
+ describe "PacerBloomfilter" do
4
+ it "fails" do
5
+ fail "hey buddy, you should probably rename this file and start specing for real"
6
+ end
7
+ end
@@ -0,0 +1,12 @@
1
+ $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
2
+ $LOAD_PATH.unshift(File.dirname(__FILE__))
3
+ require 'rspec'
4
+ require 'pacer-bloomfilter'
5
+
6
+ # Requires supporting files with custom matchers and macros, etc,
7
+ # in ./support/ and its subdirectories.
8
+ Dir["#{File.dirname(__FILE__)}/support/**/*.rb"].each {|f| require f}
9
+
10
+ RSpec.configure do |config|
11
+
12
+ end
metadata ADDED
@@ -0,0 +1,138 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: pacer-bloomfilter
3
+ version: !ruby/object:Gem::Version
4
+ prerelease:
5
+ version: 1.0.0
6
+ platform: ruby
7
+ authors:
8
+ - Darrick Wiebe
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+
13
+ date: 2011-03-29 00:00:00 -04:00
14
+ default_executable:
15
+ dependencies:
16
+ - !ruby/object:Gem::Dependency
17
+ name: pacer
18
+ version_requirements: &id001 !ruby/object:Gem::Requirement
19
+ none: false
20
+ requirements:
21
+ - - ">="
22
+ - !ruby/object:Gem::Version
23
+ version: 0.6.1
24
+ requirement: *id001
25
+ prerelease: false
26
+ type: :runtime
27
+ - !ruby/object:Gem::Dependency
28
+ name: bundler
29
+ version_requirements: &id002 !ruby/object:Gem::Requirement
30
+ none: false
31
+ requirements:
32
+ - - ~>
33
+ - !ruby/object:Gem::Version
34
+ version: 1.0.0
35
+ requirement: *id002
36
+ prerelease: false
37
+ type: :development
38
+ - !ruby/object:Gem::Dependency
39
+ name: jeweler
40
+ version_requirements: &id003 !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ version: 1.5.2
46
+ requirement: *id003
47
+ prerelease: false
48
+ type: :development
49
+ - !ruby/object:Gem::Dependency
50
+ name: rspec
51
+ version_requirements: &id004 !ruby/object:Gem::Requirement
52
+ none: false
53
+ requirements:
54
+ - - ~>
55
+ - !ruby/object:Gem::Version
56
+ version: 2.3.0
57
+ requirement: *id004
58
+ prerelease: false
59
+ type: :development
60
+ - !ruby/object:Gem::Dependency
61
+ name: rcov
62
+ version_requirements: &id005 !ruby/object:Gem::Requirement
63
+ none: false
64
+ requirements:
65
+ - - ">="
66
+ - !ruby/object:Gem::Version
67
+ version: "0"
68
+ requirement: *id005
69
+ prerelease: false
70
+ type: :development
71
+ - !ruby/object:Gem::Dependency
72
+ name: pacer
73
+ version_requirements: &id006 !ruby/object:Gem::Requirement
74
+ none: false
75
+ requirements:
76
+ - - ">="
77
+ - !ruby/object:Gem::Version
78
+ version: 0.6.1
79
+ requirement: *id006
80
+ prerelease: false
81
+ type: :runtime
82
+ description: Bloom filters are fast, compact, probabalistic data structures that allow set filtering with a configurable rate of false positives. This plugin adds .bloom_filter.uniq, .bloom_filter.only([collection]), and .bloom_filter.except([collection]) to the available routes methods in Pacer.
83
+ email: darrick@innatesoftware.com
84
+ executables: []
85
+
86
+ extensions: []
87
+
88
+ extra_rdoc_files:
89
+ - LICENSE.txt
90
+ - README.md
91
+ files:
92
+ - .document
93
+ - .rspec
94
+ - Gemfile
95
+ - LICENSE.txt
96
+ - README.md
97
+ - Rakefile
98
+ - VERSION
99
+ - lib/pacer-bloomfilter.rb
100
+ - lib/pacer/filter/bloomfilter.rb
101
+ - lib/pacer/pipe/bloomfilter_reject.rb
102
+ - spec/pacer-bloomfilter_spec.rb
103
+ - spec/spec_helper.rb
104
+ - vendor/java-bloomfilter.jar
105
+ has_rdoc: true
106
+ homepage: http://github.com/pangloss/pacer-bloomfilter
107
+ licenses:
108
+ - MIT
109
+ post_install_message:
110
+ rdoc_options: []
111
+
112
+ require_paths:
113
+ - lib
114
+ required_ruby_version: !ruby/object:Gem::Requirement
115
+ none: false
116
+ requirements:
117
+ - - ">="
118
+ - !ruby/object:Gem::Version
119
+ hash: 2
120
+ segments:
121
+ - 0
122
+ version: "0"
123
+ required_rubygems_version: !ruby/object:Gem::Requirement
124
+ none: false
125
+ requirements:
126
+ - - ">="
127
+ - !ruby/object:Gem::Version
128
+ version: "0"
129
+ requirements: []
130
+
131
+ rubyforge_project:
132
+ rubygems_version: 1.5.1
133
+ signing_key:
134
+ specification_version: 3
135
+ summary: Filter object streams in Pacer by using a bloom filter
136
+ test_files:
137
+ - spec/pacer-bloomfilter_spec.rb
138
+ - spec/spec_helper.rb