bud 0.9.2 → 0.9.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/History.txt +17 -4
- data/README.md +5 -0
- data/docs/cheat.md +1 -1
- data/docs/getstarted.md +1 -1
- data/lib/bud.rb +52 -34
- data/lib/bud/aggs.rb +8 -11
- data/lib/bud/bud_meta.rb +18 -18
- data/lib/bud/collections.rb +14 -21
- data/lib/bud/executor/README.rescan +80 -0
- data/lib/bud/executor/elements.rb +25 -44
- data/lib/bud/executor/group.rb +80 -29
- data/lib/bud/executor/join.rb +73 -90
- data/lib/bud/monkeypatch.rb +1 -1
- data/lib/bud/rebl.rb +5 -2
- data/lib/bud/rewrite.rb +18 -14
- data/lib/bud/server.rb +1 -1
- data/lib/bud/source.rb +0 -45
- data/lib/bud/storage/dbm.rb +13 -9
- data/lib/bud/viz.rb +6 -8
- data/lib/bud/viz_util.rb +1 -0
- metadata +3 -18
@@ -0,0 +1,80 @@
|
|
1
|
+
Notes on Invalidate and Rescan in Bud
|
2
|
+
=====================================
|
3
|
+
|
4
|
+
(I'll use 'downstream' to mean rhs to lhs (like in budplot). In every stratum,
|
5
|
+
data originates at scanned sources at the "top", winds its way through various
|
6
|
+
PushElements and ends up in a collection at the "bottom". I'll also the term
|
7
|
+
"elements" to mean both dataflow nodes (PushElements) and collections).
|
8
|
+
|
9
|
+
Invalidation strategy works through two flags/signals, rescan and
|
10
|
+
invalidate. Invalidation means a stateful PushElement or a scratch's contents
|
11
|
+
are erased, or table is negated. Rescan means that tuples coming out of an
|
12
|
+
element represent the entire collection (a full-scan), not just deltas.
|
13
|
+
|
14
|
+
Earlier: all stateful elements were eagerly invalidated.
|
15
|
+
Collections with state: scratches, interfaces, channels, terminal
|
16
|
+
Elements with state: Group, join, sort, reduce, each_with_index
|
17
|
+
|
18
|
+
Now: lazy invalidation where possible, based on the observation that the same
|
19
|
+
state is often rederived downstream, which means that as long as there are no
|
20
|
+
negations, one should be able to go on in incremental mode (working only on
|
21
|
+
deltas, not on storage) from one tick to another.
|
22
|
+
|
23
|
+
Observations:
|
24
|
+
|
25
|
+
1. There are two kinds of elements that are (or may be) invalidated at the
|
26
|
+
beginning of every tick: source scratches (those that are not found on the
|
27
|
+
lhs of any rule), and tables that process pending negations.
|
28
|
+
|
29
|
+
2. a. Invalidation implies rescan of its contents.
|
30
|
+
|
31
|
+
b. Rescan of its contents implies invalidation of downstream nodes.
|
32
|
+
|
33
|
+
c. Invalidation involves rebuilding of state, which means that if a node has
|
34
|
+
multiple sources, it has to ask the other sources to rescan as well.
|
35
|
+
|
36
|
+
Example: x,y,z are scratches
|
37
|
+
z <= x.group(....)
|
38
|
+
z <= y.sort {}
|
39
|
+
|
40
|
+
If x is invalidated, it will rescan its contents. The group element then
|
41
|
+
invalidates its state, and rebuilds itself as x is scanned. Since group is
|
42
|
+
in rescan mode, z invalidates its state and is rebuilt from group.
|
43
|
+
However, since part of z's state state comes from y.sort, it asks its
|
44
|
+
source element (the sort node) for a rescan as well.
|
45
|
+
|
46
|
+
This push-pull negotiation can be run until fixpoint, until the elements
|
47
|
+
that need to be invalidated and rescanned is determined fully.
|
48
|
+
|
49
|
+
3. If a node is stateless, it passes the rescan request upstream, and the
|
50
|
+
invalidations downstream. But if it is stateful, it need not pass a rescan
|
51
|
+
request upstream. In the example above, only the sort node needs to rescan
|
52
|
+
its buffer; y doesn't need to be scanned at all.
|
53
|
+
|
54
|
+
4. Solving the above constraints to a fixpoint at every tick is a huge
|
55
|
+
overhead. So we determine the strategy at wiring time.
|
56
|
+
|
57
|
+
bud.default_invalidate/default_rescan == set of elements that we know
|
58
|
+
apriori will _always_ need the corresponding signal.
|
59
|
+
|
60
|
+
scanner.invalidate_set/rescan_set == for each scanner, the set of elements
|
61
|
+
to invalidate/rescan should that scanner's collection be negated.
|
62
|
+
|
63
|
+
bud.prepare_invalidation_scheme works as follows.
|
64
|
+
|
65
|
+
Start the process by determining which tables will invalidate at each tick,
|
66
|
+
and which PushElements will rescan at the beginning of each tick. Then run
|
67
|
+
rescan_invalidate_tc for a transitive closure, where each element gets to
|
68
|
+
determine its own presence in the rescan and invalidate sets, depending on
|
69
|
+
its source or target elements' presence in those sets. This creates the
|
70
|
+
default sets.
|
71
|
+
|
72
|
+
Then for each scanner, prime the pump by setting the scanner to rescan mode,
|
73
|
+
and determine what effect it has on the system, by running
|
74
|
+
rescan_invalidate_tc. All the elements that are not already in the default
|
75
|
+
sets are those that need to be additionally informed at run time, should we
|
76
|
+
discover that that scanner's collection has been negated at the beginning of
|
77
|
+
each tick.
|
78
|
+
|
79
|
+
The BUD_SAFE environment variable is used to force old-style behavior, where
|
80
|
+
every cached element is invalidated and fully scanned once every tick.
|
@@ -144,7 +144,8 @@ module Bud
|
|
144
144
|
# default for stateless elements
|
145
145
|
public
|
146
146
|
def add_rescan_invalidate(rescan, invalidate)
|
147
|
-
# if any of the source elements are in rescan mode, then put this node in
|
147
|
+
# if any of the source elements are in rescan mode, then put this node in
|
148
|
+
# rescan.
|
148
149
|
srcs = non_temporal_predecessors
|
149
150
|
if srcs.any?{|p| rescan.member? p}
|
150
151
|
rescan << self
|
@@ -157,7 +158,7 @@ module Bud
|
|
157
158
|
# finally, if this node is in rescan, pass the request on to all source
|
158
159
|
# elements
|
159
160
|
if rescan.member? self
|
160
|
-
rescan
|
161
|
+
rescan.merge(srcs)
|
161
162
|
end
|
162
163
|
end
|
163
164
|
|
@@ -177,14 +178,12 @@ module Bud
|
|
177
178
|
def <<(i)
|
178
179
|
insert(i, nil)
|
179
180
|
end
|
181
|
+
|
180
182
|
public
|
181
183
|
def flush
|
182
184
|
end
|
183
|
-
|
184
185
|
def invalidate_cache
|
185
|
-
#override to get rid of cached information.
|
186
186
|
end
|
187
|
-
public
|
188
187
|
def stratum_end
|
189
188
|
end
|
190
189
|
|
@@ -220,7 +219,7 @@ module Bud
|
|
220
219
|
def join(elem2, &blk)
|
221
220
|
# cached = @bud_instance.push_elems[[self.object_id,:join,[self,elem2], @bud_instance, blk]]
|
222
221
|
# if cached.nil?
|
223
|
-
elem2
|
222
|
+
elem2 = elem2.to_push_elem unless elem2.class <= PushElement
|
224
223
|
toplevel = @bud_instance.toplevel
|
225
224
|
join = Bud::PushSHJoin.new([self, elem2], toplevel.this_rule_context, [])
|
226
225
|
self.wire_to(join)
|
@@ -292,7 +291,6 @@ module Bud
|
|
292
291
|
return g
|
293
292
|
end
|
294
293
|
|
295
|
-
|
296
294
|
def argagg(aggname, gbkey_cols, collection, &blk)
|
297
295
|
gbkey_cols = gbkey_cols.map{|c| canonicalize_col(c)}
|
298
296
|
collection = canonicalize_col(collection)
|
@@ -353,7 +351,6 @@ module Bud
|
|
353
351
|
end
|
354
352
|
|
355
353
|
def reduce(initial, &blk)
|
356
|
-
@memo = initial
|
357
354
|
retval = Bud::PushReduce.new("reduce#{Time.new.tv_usec}",
|
358
355
|
@bud_instance, @collection_name,
|
359
356
|
schema, initial, &blk)
|
@@ -380,34 +377,18 @@ module Bud
|
|
380
377
|
end
|
381
378
|
toplevel.push_elems[[self.object_id, :inspected]]
|
382
379
|
end
|
383
|
-
|
384
|
-
def to_enum
|
385
|
-
# scr = @bud_instance.scratch(("scratch_" + Process.pid.to_s + "_" + object_id.to_s + "_" + rand(10000).to_s).to_sym, schema)
|
386
|
-
scr = []
|
387
|
-
self.wire_to(scr)
|
388
|
-
scr
|
389
|
-
end
|
390
380
|
end
|
391
381
|
|
392
382
|
class PushStatefulElement < PushElement
|
393
|
-
def rescan_at_tick
|
394
|
-
true
|
395
|
-
end
|
396
|
-
|
397
|
-
def rescan
|
398
|
-
true # always gives an entire dump of its contents
|
399
|
-
end
|
400
|
-
|
401
383
|
def add_rescan_invalidate(rescan, invalidate)
|
402
|
-
|
403
|
-
|
404
|
-
# (doesn't need to pass a rescan request to its its source nodes).
|
405
|
-
rescan << self
|
406
|
-
srcs = non_temporal_predecessors
|
407
|
-
if srcs.any? {|p| rescan.member? p}
|
384
|
+
if non_temporal_predecessors.any? {|e| rescan.member? e}
|
385
|
+
rescan << self
|
408
386
|
invalidate << self
|
409
387
|
end
|
410
388
|
|
389
|
+
# Note that we do not need to pass rescan requests up to our source
|
390
|
+
# elements, since a stateful element has enough local information to
|
391
|
+
# reproduce its output.
|
411
392
|
invalidate_tables(rescan, invalidate)
|
412
393
|
end
|
413
394
|
end
|
@@ -437,27 +418,29 @@ module Bud
|
|
437
418
|
class PushSort < PushStatefulElement
|
438
419
|
def initialize(elem_name=nil, bud_instance=nil, collection_name=nil,
|
439
420
|
schema_in=nil, &blk)
|
440
|
-
@sortbuf = []
|
441
421
|
super(elem_name, bud_instance, collection_name, schema_in, &blk)
|
422
|
+
@sortbuf = []
|
423
|
+
@seen_new_input = false
|
442
424
|
end
|
443
425
|
|
444
426
|
def insert(item, source)
|
445
427
|
@sortbuf << item
|
428
|
+
@seen_new_input = true
|
446
429
|
end
|
447
430
|
|
448
431
|
def flush
|
449
|
-
|
432
|
+
if @seen_new_input || @rescan
|
450
433
|
@sortbuf.sort!(&@blk)
|
451
434
|
@sortbuf.each do |t|
|
452
435
|
push_out(t, false)
|
453
436
|
end
|
454
|
-
@
|
437
|
+
@seen_new_input = false
|
438
|
+
@rescan = false
|
455
439
|
end
|
456
|
-
nil
|
457
440
|
end
|
458
441
|
|
459
442
|
def invalidate_cache
|
460
|
-
@sortbuf
|
443
|
+
@sortbuf.clear
|
461
444
|
end
|
462
445
|
end
|
463
446
|
|
@@ -488,11 +471,14 @@ module Bud
|
|
488
471
|
@invalidate_set = invalidate
|
489
472
|
end
|
490
473
|
|
491
|
-
public
|
492
474
|
def add_rescan_invalidate(rescan, invalidate)
|
493
|
-
#
|
475
|
+
# if the collection is to be invalidated, the scanner needs to be in
|
476
|
+
# rescan mode
|
494
477
|
rescan << self if invalidate.member? @collection
|
495
478
|
|
479
|
+
# in addition, default PushElement rescan/invalidate logic applies
|
480
|
+
super
|
481
|
+
|
496
482
|
# Note also that this node can be nominated for rescan by a target node;
|
497
483
|
# in other words, a scanner element can be set to rescan even if the
|
498
484
|
# collection is not invalidated.
|
@@ -555,20 +541,15 @@ module Bud
|
|
555
541
|
end
|
556
542
|
|
557
543
|
def add_rescan_invalidate(rescan, invalidate)
|
558
|
-
|
559
|
-
if srcs.any? {|p| rescan.member? p}
|
560
|
-
invalidate << self
|
561
|
-
rescan << self
|
562
|
-
end
|
563
|
-
|
564
|
-
invalidate_tables(rescan, invalidate)
|
544
|
+
super
|
565
545
|
|
566
546
|
# This node has some state (@each_index), but not the tuples. If it is in
|
567
547
|
# rescan mode, then it must ask its sources to rescan, and restart its
|
568
548
|
# index.
|
569
549
|
if rescan.member? self
|
570
550
|
invalidate << self
|
571
|
-
|
551
|
+
srcs = non_temporal_predecessors
|
552
|
+
rescan.merge(srcs)
|
572
553
|
end
|
573
554
|
end
|
574
555
|
|
data/lib/bud/executor/group.rb
CHANGED
@@ -1,39 +1,80 @@
|
|
1
1
|
require 'bud/executor/elements'
|
2
|
+
require 'set'
|
2
3
|
|
3
4
|
module Bud
|
4
5
|
class PushGroup < PushStatefulElement
|
5
|
-
def initialize(elem_name, bud_instance, collection_name,
|
6
|
-
|
6
|
+
def initialize(elem_name, bud_instance, collection_name,
|
7
|
+
keys_in, aggpairs_in, schema_in, &blk)
|
7
8
|
if keys_in.nil?
|
8
9
|
@keys = []
|
9
10
|
else
|
10
11
|
@keys = keys_in.map{|k| k[1]}
|
11
12
|
end
|
12
|
-
#
|
13
|
-
|
13
|
+
# An aggpair is an array: [agg class instance, index of input field].
|
14
|
+
# ap[1] is nil for Count.
|
15
|
+
@aggpairs = aggpairs_in.map{|ap| [ap[0], ap[1].nil? ? nil : ap[1][1]]}
|
16
|
+
@groups = {}
|
17
|
+
|
18
|
+
# Check whether we need to eliminate duplicates from our input (we might
|
19
|
+
# see duplicates because of the rescan/invalidation logic, as well as
|
20
|
+
# because we don't do duplicate elimination on the output of a projection
|
21
|
+
# operator). We don't need to dupelim if all the args are exemplary.
|
22
|
+
@elim_dups = @aggpairs.any? {|a| not a[0].kind_of? ArgExemplary}
|
23
|
+
if @elim_dups
|
24
|
+
@input_cache = Set.new
|
25
|
+
end
|
26
|
+
|
27
|
+
@seen_new_data = false
|
14
28
|
super(elem_name, bud_instance, collection_name, schema_in, &blk)
|
15
29
|
end
|
16
30
|
|
17
31
|
def insert(item, source)
|
32
|
+
if @elim_dups
|
33
|
+
return if @input_cache.include? item
|
34
|
+
@input_cache << item
|
35
|
+
end
|
36
|
+
|
37
|
+
@seen_new_data = true
|
18
38
|
key = @keys.map{|k| item[k]}
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
39
|
+
group_state = @groups[key]
|
40
|
+
if group_state.nil?
|
41
|
+
@groups[key] = @aggpairs.map do |ap|
|
42
|
+
input_val = ap[1].nil? ? item : item[ap[1]]
|
43
|
+
ap[0].init(input_val)
|
44
|
+
end
|
45
|
+
else
|
46
|
+
@aggpairs.each_with_index do |ap, agg_ix|
|
47
|
+
input_val = ap[1].nil? ? item : item[ap[1]]
|
48
|
+
state_val = ap[0].trans(group_state[agg_ix], input_val)[0]
|
49
|
+
group_state[agg_ix] = state_val
|
50
|
+
end
|
24
51
|
end
|
25
52
|
end
|
26
53
|
|
54
|
+
def add_rescan_invalidate(rescan, invalidate)
|
55
|
+
# XXX: need to understand why this is necessary; it is dissimilar to the
|
56
|
+
# way other stateful non-monotonic operators are handled.
|
57
|
+
rescan << self
|
58
|
+
super
|
59
|
+
end
|
60
|
+
|
27
61
|
def invalidate_cache
|
28
|
-
puts "
|
62
|
+
puts "#{self.class}/#{self.tabname} invalidated" if $BUD_DEBUG
|
29
63
|
@groups.clear
|
64
|
+
@input_cache.clear if @elim_dups
|
65
|
+
@seen_new_data = false
|
30
66
|
end
|
31
67
|
|
32
68
|
def flush
|
69
|
+
# If we haven't seen any input since the last call to flush(), we're done:
|
70
|
+
# our output would be the same as before.
|
71
|
+
return unless @seen_new_data
|
72
|
+
@seen_new_data = false
|
73
|
+
|
33
74
|
@groups.each do |g, grps|
|
34
75
|
grp = @keys == $EMPTY ? [[]] : [g]
|
35
76
|
@aggpairs.each_with_index do |ap, agg_ix|
|
36
|
-
grp << ap[0].
|
77
|
+
grp << ap[0].final(grps[agg_ix])
|
37
78
|
end
|
38
79
|
outval = grp[0].flatten
|
39
80
|
(1..grp.length-1).each {|i| outval << grp[i]}
|
@@ -44,31 +85,38 @@ module Bud
|
|
44
85
|
|
45
86
|
class PushArgAgg < PushGroup
|
46
87
|
def initialize(elem_name, bud_instance, collection_name, keys_in, aggpairs_in, schema_in, &blk)
|
47
|
-
|
88
|
+
unless aggpairs_in.length == 1
|
89
|
+
raise Bud::Error, "multiple aggpairs #{aggpairs_in.map{|a| a.class.name}} in ArgAgg; only one allowed"
|
90
|
+
end
|
48
91
|
super(elem_name, bud_instance, collection_name, keys_in, aggpairs_in, schema_in, &blk)
|
49
|
-
@agg = @aggpairs[0]
|
50
|
-
@aggcol = @aggpairs[0][1]
|
92
|
+
@agg, @aggcol = @aggpairs[0]
|
51
93
|
@winners = {}
|
52
94
|
end
|
53
95
|
|
54
96
|
public
|
55
97
|
def invalidate_cache
|
56
|
-
|
57
|
-
@groups.clear
|
98
|
+
super
|
58
99
|
@winners.clear
|
59
100
|
end
|
60
101
|
|
61
102
|
def insert(item, source)
|
62
103
|
key = @keys.map{|k| item[k]}
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
104
|
+
group_state = @groups[key]
|
105
|
+
if group_state.nil?
|
106
|
+
@seen_new_data = true
|
107
|
+
@groups[key] = @aggpairs.map do |ap|
|
67
108
|
@winners[key] = [item]
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
109
|
+
input_val = item[ap[1]]
|
110
|
+
ap[0].init(input_val)
|
111
|
+
end
|
112
|
+
else
|
113
|
+
@aggpairs.each_with_index do |ap, agg_ix|
|
114
|
+
input_val = item[ap[1]]
|
115
|
+
state_val, flag, *rest = ap[0].trans(group_state[agg_ix], input_val)
|
116
|
+
group_state[agg_ix] = state_val
|
117
|
+
@seen_new_data = true unless flag == :ignore
|
118
|
+
|
119
|
+
case flag
|
72
120
|
when :ignore
|
73
121
|
# do nothing
|
74
122
|
when :replace
|
@@ -76,19 +124,22 @@ module Bud
|
|
76
124
|
when :keep
|
77
125
|
@winners[key] << item
|
78
126
|
when :delete
|
79
|
-
|
80
|
-
@winners[key].delete t
|
127
|
+
rest.each do |t|
|
128
|
+
@winners[key].delete t
|
81
129
|
end
|
82
130
|
else
|
83
|
-
raise Bud::Error, "strange result from argagg
|
131
|
+
raise Bud::Error, "strange result from argagg transition func: #{flag}"
|
84
132
|
end
|
85
133
|
end
|
86
|
-
@groups[key] ||= Array.new(@aggpairs.length)
|
87
|
-
@groups[key][agg_ix] = agg
|
88
134
|
end
|
89
135
|
end
|
90
136
|
|
91
137
|
def flush
|
138
|
+
# If we haven't seen any input since the last call to flush(), we're done:
|
139
|
+
# our output would be the same as before.
|
140
|
+
return unless @seen_new_data
|
141
|
+
@seen_new_data = false
|
142
|
+
|
92
143
|
@groups.each_key do |g|
|
93
144
|
@winners[g].each do |t|
|
94
145
|
push_out(t, false)
|
data/lib/bud/executor/join.rb
CHANGED
@@ -67,14 +67,16 @@ module Bud
|
|
67
67
|
public
|
68
68
|
def state_id # :nodoc: all
|
69
69
|
object_id
|
70
|
-
|
70
|
+
end
|
71
|
+
|
72
|
+
def flush
|
73
|
+
replay_join if @rescan
|
71
74
|
end
|
72
75
|
|
73
76
|
# initialize the state for this join to be carried across iterations within a fixpoint
|
74
77
|
private
|
75
78
|
def setup_state
|
76
79
|
sid = state_id
|
77
|
-
|
78
80
|
@tabname = ("(" + @all_rels_below.map{|r| r.tabname}.join('*') +"):"+sid.to_s).to_sym
|
79
81
|
@hash_tables = [{}, {}]
|
80
82
|
end
|
@@ -131,21 +133,21 @@ module Bud
|
|
131
133
|
else
|
132
134
|
@keys = []
|
133
135
|
end
|
134
|
-
# puts "@keys = #{@keys.inspect}"
|
135
136
|
end
|
136
137
|
|
137
138
|
public
|
138
139
|
def invalidate_cache
|
139
140
|
@rels.each_with_index do |source_elem, i|
|
140
141
|
if source_elem.rescan
|
141
|
-
|
142
142
|
puts "#{tabname} rel:#{i}(#{source_elem.tabname}) invalidated" if $BUD_DEBUG
|
143
143
|
@hash_tables[i] = {}
|
144
|
-
if
|
145
|
-
#
|
146
|
-
#
|
147
|
-
|
148
|
-
#
|
144
|
+
if i == 0
|
145
|
+
# Only if i == 0 because outer joins in Bloom are left outer joins.
|
146
|
+
# If i == 1, missing_keys will be corrected when items are populated
|
147
|
+
# in the rhs fork.
|
148
|
+
# XXX This is not modular. We are doing invalidation work for outer
|
149
|
+
# joins, which is part of a separate module PushSHOuterJoin.
|
150
|
+
@missing_keys.clear
|
149
151
|
end
|
150
152
|
end
|
151
153
|
end
|
@@ -268,11 +270,12 @@ module Bud
|
|
268
270
|
|
269
271
|
public
|
270
272
|
def insert(item, source)
|
271
|
-
#
|
272
|
-
if
|
273
|
-
|
274
|
-
|
275
|
-
|
273
|
+
# If we need to reproduce the join's output, do that now before we process
|
274
|
+
# the to-be-inserted tuple. This avoids needless duplicates: if the
|
275
|
+
# to-be-inserted tuple produced any join output, we'd produce that output
|
276
|
+
# again if we didn't rescan now.
|
277
|
+
replay_join if @rescan
|
278
|
+
|
276
279
|
if @selfjoins.include? source.elem_name
|
277
280
|
offsets = []
|
278
281
|
@relnames.each_with_index{|r,i| offsets << i if r == source.elem_name}
|
@@ -308,50 +311,26 @@ module Bud
|
|
308
311
|
end
|
309
312
|
end
|
310
313
|
|
311
|
-
public
|
312
|
-
def rescan_at_tick
|
313
|
-
false
|
314
|
-
end
|
315
|
-
|
316
|
-
public
|
317
|
-
def add_rescan_invalidate(rescan, invalidate)
|
318
|
-
if non_temporal_predecessors.any? {|e| rescan.member? e}
|
319
|
-
rescan << self
|
320
|
-
invalidate << self
|
321
|
-
end
|
322
|
-
|
323
|
-
# The distinction between a join node and other stateful elements is that
|
324
|
-
# when a join node needs a rescan it doesn't tell all its sources to
|
325
|
-
# rescan. In fact, it doesn't have to pass a rescan request up to a
|
326
|
-
# source, because if a target needs a rescan, the join node has all the
|
327
|
-
# state necessary to feed the downstream node. And if a source node is in
|
328
|
-
# rescan, then at run-time only the state associated with that particular
|
329
|
-
# source node @hash_tables[offset] will be cleared, and will get filled up
|
330
|
-
# again because that source will rescan anyway.
|
331
|
-
invalidate_tables(rescan, invalidate)
|
332
|
-
end
|
333
|
-
|
334
314
|
def replay_join
|
335
|
-
|
336
|
-
b = @hash_tables
|
337
|
-
|
338
|
-
|
339
|
-
|
340
|
-
|
341
|
-
|
342
|
-
|
343
|
-
|
344
|
-
|
345
|
-
end
|
315
|
+
@rescan = false
|
316
|
+
a, b = @hash_tables
|
317
|
+
return if a.empty? or b.empty?
|
318
|
+
|
319
|
+
if a.size < b.size
|
320
|
+
a.each_pair do |key, items|
|
321
|
+
the_matches = b[key]
|
322
|
+
unless the_matches.nil?
|
323
|
+
items.each do |item|
|
324
|
+
process_matches(item, the_matches, 0)
|
346
325
|
end
|
347
326
|
end
|
348
|
-
|
349
|
-
|
350
|
-
|
351
|
-
|
352
|
-
|
353
|
-
|
354
|
-
|
327
|
+
end
|
328
|
+
else
|
329
|
+
b.each_pair do |key, items|
|
330
|
+
the_matches = a[key]
|
331
|
+
unless the_matches.nil?
|
332
|
+
items.each do |item|
|
333
|
+
process_matches(item, the_matches, 1)
|
355
334
|
end
|
356
335
|
end
|
357
336
|
end
|
@@ -489,7 +468,6 @@ module Bud
|
|
489
468
|
end
|
490
469
|
|
491
470
|
module PushSHOuterJoin
|
492
|
-
|
493
471
|
private
|
494
472
|
def insert_item(item, offset)
|
495
473
|
if @keys.nil? or @keys.empty?
|
@@ -517,6 +495,11 @@ module Bud
|
|
517
495
|
end
|
518
496
|
end
|
519
497
|
|
498
|
+
public
|
499
|
+
def rescan_at_tick
|
500
|
+
true
|
501
|
+
end
|
502
|
+
|
520
503
|
public
|
521
504
|
def stratum_end
|
522
505
|
flush
|
@@ -525,23 +508,24 @@ module Bud
|
|
525
508
|
|
526
509
|
private
|
527
510
|
def push_missing
|
528
|
-
|
529
|
-
@
|
530
|
-
@
|
531
|
-
push_out([t, @rels[1].null_tuple])
|
532
|
-
end
|
511
|
+
@missing_keys.each do |key|
|
512
|
+
@hash_tables[0][key].each do |t|
|
513
|
+
push_out([t, @rels[1].null_tuple])
|
533
514
|
end
|
534
515
|
end
|
535
516
|
end
|
536
517
|
end
|
537
518
|
|
538
519
|
|
539
|
-
# Consider "u <= s.notin(t, s.a => t.b)". notin is a non-monotonic operator,
|
540
|
-
# but negatively on t. Stratification ensures
|
541
|
-
#
|
542
|
-
#
|
543
|
-
#
|
544
|
-
#
|
520
|
+
# Consider "u <= s.notin(t, s.a => t.b)". notin is a non-monotonic operator,
|
521
|
+
# where u depends positively on s, but negatively on t. Stratification ensures
|
522
|
+
# that t is fully computed in a lower stratum, which means that we can expect
|
523
|
+
# multiple iterators on s's side only. If t's scanner were to push its
|
524
|
+
# elements down first, every insert of s merely needs to be cross checked with
|
525
|
+
# the cached elements of 't', and pushed down to the next element if s notin
|
526
|
+
# t. However, if s's scanner were to fire first, we have to wait until the
|
527
|
+
# first flush, at which point we are sure to have seen all the t-side tuples
|
528
|
+
# in this tick.
|
545
529
|
class PushNotIn < PushStatefulElement
|
546
530
|
def initialize(rellist, bud_instance, preds=nil, &blk) # :nodoc: all
|
547
531
|
@lhs, @rhs = rellist
|
@@ -552,7 +536,6 @@ module Bud
|
|
552
536
|
setup_preds(preds) unless preds.empty?
|
553
537
|
@rhs_rcvd = false
|
554
538
|
@hash_tables = [{},{}]
|
555
|
-
@rhs_rcvd = false
|
556
539
|
if @lhs_keycols.nil? and blk.nil?
|
557
540
|
# pointwise comparison. Could use zip, but it creates an array for each field pair
|
558
541
|
blk = lambda {|lhs, rhs|
|
@@ -563,9 +546,10 @@ module Bud
|
|
563
546
|
end
|
564
547
|
|
565
548
|
def setup_preds(preds)
|
566
|
-
# This is simpler than PushSHJoin's setup_preds, because notin is a binary
|
567
|
-
# collections.
|
568
|
-
#
|
549
|
+
# This is simpler than PushSHJoin's setup_preds, because notin is a binary
|
550
|
+
# operator where both lhs and rhs are collections. preds an array of
|
551
|
+
# hash_pairs. For now assume that the attributes are in the same order as
|
552
|
+
# the tables.
|
569
553
|
@lhs_keycols, @rhs_keycols = preds.reduce([[], []]) do |memo, item|
|
570
554
|
# each item is a hash
|
571
555
|
l = item.keys[0]
|
@@ -578,11 +562,11 @@ module Bud
|
|
578
562
|
def find_col(colspec, rel)
|
579
563
|
if colspec.is_a? Symbol
|
580
564
|
col_desc = rel.send(colspec)
|
581
|
-
raise "
|
565
|
+
raise Bud::Error, "unknown column #{colspec} in #{@rel.tabname}" if col_desc.nil?
|
582
566
|
elsif colspec.is_a? Array
|
583
567
|
col_desc = colspec
|
584
568
|
else
|
585
|
-
raise "
|
569
|
+
raise Bud::Error, "symbol or column spec expected. Got #{colspec}"
|
586
570
|
end
|
587
571
|
col_desc[1] # col_desc is of the form [tabname, colnum, colname]
|
588
572
|
end
|
@@ -592,11 +576,6 @@ module Bud
|
|
592
576
|
keycols.nil? ? $EMPTY : keycols.map{|col| item[col]}
|
593
577
|
end
|
594
578
|
|
595
|
-
public
|
596
|
-
def invalidate_at_tick
|
597
|
-
true
|
598
|
-
end
|
599
|
-
|
600
579
|
public
|
601
580
|
def rescan_at_tick
|
602
581
|
true
|
@@ -605,7 +584,6 @@ module Bud
|
|
605
584
|
def insert(item, source)
|
606
585
|
offset = source == @lhs ? 0 : 1
|
607
586
|
key = get_key(item, offset)
|
608
|
-
#puts "#{key}, #{item}, #{offset}"
|
609
587
|
(@hash_tables[offset][key] ||= Set.new).add item
|
610
588
|
if @rhs_rcvd and offset == 0
|
611
589
|
push_lhs(key, item)
|
@@ -613,15 +591,14 @@ module Bud
|
|
613
591
|
end
|
614
592
|
|
615
593
|
def flush
|
616
|
-
# When flush is called the first time, both lhs and rhs scanners have been
|
617
|
-
# we know that the rhs is not
|
594
|
+
# When flush is called the first time, both lhs and rhs scanners have been
|
595
|
+
# invoked, and because of stratification we know that the rhs is not
|
596
|
+
# growing any more, until the next tick.
|
618
597
|
unless @rhs_rcvd
|
619
598
|
@rhs_rcvd = true
|
620
|
-
@hash_tables[0].
|
621
|
-
values.each{|item|
|
622
|
-
|
623
|
-
}
|
624
|
-
}
|
599
|
+
@hash_tables[0].each do |key,values|
|
600
|
+
values.each {|item| push_lhs(key, item)}
|
601
|
+
end
|
625
602
|
end
|
626
603
|
end
|
627
604
|
|
@@ -661,9 +638,15 @@ module Bud
|
|
661
638
|
@delete_keys.each{|o| o.pending_delete_keys([item])}
|
662
639
|
@pendings.each{|o| o.pending_merge([item])}
|
663
640
|
end
|
664
|
-
|
665
|
-
|
666
|
-
|
667
|
-
|
641
|
+
|
642
|
+
def invalidate_cache
|
643
|
+
puts "#{self.class}/#{self.tabname} invalidated" if $BUD_DEBUG
|
644
|
+
@hash_tables = [{},{}]
|
645
|
+
@rhs_rcvd = false
|
646
|
+
end
|
647
|
+
|
648
|
+
def stratum_end
|
649
|
+
@rhs_rcvd = false
|
650
|
+
end
|
668
651
|
end
|
669
652
|
end
|