distyll 0.0.0 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.rdoc +41 -10
- data/VERSION +1 -1
- data/distyll.gemspec +2 -2
- data/lib/distyll.rb +49 -47
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 355593d87b582eb84240e2a7792dbca8f3f28fae
|
4
|
+
data.tar.gz: 68405650d538248c22adeb704d1025198d2f1332
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a8696cafb27c1076f98a3b11cbc11c2045e6db1970c6ec7ce2373090a874522478f04bdb29bb6bc4de731bcb4b731e230668973244b4ab2dbd0d6ffeb2487a34
|
7
|
+
data.tar.gz: 511082bfadd2ae053c232f501d1838a644f84954de26e472ec1bbecab43cba0bf1976f6833e30778084aaf444594ce0c3a5907a36422926a61e83e1cfa31d280
|
data/README.rdoc
CHANGED
@@ -11,14 +11,38 @@ However, a record created today may have an associated record (via a foreign key
|
|
11
11
|
years ago. If you slice the entire database by created_at timestamps, you'll have foreign keys which point
|
12
12
|
nowhere. Not very helpful for ensuring that your new feature works on production data.
|
13
13
|
|
14
|
-
Distyll's solution is to start from a set of "core" ActiveRecord models supplied at
|
14
|
+
Distyll's solution is to start from a set of "core" ActiveRecord models supplied at runtime
|
15
15
|
(plus a date threshold for these models), and only pull those that have been created since the date
|
16
|
-
threshold. It then traverses
|
17
|
-
|
16
|
+
threshold. It then traverses associations from the core models and pulls in associated records (details
|
17
|
+
in the next section).
|
18
18
|
|
19
19
|
Consequently, you end up with a data set that is representative of production, is internally consistent,
|
20
20
|
and is smaller.
|
21
21
|
|
22
|
+
== Associations and associated records
|
23
|
+
|
24
|
+
Distyll starts from the core models and stores the ids of all core records with created_at timestamps
|
25
|
+
after the date threshold. It also marks these core models as <code>include_has_many</code>. This means
|
26
|
+
that <code>has_many</code> and <code>has_and_belongs_to_many</code> relationships SHOULD be traversed out
|
27
|
+
of the model. Distyll then counts all the pertinent ids across all pertinent models and starts a loop.
|
28
|
+
|
29
|
+
At each step of the loop, for each model in distyll's list of pertinent models, it reaches out via all
|
30
|
+
belongs_to associations (and other associations if the model is marked as <code>include_has_many</code>).
|
31
|
+
Using these associations, it expands both (a) the set of pertinent models to be copied and (b) the
|
32
|
+
pertinent ids for each of the pertinient models. In each step, it is only reaching out to associated
|
33
|
+
models for pertinent ids which are new since the previous loop. This cuts down on query time.
|
34
|
+
|
35
|
+
There is a trick with has_many association traversal. The core models do allow for traversal "down" (into
|
36
|
+
has_many and habtm associations), and any direct descendants of the core models continue traversal down.
|
37
|
+
However, if a descendant's associations are then followed "up" (through a belongs_to), then that parent
|
38
|
+
cannot subsequently traverse down. In essence, "include_has_many" is contagious down, but not up.
|
39
|
+
|
40
|
+
Distyll also ignores "through" relationships. It assumes that those associations will be covered via the
|
41
|
+
multi-step non-through traversal.
|
42
|
+
|
43
|
+
The looping stops when the total number of ids has not increased since the previous loop. Iterating with
|
44
|
+
a while was necessary due to self-referrential joins. They prevent the use of a clean recursive algorithm.
|
45
|
+
|
22
46
|
== Using distyll in your project
|
23
47
|
|
24
48
|
1. Add <code>gem 'distyll'</code> to your gemfile
|
@@ -27,9 +51,11 @@ and is smaller.
|
|
27
51
|
1. Run <code>rake db:create RAILS_ENV=distyll</code>
|
28
52
|
1. Run <code>rake db:schema:load RAILS_ENV=distyll</code>
|
29
53
|
1. Run <code>rails console</code>
|
30
|
-
1. Call <code>Distyll.
|
54
|
+
1. Call <code>Distyll.run(model_names, created_since)</code>, passing it an array of the core models and a
|
55
|
+
date after which core records will be copied.
|
31
56
|
|
32
|
-
If you need to clear out the distyll database and try again with different parameters, just go back to the
|
57
|
+
If you need to clear out the distyll database and try again with different parameters, just go back to the
|
58
|
+
<code>schema:load</code> step and continue from there.
|
33
59
|
|
34
60
|
== Contributing to distyll
|
35
61
|
|
@@ -39,14 +65,19 @@ If you need to clear out the distyll database and try again with different param
|
|
39
65
|
* Start a feature/bugfix branch.
|
40
66
|
* Commit and push until you are happy with your contribution.
|
41
67
|
* Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
|
42
|
-
* Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or
|
68
|
+
* Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or
|
69
|
+
is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
|
43
70
|
|
44
71
|
== Next Steps for distyll
|
45
72
|
|
46
|
-
*
|
47
|
-
|
48
|
-
|
49
|
-
*
|
73
|
+
* When finding pertinent ids across polymorphic has_many and habtm associations, Distyll currently ignores
|
74
|
+
the _type field, which may cause it to save more ids from the associated table than needed (if the foreign
|
75
|
+
key matches some other record in a different table but the _type doesn't match the current model.)
|
76
|
+
* Currently performs "IN" queries. In Oracle, this is limited to 1000 values, so I would need to chunk
|
77
|
+
them for that DBMS.
|
78
|
+
* Tests. I know. I just don't yet have my head around how to test something that's SO model- and
|
79
|
+
database-centric, when those models and databases aren't present in the gem. Any advice would be
|
80
|
+
appreciated.
|
50
81
|
|
51
82
|
== Copyright
|
52
83
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
|
1
|
+
1.0.0
|
data/distyll.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "distyll"
|
8
|
-
s.version = "
|
8
|
+
s.version = "1.0.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Mason F. Matthews"]
|
12
|
-
s.date = "2014-06-
|
12
|
+
s.date = "2014-06-20"
|
13
13
|
s.description = "Have you ever had a 100GB production database and been unable to test on an internally consistent subset of the data? Distyll is your answer."
|
14
14
|
s.email = "mason.f.matthews@gmail.com"
|
15
15
|
s.extra_rdoc_files = [
|
data/lib/distyll.rb
CHANGED
@@ -1,26 +1,33 @@
|
|
1
1
|
class Distyll
|
2
2
|
|
3
|
-
|
4
|
-
|
5
|
-
def initialize(bms, cs)
|
6
|
-
@created_since = cs.to_date
|
7
|
-
@base_models = bms.map &:constantize
|
8
|
-
set_model_profiles
|
9
|
-
end
|
3
|
+
def self.run(base_models, created_since)
|
4
|
+
@model_profiles = Hash.new
|
10
5
|
|
11
|
-
def run
|
12
6
|
base_models.each do |model|
|
13
|
-
@model_profiles[model].
|
7
|
+
if @model_profiles[model].nil?
|
8
|
+
base_profile = DistyllModelProfile.new(model, true)
|
9
|
+
base_profile.load_ids_by_timestamp(created_since)
|
10
|
+
@model_profiles[model] = base_profile
|
11
|
+
end
|
14
12
|
end
|
15
13
|
|
16
14
|
prior_count = -1
|
17
15
|
while prior_count != current_count
|
18
16
|
prior_count = current_count
|
19
|
-
@model_profiles.each_value &:
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
17
|
+
@model_profiles.each_value &:demote_current_ids
|
18
|
+
|
19
|
+
# .dup is necessary here because of Ruby 1.9's "RuntimeError: can't add a new
|
20
|
+
# key into hash during iteration" encountered in self.find_or_store_profile.
|
21
|
+
@model_profiles.dup.each_value do |profile|
|
22
|
+
unless profile.prior_ids.blank?
|
23
|
+
profile.associations.each do |a|
|
24
|
+
# We DO want to make the associated profile continue to traverse has_manies if
|
25
|
+
# (a) The current profile traverses has_manies AND
|
26
|
+
# (b) The association we're about to traverse is a has_many.
|
27
|
+
contagious_has_many = profile.include_has_many && !(a.belongs_to? || a.has_and_belongs_to_many?)
|
28
|
+
|
29
|
+
find_or_store_profile(a.klass, contagious_has_many).load_ids(profile.get_new_associated_ids(a))
|
30
|
+
end
|
24
31
|
end
|
25
32
|
end
|
26
33
|
end
|
@@ -33,24 +40,12 @@ class Distyll
|
|
33
40
|
|
34
41
|
private
|
35
42
|
|
36
|
-
def
|
37
|
-
@model_profiles
|
38
|
-
base_models.each do |bm|
|
39
|
-
@model_profiles = potentially_add_profiles(bm, @model_profiles)
|
40
|
-
end
|
41
|
-
end
|
42
|
-
|
43
|
-
def potentially_add_profiles(model, profiles)
|
44
|
-
return profiles if profiles.include? model
|
45
|
-
profiles[model] = DistyllModelProfile.new(model)
|
46
|
-
profiles[model].associations.each do |a|
|
47
|
-
profiles = potentially_add_profiles(a.klass, profiles)
|
48
|
-
end
|
49
|
-
profiles
|
43
|
+
def self.current_count
|
44
|
+
@model_profiles.each_value.sum &:get_id_count
|
50
45
|
end
|
51
46
|
|
52
|
-
def
|
53
|
-
model_profiles.
|
47
|
+
def self.find_or_store_profile(model, include_has_many)
|
48
|
+
@model_profiles[model] ||= DistyllModelProfile.new(model, include_has_many)
|
54
49
|
end
|
55
50
|
|
56
51
|
end
|
@@ -58,46 +53,51 @@ end
|
|
58
53
|
|
59
54
|
|
60
55
|
class DistyllModelProfile
|
61
|
-
attr_reader :model, :record_count, :associations, :all_ids, :
|
56
|
+
attr_reader :model, :include_has_many, :record_count, :associations, :all_ids, :prior_ids, :current_ids
|
62
57
|
|
63
|
-
def initialize(m)
|
58
|
+
def initialize(m, include_h_m = false)
|
64
59
|
@model = m
|
60
|
+
@include_has_many = include_h_m
|
65
61
|
@record_count = m.count
|
66
|
-
@all_ids =
|
67
|
-
@
|
68
|
-
@
|
62
|
+
@all_ids = Set.new
|
63
|
+
@prior_ids = Set.new
|
64
|
+
@current_ids = Set.new
|
69
65
|
set_associations
|
70
66
|
end
|
71
67
|
|
72
|
-
def
|
73
|
-
@
|
74
|
-
@
|
68
|
+
def demote_current_ids
|
69
|
+
@prior_ids = @current_ids
|
70
|
+
@current_ids = Set.new
|
75
71
|
end
|
76
72
|
|
77
73
|
def load_ids_by_timestamp(timestamp)
|
78
74
|
ids = model.where("created_at >= ?", timestamp).select(:id).map &:id
|
79
|
-
@
|
80
|
-
@all_ids
|
75
|
+
@current_ids.merge(ids)
|
76
|
+
@all_ids.merge(ids)
|
81
77
|
end
|
82
78
|
|
83
79
|
def load_ids(ids)
|
84
|
-
@
|
85
|
-
@all_ids
|
80
|
+
@current_ids.merge(ids)
|
81
|
+
@all_ids.merge(ids)
|
86
82
|
end
|
87
83
|
|
88
84
|
def get_id_count
|
89
|
-
@all_ids = @all_ids.uniq || []
|
90
85
|
@all_ids.count
|
91
86
|
end
|
92
87
|
|
93
88
|
def get_new_associated_ids(a)
|
94
|
-
|
89
|
+
if a.belongs_to?
|
90
|
+
model.where(id: prior_ids.to_a).select(a.foreign_key).map { |r| r.send(a.foreign_key) }
|
91
|
+
else
|
92
|
+
# Polymorphism could slow us down here, causing us to pull more records than we want to.
|
93
|
+
a.klass.where(a.foreign_key => prior_ids.to_a).select(:id).map { |r| r.send(:id) }
|
94
|
+
end
|
95
95
|
end
|
96
96
|
|
97
97
|
def copy_records
|
98
98
|
return nil if all_ids.blank?
|
99
99
|
|
100
|
-
records = model.where(id: all_ids).load
|
100
|
+
records = model.where(id: all_ids.to_a).load
|
101
101
|
|
102
102
|
model.establish_connection("distyll")
|
103
103
|
records.each { |record| model.new(record.attributes).save!(validate: false) }
|
@@ -111,8 +111,10 @@ class DistyllModelProfile
|
|
111
111
|
def set_associations
|
112
112
|
@associations = Array.new
|
113
113
|
model.reflect_on_all_associations.each do |association|
|
114
|
-
if association.
|
115
|
-
|
114
|
+
if association.through_reflection.nil?
|
115
|
+
if association.belongs_to? || self.include_has_many
|
116
|
+
@associations << association
|
117
|
+
end
|
116
118
|
end
|
117
119
|
end
|
118
120
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: distyll
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mason F. Matthews
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-06-
|
11
|
+
date: 2014-06-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: shoulda
|