distyll 0.0.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.rdoc +41 -10
- data/VERSION +1 -1
- data/distyll.gemspec +2 -2
- data/lib/distyll.rb +49 -47
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 355593d87b582eb84240e2a7792dbca8f3f28fae
|
4
|
+
data.tar.gz: 68405650d538248c22adeb704d1025198d2f1332
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: a8696cafb27c1076f98a3b11cbc11c2045e6db1970c6ec7ce2373090a874522478f04bdb29bb6bc4de731bcb4b731e230668973244b4ab2dbd0d6ffeb2487a34
|
7
|
+
data.tar.gz: 511082bfadd2ae053c232f501d1838a644f84954de26e472ec1bbecab43cba0bf1976f6833e30778084aaf444594ce0c3a5907a36422926a61e83e1cfa31d280
|
data/README.rdoc
CHANGED
@@ -11,14 +11,38 @@ However, a record created today may have an associated record (via a foreign key
|
|
11
11
|
years ago. If you slice the entire database by created_at timestamps, you'll have foreign keys which point
|
12
12
|
nowhere. Not very helpful for ensuring that your new feature works on production data.
|
13
13
|
|
14
|
-
Distyll's solution is to start from a set of "core" ActiveRecord models supplied at
|
14
|
+
Distyll's solution is to start from a set of "core" ActiveRecord models supplied at runtime
|
15
15
|
(plus a date threshold for these models), and only pull those that have been created since the date
|
16
|
-
threshold. It then traverses
|
17
|
-
|
16
|
+
threshold. It then traverses associations from the core models and pulls in associated records (details
|
17
|
+
in the next section).
|
18
18
|
|
19
19
|
Consequently, you end up with a data set that is representative of production, is internally consistent,
|
20
20
|
and is smaller.
|
21
21
|
|
22
|
+
== Associations and associated records
|
23
|
+
|
24
|
+
Distyll starts from the core models and stores the ids of all core records with created_at timestamps
|
25
|
+
after the date threshold. It also marks these core models as <code>include_has_many</code>. This means
|
26
|
+
that <code>has_many</code> and <code>has_and_belongs_to_many</code> relationships SHOULD be traversed out
|
27
|
+
of the model. Distyll then counts all the pertinent ids across all pertinent models and starts a loop.
|
28
|
+
|
29
|
+
At each step of the loop, for each model in distyll's list of pertinent models, it reaches out via all
|
30
|
+
belongs_to associations (and other associations if the model is marked as <code>include_has_many</code>).
|
31
|
+
Using these associations, it expands both (a) the set of pertinent models to be copied and (b) the
|
32
|
+
pertinent ids for each of the pertinient models. In each step, it is only reaching out to associated
|
33
|
+
models for pertinent ids which are new since the previous loop. This cuts down on query time.
|
34
|
+
|
35
|
+
There is a trick with has_many association traversal. The core models do allow for traversal "down" (into
|
36
|
+
has_many and habtm associations), and any direct descendants of the core models continue traversal down.
|
37
|
+
However, if a descendant's associations are then followed "up" (through a belongs_to), then that parent
|
38
|
+
cannot subsequently traverse down. In essence, "include_has_many" is contagious down, but not up.
|
39
|
+
|
40
|
+
Distyll also ignores "through" relationships. It assumes that those associations will be covered via the
|
41
|
+
multi-step non-through traversal.
|
42
|
+
|
43
|
+
The looping stops when the total number of ids has not increased since the previous loop. Iterating with
|
44
|
+
a while was necessary due to self-referrential joins. They prevent the use of a clean recursive algorithm.
|
45
|
+
|
22
46
|
== Using distyll in your project
|
23
47
|
|
24
48
|
1. Add <code>gem 'distyll'</code> to your gemfile
|
@@ -27,9 +51,11 @@ and is smaller.
|
|
27
51
|
1. Run <code>rake db:create RAILS_ENV=distyll</code>
|
28
52
|
1. Run <code>rake db:schema:load RAILS_ENV=distyll</code>
|
29
53
|
1. Run <code>rails console</code>
|
30
|
-
1. Call <code>Distyll.
|
54
|
+
1. Call <code>Distyll.run(model_names, created_since)</code>, passing it an array of the core models and a
|
55
|
+
date after which core records will be copied.
|
31
56
|
|
32
|
-
If you need to clear out the distyll database and try again with different parameters, just go back to the
|
57
|
+
If you need to clear out the distyll database and try again with different parameters, just go back to the
|
58
|
+
<code>schema:load</code> step and continue from there.
|
33
59
|
|
34
60
|
== Contributing to distyll
|
35
61
|
|
@@ -39,14 +65,19 @@ If you need to clear out the distyll database and try again with different param
|
|
39
65
|
* Start a feature/bugfix branch.
|
40
66
|
* Commit and push until you are happy with your contribution.
|
41
67
|
* Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
|
42
|
-
* Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or
|
68
|
+
* Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or
|
69
|
+
is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
|
43
70
|
|
44
71
|
== Next Steps for distyll
|
45
72
|
|
46
|
-
*
|
47
|
-
|
48
|
-
|
49
|
-
*
|
73
|
+
* When finding pertinent ids across polymorphic has_many and habtm associations, Distyll currently ignores
|
74
|
+
the _type field, which may cause it to save more ids from the associated table than needed (if the foreign
|
75
|
+
key matches some other record in a different table but the _type doesn't match the current model.)
|
76
|
+
* Currently performs "IN" queries. In Oracle, this is limited to 1000 values, so I would need to chunk
|
77
|
+
them for that DBMS.
|
78
|
+
* Tests. I know. I just don't yet have my head around how to test something that's SO model- and
|
79
|
+
database-centric, when those models and databases aren't present in the gem. Any advice would be
|
80
|
+
appreciated.
|
50
81
|
|
51
82
|
== Copyright
|
52
83
|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
|
1
|
+
1.0.0
|
data/distyll.gemspec
CHANGED
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "distyll"
|
8
|
-
s.version = "
|
8
|
+
s.version = "1.0.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Mason F. Matthews"]
|
12
|
-
s.date = "2014-06-
|
12
|
+
s.date = "2014-06-20"
|
13
13
|
s.description = "Have you ever had a 100GB production database and been unable to test on an internally consistent subset of the data? Distyll is your answer."
|
14
14
|
s.email = "mason.f.matthews@gmail.com"
|
15
15
|
s.extra_rdoc_files = [
|
data/lib/distyll.rb
CHANGED
@@ -1,26 +1,33 @@
|
|
1
1
|
class Distyll
|
2
2
|
|
3
|
-
|
4
|
-
|
5
|
-
def initialize(bms, cs)
|
6
|
-
@created_since = cs.to_date
|
7
|
-
@base_models = bms.map &:constantize
|
8
|
-
set_model_profiles
|
9
|
-
end
|
3
|
+
def self.run(base_models, created_since)
|
4
|
+
@model_profiles = Hash.new
|
10
5
|
|
11
|
-
def run
|
12
6
|
base_models.each do |model|
|
13
|
-
@model_profiles[model].
|
7
|
+
if @model_profiles[model].nil?
|
8
|
+
base_profile = DistyllModelProfile.new(model, true)
|
9
|
+
base_profile.load_ids_by_timestamp(created_since)
|
10
|
+
@model_profiles[model] = base_profile
|
11
|
+
end
|
14
12
|
end
|
15
13
|
|
16
14
|
prior_count = -1
|
17
15
|
while prior_count != current_count
|
18
16
|
prior_count = current_count
|
19
|
-
@model_profiles.each_value &:
|
20
|
-
|
21
|
-
|
22
|
-
|
23
|
-
|
17
|
+
@model_profiles.each_value &:demote_current_ids
|
18
|
+
|
19
|
+
# .dup is necessary here because of Ruby 1.9's "RuntimeError: can't add a new
|
20
|
+
# key into hash during iteration" encountered in self.find_or_store_profile.
|
21
|
+
@model_profiles.dup.each_value do |profile|
|
22
|
+
unless profile.prior_ids.blank?
|
23
|
+
profile.associations.each do |a|
|
24
|
+
# We DO want to make the associated profile continue to traverse has_manies if
|
25
|
+
# (a) The current profile traverses has_manies AND
|
26
|
+
# (b) The association we're about to traverse is a has_many.
|
27
|
+
contagious_has_many = profile.include_has_many && !(a.belongs_to? || a.has_and_belongs_to_many?)
|
28
|
+
|
29
|
+
find_or_store_profile(a.klass, contagious_has_many).load_ids(profile.get_new_associated_ids(a))
|
30
|
+
end
|
24
31
|
end
|
25
32
|
end
|
26
33
|
end
|
@@ -33,24 +40,12 @@ class Distyll
|
|
33
40
|
|
34
41
|
private
|
35
42
|
|
36
|
-
def
|
37
|
-
@model_profiles
|
38
|
-
base_models.each do |bm|
|
39
|
-
@model_profiles = potentially_add_profiles(bm, @model_profiles)
|
40
|
-
end
|
41
|
-
end
|
42
|
-
|
43
|
-
def potentially_add_profiles(model, profiles)
|
44
|
-
return profiles if profiles.include? model
|
45
|
-
profiles[model] = DistyllModelProfile.new(model)
|
46
|
-
profiles[model].associations.each do |a|
|
47
|
-
profiles = potentially_add_profiles(a.klass, profiles)
|
48
|
-
end
|
49
|
-
profiles
|
43
|
+
def self.current_count
|
44
|
+
@model_profiles.each_value.sum &:get_id_count
|
50
45
|
end
|
51
46
|
|
52
|
-
def
|
53
|
-
model_profiles.
|
47
|
+
def self.find_or_store_profile(model, include_has_many)
|
48
|
+
@model_profiles[model] ||= DistyllModelProfile.new(model, include_has_many)
|
54
49
|
end
|
55
50
|
|
56
51
|
end
|
@@ -58,46 +53,51 @@ end
|
|
58
53
|
|
59
54
|
|
60
55
|
class DistyllModelProfile
|
61
|
-
attr_reader :model, :record_count, :associations, :all_ids, :
|
56
|
+
attr_reader :model, :include_has_many, :record_count, :associations, :all_ids, :prior_ids, :current_ids
|
62
57
|
|
63
|
-
def initialize(m)
|
58
|
+
def initialize(m, include_h_m = false)
|
64
59
|
@model = m
|
60
|
+
@include_has_many = include_h_m
|
65
61
|
@record_count = m.count
|
66
|
-
@all_ids =
|
67
|
-
@
|
68
|
-
@
|
62
|
+
@all_ids = Set.new
|
63
|
+
@prior_ids = Set.new
|
64
|
+
@current_ids = Set.new
|
69
65
|
set_associations
|
70
66
|
end
|
71
67
|
|
72
|
-
def
|
73
|
-
@
|
74
|
-
@
|
68
|
+
def demote_current_ids
|
69
|
+
@prior_ids = @current_ids
|
70
|
+
@current_ids = Set.new
|
75
71
|
end
|
76
72
|
|
77
73
|
def load_ids_by_timestamp(timestamp)
|
78
74
|
ids = model.where("created_at >= ?", timestamp).select(:id).map &:id
|
79
|
-
@
|
80
|
-
@all_ids
|
75
|
+
@current_ids.merge(ids)
|
76
|
+
@all_ids.merge(ids)
|
81
77
|
end
|
82
78
|
|
83
79
|
def load_ids(ids)
|
84
|
-
@
|
85
|
-
@all_ids
|
80
|
+
@current_ids.merge(ids)
|
81
|
+
@all_ids.merge(ids)
|
86
82
|
end
|
87
83
|
|
88
84
|
def get_id_count
|
89
|
-
@all_ids = @all_ids.uniq || []
|
90
85
|
@all_ids.count
|
91
86
|
end
|
92
87
|
|
93
88
|
def get_new_associated_ids(a)
|
94
|
-
|
89
|
+
if a.belongs_to?
|
90
|
+
model.where(id: prior_ids.to_a).select(a.foreign_key).map { |r| r.send(a.foreign_key) }
|
91
|
+
else
|
92
|
+
# Polymorphism could slow us down here, causing us to pull more records than we want to.
|
93
|
+
a.klass.where(a.foreign_key => prior_ids.to_a).select(:id).map { |r| r.send(:id) }
|
94
|
+
end
|
95
95
|
end
|
96
96
|
|
97
97
|
def copy_records
|
98
98
|
return nil if all_ids.blank?
|
99
99
|
|
100
|
-
records = model.where(id: all_ids).load
|
100
|
+
records = model.where(id: all_ids.to_a).load
|
101
101
|
|
102
102
|
model.establish_connection("distyll")
|
103
103
|
records.each { |record| model.new(record.attributes).save!(validate: false) }
|
@@ -111,8 +111,10 @@ class DistyllModelProfile
|
|
111
111
|
def set_associations
|
112
112
|
@associations = Array.new
|
113
113
|
model.reflect_on_all_associations.each do |association|
|
114
|
-
if association.
|
115
|
-
|
114
|
+
if association.through_reflection.nil?
|
115
|
+
if association.belongs_to? || self.include_has_many
|
116
|
+
@associations << association
|
117
|
+
end
|
116
118
|
end
|
117
119
|
end
|
118
120
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: distyll
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 1.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mason F. Matthews
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-06-
|
11
|
+
date: 2014-06-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: shoulda
|