distyll 0.0.0 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (6) hide show
  1. checksums.yaml +4 -4
  2. data/README.rdoc +41 -10
  3. data/VERSION +1 -1
  4. data/distyll.gemspec +2 -2
  5. data/lib/distyll.rb +49 -47
  6. metadata +2 -2
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: b258d9e95565fd009d54bc4a70893f7ec7644b2e
4
- data.tar.gz: 06714e3d3c272d347c331e27bf9bfbf6d78f1df4
3
+ metadata.gz: 355593d87b582eb84240e2a7792dbca8f3f28fae
4
+ data.tar.gz: 68405650d538248c22adeb704d1025198d2f1332
5
5
  SHA512:
6
- metadata.gz: aa597d98f6dc4ae3848ae092efc3a2e2f3f511be6a05468f89527e74004f53fccd45ceb8b8b25ea053b846b4c7f1fa5149576e20f54aed61bdc3817aaf4d9a28
7
- data.tar.gz: 92a4ef0ea98c1ccdbb3b38564e8438aa0b5502e6c3ecdad89046dc3eb038bb4a33a7b1947d388a61d7b10327bc555b0e4312eec2fdaabdc69e7435102f2724bb
6
+ metadata.gz: a8696cafb27c1076f98a3b11cbc11c2045e6db1970c6ec7ce2373090a874522478f04bdb29bb6bc4de731bcb4b731e230668973244b4ab2dbd0d6ffeb2487a34
7
+ data.tar.gz: 511082bfadd2ae053c232f501d1838a644f84954de26e472ec1bbecab43cba0bf1976f6833e30778084aaf444594ce0c3a5907a36422926a61e83e1cfa31d280
data/README.rdoc CHANGED
@@ -11,14 +11,38 @@ However, a record created today may have an associated record (via a foreign key
11
11
  years ago. If you slice the entire database by created_at timestamps, you'll have foreign keys which point
12
12
  nowhere. Not very helpful for ensuring that your new feature works on production data.
13
13
 
14
- Distyll's solution is to start from a set of "core" ActiveRecord models supplied at initialization time
14
+ Distyll's solution is to start from a set of "core" ActiveRecord models supplied at runtime
15
15
  (plus a date threshold for these models), and only pull those that have been created since the date
16
- threshold. It then traverses all belongs_to relationships from those core models and pulls in all of those
17
- related records.
16
+ threshold. It then traverses associations from the core models and pulls in associated records (details
17
+ in the next section).
18
18
 
19
19
  Consequently, you end up with a data set that is representative of production, is internally consistent,
20
20
  and is smaller.
21
21
 
22
+ == Associations and associated records
23
+
24
+ Distyll starts from the core models and stores the ids of all core records with created_at timestamps
25
+ after the date threshold. It also marks these core models as <code>include_has_many</code>. This means
26
+ that <code>has_many</code> and <code>has_and_belongs_to_many</code> relationships SHOULD be traversed out
27
+ of the model. Distyll then counts all the pertinent ids across all pertinent models and starts a loop.
28
+
29
+ At each step of the loop, for each model in distyll's list of pertinent models, it reaches out via all
30
+ belongs_to associations (and other associations if the model is marked as <code>include_has_many</code>).
31
+ Using these associations, it expands both (a) the set of pertinent models to be copied and (b) the
32
+ pertinent ids for each of the pertinient models. In each step, it is only reaching out to associated
33
+ models for pertinent ids which are new since the previous loop. This cuts down on query time.
34
+
35
+ There is a trick with has_many association traversal. The core models do allow for traversal "down" (into
36
+ has_many and habtm associations), and any direct descendants of the core models continue traversal down.
37
+ However, if a descendant's associations are then followed "up" (through a belongs_to), then that parent
38
+ cannot subsequently traverse down. In essence, "include_has_many" is contagious down, but not up.
39
+
40
+ Distyll also ignores "through" relationships. It assumes that those associations will be covered via the
41
+ multi-step non-through traversal.
42
+
43
+ The looping stops when the total number of ids has not increased since the previous loop. Iterating with
44
+ a while was necessary due to self-referrential joins. They prevent the use of a clean recursive algorithm.
45
+
22
46
  == Using distyll in your project
23
47
 
24
48
  1. Add <code>gem 'distyll'</code> to your gemfile
@@ -27,9 +51,11 @@ and is smaller.
27
51
  1. Run <code>rake db:create RAILS_ENV=distyll</code>
28
52
  1. Run <code>rake db:schema:load RAILS_ENV=distyll</code>
29
53
  1. Run <code>rails console</code>
30
- 1. Call <code>Distyll.new(model_names, created_since)</code>, passing it an array of strings of the core models and a date after which core records will be copied.
54
+ 1. Call <code>Distyll.run(model_names, created_since)</code>, passing it an array of the core models and a
55
+ date after which core records will be copied.
31
56
 
32
- If you need to clear out the distyll database and try again with different parameters, just go back to the <code>schema:load</code> step and continue from there.
57
+ If you need to clear out the distyll database and try again with different parameters, just go back to the
58
+ <code>schema:load</code> step and continue from there.
33
59
 
34
60
  == Contributing to distyll
35
61
 
@@ -39,14 +65,19 @@ If you need to clear out the distyll database and try again with different param
39
65
  * Start a feature/bugfix branch.
40
66
  * Commit and push until you are happy with your contribution.
41
67
  * Make sure to add tests for it. This is important so I don't break it in a future version unintentionally.
42
- * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
68
+ * Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or
69
+ is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
43
70
 
44
71
  == Next Steps for distyll
45
72
 
46
- * Distyll only traverses belongs_to associations for now. Need to consider other association types.
47
- * Is likely to cause problems with single table inheritance. Could probably refer to table names rather than model names when traversing relationships... but this would still be an issue for the base models.
48
- * Currently performs "IN" query. In Oracle, this is limited to 1000 values, so I would need to chunk them for that DBMS.
49
- * Tests. I know. I just don't yet have my head around how to test something that's SO model- and database-centric, when those models and databases aren't present in the gem. Any advice would be appreciated.
73
+ * When finding pertinent ids across polymorphic has_many and habtm associations, Distyll currently ignores
74
+ the _type field, which may cause it to save more ids from the associated table than needed (if the foreign
75
+ key matches some other record in a different table but the _type doesn't match the current model.)
76
+ * Currently performs "IN" queries. In Oracle, this is limited to 1000 values, so I would need to chunk
77
+ them for that DBMS.
78
+ * Tests. I know. I just don't yet have my head around how to test something that's SO model- and
79
+ database-centric, when those models and databases aren't present in the gem. Any advice would be
80
+ appreciated.
50
81
 
51
82
  == Copyright
52
83
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.0.0
1
+ 1.0.0
data/distyll.gemspec CHANGED
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "distyll"
8
- s.version = "0.0.0"
8
+ s.version = "1.0.0"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Mason F. Matthews"]
12
- s.date = "2014-06-03"
12
+ s.date = "2014-06-20"
13
13
  s.description = "Have you ever had a 100GB production database and been unable to test on an internally consistent subset of the data? Distyll is your answer."
14
14
  s.email = "mason.f.matthews@gmail.com"
15
15
  s.extra_rdoc_files = [
data/lib/distyll.rb CHANGED
@@ -1,26 +1,33 @@
1
1
  class Distyll
2
2
 
3
- attr_reader :base_models, :model_profiles, :created_since
4
-
5
- def initialize(bms, cs)
6
- @created_since = cs.to_date
7
- @base_models = bms.map &:constantize
8
- set_model_profiles
9
- end
3
+ def self.run(base_models, created_since)
4
+ @model_profiles = Hash.new
10
5
 
11
- def run
12
6
  base_models.each do |model|
13
- @model_profiles[model].load_ids_by_timestamp(@created_since)
7
+ if @model_profiles[model].nil?
8
+ base_profile = DistyllModelProfile.new(model, true)
9
+ base_profile.load_ids_by_timestamp(created_since)
10
+ @model_profiles[model] = base_profile
11
+ end
14
12
  end
15
13
 
16
14
  prior_count = -1
17
15
  while prior_count != current_count
18
16
  prior_count = current_count
19
- @model_profiles.each_value &:demote_new_ids
20
-
21
- @model_profiles.each_value do |profile|
22
- profile.associations.each do |a|
23
- @model_profiles[a.klass].load_ids(profile.get_new_associated_ids(a))
17
+ @model_profiles.each_value &:demote_current_ids
18
+
19
+ # .dup is necessary here because of Ruby 1.9's "RuntimeError: can't add a new
20
+ # key into hash during iteration" encountered in self.find_or_store_profile.
21
+ @model_profiles.dup.each_value do |profile|
22
+ unless profile.prior_ids.blank?
23
+ profile.associations.each do |a|
24
+ # We DO want to make the associated profile continue to traverse has_manies if
25
+ # (a) The current profile traverses has_manies AND
26
+ # (b) The association we're about to traverse is a has_many.
27
+ contagious_has_many = profile.include_has_many && !(a.belongs_to? || a.has_and_belongs_to_many?)
28
+
29
+ find_or_store_profile(a.klass, contagious_has_many).load_ids(profile.get_new_associated_ids(a))
30
+ end
24
31
  end
25
32
  end
26
33
  end
@@ -33,24 +40,12 @@ class Distyll
33
40
 
34
41
  private
35
42
 
36
- def set_model_profiles
37
- @model_profiles = Hash.new
38
- base_models.each do |bm|
39
- @model_profiles = potentially_add_profiles(bm, @model_profiles)
40
- end
41
- end
42
-
43
- def potentially_add_profiles(model, profiles)
44
- return profiles if profiles.include? model
45
- profiles[model] = DistyllModelProfile.new(model)
46
- profiles[model].associations.each do |a|
47
- profiles = potentially_add_profiles(a.klass, profiles)
48
- end
49
- profiles
43
+ def self.current_count
44
+ @model_profiles.each_value.sum &:get_id_count
50
45
  end
51
46
 
52
- def current_count
53
- model_profiles.each_value.sum &:get_id_count
47
+ def self.find_or_store_profile(model, include_has_many)
48
+ @model_profiles[model] ||= DistyllModelProfile.new(model, include_has_many)
54
49
  end
55
50
 
56
51
  end
@@ -58,46 +53,51 @@ end
58
53
 
59
54
 
60
55
  class DistyllModelProfile
61
- attr_reader :model, :record_count, :associations, :all_ids, :last_ids, :new_ids
56
+ attr_reader :model, :include_has_many, :record_count, :associations, :all_ids, :prior_ids, :current_ids
62
57
 
63
- def initialize(m)
58
+ def initialize(m, include_h_m = false)
64
59
  @model = m
60
+ @include_has_many = include_h_m
65
61
  @record_count = m.count
66
- @all_ids = Array.new
67
- @last_ids = Array.new
68
- @new_ids = Array.new
62
+ @all_ids = Set.new
63
+ @prior_ids = Set.new
64
+ @current_ids = Set.new
69
65
  set_associations
70
66
  end
71
67
 
72
- def demote_new_ids
73
- @last_ids = @new_ids
74
- @new_ids = Array.new
68
+ def demote_current_ids
69
+ @prior_ids = @current_ids
70
+ @current_ids = Set.new
75
71
  end
76
72
 
77
73
  def load_ids_by_timestamp(timestamp)
78
74
  ids = model.where("created_at >= ?", timestamp).select(:id).map &:id
79
- @new_ids += ids
80
- @all_ids += ids
75
+ @current_ids.merge(ids)
76
+ @all_ids.merge(ids)
81
77
  end
82
78
 
83
79
  def load_ids(ids)
84
- @new_ids += ids
85
- @all_ids += ids
80
+ @current_ids.merge(ids)
81
+ @all_ids.merge(ids)
86
82
  end
87
83
 
88
84
  def get_id_count
89
- @all_ids = @all_ids.uniq || []
90
85
  @all_ids.count
91
86
  end
92
87
 
93
88
  def get_new_associated_ids(a)
94
- model.where(id: last_ids).select(a.foreign_key).map { |r| r.send(a.foreign_key) }
89
+ if a.belongs_to?
90
+ model.where(id: prior_ids.to_a).select(a.foreign_key).map { |r| r.send(a.foreign_key) }
91
+ else
92
+ # Polymorphism could slow us down here, causing us to pull more records than we want to.
93
+ a.klass.where(a.foreign_key => prior_ids.to_a).select(:id).map { |r| r.send(:id) }
94
+ end
95
95
  end
96
96
 
97
97
  def copy_records
98
98
  return nil if all_ids.blank?
99
99
 
100
- records = model.where(id: all_ids).load
100
+ records = model.where(id: all_ids.to_a).load
101
101
 
102
102
  model.establish_connection("distyll")
103
103
  records.each { |record| model.new(record.attributes).save!(validate: false) }
@@ -111,8 +111,10 @@ class DistyllModelProfile
111
111
  def set_associations
112
112
  @associations = Array.new
113
113
  model.reflect_on_all_associations.each do |association|
114
- if association.belongs_to? && association.through_reflection.nil?
115
- @associations << association
114
+ if association.through_reflection.nil?
115
+ if association.belongs_to? || self.include_has_many
116
+ @associations << association
117
+ end
116
118
  end
117
119
  end
118
120
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: distyll
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.0
4
+ version: 1.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mason F. Matthews
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-06-03 00:00:00.000000000 Z
11
+ date: 2014-06-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: shoulda