upsert 0.2.2 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,8 +1,15 @@
1
+ 0.3.0 / 2012-06-21
2
+
3
+ * Enhancements
4
+
5
+ * Remove all the sampling - just keep a cumulative total of sql bytes as we build up an ON DUPLICATE KEY UPDATE query.
6
+ * Deprecate Upsert.stream in favor of Upsert.batch (but provide an alias for backwards compat)
7
+
1
8
  0.2.2 / 2012-06-21
2
9
 
3
10
  * Bug fixes
4
11
 
5
- * Correct and simplify how sql length is calculated when batching (streaming) MySQL upserts.
12
+ * Correct and simplify how sql length is calculated when batching MySQL upserts.
6
13
 
7
14
  0.2.1 / 2012-06-21
8
15
 
data/README.md CHANGED
@@ -10,24 +10,24 @@ The second argument is currently (mis)named a "document" because this was inspir
10
10
 
11
11
  ### One by one
12
12
 
13
- Faster than just doing `Pet.create`... 85% faster on PostgreSQL, for example. But no validations or anything.
13
+ Faster than just doing `Pet.create`... 85% faster on PostgreSQL, for example, than all the different native ActiveRecord methods I've tried. But no validations or anything.
14
14
 
15
15
  upsert = Upsert.new Pet.connection, Pet.table_name
16
16
  upsert.row({:name => 'Jerry'}, :breed => 'beagle')
17
17
  upsert.row({:name => 'Pierre'}, :breed => 'tabby')
18
18
 
19
- ### Streaming
19
+ ### Batch mode
20
20
 
21
21
  Rows are buffered in memory until it's efficient to send them to the database. Currently this only provides an advantage on MySQL because it uses `ON DUPLICATE KEY UPDATE`... but if a similar method appears in PostgreSQL, the same code will still work.
22
22
 
23
- Upsert.stream(Pet.connection, Pet.table_name) do |upsert|
23
+ Upsert.batch(Pet.connection, Pet.table_name) do |upsert|
24
24
  upsert.row({:name => 'Jerry'}, :breed => 'beagle')
25
25
  upsert.row({:name => 'Pierre'}, :breed => 'tabby')
26
26
  end
27
27
 
28
28
  ### `ActiveRecord::Base.upsert` (optional)
29
29
 
30
- For bulk upserts, you probably still want to use `Upsert.stream`.
30
+ For bulk upserts, you probably still want to use `Upsert.batch`.
31
31
 
32
32
  require 'upsert/active_record_upsert'
33
33
  Pet.upsert({:name => 'Jerry'}, :breed => 'beagle')
@@ -37,7 +37,7 @@ For bulk upserts, you probably still want to use `Upsert.stream`.
37
37
 
38
38
  Currently, the first row you pass in determines the columns that will be used. That's useful for mass importing of many rows with the same columns, but is surprising if you're trying to use a single `Upsert` object to add arbitrary data. For example, this won't work:
39
39
 
40
- Upsert.stream(Pet.connection, Pet.table_name) do |upsert|
40
+ Upsert.batch(Pet.connection, Pet.table_name) do |upsert|
41
41
  upsert.row({:name => 'Jerry'}, :breed => 'beagle')
42
42
  upsert.row({:tag_number => 456}, :spiel => 'great cat') # won't work - doesn't use same columns
43
43
  end
@@ -51,10 +51,9 @@ You would need to use a new `Upsert` object. On the other hand, this is totally
51
51
 
52
52
  Pull requests for any of these would be greatly appreciated:
53
53
 
54
- 1. Somebody who understands statistics should look at how I'm sampling rows in `Upsert::Mysql2_Client#estimate_variable_sql_bytesize`... I think we can assume that row sizes are random, so I don't think we actually have to select random elements.
55
- 2. Fix SQLite tests.
56
- 3. If you think there's a fix for the "fixed column set" gotcha...
57
- 4. Naming suggestions: should "document" be called "setters" or "attributes"? Should "stream" be "batch" instead?
54
+ 1. Fix SQLite tests.
55
+ 2. If you think there's a fix for the "fixed column set" gotcha...
56
+ 3. Naming suggestions: should "document" be called "setters" or "attributes"?
58
57
 
59
58
  ## Real-world usage
60
59
 
@@ -219,7 +218,7 @@ You could also use [activerecord-import](https://github.com/zdennis/activerecord
219
218
 
220
219
  Pet.import columns, all_values, :timestamps => false, :on_duplicate_key_update => columns
221
220
 
222
- This, however, only works on MySQL and requires ActiveRecord—and if all you are doing is upserts, `upsert` is tested to be 40% faster. And you don't have to put all of the rows to be upserted into a single huge array - you can stream them using `Upsert.stream`.
221
+ This, however, only works on MySQL and requires ActiveRecord—and if all you are doing is upserts, `upsert` is tested to be 40% faster. And you don't have to put all of the rows to be upserted into a single huge array - you can batch them using `Upsert.batch`.
223
222
 
224
223
  ## Copyright
225
224
 
@@ -16,7 +16,7 @@ class Upsert
16
16
  Binary.new v
17
17
  end
18
18
 
19
- # @yield [Upsert] An +Upsert+ object in streaming mode. You can call #row on it multiple times and it will try to optimize on speed.
19
+ # @yield [Upsert] An +Upsert+ object in batch mode. You can call #row on it multiple times and it will try to optimize on speed.
20
20
  #
21
21
  # @note Buffered in memory until it's efficient to send to the server a packet.
22
22
  #
@@ -25,16 +25,19 @@ class Upsert
25
25
  # @return [nil]
26
26
  #
27
27
  # @example Many at once
28
- # Upsert.stream(Pet.connection, Pet.table_name) do |upsert|
28
+ # Upsert.batch(Pet.connection, Pet.table_name) do |upsert|
29
29
  # upsert.row({:name => 'Jerry'}, :breed => 'beagle')
30
30
  # upsert.row({:name => 'Pierre'}, :breed => 'tabby')
31
31
  # end
32
- def stream(connection, table_name)
32
+ def batch(connection, table_name)
33
33
  upsert = new connection, table_name
34
34
  upsert.async!
35
35
  yield upsert
36
36
  upsert.sync!
37
37
  end
38
+
39
+ # @deprecated Use .batch instead.
40
+ alias :stream :batch
38
41
  end
39
42
 
40
43
  # Raised if a query would be too large to send in a single packet.
@@ -58,13 +61,13 @@ class Upsert
58
61
  attr_reader :table_name
59
62
 
60
63
  # @private
61
- attr_reader :rows
64
+ attr_reader :buffer
62
65
 
63
66
  # @param [Mysql2::Client,Sqlite3::Database,PG::Connection,#raw_connection] connection A supported database connection.
64
67
  # @param [String,Symbol] table_name The name of the table into which you will be upserting.
65
68
  def initialize(connection, table_name)
66
69
  @table_name = table_name
67
- @rows = []
70
+ @buffer = []
68
71
 
69
72
  @connection = if connection.respond_to?(:raw_connection)
70
73
  # deal with ActiveRecord::Base.connection or ActiveRecord::Base.connection_pool.checkout
@@ -92,7 +95,7 @@ class Upsert
92
95
  # upsert.row({:name => 'Jerry'}, :breed => 'beagle')
93
96
  # upsert.row({:name => 'Pierre'}, :breed => 'tabby')
94
97
  def row(selector, document)
95
- rows << Row.new(self, selector, document)
98
+ buffer.push Row.new(self, selector, document)
96
99
  if sql = chunk
97
100
  execute sql
98
101
  end
@@ -1,50 +1,34 @@
1
1
  class Upsert
2
2
  # @private
3
3
  module Mysql2_Client
4
- SAMPLE = 0.1
5
-
6
4
  def chunk
7
- return if rows.empty?
8
- all = rows.length
9
- take = all
10
- while take > 1 and probably_oversize?(take)
11
- take -= 1
12
- end
13
- if async? and take == all
14
- return
15
- end
16
- while take > 2 and oversize?(take)
17
- $stderr.puts " Length prediction via sampling failed, shrinking" if ENV['UPSERT_DEBUG'] == 'true'
18
- take -= 2
5
+ return if buffer.empty?
6
+ if not async?
7
+ retval = sql
8
+ buffer.clear
9
+ return retval
19
10
  end
20
- chunk = sql take
21
- while take > 1 and chunk.bytesize > max_sql_bytesize
22
- $stderr.puts " Supposedly exact bytesize guess failed, shrinking" if ENV['UPSERT_DEBUG'] == 'true'
23
- take -= 1
24
- chunk = sql take
25
- end
26
- if chunk.bytesize > max_sql_bytesize
27
- raise TooBig
11
+ @cumulative_sql_bytesize ||= static_sql_bytesize
12
+ new_row = buffer.pop
13
+ d = new_row.values_sql_bytesize + 3 # ),(
14
+ if @cumulative_sql_bytesize + d > max_sql_bytesize
15
+ retval = sql
16
+ buffer.clear
17
+ @cumulative_sql_bytesize = static_sql_bytesize + d
18
+ else
19
+ retval = nil
20
+ @cumulative_sql_bytesize += d
28
21
  end
29
- $stderr.puts " Chunk (#{take}/#{chunk.bytesize}) was #{(chunk.bytesize / max_sql_bytesize.to_f * 100).round}% of the max" if ENV['UPSERT_DEBUG'] == 'true'
30
- @rows = rows.drop(take)
31
- chunk
22
+ buffer.push new_row
23
+ retval
32
24
  end
33
25
 
34
26
  def execute(sql)
35
27
  connection.query sql
36
28
  end
37
29
 
38
- def probably_oversize?(take)
39
- estimate_sql_bytesize(take) > max_sql_bytesize
40
- end
41
-
42
- def oversize?(take)
43
- sql_bytesize(take) > max_sql_bytesize
44
- end
45
-
46
30
  def columns
47
- @columns ||= rows.first.columns
31
+ @columns ||= buffer.first.columns
48
32
  end
49
33
 
50
34
  def insert_part
@@ -65,48 +49,11 @@ class Upsert
65
49
  @static_sql_bytesize ||= insert_part.bytesize + update_part.bytesize + 2
66
50
  end
67
51
 
68
-
69
- def variable_sql_bytesize(take)
70
- memo = rows.first(take).inject(0) { |sum, row| sum + row.values_sql_bytesize }
71
- if take > 0
72
- # parens and comma
73
- memo += 3*(take-1)
74
- end
75
- memo
76
- end
77
-
78
- def estimate_variable_sql_bytesize(take)
79
- n = (take * SAMPLE).ceil
80
- sample = if RUBY_VERSION >= '1.9'
81
- rows.first(take).sample(n)
82
- else
83
- # based on https://github.com/marcandre/backports/blob/master/lib/backports/1.8.7/array.rb
84
- memo = rows.first(take)
85
- n.times do |i|
86
- r = i + Kernel.rand(take - i)
87
- memo[i], memo[r] = memo[r], memo[i]
88
- end
89
- memo.first(n)
90
- end
91
- memo = sample.inject(0) { |sum, row| sum + row.values_sql_bytesize } / SAMPLE
92
- if take > 0
93
- # parens and comma
94
- memo += 3*(take-1)
95
- end
96
- memo
97
- end
98
-
99
- def sql_bytesize(take)
100
- static_sql_bytesize + variable_sql_bytesize(take)
101
- end
102
-
103
- def estimate_sql_bytesize(take)
104
- static_sql_bytesize + estimate_variable_sql_bytesize(take)
105
- end
106
-
107
- def sql(take)
108
- all_value_sql = rows.first(take).map { |row| row.values_sql }
109
- [ insert_part, '(', all_value_sql.join('),('), ')', update_part ].join
52
+ def sql
53
+ all_value_sql = buffer.map { |row| row.values_sql }
54
+ retval = [ insert_part, '(', all_value_sql.join('),('), ')', update_part ].join
55
+ raise TooBig if retval.bytesize > max_sql_bytesize
56
+ retval
110
57
  end
111
58
 
112
59
  # since setting an option like :as => :hash actually persists that option to the client, don't pass any options
@@ -7,8 +7,8 @@ class Upsert
7
7
  attr_reader :merge_function
8
8
 
9
9
  def chunk
10
- return if rows.empty?
11
- row = rows.shift
10
+ return if buffer.empty?
11
+ row = buffer.shift
12
12
  unless merge_function
13
13
  create_merge_function row
14
14
  end
@@ -2,8 +2,8 @@ class Upsert
2
2
  # @private
3
3
  module SQLite3_Database
4
4
  def chunk
5
- return if rows.empty?
6
- row = rows.shift
5
+ return if buffer.empty?
6
+ row = buffer.shift
7
7
  %{INSERT OR IGNORE INTO "#{table_name}" (#{row.columns_sql}) VALUES (#{row.values_sql});UPDATE "#{table_name}" SET #{row.set_sql} WHERE #{row.where_sql}}
8
8
  end
9
9
 
@@ -1,3 +1,3 @@
1
1
  class Upsert
2
- VERSION = "0.2.2"
2
+ VERSION = "0.3.0"
3
3
  end
@@ -77,7 +77,7 @@ MiniTest::Spec.class_eval do
77
77
 
78
78
  Pet.delete_all
79
79
 
80
- Upsert.stream(connection, :pets) do |upsert|
80
+ Upsert.batch(connection, :pets) do |upsert|
81
81
  records.each do |selector, document|
82
82
  upsert.row(selector, document)
83
83
  end
@@ -109,7 +109,7 @@ MiniTest::Spec.class_eval do
109
109
  sleep 1
110
110
 
111
111
  upsert_time = Benchmark.realtime do
112
- Upsert.stream(connection, :pets) do |upsert|
112
+ Upsert.batch(connection, :pets) do |upsert|
113
113
  records.each do |selector, document|
114
114
  upsert.row(selector, document)
115
115
  end
@@ -44,17 +44,17 @@ shared_examples_for 'is a database with an upsert trick' do
44
44
  end
45
45
  end
46
46
  end
47
- describe :stream do
47
+ describe :batch do
48
48
  it "works for multiple rows (base case)" do
49
49
  assert_creates(Pet, [{:name => 'Jerry', :gender => 'male'}]) do
50
- Upsert.stream(connection, :pets) do |upsert|
50
+ Upsert.batch(connection, :pets) do |upsert|
51
51
  upsert.row({:name => 'Jerry'}, :gender => 'male')
52
52
  end
53
53
  end
54
54
  end
55
55
  it "works for multiple rows (not changing anything)" do
56
56
  assert_creates(Pet, [{:name => 'Jerry', :gender => 'male'}]) do
57
- Upsert.stream(connection, :pets) do |upsert|
57
+ Upsert.batch(connection, :pets) do |upsert|
58
58
  upsert.row({:name => 'Jerry'}, :gender => 'male')
59
59
  upsert.row({:name => 'Jerry'}, :gender => 'male')
60
60
  end
@@ -62,7 +62,7 @@ shared_examples_for 'is a database with an upsert trick' do
62
62
  end
63
63
  it "works for multiple rows (changing something)" do
64
64
  assert_creates(Pet, [{:name => 'Jerry', :gender => 'neutered'}]) do
65
- Upsert.stream(connection, :pets) do |upsert|
65
+ Upsert.batch(connection, :pets) do |upsert|
66
66
  upsert.row({:name => 'Jerry'}, :gender => 'male')
67
67
  upsert.row({:name => 'Jerry'}, :gender => 'neutered')
68
68
  end
@@ -13,9 +13,9 @@ shared_examples_for "supports multibyte" do
13
13
  upsert.row({:name => 'I♥NY'}, {:gender => 'jÚrgen'})
14
14
  end
15
15
  end
16
- it "works streaming" do
16
+ it "works batch" do
17
17
  assert_creates(Pet, [{:name => 'I♥NY', :gender => 'jÚrgen'}]) do
18
- Upsert.stream(connection, :pets) do |upsert|
18
+ Upsert.batch(connection, :pets) do |upsert|
19
19
  upsert.row({:name => 'I♥NY'}, {:gender => 'périferôl'})
20
20
  upsert.row({:name => 'I♥NY'}, {:gender => 'jÚrgen'})
21
21
  end
@@ -13,9 +13,9 @@ shared_examples_for 'is thread-safe' do
13
13
  end
14
14
  end
15
15
  end
16
- it "is safe to use streaming" do
16
+ it "is safe to use batch" do
17
17
  assert_creates(Pet, [{:name => 'Jerry', :gender => 'neutered'}]) do
18
- Upsert.stream(connection, :pets) do |upsert|
18
+ Upsert.batch(connection, :pets) do |upsert|
19
19
  ts = []
20
20
  10.times do
21
21
  ts << Thread.new do
@@ -38,69 +38,4 @@ describe Upsert::Mysql2_Client do
38
38
  it_also "doesn't mess with timezones"
39
39
 
40
40
  it_also "doesn't blow up on reserved words"
41
-
42
- describe '#sql_bytesize' do
43
- def assert_exact(selector_proc, document_proc, show = false)
44
- upsert = Upsert.new connection, :pets
45
- 0.upto(256) do |i|
46
- upsert.rows << Upsert::Row.new(upsert, selector_proc.call(i), document_proc.call(i))
47
- i.upto(upsert.rows.length) do |take|
48
- expected_sql = upsert.sql(take)
49
- actual = upsert.sql_bytesize(take)
50
- if show and actual != expected_sql.bytesize
51
- $stderr.puts
52
- $stderr.puts "Expected: #{expected_sql.bytesize}"
53
- $stderr.puts "Actual: #{actual}"
54
- $stderr.puts expected_sql
55
- end
56
- actual.must_equal expected_sql.bytesize
57
- end
58
- end
59
- end
60
- def rand_string(length)
61
- # http://www.dzone.com/snippets/generate-random-string-letters
62
- # Array.new(length) { (rand(122-97) + 97).chr }.join
63
- if RUBY_VERSION >= '1.9'
64
- Array.new(length) { rand(512).chr(Encoding::UTF_8) }.join
65
- else
66
- Array.new(length) { rand(512) }.pack('C*')
67
- end
68
- end
69
- it "is exact as selector length changes" do
70
- selector_proc = proc do |i|
71
- { :name => rand_string(i) }
72
- end
73
- document_proc = proc do |i|
74
- {}
75
- end
76
- assert_exact selector_proc, document_proc
77
- end
78
- it "is exact as value length changes" do
79
- selector_proc = proc do |i|
80
- { :name => 'Jerry' }
81
- end
82
- document_proc = proc do |i|
83
- { :spiel => rand_string(i) }
84
- end
85
- assert_exact selector_proc, document_proc
86
- end
87
- it "is exact as both selector and value length change" do
88
- selector_proc = proc do |i|
89
- { :name => rand_string(i) }
90
- end
91
- document_proc = proc do |i|
92
- { :spiel => rand_string(i) }
93
- end
94
- assert_exact selector_proc, document_proc
95
- end
96
- it "is exact with numbers too" do
97
- selector_proc = proc do |i|
98
- { :tag_number => rand(1e5) }
99
- end
100
- document_proc = proc do |i|
101
- { :lovability => rand }
102
- end
103
- assert_exact selector_proc, document_proc
104
- end
105
- end
106
41
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: upsert
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.2
4
+ version: 0.3.0
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors: