postgresql_cursor 0.4.3 → 0.5.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: c7761053fbe88cc359b47415f7e79b620defe326
4
- data.tar.gz: 8398abd146476cc6ddfa637e3d5215fcc446be38
3
+ metadata.gz: 6d9ff2e625bba6a119c092d2a7af6cc4b0ab2fde
4
+ data.tar.gz: d44d29ed0238aac7fc0597d6e1ae83763b0e7cdb
5
5
  SHA512:
6
- metadata.gz: 18ebbe7bfa3fdee1435909725be137bf7b895f55f3ad87620a5486697fa67f06afab79d117945643ecbe2e4b1f7e6d6c6e30364358763429beb2c720cfe1d9fe
7
- data.tar.gz: 482aeffdbe884a29ca1e05b8a860e897845c9cc86d134c4512475c2039c5f448c2585298c444f89a0afd4f4b781bd4b3221c2deccb321eba6b68f1d178abe49e
6
+ metadata.gz: 7451b81a33709fd76494f97e883a11d97d30611c7854ae712082a06899a14ce8b9bca28fd14925e06c21a695f1d1df20a1d5044883519f0e34148dbcca5b1313
7
+ data.tar.gz: 15350ff60d906758aeb41f104102f89a28d60e713343c62023a907a389c3244db305f0237de2890cc13e9dce17810da67765cc90e56a008ed40ec86f4f51a580
@@ -0,0 +1,24 @@
1
+ ## MAC OS
2
+ .DS_Store
3
+
4
+ ## TEXTMATE
5
+ *.tmproj
6
+ tmtags
7
+
8
+ ## EMACS
9
+ *~
10
+ \#*
11
+ .\#*
12
+
13
+ ## VIM
14
+ *.swp
15
+
16
+ ## IntelliJ/Rubymine
17
+ .idea
18
+
19
+ ## PROJECT::GENERAL
20
+ coverage
21
+ rdoc
22
+ pkg
23
+
24
+ ## PROJECT::SPECIFIC
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in postgresql_cursor.gemspec
4
+ gemspec
@@ -0,0 +1,41 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ postgresql_cursor (0.5.0)
5
+ activerecord (>= 3.2.0)
6
+
7
+ GEM
8
+ remote: https://rubygems.org/
9
+ specs:
10
+ activemodel (4.1.1)
11
+ activesupport (= 4.1.1)
12
+ builder (~> 3.1)
13
+ activerecord (4.1.1)
14
+ activemodel (= 4.1.1)
15
+ activesupport (= 4.1.1)
16
+ arel (~> 5.0.0)
17
+ activesupport (4.1.1)
18
+ i18n (~> 0.6, >= 0.6.9)
19
+ json (~> 1.7, >= 1.7.7)
20
+ minitest (~> 5.1)
21
+ thread_safe (~> 0.1)
22
+ tzinfo (~> 1.1)
23
+ arel (5.0.1.20140414130214)
24
+ builder (3.2.2)
25
+ i18n (0.6.9)
26
+ json (1.8.1)
27
+ minitest (5.3.3)
28
+ pg (0.17.1)
29
+ rake (10.3.1)
30
+ thread_safe (0.3.4)
31
+ tzinfo (1.2.1)
32
+ thread_safe (~> 0.1)
33
+
34
+ PLATFORMS
35
+ ruby
36
+
37
+ DEPENDENCIES
38
+ minitest
39
+ pg
40
+ postgresql_cursor!
41
+ rake
@@ -0,0 +1,185 @@
1
+ #PostgreSQLCursor for handling large Result Sets
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/postgresql_cursor.svg)](http://badge.fury.io/rb/postgresql_cursor)
4
+
5
+ PostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries
6
+ returning a large number of rows, and allows you to sort your result set.
7
+
8
+ In PostgreSQL, a
9
+ [cursor](http://www.postgresql.org/docs/9.4/static/plpgsql-cursors.html)
10
+ runs a query, from which you fetch a block of
11
+ (say 1000) rows, process them, and continue fetching until the result
12
+ set is exhausted. By fetching a smaller chunk of data, this reduces the
13
+ amount of memory your application uses and prevents the potential crash
14
+ of running out of memory.
15
+
16
+ Version 0.5.0 has been refactored to install more smoothly into ActiveRecord.
17
+ It supports Rails and ActiveRecord 3.2.x and up.
18
+
19
+ ##Use Cursors
20
+
21
+ PostgreSQLCursor was developed to take advantage of PostgreSQL's cursors. Cursors allow the program
22
+ to declare a cursor to run a given query returning "chunks" of rows to the application program while
23
+ retaining the position of the full result set in the database. This overcomes all the disadvantages
24
+ of using find_each and find_in_batches.
25
+
26
+ Also, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the
27
+ instantiated models. An informal benchmark showed that returning instances is a factor of 4 times
28
+ slower than returning hashes. If you are can work with the data in this form, you will find better
29
+ performance.
30
+
31
+ With PostgreSQL, you can work with cursors as follows:
32
+
33
+ ```ruby
34
+ Product.where("id>0").order("name").each_row { |hash| Product.process(hash) }
35
+
36
+ Product.where("id>0").each_instance { |product| product.process! }
37
+ Product.where("id>0").each_instance(block_size:100_000) { |product| product.process }
38
+
39
+ Product.each_row { |hash| Product.process(hash) }
40
+ Product.each_instance { |product| product.process }
41
+
42
+ Product.each_row_by_sql("select * from products") { |hash| Product.process(hash) }
43
+ Product.each_instance_by_sql("select * from products") { |product| product.process }
44
+ ```
45
+
46
+ ###PostgreSQLCursor is an Enumerable
47
+
48
+ If you do not pass in a block, the cursor is returned, which mixes in the Enumerable
49
+ libary. With that, you can pass it around, or chain in the awesome enumerable things
50
+ like `map` and `reduce`. Furthermore, the cursors already act as `lazy`, but you can
51
+ also chain in `lazy` when you want to keep the memory footprint small for rest of the process.
52
+
53
+ ```ruby
54
+ Product.each_row.map {|r| r["id"].to_i } #=> [1, 2, 3, ...]
55
+ Product.each_instance.map {|r| r.id }.each {|id| p id } #=> [1, 2, 3, ...]
56
+ Product.each_instance.lazy.inject(0) {|sum,r| sum + r.quantity } #=> 499500
57
+ ```
58
+
59
+ All these methods take an options hash to control things more:
60
+
61
+ block_size:n The number of rows to fetch from the database each time (default 1000)
62
+ while:value Continue looping as long as the block returns this value
63
+ until:value Continue looping until the block returns this value
64
+ connection:conn Use this connection instead of the current Product connection
65
+ fraction:float A value to set for the cursor_tuple_fraction variable.
66
+ PostgreSQL uses 0.1 (optimize for 10% of result set)
67
+ This library uses 1.0 (Optimize for 100% of the result set)
68
+ Do not override this value unless you understand it.
69
+
70
+ Notes:
71
+
72
+ * Use cursors *only* for large result sets. They have more overhead with the database
73
+ than ActiveRecord selecting all matching records.
74
+ * Aliases each_hash and each_hash_by_sql are provided for each_row and each_row_by_sql
75
+ if you prefer to express what types are being returned.
76
+
77
+ ###Hashes vs. Instances
78
+
79
+ The each_row method returns the Hash of strings for speed (as this allows you to process a lot of rows).
80
+ Hashes are returned with String values, and you must take care of any type conversion.
81
+
82
+ When you use each_instance, ActiveRecord lazily casts these strings into
83
+ Ruby types (Time, Fixnum, etc.) only when you read the attribute.
84
+
85
+ If you find you need the types cast for your attributes, consider using each_instance
86
+ insead. ActiveRecord's read casting algorithm will only cast the values you need and
87
+ has become more efficient over time.
88
+
89
+ ###Select and Pluck
90
+
91
+ To limit the columns returned to just those you need, use `.select(:id, :name)`
92
+ query method.
93
+
94
+ ```ruby
95
+ Product.select(:id, :name).each_row { |product| product.process }
96
+ ```
97
+
98
+ Pluck is a great alternative instead of using a cursor. It does not instantiate
99
+ the row, and builds an array of result values, and translates the values into ruby
100
+ values (numbers, Timestamps. etc.). Using the cursor would still allow you to lazy
101
+ load them in batches for very large sets.
102
+
103
+ You can also use the `pluck_rows` or `pluck_instances` if the results
104
+ won't eat up too much memory.
105
+
106
+ ```ruby
107
+ Product.newly_arrived.pluck(:id) #=> [1, 2, 3, ...]
108
+ Product.newly_arrived.each_row { |hash| }
109
+ Product.select(:id).each_row.map {|r| r["id"].to_i } # cursor instead of pluck
110
+ Product.pluck_rows(:id) #=> ["1", "2", ...]
111
+ Product.pluck_instances(:id, :quantity) #=> [[1, 503], [2, 932], ...]
112
+ ```
113
+
114
+ ###Associations and Eager Loading
115
+
116
+ ActiveRecord performs some magic when eager-loading associated row. It
117
+ will usually not join the tables, and prefers to load the data in
118
+ separate queries.
119
+
120
+ This library hooks onto the `to_sql` feature of the query builder. As a
121
+ result, it can't do the join if ActiveRecord decided not to join, nor
122
+ can it construct the association objects eagerly.
123
+
124
+ ##Background: Why PostgreSQL Cursors?
125
+
126
+ ActiveRecord is designed and optimized for web performance. In a web transaction, only a "page" of
127
+ around 20 rows is returned to the user. When you do this
128
+
129
+ ```ruby
130
+ Product.find_each { |product| product.process }
131
+ ```
132
+
133
+ The database returns all matching result set rows to ActiveRecord, which instantiates each row with
134
+ the data returned. This function returns an array of all these rows to the caller.
135
+
136
+ Asyncronous, Background, or Offline processing may require processing a large amount of data.
137
+ When there is a very large number of rows, this requires a lot more memory to hold the data. Ruby
138
+ does not return that memory after processing the array, and the causes your process to "bloat". If you
139
+ don't have enough memory, it will cause an exception.
140
+
141
+ ###ActiveRecord.find_each and find_in_batches
142
+
143
+ To solve this problem, ActiveRecord gives us two alternative methods that work in "chunks" of your data:
144
+
145
+ ```ruby
146
+ Product.where("id>0").find_each { |model| Product.process }
147
+
148
+ Product.where("id>0").find_in_batches do |batch|
149
+ batch.each { |model| Product.process }
150
+ end
151
+ ```
152
+
153
+ Optionally, you can specify a :batch_size option as the size of the "chunk", and defaults to 1000.
154
+
155
+ There are drawbacks with these methods:
156
+
157
+ * You cannot specify the order, it will be ordered by the primary key (usually id)
158
+ * The primary key must be numeric
159
+ * The query is rerun for each chunk (1000 rows), starting at the next id sequence.
160
+ * You cannot use overly complex queries as that will be rerun and incur more overhead.
161
+
162
+
163
+ ##Meta
164
+ ###Author
165
+ Allen Fair, [@allenfair](https://twitter/com/allenfair), http://github.com/afair
166
+
167
+ Thanks to:
168
+
169
+ * Iulian Dogariu, http://github.com/iulianu (Fixes)
170
+ * Julian Mehnle, julian@mehnle.net (Suggestions)
171
+ * ...And all the other contributers!
172
+
173
+ ###Note on Patches/Pull Requests
174
+
175
+ * Fork the project.
176
+ * Make your feature addition or bug fix.
177
+ * Add tests for it. This is important so I don't break it in a
178
+ future version unintentionally.
179
+ * Commit, do not mess with rakefile, version, or history.
180
+ (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
181
+ * Send me a pull request. Bonus points for topic branches.
182
+
183
+ ###Copyright
184
+
185
+ Copyright (c) 2010-2014 Allen Fair. See (MIT) LICENSE for details.
data/Rakefile CHANGED
@@ -1,53 +1,24 @@
1
- require 'rubygems'
2
- require 'rake'
1
+ require "bundler/gem_tasks"
2
+ require "bundler/setup"
3
+ require 'rake/testtask'
3
4
 
4
- begin
5
- require 'jeweler'
6
- Jeweler::Tasks.new do |gem|
7
- gem.name = "postgresql_cursor"
8
- gem.summary = %Q{ActiveRecord PostgreSQL Adapter extension for using a cursor to return a large result set}
9
- gem.description = %Q{PostgreSQL Cursor is an extension to the ActiveRecord PostgreSQLAdapter for very large result sets. It provides a cursor open/fetch/close interface to access data without loading all rows into memory, and instead loads the result rows in "chunks" (default of 10_000 rows), buffers them, and returns the rows one at a time.}
10
- gem.email = "allen.fair@gmail.com"
11
- gem.homepage = "http://github.com/afair/postgresql_cursor"
12
- gem.authors = ["Allen Fair"]
13
- gem.add_dependency 'activerecord'
14
- # gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
15
- end
16
- Jeweler::GemcutterTasks.new
17
- rescue LoadError
18
- puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
19
- end
5
+ task :default => :test
20
6
 
21
- require 'rake/testtask'
22
- Rake::TestTask.new(:test) do |test|
23
- test.libs << 'lib' << 'test'
24
- test.pattern = 'test/**/test_*.rb'
25
- test.verbose = true
7
+ desc "Run the Test Suite, toot suite"
8
+ task :test do
9
+ sh "ruby test/test_*"
26
10
  end
27
11
 
28
- begin
29
- require 'rcov/rcovtask'
30
- Rcov::RcovTask.new do |test|
31
- test.libs << 'test'
32
- test.pattern = 'test/**/test_*.rb'
33
- test.verbose = true
34
- end
35
- rescue LoadError
36
- task :rcov do
37
- abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
38
- end
12
+ desc "Open and IRB Console with the gem and test-app loaded"
13
+ task :console do
14
+ sh "bundle exec irb -Ilib -I . -r postgresql_cursor -r test-app/app"
15
+ #require 'irb'
16
+ #ARGV.clear
17
+ #IRB.start
39
18
  end
40
19
 
41
- task :test => :check_dependencies
42
-
43
- task :default => :test
44
-
45
- require 'rdoc/task'
46
- Rake::RDocTask.new do |rdoc|
47
- version = File.exist?('VERSION') ? File.read('VERSION') : ""
48
-
49
- rdoc.rdoc_dir = 'rdoc'
50
- rdoc.title = "postgresql_cursor #{version}"
51
- rdoc.rdoc_files.include('README*')
52
- rdoc.rdoc_files.include('lib/**/*.rb')
20
+ desc "Setup testing database and table"
21
+ task :setup do
22
+ sh %q(createdb postgresql_cursor_test)
23
+ sh %Q<echo "create table products ( id serial primary key);" | psql postgresql_cursor_test>
53
24
  end
@@ -1,180 +1,12 @@
1
- # PostgreSQLCursor: library class provides postgresql cursor for large result
2
- # set processing. Requires ActiveRecord, but can be adapted to other DBI/ORM libraries.
3
- # If you don't use AR, this assumes #connection and #instantiate methods are available.
4
- #
5
- # options - Hash to control operation and loop breaks
6
- # connection: instance - ActiveRecord connection to use
7
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
8
- # block_size: 1..n - The number of rows to fetch per db block fetch
9
- # while: value - Exits loop when block does not return this value.
10
- # until: value - Exits loop when block returns this value.
11
- #
12
- # Exmaples:
13
- # PostgreSQLCursor.new("select ...").each { |hash| ... }
14
- # ActiveRecordModel.where(...).each_row { |hash| ... }
15
- # ActiveRecordModel.each_row_by_sql("select ...") { |hash| ... }
16
- # ActiveRecordModel.each_instance_by_sql("select ...") { |model| ... }
17
- #
18
- class PostgreSQLCursor
19
- include Enumerable
20
- attr_reader :sql, :options, :connection, :count, :result
21
- @@cursor_seq = 0
22
-
23
- # Public: Start a new PostgreSQL cursor query
24
- # sql - The SQL statement with interpolated values
25
- # options - hash of processing controls
26
- # while: value - Exits loop when block does not return this value.
27
- # until: value - Exits loop when block returns this value.
28
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
29
- # block_size: 1..n - The number of rows to fetch per db block fetch
30
- # Defaults to 1000
31
- #
32
- # Examples
33
- #
34
- # PostgreSQLCursor.new("select ....")
35
- #
36
- # Returns the cursor object when called with new.
37
- def initialize(sql, options={})
38
- @sql = sql
39
- @options = options
40
- @connection = @options.fetch(:connection) { ActiveRecord::Base.connection }
41
- @count = 0
42
- end
43
-
44
- # Public: Yields each row of the result set to the passed block
45
- #
46
- #
47
- # Yields the row to the block. The row is a hash with symbolized keys.
48
- # {colname: value, ....}
49
- #
50
- # Returns the count of rows processed
51
- def each(&block)
52
- has_do_until = @options.has_key?(:until)
53
- has_do_while = @options.has_key?(:while)
54
- @count = 0
55
- @connection.transaction do
56
- begin
57
- open
58
- while (row = fetch) do
59
- break if row.size==0
60
- @count += 1
61
- row = row.symbolize_keys
62
- rc = yield row
63
- # TODO: Handle exceptions raised within block
64
- break if has_do_until && rc == @options[:until]
65
- break if has_do_while && rc != @options[:while]
66
- end
67
- rescue Exception => e
68
- raise e
69
- ensure
70
- close
71
- end
72
- end
73
- @count
74
- end
75
-
76
- # Public: Opens (actually, "declares") the cursor. Call this before fetching
77
- def open
78
- set_cursor_tuple_fraction
79
- @cursor = @@cursor_seq += 1
80
- @result = @connection.execute("declare cursor_#{@cursor} cursor for #{@sql}")
81
- @block = []
82
- end
83
-
84
- # Public: Returns the next row from the cursor, or empty hash if end of results
85
- #
86
- # Returns a row as a hash of {'colname'=>value,...}
87
- def fetch
88
- fetch_block if @block.size==0
89
- @block.shift
90
- end
91
-
92
- # Private: Fetches the next block of rows into @block
93
- def fetch_block(block_size=nil)
94
- block_size ||= @block_size ||= @options.fetch(:block_size) { 1000 }
95
- @result = @connection.execute("fetch #{block_size} from cursor_#{@cursor}")
96
- @block = @result.collect {|row| row } # Make our own
97
- end
98
-
99
- # Public: Closes the cursor
100
- def close
101
- @connection.execute("close cursor_#{@cursor}")
102
- end
103
-
104
- # Private: Sets the PostgreSQL cursor_tuple_fraction value = 1.0 to assume all rows will be fetched
105
- # This is a value between 0.1 and 1.0 (PostgreSQL defaults to 0.1, this library defaults to 1.0)
106
- # used to determine the expected fraction (percent) of result rows returned the the caller.
107
- # This value determines the access path by the query planner.
108
- def set_cursor_tuple_fraction(frac=1.0)
109
- @cursor_tuple_fraction ||= @options.fetch(:fraction) { 1.0 }
110
- return @cursor_tuple_fraction if frac == @cursor_tuple_fraction
111
- @cursor_tuple_fraction = frac
112
- @result = @connection.execute("set cursor_tuple_fraction to #{frac}")
113
- frac
114
- end
115
-
116
- end
117
-
118
- # Defines extension to ActiveRecord to use this library
119
- class ActiveRecord::Base
120
- # Public: Returns each row as a hash to the given block
121
- #
122
- # sql - Full SQL statement, variables interpolated
123
- # options - Hash to control
124
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
125
- # block_size: 1..n - The number of rows to fetch per db block fetch
126
- # while: value - Exits loop when block does not return this value.
127
- # until: value - Exits loop when block returns this value.
128
- #
129
- # Returns the number of rows yielded to the block
130
- def self.each_row_by_sql(sql, options={}, &block)
131
- options = {:connection => self.connection}.merge(options)
132
- PostgreSQLCursor.new(sql, options).each(&block)
133
- end
134
-
135
- # Public: Returns each row as a model instance to the given block
136
- # As this instantiates a model object, it is slower than each_row_by_sql
137
- #
138
- # Paramaters: see each_row_by_sql
139
- #
140
- # Returns the number of rows yielded to the block
141
- def self.each_instance_by_sql(sql, options={}, &block)
142
- options = {:connection => self.connection}.merge(options)
143
- PostgreSQLCursor.new(sql, options).each do |row|
144
- model = instantiate(row)
145
- yield model
146
- end
147
- end
148
- end
149
-
150
- # Defines extension to ActiveRecord/AREL to use this library
151
- class ActiveRecord::Relation
152
-
153
- # Public: Executes the query, returning each row as a hash
154
- # to the given block.
155
- #
156
- # options - Hash to control
157
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
158
- # block_size: 1..n - The number of rows to fetch per db block fetch
159
- # while: value - Exits loop when block does not return this value.
160
- # until: value - Exits loop when block returns this value.
161
- #
162
- # Returns the number of rows yielded to the block
163
- def each_row(options={}, &block)
164
- options = {:connection => self.connection}.merge(options)
165
- PostgreSQLCursor.new(to_sql, options).each(&block)
166
- end
167
-
168
- # Public: Like each_row, but returns an instantiated model object to the block
169
- #
170
- # Paramaters: same as each_row
171
- #
172
- # Returns the number of rows yielded to the block
173
- def each_instance(options={}, &block)
174
- options = {:connection => self.connection}.merge(options)
175
- PostgreSQLCursor.new(to_sql, options).each do |row|
176
- model = instantiate(row)
177
- block.call model
178
- end
179
- end
180
- end
1
+ require 'postgresql_cursor/version'
2
+ require 'postgresql_cursor/cursor'
3
+ require 'postgresql_cursor/active_record/relation/cursor_iterators'
4
+ require 'postgresql_cursor/active_record/sql_cursor'
5
+ require 'postgresql_cursor/active_record/connection_adapters/postgresql_type_map'
6
+
7
+ # ActiveRecord 4.x
8
+ require 'active_record'
9
+ require 'active_record/connection_adapters/postgresql_adapter'
10
+ ActiveRecord::Base.extend(PostgreSQLCursor::ActiveRecord::SqlCursor)
11
+ ActiveRecord::Relation.send(:include, PostgreSQLCursor::ActiveRecord::Relation::CursorIterators)
12
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.send(:include, PostgreSQLCursor::ActiveRecord::ConnectionAdapters::PostgreSQLTypeMap)