postgresql_cursor 0.4.3 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: c7761053fbe88cc359b47415f7e79b620defe326
4
- data.tar.gz: 8398abd146476cc6ddfa637e3d5215fcc446be38
3
+ metadata.gz: 6d9ff2e625bba6a119c092d2a7af6cc4b0ab2fde
4
+ data.tar.gz: d44d29ed0238aac7fc0597d6e1ae83763b0e7cdb
5
5
  SHA512:
6
- metadata.gz: 18ebbe7bfa3fdee1435909725be137bf7b895f55f3ad87620a5486697fa67f06afab79d117945643ecbe2e4b1f7e6d6c6e30364358763429beb2c720cfe1d9fe
7
- data.tar.gz: 482aeffdbe884a29ca1e05b8a860e897845c9cc86d134c4512475c2039c5f448c2585298c444f89a0afd4f4b781bd4b3221c2deccb321eba6b68f1d178abe49e
6
+ metadata.gz: 7451b81a33709fd76494f97e883a11d97d30611c7854ae712082a06899a14ce8b9bca28fd14925e06c21a695f1d1df20a1d5044883519f0e34148dbcca5b1313
7
+ data.tar.gz: 15350ff60d906758aeb41f104102f89a28d60e713343c62023a907a389c3244db305f0237de2890cc13e9dce17810da67765cc90e56a008ed40ec86f4f51a580
@@ -0,0 +1,24 @@
1
+ ## MAC OS
2
+ .DS_Store
3
+
4
+ ## TEXTMATE
5
+ *.tmproj
6
+ tmtags
7
+
8
+ ## EMACS
9
+ *~
10
+ \#*
11
+ .\#*
12
+
13
+ ## VIM
14
+ *.swp
15
+
16
+ ## IntelliJ/Rubymine
17
+ .idea
18
+
19
+ ## PROJECT::GENERAL
20
+ coverage
21
+ rdoc
22
+ pkg
23
+
24
+ ## PROJECT::SPECIFIC
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in postgresql_cursor.gemspec
4
+ gemspec
@@ -0,0 +1,41 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ postgresql_cursor (0.5.0)
5
+ activerecord (>= 3.2.0)
6
+
7
+ GEM
8
+ remote: https://rubygems.org/
9
+ specs:
10
+ activemodel (4.1.1)
11
+ activesupport (= 4.1.1)
12
+ builder (~> 3.1)
13
+ activerecord (4.1.1)
14
+ activemodel (= 4.1.1)
15
+ activesupport (= 4.1.1)
16
+ arel (~> 5.0.0)
17
+ activesupport (4.1.1)
18
+ i18n (~> 0.6, >= 0.6.9)
19
+ json (~> 1.7, >= 1.7.7)
20
+ minitest (~> 5.1)
21
+ thread_safe (~> 0.1)
22
+ tzinfo (~> 1.1)
23
+ arel (5.0.1.20140414130214)
24
+ builder (3.2.2)
25
+ i18n (0.6.9)
26
+ json (1.8.1)
27
+ minitest (5.3.3)
28
+ pg (0.17.1)
29
+ rake (10.3.1)
30
+ thread_safe (0.3.4)
31
+ tzinfo (1.2.1)
32
+ thread_safe (~> 0.1)
33
+
34
+ PLATFORMS
35
+ ruby
36
+
37
+ DEPENDENCIES
38
+ minitest
39
+ pg
40
+ postgresql_cursor!
41
+ rake
@@ -0,0 +1,185 @@
1
+ #PostgreSQLCursor for handling large Result Sets
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/postgresql_cursor.svg)](http://badge.fury.io/rb/postgresql_cursor)
4
+
5
+ PostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries
6
+ returning a large number of rows, and allows you to sort your result set.
7
+
8
+ In PostgreSQL, a
9
+ [cursor](http://www.postgresql.org/docs/9.4/static/plpgsql-cursors.html)
10
+ runs a query, from which you fetch a block of
11
+ (say 1000) rows, process them, and continue fetching until the result
12
+ set is exhausted. By fetching a smaller chunk of data, this reduces the
13
+ amount of memory your application uses and prevents the potential crash
14
+ of running out of memory.
15
+
16
+ Version 0.5.0 has been refactored to install more smoothly into ActiveRecord.
17
+ It supports Rails and ActiveRecord 3.2.x and up.
18
+
19
+ ##Use Cursors
20
+
21
+ PostgreSQLCursor was developed to take advantage of PostgreSQL's cursors. Cursors allow the program
22
+ to declare a cursor to run a given query returning "chunks" of rows to the application program while
23
+ retaining the position of the full result set in the database. This overcomes all the disadvantages
24
+ of using find_each and find_in_batches.
25
+
26
+ Also, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the
27
+ instantiated models. An informal benchmark showed that returning instances is a factor of 4 times
28
+ slower than returning hashes. If you are can work with the data in this form, you will find better
29
+ performance.
30
+
31
+ With PostgreSQL, you can work with cursors as follows:
32
+
33
+ ```ruby
34
+ Product.where("id>0").order("name").each_row { |hash| Product.process(hash) }
35
+
36
+ Product.where("id>0").each_instance { |product| product.process! }
37
+ Product.where("id>0").each_instance(block_size:100_000) { |product| product.process }
38
+
39
+ Product.each_row { |hash| Product.process(hash) }
40
+ Product.each_instance { |product| product.process }
41
+
42
+ Product.each_row_by_sql("select * from products") { |hash| Product.process(hash) }
43
+ Product.each_instance_by_sql("select * from products") { |product| product.process }
44
+ ```
45
+
46
+ ###PostgreSQLCursor is an Enumerable
47
+
48
+ If you do not pass in a block, the cursor is returned, which mixes in the Enumerable
49
+ libary. With that, you can pass it around, or chain in the awesome enumerable things
50
+ like `map` and `reduce`. Furthermore, the cursors already act as `lazy`, but you can
51
+ also chain in `lazy` when you want to keep the memory footprint small for rest of the process.
52
+
53
+ ```ruby
54
+ Product.each_row.map {|r| r["id"].to_i } #=> [1, 2, 3, ...]
55
+ Product.each_instance.map {|r| r.id }.each {|id| p id } #=> [1, 2, 3, ...]
56
+ Product.each_instance.lazy.inject(0) {|sum,r| sum + r.quantity } #=> 499500
57
+ ```
58
+
59
+ All these methods take an options hash to control things more:
60
+
61
+ block_size:n The number of rows to fetch from the database each time (default 1000)
62
+ while:value Continue looping as long as the block returns this value
63
+ until:value Continue looping until the block returns this value
64
+ connection:conn Use this connection instead of the current Product connection
65
+ fraction:float A value to set for the cursor_tuple_fraction variable.
66
+ PostgreSQL uses 0.1 (optimize for 10% of result set)
67
+ This library uses 1.0 (Optimize for 100% of the result set)
68
+ Do not override this value unless you understand it.
69
+
70
+ Notes:
71
+
72
+ * Use cursors *only* for large result sets. They have more overhead with the database
73
+ than ActiveRecord selecting all matching records.
74
+ * Aliases each_hash and each_hash_by_sql are provided for each_row and each_row_by_sql
75
+ if you prefer to express what types are being returned.
76
+
77
+ ###Hashes vs. Instances
78
+
79
+ The each_row method returns the Hash of strings for speed (as this allows you to process a lot of rows).
80
+ Hashes are returned with String values, and you must take care of any type conversion.
81
+
82
+ When you use each_instance, ActiveRecord lazily casts these strings into
83
+ Ruby types (Time, Fixnum, etc.) only when you read the attribute.
84
+
85
+ If you find you need the types cast for your attributes, consider using each_instance
86
+ insead. ActiveRecord's read casting algorithm will only cast the values you need and
87
+ has become more efficient over time.
88
+
89
+ ###Select and Pluck
90
+
91
+ To limit the columns returned to just those you need, use `.select(:id, :name)`
92
+ query method.
93
+
94
+ ```ruby
95
+ Product.select(:id, :name).each_row { |product| product.process }
96
+ ```
97
+
98
+ Pluck is a great alternative instead of using a cursor. It does not instantiate
99
+ the row, and builds an array of result values, and translates the values into ruby
100
+ values (numbers, Timestamps. etc.). Using the cursor would still allow you to lazy
101
+ load them in batches for very large sets.
102
+
103
+ You can also use the `pluck_rows` or `pluck_instances` if the results
104
+ won't eat up too much memory.
105
+
106
+ ```ruby
107
+ Product.newly_arrived.pluck(:id) #=> [1, 2, 3, ...]
108
+ Product.newly_arrived.each_row { |hash| }
109
+ Product.select(:id).each_row.map {|r| r["id"].to_i } # cursor instead of pluck
110
+ Product.pluck_rows(:id) #=> ["1", "2", ...]
111
+ Product.pluck_instances(:id, :quantity) #=> [[1, 503], [2, 932], ...]
112
+ ```
113
+
114
+ ###Associations and Eager Loading
115
+
116
+ ActiveRecord performs some magic when eager-loading associated row. It
117
+ will usually not join the tables, and prefers to load the data in
118
+ separate queries.
119
+
120
+ This library hooks onto the `to_sql` feature of the query builder. As a
121
+ result, it can't do the join if ActiveRecord decided not to join, nor
122
+ can it construct the association objects eagerly.
123
+
124
+ ##Background: Why PostgreSQL Cursors?
125
+
126
+ ActiveRecord is designed and optimized for web performance. In a web transaction, only a "page" of
127
+ around 20 rows is returned to the user. When you do this
128
+
129
+ ```ruby
130
+ Product.find_each { |product| product.process }
131
+ ```
132
+
133
+ The database returns all matching result set rows to ActiveRecord, which instantiates each row with
134
+ the data returned. This function returns an array of all these rows to the caller.
135
+
136
+ Asyncronous, Background, or Offline processing may require processing a large amount of data.
137
+ When there is a very large number of rows, this requires a lot more memory to hold the data. Ruby
138
+ does not return that memory after processing the array, and the causes your process to "bloat". If you
139
+ don't have enough memory, it will cause an exception.
140
+
141
+ ###ActiveRecord.find_each and find_in_batches
142
+
143
+ To solve this problem, ActiveRecord gives us two alternative methods that work in "chunks" of your data:
144
+
145
+ ```ruby
146
+ Product.where("id>0").find_each { |model| Product.process }
147
+
148
+ Product.where("id>0").find_in_batches do |batch|
149
+ batch.each { |model| Product.process }
150
+ end
151
+ ```
152
+
153
+ Optionally, you can specify a :batch_size option as the size of the "chunk", and defaults to 1000.
154
+
155
+ There are drawbacks with these methods:
156
+
157
+ * You cannot specify the order, it will be ordered by the primary key (usually id)
158
+ * The primary key must be numeric
159
+ * The query is rerun for each chunk (1000 rows), starting at the next id sequence.
160
+ * You cannot use overly complex queries as that will be rerun and incur more overhead.
161
+
162
+
163
+ ##Meta
164
+ ###Author
165
+ Allen Fair, [@allenfair](https://twitter/com/allenfair), http://github.com/afair
166
+
167
+ Thanks to:
168
+
169
+ * Iulian Dogariu, http://github.com/iulianu (Fixes)
170
+ * Julian Mehnle, julian@mehnle.net (Suggestions)
171
+ * ...And all the other contributers!
172
+
173
+ ###Note on Patches/Pull Requests
174
+
175
+ * Fork the project.
176
+ * Make your feature addition or bug fix.
177
+ * Add tests for it. This is important so I don't break it in a
178
+ future version unintentionally.
179
+ * Commit, do not mess with rakefile, version, or history.
180
+ (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
181
+ * Send me a pull request. Bonus points for topic branches.
182
+
183
+ ###Copyright
184
+
185
+ Copyright (c) 2010-2014 Allen Fair. See (MIT) LICENSE for details.
data/Rakefile CHANGED
@@ -1,53 +1,24 @@
1
- require 'rubygems'
2
- require 'rake'
1
+ require "bundler/gem_tasks"
2
+ require "bundler/setup"
3
+ require 'rake/testtask'
3
4
 
4
- begin
5
- require 'jeweler'
6
- Jeweler::Tasks.new do |gem|
7
- gem.name = "postgresql_cursor"
8
- gem.summary = %Q{ActiveRecord PostgreSQL Adapter extension for using a cursor to return a large result set}
9
- gem.description = %Q{PostgreSQL Cursor is an extension to the ActiveRecord PostgreSQLAdapter for very large result sets. It provides a cursor open/fetch/close interface to access data without loading all rows into memory, and instead loads the result rows in "chunks" (default of 10_000 rows), buffers them, and returns the rows one at a time.}
10
- gem.email = "allen.fair@gmail.com"
11
- gem.homepage = "http://github.com/afair/postgresql_cursor"
12
- gem.authors = ["Allen Fair"]
13
- gem.add_dependency 'activerecord'
14
- # gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
15
- end
16
- Jeweler::GemcutterTasks.new
17
- rescue LoadError
18
- puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
19
- end
5
+ task :default => :test
20
6
 
21
- require 'rake/testtask'
22
- Rake::TestTask.new(:test) do |test|
23
- test.libs << 'lib' << 'test'
24
- test.pattern = 'test/**/test_*.rb'
25
- test.verbose = true
7
+ desc "Run the Test Suite, toot suite"
8
+ task :test do
9
+ sh "ruby test/test_*"
26
10
  end
27
11
 
28
- begin
29
- require 'rcov/rcovtask'
30
- Rcov::RcovTask.new do |test|
31
- test.libs << 'test'
32
- test.pattern = 'test/**/test_*.rb'
33
- test.verbose = true
34
- end
35
- rescue LoadError
36
- task :rcov do
37
- abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
38
- end
12
+ desc "Open and IRB Console with the gem and test-app loaded"
13
+ task :console do
14
+ sh "bundle exec irb -Ilib -I . -r postgresql_cursor -r test-app/app"
15
+ #require 'irb'
16
+ #ARGV.clear
17
+ #IRB.start
39
18
  end
40
19
 
41
- task :test => :check_dependencies
42
-
43
- task :default => :test
44
-
45
- require 'rdoc/task'
46
- Rake::RDocTask.new do |rdoc|
47
- version = File.exist?('VERSION') ? File.read('VERSION') : ""
48
-
49
- rdoc.rdoc_dir = 'rdoc'
50
- rdoc.title = "postgresql_cursor #{version}"
51
- rdoc.rdoc_files.include('README*')
52
- rdoc.rdoc_files.include('lib/**/*.rb')
20
+ desc "Setup testing database and table"
21
+ task :setup do
22
+ sh %q(createdb postgresql_cursor_test)
23
+ sh %Q<echo "create table products ( id serial primary key);" | psql postgresql_cursor_test>
53
24
  end
@@ -1,180 +1,12 @@
1
- # PostgreSQLCursor: library class provides postgresql cursor for large result
2
- # set processing. Requires ActiveRecord, but can be adapted to other DBI/ORM libraries.
3
- # If you don't use AR, this assumes #connection and #instantiate methods are available.
4
- #
5
- # options - Hash to control operation and loop breaks
6
- # connection: instance - ActiveRecord connection to use
7
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
8
- # block_size: 1..n - The number of rows to fetch per db block fetch
9
- # while: value - Exits loop when block does not return this value.
10
- # until: value - Exits loop when block returns this value.
11
- #
12
- # Exmaples:
13
- # PostgreSQLCursor.new("select ...").each { |hash| ... }
14
- # ActiveRecordModel.where(...).each_row { |hash| ... }
15
- # ActiveRecordModel.each_row_by_sql("select ...") { |hash| ... }
16
- # ActiveRecordModel.each_instance_by_sql("select ...") { |model| ... }
17
- #
18
- class PostgreSQLCursor
19
- include Enumerable
20
- attr_reader :sql, :options, :connection, :count, :result
21
- @@cursor_seq = 0
22
-
23
- # Public: Start a new PostgreSQL cursor query
24
- # sql - The SQL statement with interpolated values
25
- # options - hash of processing controls
26
- # while: value - Exits loop when block does not return this value.
27
- # until: value - Exits loop when block returns this value.
28
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
29
- # block_size: 1..n - The number of rows to fetch per db block fetch
30
- # Defaults to 1000
31
- #
32
- # Examples
33
- #
34
- # PostgreSQLCursor.new("select ....")
35
- #
36
- # Returns the cursor object when called with new.
37
- def initialize(sql, options={})
38
- @sql = sql
39
- @options = options
40
- @connection = @options.fetch(:connection) { ActiveRecord::Base.connection }
41
- @count = 0
42
- end
43
-
44
- # Public: Yields each row of the result set to the passed block
45
- #
46
- #
47
- # Yields the row to the block. The row is a hash with symbolized keys.
48
- # {colname: value, ....}
49
- #
50
- # Returns the count of rows processed
51
- def each(&block)
52
- has_do_until = @options.has_key?(:until)
53
- has_do_while = @options.has_key?(:while)
54
- @count = 0
55
- @connection.transaction do
56
- begin
57
- open
58
- while (row = fetch) do
59
- break if row.size==0
60
- @count += 1
61
- row = row.symbolize_keys
62
- rc = yield row
63
- # TODO: Handle exceptions raised within block
64
- break if has_do_until && rc == @options[:until]
65
- break if has_do_while && rc != @options[:while]
66
- end
67
- rescue Exception => e
68
- raise e
69
- ensure
70
- close
71
- end
72
- end
73
- @count
74
- end
75
-
76
- # Public: Opens (actually, "declares") the cursor. Call this before fetching
77
- def open
78
- set_cursor_tuple_fraction
79
- @cursor = @@cursor_seq += 1
80
- @result = @connection.execute("declare cursor_#{@cursor} cursor for #{@sql}")
81
- @block = []
82
- end
83
-
84
- # Public: Returns the next row from the cursor, or empty hash if end of results
85
- #
86
- # Returns a row as a hash of {'colname'=>value,...}
87
- def fetch
88
- fetch_block if @block.size==0
89
- @block.shift
90
- end
91
-
92
- # Private: Fetches the next block of rows into @block
93
- def fetch_block(block_size=nil)
94
- block_size ||= @block_size ||= @options.fetch(:block_size) { 1000 }
95
- @result = @connection.execute("fetch #{block_size} from cursor_#{@cursor}")
96
- @block = @result.collect {|row| row } # Make our own
97
- end
98
-
99
- # Public: Closes the cursor
100
- def close
101
- @connection.execute("close cursor_#{@cursor}")
102
- end
103
-
104
- # Private: Sets the PostgreSQL cursor_tuple_fraction value = 1.0 to assume all rows will be fetched
105
- # This is a value between 0.1 and 1.0 (PostgreSQL defaults to 0.1, this library defaults to 1.0)
106
- # used to determine the expected fraction (percent) of result rows returned the the caller.
107
- # This value determines the access path by the query planner.
108
- def set_cursor_tuple_fraction(frac=1.0)
109
- @cursor_tuple_fraction ||= @options.fetch(:fraction) { 1.0 }
110
- return @cursor_tuple_fraction if frac == @cursor_tuple_fraction
111
- @cursor_tuple_fraction = frac
112
- @result = @connection.execute("set cursor_tuple_fraction to #{frac}")
113
- frac
114
- end
115
-
116
- end
117
-
118
- # Defines extension to ActiveRecord to use this library
119
- class ActiveRecord::Base
120
- # Public: Returns each row as a hash to the given block
121
- #
122
- # sql - Full SQL statement, variables interpolated
123
- # options - Hash to control
124
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
125
- # block_size: 1..n - The number of rows to fetch per db block fetch
126
- # while: value - Exits loop when block does not return this value.
127
- # until: value - Exits loop when block returns this value.
128
- #
129
- # Returns the number of rows yielded to the block
130
- def self.each_row_by_sql(sql, options={}, &block)
131
- options = {:connection => self.connection}.merge(options)
132
- PostgreSQLCursor.new(sql, options).each(&block)
133
- end
134
-
135
- # Public: Returns each row as a model instance to the given block
136
- # As this instantiates a model object, it is slower than each_row_by_sql
137
- #
138
- # Paramaters: see each_row_by_sql
139
- #
140
- # Returns the number of rows yielded to the block
141
- def self.each_instance_by_sql(sql, options={}, &block)
142
- options = {:connection => self.connection}.merge(options)
143
- PostgreSQLCursor.new(sql, options).each do |row|
144
- model = instantiate(row)
145
- yield model
146
- end
147
- end
148
- end
149
-
150
- # Defines extension to ActiveRecord/AREL to use this library
151
- class ActiveRecord::Relation
152
-
153
- # Public: Executes the query, returning each row as a hash
154
- # to the given block.
155
- #
156
- # options - Hash to control
157
- # fraction: 0.1..1.0 - The cursor_tuple_fraction (default 1.0)
158
- # block_size: 1..n - The number of rows to fetch per db block fetch
159
- # while: value - Exits loop when block does not return this value.
160
- # until: value - Exits loop when block returns this value.
161
- #
162
- # Returns the number of rows yielded to the block
163
- def each_row(options={}, &block)
164
- options = {:connection => self.connection}.merge(options)
165
- PostgreSQLCursor.new(to_sql, options).each(&block)
166
- end
167
-
168
- # Public: Like each_row, but returns an instantiated model object to the block
169
- #
170
- # Paramaters: same as each_row
171
- #
172
- # Returns the number of rows yielded to the block
173
- def each_instance(options={}, &block)
174
- options = {:connection => self.connection}.merge(options)
175
- PostgreSQLCursor.new(to_sql, options).each do |row|
176
- model = instantiate(row)
177
- block.call model
178
- end
179
- end
180
- end
1
+ require 'postgresql_cursor/version'
2
+ require 'postgresql_cursor/cursor'
3
+ require 'postgresql_cursor/active_record/relation/cursor_iterators'
4
+ require 'postgresql_cursor/active_record/sql_cursor'
5
+ require 'postgresql_cursor/active_record/connection_adapters/postgresql_type_map'
6
+
7
+ # ActiveRecord 4.x
8
+ require 'active_record'
9
+ require 'active_record/connection_adapters/postgresql_adapter'
10
+ ActiveRecord::Base.extend(PostgreSQLCursor::ActiveRecord::SqlCursor)
11
+ ActiveRecord::Relation.send(:include, PostgreSQLCursor::ActiveRecord::Relation::CursorIterators)
12
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.send(:include, PostgreSQLCursor::ActiveRecord::ConnectionAdapters::PostgreSQLTypeMap)