RubyGems - postgresql_cursor - Versions diffs - 0.4.3 → 0.5.0 - Mend

postgresql_cursor 0.4.3 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +4 -4
data/.gitignore +24 -0
data/Gemfile +4 -0
data/Gemfile.lock +41 -0
data/README.md +185 -0
data/Rakefile +17 -46
data/lib/postgresql_cursor.rb +12 -180
data/lib/postgresql_cursor/active_record/connection_adapters/postgresql_type_map.rb +17 -0
data/lib/postgresql_cursor/active_record/relation/cursor_iterators.rb +64 -0
data/lib/postgresql_cursor/active_record/sql_cursor.rb +92 -0
data/lib/postgresql_cursor/cursor.rb +199 -0
data/lib/postgresql_cursor/version.rb +3 -0
data/postgresql_cursor.gemspec +22 -43
data/test-app/Gemfile +14 -0
data/test-app/Gemfile.lock +34 -0
data/test-app/app.rb +30 -0
data/test-app/run.sh +10 -0
data/test/helper.rb +12 -15
data/test/test_postgresql_cursor.rb +41 -16
metadata +68 -12
data/README.rdoc +0 -97

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: c7761053fbe88cc359b47415f7e79b620defe326
-  data.tar.gz: 8398abd146476cc6ddfa637e3d5215fcc446be38
+  metadata.gz: 6d9ff2e625bba6a119c092d2a7af6cc4b0ab2fde
+  data.tar.gz: d44d29ed0238aac7fc0597d6e1ae83763b0e7cdb
 SHA512:
-  metadata.gz: 18ebbe7bfa3fdee1435909725be137bf7b895f55f3ad87620a5486697fa67f06afab79d117945643ecbe2e4b1f7e6d6c6e30364358763429beb2c720cfe1d9fe
-  data.tar.gz: 482aeffdbe884a29ca1e05b8a860e897845c9cc86d134c4512475c2039c5f448c2585298c444f89a0afd4f4b781bd4b3221c2deccb321eba6b68f1d178abe49e
+  metadata.gz: 7451b81a33709fd76494f97e883a11d97d30611c7854ae712082a06899a14ce8b9bca28fd14925e06c21a695f1d1df20a1d5044883519f0e34148dbcca5b1313
+  data.tar.gz: 15350ff60d906758aeb41f104102f89a28d60e713343c62023a907a389c3244db305f0237de2890cc13e9dce17810da67765cc90e56a008ed40ec86f4f51a580

data/.gitignore ADDED

@@ -0,0 +1,24 @@
+## MAC OS
+.DS_Store
+## TEXTMATE
+*.tmproj
+tmtags
+## EMACS
+*~
+\#*
+.\#*
+## VIM
+*.swp
+## IntelliJ/Rubymine
+.idea
+## PROJECT::GENERAL
+coverage
+rdoc
+pkg
+## PROJECT::SPECIFIC

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in postgresql_cursor.gemspec
+gemspec

data/Gemfile.lock ADDED

@@ -0,0 +1,41 @@
+PATH
+  remote: .
+  specs:
+    postgresql_cursor (0.5.0)
+      activerecord (>= 3.2.0)
+GEM
+  remote: https://rubygems.org/
+  specs:
+    activemodel (4.1.1)
+      activesupport (= 4.1.1)
+      builder (~> 3.1)
+    activerecord (4.1.1)
+      activemodel (= 4.1.1)
+      activesupport (= 4.1.1)
+      arel (~> 5.0.0)
+    activesupport (4.1.1)
+      i18n (~> 0.6, >= 0.6.9)
+      json (~> 1.7, >= 1.7.7)
+      minitest (~> 5.1)
+      thread_safe (~> 0.1)
+      tzinfo (~> 1.1)
+    arel (5.0.1.20140414130214)
+    builder (3.2.2)
+    i18n (0.6.9)
+    json (1.8.1)
+    minitest (5.3.3)
+    pg (0.17.1)
+    rake (10.3.1)
+    thread_safe (0.3.4)
+    tzinfo (1.2.1)
+      thread_safe (~> 0.1)
+PLATFORMS
+  ruby
+DEPENDENCIES
+  minitest
+  pg
+  postgresql_cursor!
+  rake

data/README.md ADDED

@@ -0,0 +1,185 @@
+#PostgreSQLCursor for handling large Result Sets
+[![Gem Version](https://badge.fury.io/rb/postgresql_cursor.svg)](http://badge.fury.io/rb/postgresql_cursor)
+PostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries
+returning a large number of rows, and allows you to sort your result set.
+In PostgreSQL, a
+[cursor](http://www.postgresql.org/docs/9.4/static/plpgsql-cursors.html)
+runs a query, from which you fetch a block of
+(say 1000) rows, process them, and continue fetching until the result
+set is exhausted. By fetching a smaller chunk of data, this reduces the
+amount of memory your application uses and prevents the potential crash
+of running out of memory.
+Version 0.5.0 has been refactored to install more smoothly into ActiveRecord.
+It supports Rails and ActiveRecord 3.2.x and up.
+##Use Cursors
+PostgreSQLCursor was developed to take advantage of PostgreSQL's cursors. Cursors allow the program
+to declare a cursor to run a given query returning "chunks" of rows to the application program while
+retaining the position of the full result set in the database. This overcomes all the disadvantages
+of using find_each and find_in_batches.
+Also, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the
+instantiated models. An informal benchmark showed that returning instances is a factor of 4 times
+slower than returning hashes. If you are can work with the data in this form, you will find better
+performance.
+With PostgreSQL, you can work with cursors as follows:
+```ruby
+Product.where("id>0").order("name").each_row { |hash| Product.process(hash) }
+Product.where("id>0").each_instance { |product| product.process! }
+Product.where("id>0").each_instance(block_size:100_000) { |product| product.process }
+Product.each_row { |hash| Product.process(hash) }
+Product.each_instance { |product| product.process }
+Product.each_row_by_sql("select * from products") { |hash| Product.process(hash) }
+Product.each_instance_by_sql("select * from products") { |product| product.process }
+```
+###PostgreSQLCursor is an Enumerable
+If you do not pass in a block, the cursor is returned, which mixes in the Enumerable
+libary. With that, you can pass it around, or chain in the awesome enumerable things
+like `map` and `reduce`. Furthermore, the cursors already act as `lazy`, but you can
+also chain in `lazy` when you want to keep the memory footprint small for rest of the process.
+```ruby
+Product.each_row.map {|r| r["id"].to_i } #=> [1, 2, 3, ...]
+Product.each_instance.map {|r| r.id }.each {|id| p id } #=> [1, 2, 3, ...]
+Product.each_instance.lazy.inject(0) {|sum,r| sum +  r.quantity } #=> 499500
+```
+All these methods take an options hash to control things more:
+    block_size:n      The number of rows to fetch from the database each time (default 1000)
+    while:value       Continue looping as long as the block returns this value
+    until:value       Continue looping until the block returns this value
+    connection:conn   Use this connection instead of the current Product connection
+    fraction:float    A value to set for the cursor_tuple_fraction variable.
+                      PostgreSQL uses 0.1 (optimize for 10% of result set)
+                      This library uses 1.0 (Optimize for 100% of the result set)
+                      Do not override this value unless you understand it.
+Notes:
+* Use cursors *only* for large result sets. They have more overhead with the database
+  than ActiveRecord selecting all matching records.
+* Aliases each_hash and each_hash_by_sql are provided for each_row and each_row_by_sql
+  if you prefer to express what types are being returned.
+###Hashes vs. Instances
+The each_row method returns the Hash of strings for speed (as this allows you to process a lot of rows).
+Hashes are returned with String values, and you must take care of any type conversion.
+When you use each_instance, ActiveRecord lazily casts these strings into
+Ruby types (Time, Fixnum, etc.) only when you read the attribute.
+If you find you need the types cast for your attributes, consider using each_instance
+insead. ActiveRecord's read casting algorithm will only cast the values you need and
+has become more efficient over time.
+###Select and Pluck
+To limit the columns returned to just those you need, use `.select(:id, :name)`
+query method.
+```ruby
+Product.select(:id, :name).each_row { |product| product.process }
+```
+Pluck is a great alternative instead of using a cursor. It does not instantiate
+the row, and builds an array of result values, and translates the values into ruby
+values (numbers, Timestamps. etc.). Using the cursor would still allow you to lazy
+load them in batches for very large sets.
+You can also use the `pluck_rows` or `pluck_instances` if the results
+won't eat up too much memory.
+```ruby
+Product.newly_arrived.pluck(:id) #=> [1, 2, 3, ...]
+Product.newly_arrived.each_row { |hash| }
+Product.select(:id).each_row.map {|r| r["id"].to_i } # cursor instead of pluck
+Product.pluck_rows(:id) #=> ["1", "2", ...]
+Product.pluck_instances(:id, :quantity) #=> [[1, 503], [2, 932], ...]
+```
+###Associations and Eager Loading
+ActiveRecord performs some magic when eager-loading associated row. It
+will usually not join the tables, and prefers to load the data in
+separate queries.
+This library hooks onto the `to_sql` feature of the query builder. As a
+result, it can't do the join if ActiveRecord decided not to join, nor
+can it construct the association objects eagerly.
+##Background: Why PostgreSQL Cursors?
+ActiveRecord is designed and optimized for web performance. In a web transaction, only a "page" of
+around 20 rows is returned to the user. When you do this
+```ruby
+Product.find_each { |product| product.process }
+```
+The database returns all matching result set rows to ActiveRecord, which instantiates each row with
+the data returned. This function returns an array of all these rows to the caller.
+Asyncronous, Background, or Offline processing may require processing a large amount of data.
+When there is a very large number of rows, this requires a lot more memory to hold the data. Ruby
+does not return that memory after processing the array, and the causes your process to "bloat". If you
+don't have enough memory, it will cause an exception.
+###ActiveRecord.find_each and find_in_batches
+To solve this problem, ActiveRecord gives us two alternative methods that work in "chunks" of your data:
+```ruby
+Product.where("id>0").find_each { |model| Product.process }
+Product.where("id>0").find_in_batches do |batch|
+  batch.each { |model| Product.process }
+end
+```
+Optionally, you can specify a :batch_size option as the size of the "chunk", and defaults to 1000.
+There are drawbacks with these methods:
+* You cannot specify the order, it will be ordered by the primary key (usually id)
+* The primary key must be numeric
+* The query is rerun for each chunk (1000 rows), starting at the next id sequence.
+* You cannot use overly complex queries as that will be rerun and incur more overhead.
+##Meta
+###Author
+Allen Fair, [@allenfair](https://twitter/com/allenfair), http://github.com/afair
+Thanks to:
+* Iulian Dogariu, http://github.com/iulianu (Fixes)
+* Julian Mehnle, julian@mehnle.net (Suggestions)
+* ...And all the other contributers!
+###Note on Patches/Pull Requests
+* Fork the project.
+* Make your feature addition or bug fix.
+* Add tests for it. This is important so I don't break it in a
+  future version unintentionally.
+* Commit, do not mess with rakefile, version, or history.
+  (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
+* Send me a pull request. Bonus points for topic branches.
+###Copyright
+Copyright (c) 2010-2014 Allen Fair. See (MIT) LICENSE for details.

data/Rakefile CHANGED

@@ -1,53 +1,24 @@
-require 'rubygems'
-require 'rake'
+require "bundler/gem_tasks"
+require "bundler/setup"
+require 'rake/testtask'
-begin
-  require 'jeweler'
-  Jeweler::Tasks.new do |gem|
-    gem.name = "postgresql_cursor"
-    gem.summary = %Q{ActiveRecord PostgreSQL Adapter extension for using a cursor to return a large result set}
-    gem.description = %Q{PostgreSQL Cursor is an extension to the ActiveRecord PostgreSQLAdapter for very large result sets. It provides a cursor open/fetch/close interface to access data without loading all rows into memory, and instead loads the result rows in "chunks" (default of 10_000 rows), buffers them, and returns the rows one at a time.}
-    gem.email = "allen.fair@gmail.com"
-    gem.homepage = "http://github.com/afair/postgresql_cursor"
-    gem.authors = ["Allen Fair"]
-    gem.add_dependency 'activerecord'
-    # gem is a Gem::Specification... see http://www.rubygems.org/read/chapter/20 for additional settings
-  end
-  Jeweler::GemcutterTasks.new
-rescue LoadError
-  puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
-end
+task :default => :test
-require 'rake/testtask'
-Rake::TestTask.new(:test) do |test|
-  test.libs << 'lib' << 'test'
-  test.pattern = 'test/**/test_*.rb'
-  test.verbose = true
+desc "Run the Test Suite, toot suite"
+task :test do
+  sh "ruby test/test_*"
 end
-begin
-  require 'rcov/rcovtask'
-  Rcov::RcovTask.new do |test|
-    test.libs << 'test'
-    test.pattern = 'test/**/test_*.rb'
-    test.verbose = true
-  end
-rescue LoadError
-  task :rcov do
-    abort "RCov is not available. In order to run rcov, you must: sudo gem install spicycode-rcov"
-  end
+desc "Open and IRB Console with the gem and test-app loaded"
+task :console do
+  sh "bundle exec irb  -Ilib -I . -r postgresql_cursor -r test-app/app"
+  #require 'irb'
+  #ARGV.clear
+  #IRB.start
 end
-task :test => :check_dependencies
-task :default => :test
-require 'rdoc/task'
-Rake::RDocTask.new do |rdoc|
-  version = File.exist?('VERSION') ? File.read('VERSION') : ""
-  rdoc.rdoc_dir = 'rdoc'
-  rdoc.title = "postgresql_cursor #{version}"
-  rdoc.rdoc_files.include('README*')
-  rdoc.rdoc_files.include('lib/**/*.rb')
+desc "Setup testing database and table"
+task :setup do
+  sh %q(createdb postgresql_cursor_test)
+  sh %Q<echo "create table products ( id serial primary key);" | psql postgresql_cursor_test>
 end

data/lib/postgresql_cursor.rb CHANGED

@@ -1,180 +1,12 @@
-# PostgreSQLCursor: library class provides postgresql cursor for large result
-# set processing. Requires ActiveRecord, but can be adapted to other DBI/ORM libraries.
-# If you don't use AR, this assumes #connection and #instantiate methods are available.
-#
-# options     - Hash to control operation and loop breaks
-#   connection: instance  - ActiveRecord connection to use
-#   fraction: 0.1..1.0    - The cursor_tuple_fraction (default 1.0)
-#   block_size: 1..n      - The number of rows to fetch per db block fetch
-#   while: value          - Exits loop when block does not return this value.
-#   until: value          - Exits loop when block returns this value.
-#
-# Exmaples:
-#   PostgreSQLCursor.new("select ...").each { |hash| ... }
-#   ActiveRecordModel.where(...).each_row { |hash| ... }
-#   ActiveRecordModel.each_row_by_sql("select ...") { |hash| ... }
-#   ActiveRecordModel.each_instance_by_sql("select ...") { |model| ... }
-#
-class PostgreSQLCursor
-  include Enumerable
-  attr_reader :sql, :options, :connection, :count, :result
-  @@cursor_seq = 0
-  # Public: Start a new PostgreSQL cursor query
-  # sql     - The SQL statement with interpolated values
-  # options - hash of processing controls
-  #   while: value    - Exits loop when block does not return this value.
-  #   until: value    - Exits loop when block returns this value.
-  #   fraction: 0.1..1.0    - The cursor_tuple_fraction (default 1.0)
-  #   block_size: 1..n      - The number of rows to fetch per db block fetch
-  #                           Defaults to 1000
-  #
-  # Examples
-  #
-  #   PostgreSQLCursor.new("select ....")
-  #
-  # Returns the cursor object when called with new.
-  def initialize(sql, options={})
-    @sql        = sql
-    @options    = options
-    @connection = @options.fetch(:connection) { ActiveRecord::Base.connection }
-    @count      = 0
-  end
-  # Public: Yields each row of the result set to the passed block
-  #
-  #
-  # Yields the row to the block. The row is a hash with symbolized keys.
-  #   {colname: value, ....}
-  #
-  # Returns the count of rows processed
-  def each(&block)
-    has_do_until = @options.has_key?(:until)
-    has_do_while = @options.has_key?(:while)
-    @count      = 0
-    @connection.transaction do
-      begin
-        open
-        while (row = fetch) do
-          break if row.size==0
-          @count += 1
-          row = row.symbolize_keys
-          rc = yield row
-          # TODO: Handle exceptions raised within block
-          break if has_do_until && rc == @options[:until]
-          break if has_do_while && rc != @options[:while]
-        end
-      rescue Exception => e
-        raise e
-      ensure
-        close
-      end
-    end
-    @count
-  end
-  # Public: Opens (actually, "declares") the cursor. Call this before fetching
-  def open
-    set_cursor_tuple_fraction
-    @cursor = @@cursor_seq += 1
-    @result = @connection.execute("declare cursor_#{@cursor} cursor for #{@sql}")
-    @block = []
-  end
-  # Public: Returns the next row from the cursor, or empty hash if end of results
-  #
-  # Returns a row as a hash of {'colname'=>value,...}
-  def fetch
-    fetch_block if @block.size==0
-    @block.shift
-  end
-  # Private: Fetches the next block of rows into @block
-  def  fetch_block(block_size=nil)
-    block_size ||= @block_size ||= @options.fetch(:block_size) { 1000 }
-    @result = @connection.execute("fetch #{block_size} from cursor_#{@cursor}")
-    @block = @result.collect {|row| row } # Make our own
-  end
-  # Public: Closes the cursor
-  def close
-    @connection.execute("close cursor_#{@cursor}")
-  end
-  # Private: Sets the PostgreSQL cursor_tuple_fraction value = 1.0 to assume all rows will be fetched
-  # This is a value between 0.1 and 1.0 (PostgreSQL defaults to 0.1, this library defaults to 1.0)
-  # used to determine the expected fraction (percent) of result rows returned the the caller.
-  # This value determines the access path by the query planner.
-  def set_cursor_tuple_fraction(frac=1.0)
-    @cursor_tuple_fraction ||= @options.fetch(:fraction) { 1.0 }
-    return @cursor_tuple_fraction if frac == @cursor_tuple_fraction
-    @cursor_tuple_fraction = frac
-    @result = @connection.execute("set cursor_tuple_fraction to  #{frac}")
-    frac
-  end
-end
-# Defines extension to ActiveRecord to use this library
-class ActiveRecord::Base
-  # Public: Returns each row as a hash to the given block
-  #
-  # sql         - Full SQL statement, variables interpolated
-  # options     - Hash to control
-  #   fraction: 0.1..1.0    - The cursor_tuple_fraction (default 1.0)
-  #   block_size: 1..n      - The number of rows to fetch per db block fetch
-  #   while: value          - Exits loop when block does not return this value.
-  #   until: value          - Exits loop when block returns this value.
-  #
-  # Returns the number of rows yielded to the block
-  def self.each_row_by_sql(sql, options={}, &block)
-    options = {:connection => self.connection}.merge(options)
-    PostgreSQLCursor.new(sql, options).each(&block)
-  end
-  # Public: Returns each row as a model instance to the given block
-  # As this instantiates a model object, it is slower than each_row_by_sql
-  #
-  # Paramaters: see each_row_by_sql
-  #
-  # Returns the number of rows yielded to the block
-  def self.each_instance_by_sql(sql, options={}, &block)
-    options = {:connection => self.connection}.merge(options)
-    PostgreSQLCursor.new(sql, options).each do |row|
-      model = instantiate(row)
-      yield model
-    end
-  end
-end
-# Defines extension to ActiveRecord/AREL to use this library
-class ActiveRecord::Relation
-  # Public: Executes the query, returning each row as a hash
-  # to the given block.
-  #
-  # options     - Hash to control
-  #   fraction: 0.1..1.0    - The cursor_tuple_fraction (default 1.0)
-  #   block_size: 1..n      - The number of rows to fetch per db block fetch
-  #   while: value          - Exits loop when block does not return this value.
-  #   until: value          - Exits loop when block returns this value.
-  #
-  # Returns the number of rows yielded to the block
-  def each_row(options={}, &block)
-    options = {:connection => self.connection}.merge(options)
-    PostgreSQLCursor.new(to_sql, options).each(&block)
-  end
-  # Public: Like each_row, but returns an instantiated model object to the block
-  #
-  # Paramaters: same as each_row
-  #
-  # Returns the number of rows yielded to the block
-  def each_instance(options={}, &block)
-    options = {:connection => self.connection}.merge(options)
-    PostgreSQLCursor.new(to_sql, options).each do |row|
-      model = instantiate(row)
-      block.call model
-    end
-  end
-end
+require 'postgresql_cursor/version'
+require 'postgresql_cursor/cursor'
+require 'postgresql_cursor/active_record/relation/cursor_iterators'
+require 'postgresql_cursor/active_record/sql_cursor'
+require 'postgresql_cursor/active_record/connection_adapters/postgresql_type_map'
+# ActiveRecord 4.x
+require 'active_record'
+require 'active_record/connection_adapters/postgresql_adapter'
+ActiveRecord::Base.extend(PostgreSQLCursor::ActiveRecord::SqlCursor)
+ActiveRecord::Relation.send(:include, PostgreSQLCursor::ActiveRecord::Relation::CursorIterators)
+ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.send(:include, PostgreSQLCursor::ActiveRecord::ConnectionAdapters::PostgreSQLTypeMap)