RubyGems - pupa - Versions diffs - 0.1.4 → 0.1.5 - Mend

pupa 0.1.4 → 0.1.5

Files changed (8) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: b859cc31c591efea3402d3d4b0134fdccb394550
-  data.tar.gz: f077f4c7765c50c463e8f65124b19ce14b42f095
+  metadata.gz: d794649266b975f92ee8ff502a3de21390dc540b
+  data.tar.gz: 59b89a81274d35ee848d944da9f4337295ab8567
 SHA512:
-  metadata.gz: 3e9eeddce347575a99e50ea67d57e8eb8c1310b8929184f10ffc5ca62da9cc843198a831241411abef829c9d8cb8165f3a6438b67fe6e4b32f4e1317ae7e0f67
-  data.tar.gz: 8ca3a8dde166ac6712baaf53743846ba8cd80e6af87e8436eb8f9481fc24f4cbde180ae9e12f3fc747838292bfbf67f6f3b0659bcf8cb0693772c96b5f5c0e83
+  metadata.gz: 19582ce0e29e5a9ad52d4dabb216664d418fe738b4bf9b534a51b161eae209616df8bc87138a5f34253dace8c657ae21c0a2684024258dc918a94dfc63476a68
+  data.tar.gz: 3464169a23f255de3e4b357e245135fc571bc11642a1929d448a88fc1382e5418ed96b3214317c30bae90cb535f90ab6bd4a2bd94d1403ce840a1ccbf6cad734

data/PERFORMANCE.md ADDED Viewed

@@ -0,0 +1,129 @@
+# Pupa.rb: A Data Scraping Framework
+## Performance
+Pupa.rb offers several ways to significantly improve performance.
+In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
+The `import` action's performance is currently limited by the database when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
+### Reducing HTTP requests
+HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
+    ruby cat.rb --cache_dir /tmp/my_cache_dir
+### Parallelizing HTTP requests
+To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
+```ruby
+require 'pupa'
+require 'typhoeus'
+require 'typhoeus/adapters/faraday'
+```
+Then, in your scraping methods, write code like:
+```ruby
+responses = []
+# Change the maximum number of concurrent requests (default 200). You usually
+# need to tweak this number by trial and error.
+# @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
+manager = Typhoeus::Hydra.new(max_concurrency: 20)
+begin
+  # Send HTTP requests in parallel.
+  client.in_parallel(manager) do
+    responses << client.get('http://example.com/foo')
+    responses << client.get('http://example.com/bar')
+    # More requests...
+  end
+rescue Faraday::Error::ClientError => e
+  # Log an error message if, for example, you exceed a server's maximum number
+  # of concurrent connections or if you exceed an API's rate limit.
+  error(e.response.inspect)
+end
+# Responses are now available for use.
+responses.each do |response|
+  # Only process the finished responses.
+  if response.success?
+    # If success...
+  elsif response.finished?
+    # If error...
+  end
+end
+```
+### Reducing disk I/O
+After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
+#### RAM file systems
+A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and  `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
+    ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
+    diskutil erasevolume HFS+ 'ramdisk' $ramdisk
+You can then set the `output_dir` and `cache_dir` on OS X as:
+    ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
+Once you are done with the RAM disk, release the memory:
+    diskutil unmount $ramdisk
+    hdiutil detach $ramdisk
+Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
+#### Memcached
+You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
+    ruby cat.rb --cache_dir memcached://localhost:11211
+The data in Memcached will be lost between reboots.
+#### Redis
+You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
+    ruby cat.rb --output_dir redis://localhost:6379/0
+To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
+    ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
+Requiring the `hiredis` gem will slightly improve performance.
+Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
+### Skipping validation
+The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
+The [pupa-validate](https://npmjs.org/package/pupa-validate) npm package can be used to validate JSON documents using the faster JSV. In an example case, using JSV instead of the `json-schema` gem reduced by half the time to validate 10,000 documents.
+### Ruby version
+Pupa.rb requires Ruby 2.x. If you have already made all the above optimizations, you may notice a significant improvement by using Ruby 2.1, which has better garbage collection than Ruby 2.0.
+### Profiling
+You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
+    gem install perftools.rb
+Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
+    CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
+You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
+[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
+    pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf

data/README.md CHANGED Viewed

@@ -69,7 +69,7 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
 JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems.
-### [OpenCivicData](http://opencivicdata.org/) compatibility
+## [OpenCivicData](http://opencivicdata.org/) compatibility
 Both Pupa.rb and Sunlight Labs' [Pupa](https://github.com/opencivicdata/pupa) implement models for people, organizations and memberships from the [Popolo](http://popoloproject.com/) open government data specification. Pupa.rb lets you use your own classes, but Pupa only supports a fixed set of classes. A consequence of Pupa.rb's flexibility is that the value of the `_type` property for `Person`, `Organization` and `Membership` objects differs between Pupa.rb and Pupa. Pupa.rb has namespaced types like `pupa/person` – to allow Ruby to load the `Person` class in the `Pupa` module – whereas Pupa has unnamespaced types like `person`.
@@ -81,138 +81,8 @@ require 'pupa/refinements/opencivicdata'
 It is not currently possible to run the `scrape` action with one of Pupa.rb and Pupa, and to then run the `import` action with the other. Both actions must be run by the same library.
-## Performance
-Pupa.rb offers several ways to significantly improve performance.
-In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
-The `import` action's performance is currently limited by MongoDB when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
-### Reducing HTTP requests
-HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
-    ruby cat.rb --cache_dir /tmp/my_cache_dir
-### Parallelizing HTTP requests
-To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
-```ruby
-require 'pupa'
-require 'typhoeus'
-require 'typhoeus/adapters/faraday'
-```
-Then, in your scraping methods, write code like:
-```ruby
-responses = []
-# Change the maximum number of concurrent requests (default 200). You usually
-# need to tweak this number by trial and error.
-# @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
-manager = Typhoeus::Hydra.new(max_concurrency: 20)
-begin
-  # Send HTTP requests in parallel.
-  client.in_parallel(manager) do
-    responses << client.get('http://example.com/foo')
-    responses << client.get('http://example.com/bar')
-    # More requests...
-  end
-rescue Faraday::Error::ClientError => e
-  # Log an error message if, for example, you exceed a server's maximum number
-  # of concurrent connections or if you exceed an API's rate limit.
-  error(e.response.inspect)
-end
-# Responses are now available for use.
-responses.each do |response|
-  # Only process the finished responses.
-  if response.success?
-    # If success...
-  elsif response.finished?
-    # If error...
-  end
-end
-```
-### Reducing disk I/O
-After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
-#### RAM file systems
-A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and  `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
-    ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
-    diskutil erasevolume HFS+ 'ramdisk' $ramdisk
-You can then set the `output_dir` and `cache_dir` on OS X as:
-    ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
-Once you are done with the RAM disk, release the memory:
-    diskutil unmount $ramdisk
-    hdiutil detach $ramdisk
-Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
-#### Memcached
-You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
-    ruby cat.rb --cache_dir memcached://localhost:11211
-The data in Memcached will be lost between reboots.
-#### Redis
-You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
-    ruby cat.rb --output_dir redis://localhost:6379/0
-To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
-    ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
-Requiring the `hiredis` gem will slightly improve performance.
-Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
-### Skipping validation
-The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
-The [pupa-validate](https://npmjs.org/package/pupa-validate) npm package can be used to validate JSON documents using the faster JSV. In an example case, using JSV instead of the `json-schema` gem reduced by half the time to validate 10,000 documents.
-### Ruby version
-Pupa.rb requires Ruby 2.x. If you have already made all the above optimizations, you may notice a significant improvement by using Ruby 2.1, which has better garbage collection than Ruby 2.0.
-### Profiling
-You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
-    gem install perftools.rb
-Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
-    CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
-You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
-[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
-    pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf
 ## Integration with ODMs
-### Mongoid
 `Pupa::Model` is incompatible with `Mongoid::Document`. Don't do this:
 ```ruby
@@ -224,6 +94,10 @@ end
 Instead, have a scraping model that includes `Pupa::Model` and an app model that includes `Mongoid::Document`.
+## Performance
+Pupa.rb offers several ways to significantly improve performance. [Read the documentation.](https://github.com/opennorth/pupa-ruby/blob/master/PERFORMANCE.md#readme)
 ## Testing
 **DO NOT** run this gem's specs if you are using Redis database number 15 on `localhost`!

data/lib/pupa/processor/client.rb CHANGED Viewed

@@ -30,9 +30,12 @@ module Pupa
       #   (e.g. `memcached://localhost:11211`) in which to cache requests
       # @param [Integer] expires_in the cache's expiration time in seconds
       # @param [Integer] value_max_bytes the maximum Memcached item size
+      # @param [String] memcached_username the Memcached username
+      # @param [String] memcached_password the Memcached password
       # @param [String] level the log level
+      # @param [String,IO] logdev the log device
       # @return [Faraday::Connection] a configured Faraday HTTP client
-      def self.new(cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, level: 'INFO') # 1 day
+      def self.new(cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, memcached_username: nil, memcached_password: nil, level: 'INFO', logdev: STDOUT) # 1 day
         Faraday.new do |connection|
           connection.request :url_encoded
           connection.use Middleware::Logger, Logger.new('faraday', level: level)
@@ -59,7 +62,7 @@ module Pupa
             connection.response :caching do
               address = cache_dir[%r{\Amemcached://(.+)\z}, 1]
               if address
-                ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in, value_max_bytes: Integer(value_max_bytes))
+                ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in, value_max_bytes: Integer(value_max_bytes), username: memcached_username, password: memcached_password)
               else
                 ActiveSupport::Cache::FileStore.new(cache_dir, expires_in: expires_in)
               end

data/lib/pupa/processor.rb CHANGED Viewed

@@ -25,14 +25,16 @@ module Pupa
     #   (e.g. `memcached://localhost:11211`) in which to cache HTTP responses
     # @param [Integer] expires_in the cache's expiration time in seconds
     # @param [Integer] value_max_bytes the maximum Memcached item size
+    # @param [String] memcached_username the Memcached username
+    # @param [String] memcached_password the Memcached password
     # @param [String] database_url the database URL
     # @param [Boolean] validate whether to validate JSON documents
     # @param [String] level the log level
     # @param [String,IO] logdev the log device
     # @param [Hash] options criteria for selecting the methods to run
-    def initialize(output_dir, pipelined: false, cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, database_url: 'mongodb://localhost:27017/pupa', validate: true, level: 'INFO', logdev: STDOUT, options: {})
+    def initialize(output_dir, pipelined: false, cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, memcached_username: nil, memcached_password: nil, database_url: 'mongodb://localhost:27017/pupa', validate: true, level: 'INFO', logdev: STDOUT, options: {})
       @store      = DocumentStore.new(output_dir, pipelined: pipelined)
-      @client     = Client.new(cache_dir: cache_dir, expires_in: expires_in, value_max_bytes: value_max_bytes, level: level)
+      @client     = Client.new(cache_dir: cache_dir, expires_in: expires_in, value_max_bytes: value_max_bytes, memcached_username: memcached_username, memcached_password: memcached_password, level: level, logdev: logdev)
       @connection = Connection.new(database_url)
       @logger     = Logger.new('pupa', level: level, logdev: logdev)
       @validate   = validate

data/lib/pupa/runner.rb CHANGED Viewed

@@ -11,17 +11,19 @@ module Pupa
       @processor_class = processor_class
       @options = OpenStruct.new({
-        actions:         [],
-        tasks:           [],
-        output_dir:      File.expand_path('scraped_data', Dir.pwd),
-        pipelined:       false,
-        cache_dir:       File.expand_path('web_cache', Dir.pwd),
-        expires_in:      86400, # 1 day
-        value_max_bytes: 1048576, # 1 MB
-        database_url:    'mongodb://localhost:27017/pupa',
-        validate:        true,
-        level:           'INFO',
-        dry_run:         false,
+        actions:            [],
+        tasks:              [],
+        output_dir:         File.expand_path('scraped_data', Dir.pwd),
+        pipelined:          false,
+        cache_dir:          File.expand_path('web_cache', Dir.pwd),
+        expires_in:         86400, # 1 day
+        value_max_bytes:    1048576, # 1 MB
+        memcached_username: nil,
+        memcached_password: nil,
+        database_url:       'mongodb://localhost:27017/pupa',
+        validate:           true,
+        level:              'INFO',
+        dry_run:            false,
       }.merge(defaults))
       @actions = {
@@ -86,7 +88,13 @@ module Pupa
         opts.on('--value_max_bytes BYTES', "The maximum Memcached item size") do |v|
           options.value_max_bytes = v
         end
-        opts.on('-d', '--database_url SCHEME://USERNAME:PASSWORD@HOST:PORT/DATABASE', 'The database URL') do |v|
+        opts.on('--memcached_username USERNAME', "The Memcached username") do |v|
+          options.memcached_username = v
+        end
+        opts.on('--memcached_password USERNAME', "The Memcached password") do |v|
+          options.memcached_password = v
+        end
+        opts.on('-d', '--database_url', 'The database URL (e.g. mongodb://USER:PASSWORD@localhost:27017/pupa or postgres://USER:PASSWORD@localhost:5432/pupa') do |v|
           options.database_url = v
         end
         opts.on('--[no-]validate', 'Validate JSON documents') do |v|
@@ -147,6 +155,8 @@ module Pupa
         cache_dir: options.cache_dir,
         expires_in: options.expires_in,
         value_max_bytes: options.value_max_bytes,
+        memcached_username: options.memcached_username,
+        memcached_password: options.memcached_password,
         database_url: options.database_url,
         validate: options.validate,
         level: options.level,
@@ -165,7 +175,7 @@ module Pupa
       end
       if options.level == 'DEBUG'
-        %w(output_dir pipelined cache_dir expires_in value_max_bytes database_url validate level).each do |option|
+        %w(output_dir pipelined cache_dir expires_in value_max_bytes memcached_username memcached_password database_url validate level).each do |option|
           puts "#{option}: #{options[option]}"
         end
         unless rest.empty?

data/lib/pupa/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Pupa
-  VERSION = "0.1.4"
+  VERSION = "0.1.5"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: pupa
 version: !ruby/object:Gem::Version
-  version: 0.1.4
+  version: 0.1.5
 platform: ruby
 authors:
 - Open North
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-05-23 00:00:00.000000000 Z
+date: 2014-07-11 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -288,6 +288,7 @@ files:
 - ".yardopts"
 - Gemfile
 - LICENSE
+- PERFORMANCE.md
 - README.md
 - Rakefile
 - USAGE