pupa 0.0.7 → 0.0.8

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: c06be9f93589b34b71f5b7037ab75cecbfaa43da
4
- data.tar.gz: 44b4ab787cc85352b84d6edf8d502adc8a7d6b79
3
+ metadata.gz: d4ec7210671485a2de58673a70088e415a9767b7
4
+ data.tar.gz: 8b2a77e3fe3c5775509fef59fb847ea2057f63cb
5
5
  SHA512:
6
- metadata.gz: 5d605b3a6e98b3bdf94161e88904653b7e6facdf25ffff29d0d8d5170c4066196772d2c7d9c2736d2789325a66715423078a1b3761a7a41b065f366eba379735
7
- data.tar.gz: db7166154329a552a9a8ed9d8c1486077536590cff961173f64a595513d4cdfb522db47c11a56d9d5c9faa65a9fe51d745defb4e7d0033f84484b5a8d256bf53
6
+ metadata.gz: b3cdcf2da535ebd8d840fe2a1f6e6dd0db68de4d92b728f6f33bf7ba80f7499331f6cc9ace81da93b7e1062e7f3b8389fbab1891d81fbc44d1c48cbbc3be8eea
7
+ data.tar.gz: 641956572610184f0437f0869e2ee9da3c624e493284ac3d4e0e88157395c4b8d41840ac4f1d351b63032aff1fde091862a2b83401c5ee55c1d83ac52a856a70
data/.travis.yml CHANGED
@@ -3,3 +3,4 @@ rvm:
3
3
  - 2.0.0
4
4
  services:
5
5
  - mongodb
6
+ - redis
data/README.md CHANGED
@@ -45,6 +45,56 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
45
45
 
46
46
  1. You may want more control over the method used to perform a scraping task. For example, a legislature may publish legislators before 1997 in one format and legislators after 1997 in another format. In this case, you may want to select the method used to scrape legislators according to the year. See [legislator.rb](http://opennorth.github.io/pupa-ruby/docs/legislator.html).
47
47
 
48
+ ## Performance
49
+
50
+ Pupa.rb offers several ways to significantly improve performance.
51
+
52
+ In an example case, reducing file I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
53
+
54
+ The `import` action's performance (when using a dependency graph) is currently limited by MongoDB.
55
+
56
+ ### Caching HTTP requests
57
+
58
+ HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
59
+
60
+ ruby cat.rb --cache_dir my_cache_dir
61
+
62
+ ### Reducing file I/O
63
+
64
+ After HTTP requests, file I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk. You may store HTTP responses in [Memcached](http://memcached.org/) like so:
65
+
66
+ ruby cat.rb --cache_dir memcached://localhost:11211
67
+
68
+ And you may store JSON documents in [Redis](http://redis.io/) like so:
69
+
70
+ ruby cat.rb --output_dir redis://localhost:6379/0
71
+
72
+ Note that Pupa.rb flushes the JSON documents before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/1`, where `1` is the Redis database number.
73
+
74
+ ### Skipping validation
75
+
76
+ The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
77
+
78
+ ### Profiling
79
+
80
+ You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
81
+
82
+ gem install perftools.rb
83
+
84
+ Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
85
+
86
+ CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
87
+
88
+ You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to change the behavior of the `json-schema` gem, for whatever reason.
89
+
90
+ [perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
91
+
92
+ pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf
93
+
94
+ ## Testing
95
+
96
+ **DO NOT** run this gem's specs if you are using Redis database number 15 on `localhost`!
97
+
48
98
  ## Bugs? Questions?
49
99
 
50
100
  This project's main repository is on GitHub: [http://github.com/opennorth/pupa-ruby](http://github.com/opennorth/pupa-ruby), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
data/lib/pupa.rb CHANGED
@@ -1,3 +1,4 @@
1
+ require 'fileutils'
1
2
  require 'forwardable'
2
3
 
3
4
  require 'active_support/concern'
@@ -84,9 +84,9 @@ module Pupa
84
84
  self.json_schema = if Hash === value
85
85
  value
86
86
  elsif Pathname.new(value).absolute?
87
- value
87
+ File.read(value)
88
88
  else
89
- File.expand_path(File.join('..', '..', '..', 'schemas', "#{value}.json"), __dir__)
89
+ File.read(File.expand_path(File.join('..', '..', '..', 'schemas', "#{value}.json"), __dir__))
90
90
  end
91
91
  end
92
92
  end
@@ -164,7 +164,7 @@ module Pupa
164
164
  # @raises [JSON::Schema::ValidationError] if the object is invalid
165
165
  def validate!
166
166
  if self.class.json_schema
167
- # JSON::Validator#initialize_data runs fastest if given a hash.
167
+ # JSON::Validator#initialize_schema runs fastest if given a hash.
168
168
  JSON::Validator.validate!(self.class.json_schema, stringify_keys(to_h))
169
169
  end
170
170
  end
@@ -6,8 +6,12 @@ require 'pupa/processor/client'
6
6
  require 'pupa/processor/dependency_graph'
7
7
  require 'pupa/processor/helper'
8
8
  require 'pupa/processor/persistence'
9
+ require 'pupa/processor/document_store'
9
10
  require 'pupa/processor/yielder'
10
11
 
12
+ require 'pupa/processor/document_store/file_store'
13
+ require 'pupa/processor/document_store/redis_store'
14
+
11
15
  module Pupa
12
16
  # An abstract processor class from which specific processors inherit.
13
17
  class Processor
@@ -17,23 +21,26 @@ module Pupa
17
21
  class_attribute :tasks
18
22
  self.tasks = []
19
23
 
20
- attr_reader :report, :client, :options
24
+ attr_reader :report, :store, :client, :options
21
25
 
22
26
  def_delegators :@logger, :debug, :info, :warn, :error, :fatal
23
27
 
24
- # @param [String] output_dir the directory in which to dump JSON documents
25
- # @param [String] cache_dir the directory in which to cache HTTP responses
28
+ # @param [String] output_dir the directory or Redis address
29
+ # (e.g. `redis://localhost:6379`) in which to dump JSON documents
30
+ # @param [String] cache_dir the directory or Memcached address
31
+ # (e.g. `memcached://localhost:11211`) in which to cache HTTP responses
26
32
  # @param [Integer] expires_in the cache's expiration time in seconds
33
+ # @param [Boolean] validate whether to validate JSON documents
27
34
  # @param [String] level the log level
28
35
  # @param [String,IO] logdev the log device
29
36
  # @param [Hash] options criteria for selecting the methods to run
30
- def initialize(output_dir, cache_dir: nil, expires_in: 86400, level: 'INFO', logdev: STDOUT, options: {})
31
- @output_dir = output_dir
32
- @options = options
33
- @level = level
34
- @logger = Logger.new('pupa', level: level, logdev: logdev)
35
- @client = Client.new(cache_dir: cache_dir, expires_in: expires_in, level: level)
36
- @report = {}
37
+ def initialize(output_dir, cache_dir: nil, expires_in: 86400, validate: true, level: 'INFO', logdev: STDOUT, options: {})
38
+ @store = DocumentStore.new(output_dir)
39
+ @client = Client.new(cache_dir: cache_dir, expires_in: expires_in, level: level)
40
+ @logger = Logger.new('pupa', level: level, logdev: logdev)
41
+ @validate = validate
42
+ @options = options
43
+ @report = {}
37
44
  end
38
45
 
39
46
  # Retrieves and parses a document with a GET request.
@@ -213,23 +220,22 @@ module Pupa
213
220
  # @raises [Pupa::Errors::DuplicateObjectIdError]
214
221
  def dump_scraped_object(object)
215
222
  type = object.class.to_s.demodulize.underscore
216
- basename = "#{type}_#{object._id.gsub(File::SEPARATOR, '_')}.json"
217
- path = File.join(@output_dir, basename)
223
+ name = "#{type}_#{object._id.gsub(File::SEPARATOR, '_')}.json"
218
224
 
219
- if File.exist?(path)
225
+ if @store.exist?(name)
220
226
  raise Errors::DuplicateObjectIdError, "duplicate object ID: #{object._id} (was the same objected yielded twice?)"
221
227
  end
222
228
 
223
- info {"save #{type} #{object.to_s} as #{basename}"}
229
+ info {"save #{type} #{object.to_s} as #{name}"}
224
230
 
225
- File.open(path, 'w') do |f|
226
- f.write(JSON.dump(object.to_h(include_foreign_objects: true)))
227
- end
231
+ @store.write(name, object.to_h(include_foreign_objects: true))
228
232
 
229
- begin
230
- object.validate!
231
- rescue JSON::Schema::ValidationError => e
232
- warn {e.message}
233
+ if @validate
234
+ begin
235
+ object.validate!
236
+ rescue JSON::Schema::ValidationError => e
237
+ warn {e.message}
238
+ end
233
239
  end
234
240
  end
235
241
 
@@ -238,8 +244,7 @@ module Pupa
238
244
  # @return [Hash] a hash of scraped objects keyed by ID
239
245
  def load_scraped_objects
240
246
  {}.tap do |objects|
241
- Dir[File.join(@output_dir, '*.json')].each do |path|
242
- data = JSON.load(File.read(path))
247
+ @store.read_multi(@store.entries).each do |data|
243
248
  object = data['_type'].camelize.constantize.new(data)
244
249
  objects[object._id] = object
245
250
  end
@@ -276,16 +281,15 @@ module Pupa
276
281
  # @param [Hash] objects a hash of scraped objects keyed by ID
277
282
  # @return [Hash] a mapping from an object ID to the ID of its duplicate
278
283
  def build_losers_to_winners_map(objects)
284
+ inverse = {}
285
+ objects.each do |id,object|
286
+ (inverse[object.to_h.except(:_id)] ||= []) << id
287
+ end
288
+
279
289
  {}.tap do |map|
280
- # We don't need to iterate on the last item in the hash, but skipping
281
- # the last item is more effort than running the last item.
282
- objects.each_with_index do |(id1,object1),index|
283
- unless map.key?(id1) # Don't search for duplicates of duplicates.
284
- objects.drop(index + 1).each do |id2,object2|
285
- if object1 == object2
286
- map[id2] = id1
287
- end
288
- end
290
+ inverse.values.each do |ids|
291
+ ids.drop(1).each do |id|
292
+ map[id] = ids[0]
289
293
  end
290
294
  end
291
295
  end
@@ -18,7 +18,10 @@ module Pupa
18
18
  class Client
19
19
  # Returns a configured Faraday HTTP client.
20
20
  #
21
- # @param [String] cache_dir a directory in which to cache requests
21
+ # In order to automatically parse XML responses, you must `require 'multi_xml'`.
22
+ #
23
+ # @param [String] cache_dir a directory or a Memcached address
24
+ # (e.g. `memcached://localhost:11211`) in which to cache requests
22
25
  # @param [Integer] expires_in the cache's expiration time in seconds
23
26
  # @param [String] level the log level
24
27
  # @return [Faraday::Connection] a configured Faraday HTTP client
@@ -26,20 +29,30 @@ module Pupa
26
29
  Faraday.new do |connection|
27
30
  connection.request :url_encoded
28
31
  connection.use Middleware::Logger, Logger.new('faraday', level: level)
32
+
29
33
  # @see http://tools.ietf.org/html/rfc2854
30
34
  # @see http://tools.ietf.org/html/rfc3236
31
35
  connection.use Middleware::ParseHtml, content_type: %w(text/html application/xhtml+xml)
36
+
32
37
  # @see http://tools.ietf.org/html/rfc4627
33
38
  connection.use FaradayMiddleware::ParseJson, content_type: /\bjson$/
34
- # @see http://tools.ietf.org/html/rfc3023
39
+
35
40
  if defined?(MultiXml)
41
+ # @see http://tools.ietf.org/html/rfc3023
36
42
  connection.use FaradayMiddleware::ParseXml, content_type: /\bxml$/
37
43
  end
44
+
38
45
  if cache_dir
39
46
  connection.response :caching do
40
- ActiveSupport::Cache::FileStore.new(cache_dir, expires_in: expires_in)
47
+ address = cache_dir[%r{\Amemcached://(.+)\z}, 1]
48
+ if address
49
+ ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in)
50
+ else
51
+ ActiveSupport::Cache::FileStore.new(cache_dir, expires_in: expires_in)
52
+ end
41
53
  end
42
54
  end
55
+
43
56
  connection.adapter Faraday.default_adapter # must be last
44
57
  end
45
58
  end
@@ -0,0 +1,21 @@
1
+ module Pupa
2
+ class Processor
3
+ # An JSON document store factory.
4
+ #
5
+ # Heavily inspired by `ActiveSupport::Cache::Store`.
6
+ class DocumentStore
7
+ # Returns a configured JSON document store.
8
+ #
9
+ # @param [String] argument the filesystem directory or Redis address
10
+ # (e.g. `redis://localhost:6379/0`) in which to dump JSON documents
11
+ # @return a configured JSON document store
12
+ def self.new(argument)
13
+ if argument[%r{\Aredis://}]
14
+ RedisStore.new(argument)
15
+ else
16
+ FileStore.new(argument)
17
+ end
18
+ end
19
+ end
20
+ end
21
+ end
@@ -0,0 +1,83 @@
1
+ module Pupa
2
+ class Processor
3
+ class DocumentStore
4
+ # Stores JSON documents on disk.
5
+ #
6
+ # @see ActiveSupport::Cache::FileStore
7
+ class FileStore
8
+ # @param [String] output_dir the directory in which to dump JSON documents
9
+ def initialize(output_dir)
10
+ @output_dir = output_dir
11
+ FileUtils.mkdir_p(@output_dir)
12
+ end
13
+
14
+ # Returns whether a file with the given name exists.
15
+ #
16
+ # @param [String] name a key
17
+ # @return [Boolean] whether the store contains an entry for the given key
18
+ def exist?(name)
19
+ File.exist?(namespaced_key(name))
20
+ end
21
+
22
+ # Returns all file names in the storage directory.
23
+ #
24
+ # @return [Array<String>] all keys in the store
25
+ def entries
26
+ Dir.chdir(@output_dir) do
27
+ Dir['*.json']
28
+ end
29
+ end
30
+
31
+ # Returns, as JSON, the contents of the file with the given name.
32
+ #
33
+ # @param [String] name a key
34
+ # @return [Hash] the value of the given key
35
+ def read(name)
36
+ File.open(namespaced_key(name)) do |f|
37
+ JSON.load(f)
38
+ end
39
+ end
40
+
41
+ # Returns, as JSON, the contents of the files with the given names.
42
+ #
43
+ # @param [String] names keys
44
+ # @return [Array<Hash>] the values of the given keys
45
+ def read_multi(names)
46
+ names.map do |name|
47
+ read(name)
48
+ end
49
+ end
50
+
51
+ # Writes, as JSON, the value to a file with the given name.
52
+ #
53
+ # @param [String] name a key
54
+ # @param [Hash] value a value
55
+ def write(name, value)
56
+ File.open(namespaced_key(name), 'w') do |f|
57
+ JSON.dump(value, f)
58
+ end
59
+ end
60
+
61
+ # Delete a file with the given name.
62
+ #
63
+ # @param [String] name a key
64
+ def delete(name)
65
+ File.delete(namespaced_key(name))
66
+ end
67
+
68
+ # Deletes all files in the storage directory.
69
+ def clear
70
+ Dir[File.join(@output_dir, '*.json')].each do |path|
71
+ File.delete(path)
72
+ end
73
+ end
74
+
75
+ private
76
+
77
+ def namespaced_key(name)
78
+ File.join(@output_dir, name)
79
+ end
80
+ end
81
+ end
82
+ end
83
+ end
@@ -0,0 +1,77 @@
1
+ module Pupa
2
+ class Processor
3
+ class DocumentStore
4
+ # Stores JSON documents in Redis.
5
+ #
6
+ # Pupa flushes the JSON document store before scraping. If you use Redis,
7
+ # **DO NOT** share a Redis database with Pupa and other applications. You
8
+ # can select a different database than the default `0` for use with Pupa
9
+ # by passing an argument like `redis://localhost:6379/0`.
10
+ #
11
+ # @note Redis support depends on the `redis` gem. For better performance,
12
+ # use the `hiredis` gem as well.
13
+ class RedisStore
14
+ # @param [String] address the address (e.g. `redis://localhost:6379/0`)
15
+ # in which to dump JSON documents
16
+ def initialize(address)
17
+ options = {}
18
+ if defined?(Hiredis)
19
+ options.update(driver: :hiredis)
20
+ end
21
+ @redis = Redis::Store::Factory.create(address, options)
22
+ end
23
+
24
+ # Returns whether database contains an entry for the given key.
25
+ #
26
+ # @param [String] name a key
27
+ # @return [Boolean] whether the store contains an entry for the given key
28
+ def exist?(name)
29
+ @redis.exists(name)
30
+ end
31
+
32
+ # Returns all keys in the database.
33
+ #
34
+ # @return [Array<String>] all keys in the store
35
+ def entries
36
+ @redis.keys('*')
37
+ end
38
+
39
+ # Returns, as JSON, the value of the given key.
40
+ #
41
+ # @param [String] name a key
42
+ # @return [Hash] the value of the given key
43
+ def read(name)
44
+ JSON.load(@redis.get(name))
45
+ end
46
+
47
+ # Returns, as JSON, the values of the given keys.
48
+ #
49
+ # @param [String] names keys
50
+ # @return [Array<Hash>] the values of the given keys
51
+ def read_multi(names)
52
+ @redis.mget(*names).map{|value| JSON.load(value)}
53
+ end
54
+
55
+ # Writes, as JSON, the value to a key.
56
+ #
57
+ # @param [String] name a key
58
+ # @param [Hash] value a value
59
+ def write(name, value)
60
+ @redis.set(name, JSON.dump(value))
61
+ end
62
+
63
+ # Delete a key.
64
+ #
65
+ # @param [String] name a key
66
+ def delete(name)
67
+ @redis.del(name)
68
+ end
69
+
70
+ # Deletes all keys in the database.
71
+ def clear
72
+ @redis.flushdb
73
+ end
74
+ end
75
+ end
76
+ end
77
+ end
@@ -1,6 +1,4 @@
1
- # A refinement for the Faraday caching middleware to cache all requests, not
2
- # only GET requests. Using Ruby's refinements doesn't seem to work, possibly
3
- # because Faraday caches middlewares.
1
+ # Caches all requests, not only GET requests.
4
2
  class FaradayMiddleware::Caching
5
3
  def call(env)
6
4
  # Remove if-statement to cache any request, not only GET.
@@ -1,9 +1,9 @@
1
1
  module Pupa
2
- class Refinements
2
+ module Refinements
3
3
  # A refinement for JSON Schema to validate "email" and "uri" formats. Using
4
4
  # Ruby's refinements doesn't seem to work, possibly because `refine` can't
5
5
  # be used with `prepend`.
6
- module Format
6
+ module FormatAttribute
7
7
  # @see http://my.rails-royce.org/2010/07/21/email-validation-in-ruby-on-rails-without-regexp/
8
8
  def validate(current_schema, data, fragments, processor, validator, options = {})
9
9
  case current_schema.schema['format']
@@ -33,6 +33,6 @@ end
33
33
 
34
34
  class JSON::Schema::FormatAttribute
35
35
  class << self
36
- prepend Pupa::Refinements::Format
36
+ prepend Pupa::Refinements::FormatAttribute
37
37
  end
38
38
  end
data/lib/pupa/runner.rb CHANGED
@@ -1,4 +1,3 @@
1
- require 'fileutils'
2
1
  require 'optparse'
3
2
  require 'ostruct'
4
3
 
@@ -19,6 +18,7 @@ module Pupa
19
18
  output_dir: File.expand_path('scraped_data', Dir.pwd),
20
19
  cache_dir: File.expand_path('web_cache', Dir.pwd),
21
20
  expires_in: 86400, # 1 day
21
+ validate: true,
22
22
  host_with_port: 'localhost:27017',
23
23
  database: 'pupa',
24
24
  dry_run: false,
@@ -72,15 +72,18 @@ module Pupa
72
72
  opts.on('-t', '--task TASK', @processor_class.tasks, 'Select a scraping task to run (you may give this switch multiple times)', " (#{@processor_class.tasks.join(', ')})") do |v|
73
73
  options.tasks << v
74
74
  end
75
- opts.on('-o', '--output_dir PATH', 'The directory in which to dump JSON documents') do |v|
75
+ opts.on('-o', '--output_dir PATH', 'The directory or Redis address (e.g. redis://localhost:6379) in which to dump JSON documents') do |v|
76
76
  options.output_dir = v
77
77
  end
78
- opts.on('-c', '--cache_dir PATH', 'The directory in which to cache HTTP requests') do |v|
78
+ opts.on('-c', '--cache_dir PATH', 'The directory or Memcached address (e.g. memcached://localhost:11211) in which to cache HTTP requests') do |v|
79
79
  options.cache_dir = v
80
80
  end
81
81
  opts.on('-e', '--expires_in SECONDS', "The cache's expiration time in seconds") do |v|
82
82
  options.expires_in = v
83
83
  end
84
+ opts.on('--[no-]validate', 'Validate JSON documents') do |v|
85
+ options.validate = v
86
+ end
84
87
  opts.on('-H', '--host HOST:PORT', 'The host and port to MongoDB') do |v|
85
88
  options.host_with_port = v
86
89
  end
@@ -137,7 +140,12 @@ module Pupa
137
140
  options.tasks = @processor_class.tasks
138
141
  end
139
142
 
140
- processor = @processor_class.new(options.output_dir, cache_dir: options.cache_dir, expires_in: options.expires_in, level: options.level, options: Hash[*rest])
143
+ processor = @processor_class.new(options.output_dir,
144
+ cache_dir: options.cache_dir,
145
+ expires_in: options.expires_in,
146
+ validate: options.validate,
147
+ level: options.level,
148
+ options: Hash[*rest])
141
149
 
142
150
  options.actions.each do |action|
143
151
  unless action == 'scrape' || processor.respond_to?(action)
@@ -174,13 +182,7 @@ module Pupa
174
182
  Pupa.session = Moped::Session.new([options.host_with_port], database: options.database)
175
183
 
176
184
  if options.actions.delete('scrape')
177
- FileUtils.mkdir_p(options.output_dir)
178
- FileUtils.mkdir_p(options.cache_dir)
179
-
180
- Dir[File.join(options.output_dir, '*.json')].each do |path|
181
- FileUtils.rm(path)
182
- end
183
-
185
+ processor.store.clear
184
186
  report[:scrape] = {}
185
187
  options.tasks.each do |task_name|
186
188
  report[:scrape][task_name] = processor.dump_scraped_objects(task_name)
data/lib/pupa/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Pupa
2
- VERSION = "0.0.7"
2
+ VERSION = "0.0.8"
3
3
  end
data/pupa.gemspec CHANGED
@@ -25,10 +25,12 @@ Gem::Specification.new do |s|
25
25
  s.add_runtime_dependency('nokogiri', '~> 1.6.0')
26
26
 
27
27
  s.add_development_dependency('coveralls')
28
+ s.add_development_dependency('dalli')
28
29
  s.add_development_dependency('json', '~> 1.7.7') # to silence coveralls warning
30
+ s.add_development_dependency('multi_xml')
29
31
  s.add_development_dependency('octokit') # to update Popolo schema
30
32
  s.add_development_dependency('rake')
33
+ s.add_development_dependency('redis-store')
31
34
  s.add_development_dependency('rspec', '~> 2.10')
32
35
  s.add_development_dependency('vcr', '~> 2.5.0')
33
- s.add_development_dependency('multi_xml')
34
36
  end
@@ -0,0 +1,62 @@
1
+ ---
2
+ http_interactions:
3
+ - request:
4
+ method: get
5
+ uri: http://example.com/
6
+ body:
7
+ encoding: US-ASCII
8
+ string: ''
9
+ headers:
10
+ User-Agent:
11
+ - Faraday v0.8.8
12
+ response:
13
+ status:
14
+ code: 200
15
+ message:
16
+ headers:
17
+ accept-ranges:
18
+ - bytes
19
+ cache-control:
20
+ - max-age=604800
21
+ content-type:
22
+ - text/html
23
+ date:
24
+ - Fri, 27 Sep 2013 00:31:23 GMT
25
+ etag:
26
+ - '"3012602696"'
27
+ expires:
28
+ - Fri, 04 Oct 2013 00:31:23 GMT
29
+ last-modified:
30
+ - Fri, 09 Aug 2013 23:54:35 GMT
31
+ server:
32
+ - ECS (mdw/13C6)
33
+ x-cache:
34
+ - HIT
35
+ x-ec-custom-error:
36
+ - '1'
37
+ content-length:
38
+ - '1270'
39
+ connection:
40
+ - close
41
+ body:
42
+ encoding: UTF-8
43
+ string: "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n
44
+ \ <meta charset=\"utf-8\" />\n <meta http-equiv=\"Content-type\" content=\"text/html;
45
+ charset=utf-8\" />\n <meta name=\"viewport\" content=\"width=device-width,
46
+ initial-scale=1\" />\n <style type=\"text/css\">\n body {\n background-color:
47
+ #f0f0f2;\n margin: 0;\n padding: 0;\n font-family: \"Open
48
+ Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n \n }\n
49
+ \ div {\n width: 600px;\n margin: 5em auto;\n padding:
50
+ 50px;\n background-color: #fff;\n border-radius: 1em;\n }\n
51
+ \ a:link, a:visited {\n color: #38488f;\n text-decoration:
52
+ none;\n }\n @media (max-width: 700px) {\n body {\n background-color:
53
+ #fff;\n }\n div {\n width: auto;\n margin:
54
+ 0 auto;\n border-radius: 0;\n padding: 1em;\n }\n
55
+ \ }\n </style> \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n
56
+ \ <p>This domain is established to be used for illustrative examples in
57
+ documents. You may use this\n domain in examples without prior coordination
58
+ or asking for permission.</p>\n <p><a href=\"http://www.iana.org/domains/example\">More
59
+ information...</a></p>\n</div>\n</body>\n</html>\n"
60
+ http_version:
61
+ recorded_at: Fri, 27 Sep 2013 00:31:23 GMT
62
+ recorded_with: VCR 2.5.0
@@ -0,0 +1 @@
1
+ {"name":"bar"}
@@ -0,0 +1 @@
1
+ {"name":"baz"}
@@ -0,0 +1 @@
1
+ {"name":"foo"}
@@ -18,9 +18,15 @@ describe Pupa::Base do
18
18
  },
19
19
  },
20
20
  }
21
- attr_accessor :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links
21
+
22
+ attr_accessor :label, :founding_date, :inactive, :label_id, :manager_id, :links
23
+ attr_reader :name
22
24
  foreign_key :label_id, :manager_id
23
25
  foreign_object :label
26
+
27
+ def name=(name)
28
+ @name = name
29
+ end
24
30
  end
25
31
  end
26
32
 
@@ -32,25 +38,33 @@ describe Pupa::Base do
32
38
  Music::Band.new(properties)
33
39
  end
34
40
 
35
- describe '#attr_accessor' do
41
+ describe '.attr_accessor' do
42
+ it 'should add properties' do
43
+ [:_id, :_type, :extras, :label, :founding_date, :inactive, :label_id, :manager_id, :links].each do |property|
44
+ Music::Band.properties.to_a.should include(property)
45
+ end
46
+ end
47
+ end
48
+
49
+ describe '.attr_reader' do
36
50
  it 'should add properties' do
37
- Music::Band.properties.to_a.should == [:_id, :_type, :extras, :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links]
51
+ Music::Band.properties.to_a.should include(:name)
38
52
  end
39
53
  end
40
54
 
41
- describe '#foreign_key' do
55
+ describe '.foreign_key' do
42
56
  it 'should add foreign keys' do
43
57
  Music::Band.foreign_keys.to_a.should == [:label_id, :manager_id]
44
58
  end
45
59
  end
46
60
 
47
- describe '#foreign_object' do
61
+ describe '.foreign_object' do
48
62
  it 'should add foreign objects' do
49
63
  Music::Band.foreign_objects.to_a.should == [:label]
50
64
  end
51
65
  end
52
66
 
53
- describe '#schema=' do
67
+ describe '.schema=' do
54
68
  let :klass_with_absolute_path do
55
69
  Class.new(Pupa::Base) do
56
70
  self.schema = '/path/to/schema.json'
@@ -82,11 +96,13 @@ describe Pupa::Base do
82
96
  end
83
97
 
84
98
  it 'should accept an absolute path' do
85
- klass_with_absolute_path.json_schema.should == '/path/to/schema.json'
99
+ File.should_receive(:read).and_return('{}')
100
+ klass_with_absolute_path.json_schema.should == '{}'
86
101
  end
87
102
 
88
103
  it 'should accept a relative path' do
89
- klass_with_relative_path.json_schema.should == File.expand_path(File.join('..', '..', 'schemas', 'schema.json'), __dir__)
104
+ File.should_receive(:read).and_return('{}')
105
+ klass_with_relative_path.json_schema.should == '{}'
90
106
  end
91
107
  end
92
108
 
@@ -1,4 +1,15 @@
1
1
  require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
2
2
 
3
3
  describe Pupa::Processor::Client do
4
+ describe '.new' do
5
+ it 'should use the filesystem' do
6
+ ActiveSupport::Cache::FileStore.should_receive(:new).and_call_original
7
+ Pupa::Processor::Client.new(cache_dir: '/tmp', level: 'UNKNOWN').get('http://example.com/')
8
+ end
9
+
10
+ it 'should use Memcached' do
11
+ ActiveSupport::Cache::MemCacheStore.should_receive(:new).and_call_original
12
+ Pupa::Processor::Client.new(cache_dir: 'memcached://localhost', level: 'UNKNOWN').get('http://example.com/')
13
+ end
14
+ end
4
15
  end
@@ -0,0 +1,65 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
+
3
+ describe Pupa::Processor::DocumentStore::FileStore do
4
+ let :store do
5
+ Pupa::Processor::DocumentStore::FileStore.new(File.expand_path(File.join('..', '..', 'fixtures'), __dir__))
6
+ end
7
+
8
+ describe '#exist?' do
9
+ it 'should return true if the store contains an entry for the given key' do
10
+ store.exist?('foo.json').should == true
11
+ end
12
+
13
+ it 'should return false if the store does not contain an entry for the given key' do
14
+ store.exist?('nonexistent').should == false
15
+ end
16
+ end
17
+
18
+ describe '#entries' do
19
+ it 'should return all keys in the store' do
20
+ store.entries.sort.should == %w(bar.json baz.json foo.json)
21
+ end
22
+ end
23
+
24
+ describe '#read' do
25
+ it 'should return the value of the given key' do
26
+ store.read('foo.json').should == {'name' => 'foo'}
27
+ end
28
+ end
29
+
30
+ describe '#read_multi' do
31
+ it 'should return the values of the given keys' do
32
+ store.read_multi(%w(foo.json bar.json)).should == [{'name' => 'foo'}, {'name' => 'bar'}]
33
+ end
34
+ end
35
+
36
+ describe '#write' do
37
+ it 'should write an entry with the given value for the given key' do
38
+ store.exist?('new.json').should == false
39
+ store.write('new.json', {'name' => 'new'})
40
+ store.read('new.json').should == {'name' => 'new'}
41
+ store.delete('new.json') # cleanup
42
+ end
43
+ end
44
+
45
+ describe '#delete' do
46
+ it 'should delete an entry with the given key from the store' do
47
+ store.write('new.json', {'name' => 'new'})
48
+ store.exist?('new.json').should == true
49
+ store.delete('new.json')
50
+ store.exist?('new.json').should == false
51
+ end
52
+ end
53
+
54
+ describe '#clear' do
55
+ it 'should delete all entries from the store' do
56
+ store.entries.sort.should == %w(bar.json baz.json foo.json)
57
+ store.clear
58
+ store.entries.should == []
59
+
60
+ %w(bar baz foo).each do |name| # cleanup
61
+ store.write("#{name}.json", {'name' => name})
62
+ end
63
+ end
64
+ end
65
+ end
@@ -0,0 +1,71 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
+
3
+ describe Pupa::Processor::DocumentStore::RedisStore do
4
+ def store
5
+ Pupa::Processor::DocumentStore::RedisStore.new('redis://localhost/15')
6
+ end
7
+
8
+ before :all do
9
+ %w(foo bar baz).each do |name|
10
+ store.write("#{name}.json", {'name' => name})
11
+ end
12
+ end
13
+
14
+ describe '#exist?' do
15
+ it 'should return true if the store contains an entry for the given key' do
16
+ store.exist?('foo.json').should == true
17
+ end
18
+
19
+ it 'should return false if the store does not contain an entry for the given key' do
20
+ store.exist?('nonexistent').should == false
21
+ end
22
+ end
23
+
24
+ describe '#entries' do
25
+ it 'should return all keys in the store' do
26
+ store.entries.sort.should == %w(bar.json baz.json foo.json)
27
+ end
28
+ end
29
+
30
+ describe '#read' do
31
+ it 'should return the value of the given key' do
32
+ store.read('foo.json').should == {'name' => 'foo'}
33
+ end
34
+ end
35
+
36
+ describe '#read_multi' do
37
+ it 'should return the values of the given keys' do
38
+ store.read_multi(%w(foo.json bar.json)).should == [{'name' => 'foo'}, {'name' => 'bar'}]
39
+ end
40
+ end
41
+
42
+ describe '#write' do
43
+ it 'should write an entry with the given value for the given key' do
44
+ store.exist?('new.json').should == false
45
+ store.write('new.json', {'name' => 'new'})
46
+ store.read('new.json').should == {'name' => 'new'}
47
+ store.delete('new.json') # cleanup
48
+ end
49
+ end
50
+
51
+ describe '#delete' do
52
+ it 'should delete an entry with the given key from the store' do
53
+ store.write('new.json', {'name' => 'new'})
54
+ store.exist?('new.json').should == true
55
+ store.delete('new.json')
56
+ store.exist?('new.json').should == false
57
+ end
58
+ end
59
+
60
+ describe '#clear' do
61
+ it 'should delete all entries from the store' do
62
+ store.entries.sort.should == %w(bar.json baz.json foo.json)
63
+ store.clear
64
+ store.entries.should == []
65
+
66
+ %w(bar baz foo).each do |name| # cleanup
67
+ store.write("#{name}.json", {'name' => name})
68
+ end
69
+ end
70
+ end
71
+ end
@@ -0,0 +1,15 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
2
+
3
+ describe Pupa::Processor::DocumentStore do
4
+ describe '.new' do
5
+ it 'should use the filesystem' do
6
+ Pupa::Processor::DocumentStore::FileStore.should_receive(:new).with('/tmp').and_call_original
7
+ Pupa::Processor::DocumentStore.new('/tmp')
8
+ end
9
+
10
+ it 'should use Redis' do
11
+ Pupa::Processor::DocumentStore::RedisStore.should_receive(:new).with('redis://localhost').and_call_original
12
+ Pupa::Processor::DocumentStore.new('redis://localhost')
13
+ end
14
+ end
15
+ end
@@ -11,7 +11,7 @@ describe Pupa::Processor::Persistence do
11
11
  Pupa.session[:people].insert(_type: 'pupa/person', name: 'non-unique')
12
12
  end
13
13
 
14
- describe '#find' do
14
+ describe '.find' do
15
15
  it 'should return nil if no matches' do
16
16
  Pupa::Processor::Persistence.find(_type: 'pupa/person', name: 'nonexistent').should == nil
17
17
  end
@@ -31,6 +31,10 @@ describe Pupa::Processor do
31
31
  PersonProcessor.new('/tmp', level: 'WARN', logdev: io)
32
32
  end
33
33
 
34
+ let :novalidate do
35
+ PersonProcessor.new('/tmp', level: 'WARN', logdev: io, validate: false)
36
+ end
37
+
34
38
  describe '#get' do
35
39
  it 'should send a GET request' do
36
40
  processor.get('http://httpbin.org/get', 'foo=bar')['args'].should == {'foo' => 'bar'}
@@ -51,7 +55,7 @@ describe Pupa::Processor do
51
55
  end
52
56
  end
53
57
 
54
- describe '#add_scraping_task' do
58
+ describe '.add_scraping_task' do
55
59
  it 'should add a scraping task and define a lazy method' do
56
60
  PersonProcessor.tasks.should == [:people]
57
61
  processor.should respond_to(:people)
@@ -64,9 +68,9 @@ describe Pupa::Processor do
64
68
  end
65
69
 
66
70
  it 'should not overwrite an existing file' do
67
- FileUtils.touch(path)
71
+ File.open(path, 'w') {}
68
72
  expect{processor.dump_scraped_objects(:people)}.to raise_error(Pupa::Errors::DuplicateObjectIdError)
69
- FileUtils.rm(path)
73
+ File.delete(path)
70
74
  end
71
75
 
72
76
  it 'should dump a JSON document' do
@@ -80,6 +84,12 @@ describe Pupa::Processor do
80
84
  processor.dump_scraped_objects(:people)
81
85
  io.string.should match('http://popoloproject.com/schemas/person.json')
82
86
  end
87
+
88
+ it 'should not validate the object' do
89
+ novalidate.make_person_invalid
90
+ novalidate.dump_scraped_objects(:people)
91
+ io.string.should_not match('http://popoloproject.com/schemas/person.json')
92
+ end
83
93
  end
84
94
 
85
95
  describe '#import' do
data/spec/spec_helper.rb CHANGED
@@ -3,6 +3,8 @@ require 'rubygems'
3
3
  require 'coveralls'
4
4
  Coveralls.wear!
5
5
 
6
+ require 'multi_xml'
7
+ require 'redis-store'
6
8
  require 'rspec'
7
9
  require 'vcr'
8
10
  require File.dirname(__FILE__) + '/../lib/pupa'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pupa
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.7
4
+ version: 0.0.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Open North
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-09-26 00:00:00.000000000 Z
11
+ date: 2013-09-27 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -122,6 +122,20 @@ dependencies:
122
122
  - - '>='
123
123
  - !ruby/object:Gem::Version
124
124
  version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: dalli
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - '>='
130
+ - !ruby/object:Gem::Version
131
+ version: '0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - '>='
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
125
139
  - !ruby/object:Gem::Dependency
126
140
  name: json
127
141
  requirement: !ruby/object:Gem::Requirement
@@ -136,6 +150,20 @@ dependencies:
136
150
  - - ~>
137
151
  - !ruby/object:Gem::Version
138
152
  version: 1.7.7
153
+ - !ruby/object:Gem::Dependency
154
+ name: multi_xml
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - '>='
158
+ - !ruby/object:Gem::Version
159
+ version: '0'
160
+ type: :development
161
+ prerelease: false
162
+ version_requirements: !ruby/object:Gem::Requirement
163
+ requirements:
164
+ - - '>='
165
+ - !ruby/object:Gem::Version
166
+ version: '0'
139
167
  - !ruby/object:Gem::Dependency
140
168
  name: octokit
141
169
  requirement: !ruby/object:Gem::Requirement
@@ -165,47 +193,47 @@ dependencies:
165
193
  - !ruby/object:Gem::Version
166
194
  version: '0'
167
195
  - !ruby/object:Gem::Dependency
168
- name: rspec
196
+ name: redis-store
169
197
  requirement: !ruby/object:Gem::Requirement
170
198
  requirements:
171
- - - ~>
199
+ - - '>='
172
200
  - !ruby/object:Gem::Version
173
- version: '2.10'
201
+ version: '0'
174
202
  type: :development
175
203
  prerelease: false
176
204
  version_requirements: !ruby/object:Gem::Requirement
177
205
  requirements:
178
- - - ~>
206
+ - - '>='
179
207
  - !ruby/object:Gem::Version
180
- version: '2.10'
208
+ version: '0'
181
209
  - !ruby/object:Gem::Dependency
182
- name: vcr
210
+ name: rspec
183
211
  requirement: !ruby/object:Gem::Requirement
184
212
  requirements:
185
213
  - - ~>
186
214
  - !ruby/object:Gem::Version
187
- version: 2.5.0
215
+ version: '2.10'
188
216
  type: :development
189
217
  prerelease: false
190
218
  version_requirements: !ruby/object:Gem::Requirement
191
219
  requirements:
192
220
  - - ~>
193
221
  - !ruby/object:Gem::Version
194
- version: 2.5.0
222
+ version: '2.10'
195
223
  - !ruby/object:Gem::Dependency
196
- name: multi_xml
224
+ name: vcr
197
225
  requirement: !ruby/object:Gem::Requirement
198
226
  requirements:
199
- - - '>='
227
+ - - ~>
200
228
  - !ruby/object:Gem::Version
201
- version: '0'
229
+ version: 2.5.0
202
230
  type: :development
203
231
  prerelease: false
204
232
  version_requirements: !ruby/object:Gem::Requirement
205
233
  requirements:
206
- - - '>='
234
+ - - ~>
207
235
  - !ruby/object:Gem::Version
208
- version: '0'
236
+ version: 2.5.0
209
237
  description:
210
238
  email:
211
239
  - info@opennorth.ca
@@ -240,6 +268,9 @@ files:
240
268
  - lib/pupa/processor.rb
241
269
  - lib/pupa/processor/client.rb
242
270
  - lib/pupa/processor/dependency_graph.rb
271
+ - lib/pupa/processor/document_store.rb
272
+ - lib/pupa/processor/document_store/file_store.rb
273
+ - lib/pupa/processor/document_store/redis_store.rb
243
274
  - lib/pupa/processor/helper.rb
244
275
  - lib/pupa/processor/middleware/logger.rb
245
276
  - lib/pupa/processor/middleware/parse_html.rb
@@ -264,6 +295,10 @@ files:
264
295
  - spec/cassettes/ce69ff734ce852d2bfaa482bbf55d7ffb4762e87.yml
265
296
  - spec/cassettes/da629b01e0836deda8a5540a4e6a08783dd7aef9.yml
266
297
  - spec/cassettes/e398f35bea86b3d4c87a6934bae1eb7fca8744f9.yml
298
+ - spec/cassettes/f861172f1df3bdb2052af5451f9922699d574b77.yml
299
+ - spec/fixtures/bar.json
300
+ - spec/fixtures/baz.json
301
+ - spec/fixtures/foo.json
267
302
  - spec/logger_spec.rb
268
303
  - spec/models/base_spec.rb
269
304
  - spec/models/concerns/contactable_spec.rb
@@ -280,6 +315,9 @@ files:
280
315
  - spec/models/post_spec.rb
281
316
  - spec/processor/client_spec.rb
282
317
  - spec/processor/dependency_graph_spec.rb
318
+ - spec/processor/document_store/file_store_spec.rb
319
+ - spec/processor/document_store/redis_store_spec.rb
320
+ - spec/processor/document_store_spec.rb
283
321
  - spec/processor/helper_spec.rb
284
322
  - spec/processor/middleware/logger_spec.rb
285
323
  - spec/processor/middleware/parse_html_spec.rb
@@ -319,6 +357,10 @@ test_files:
319
357
  - spec/cassettes/ce69ff734ce852d2bfaa482bbf55d7ffb4762e87.yml
320
358
  - spec/cassettes/da629b01e0836deda8a5540a4e6a08783dd7aef9.yml
321
359
  - spec/cassettes/e398f35bea86b3d4c87a6934bae1eb7fca8744f9.yml
360
+ - spec/cassettes/f861172f1df3bdb2052af5451f9922699d574b77.yml
361
+ - spec/fixtures/bar.json
362
+ - spec/fixtures/baz.json
363
+ - spec/fixtures/foo.json
322
364
  - spec/logger_spec.rb
323
365
  - spec/models/base_spec.rb
324
366
  - spec/models/concerns/contactable_spec.rb
@@ -335,6 +377,9 @@ test_files:
335
377
  - spec/models/post_spec.rb
336
378
  - spec/processor/client_spec.rb
337
379
  - spec/processor/dependency_graph_spec.rb
380
+ - spec/processor/document_store/file_store_spec.rb
381
+ - spec/processor/document_store/redis_store_spec.rb
382
+ - spec/processor/document_store_spec.rb
338
383
  - spec/processor/helper_spec.rb
339
384
  - spec/processor/middleware/logger_spec.rb
340
385
  - spec/processor/middleware/parse_html_spec.rb