pupa 0.0.8 → 0.0.9

Sign up to get free protection for your applications and to get access to all the features.
Files changed (39) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +93 -9
  3. data/lib/pupa/models/concerns/contactable.rb +1 -0
  4. data/lib/pupa/models/concerns/identifiable.rb +1 -0
  5. data/lib/pupa/models/concerns/linkable.rb +1 -0
  6. data/lib/pupa/models/concerns/nameable.rb +1 -0
  7. data/lib/pupa/models/concerns/sourceable.rb +1 -0
  8. data/lib/pupa/models/concerns/timestamps.rb +1 -0
  9. data/lib/pupa/models/membership.rb +5 -1
  10. data/lib/pupa/models/{base.rb → model.rb} +35 -45
  11. data/lib/pupa/models/organization.rb +5 -1
  12. data/lib/pupa/models/person.rb +6 -1
  13. data/lib/pupa/models/post.rb +4 -1
  14. data/lib/pupa/processor/client.rb +18 -7
  15. data/lib/pupa/processor/document_store/file_store.rb +28 -2
  16. data/lib/pupa/processor/document_store/redis_store.rb +43 -10
  17. data/lib/pupa/processor/document_store.rb +5 -2
  18. data/lib/pupa/processor/middleware/parse_html.rb +2 -2
  19. data/lib/pupa/processor/middleware/parse_json.rb +16 -0
  20. data/lib/pupa/processor/middleware/raise_error.rb +33 -0
  21. data/lib/pupa/processor/persistence.rb +4 -4
  22. data/lib/pupa/processor.rb +21 -15
  23. data/lib/pupa/runner.rb +7 -2
  24. data/lib/pupa/version.rb +1 -1
  25. data/lib/pupa.rb +3 -1
  26. data/pupa.gemspec +2 -1
  27. data/spec/models/base_spec.rb +19 -23
  28. data/spec/models/concerns/contactable_spec.rb +2 -1
  29. data/spec/models/concerns/identifiable_spec.rb +2 -1
  30. data/spec/models/concerns/linkable_spec.rb +2 -1
  31. data/spec/models/concerns/nameable_spec.rb +2 -1
  32. data/spec/models/concerns/sourceable_spec.rb +2 -1
  33. data/spec/models/concerns/timestamps_spec.rb +2 -1
  34. data/spec/processor/document_store/file_store_spec.rb +32 -0
  35. data/spec/processor/document_store/redis_store_spec.rb +33 -0
  36. data/spec/processor/document_store_spec.rb +1 -1
  37. data/spec/processor_spec.rb +1 -1
  38. data/spec/spec_helper.rb +1 -0
  39. metadata +33 -17
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: d4ec7210671485a2de58673a70088e415a9767b7
4
- data.tar.gz: 8b2a77e3fe3c5775509fef59fb847ea2057f63cb
3
+ metadata.gz: 8124ac65b9df49b337205ce22fe6491ead5a03ec
4
+ data.tar.gz: e28195cea41f576dea3fa58def6ccfa403d2cf8c
5
5
  SHA512:
6
- metadata.gz: b3cdcf2da535ebd8d840fe2a1f6e6dd0db68de4d92b728f6f33bf7ba80f7499331f6cc9ace81da93b7e1062e7f3b8389fbab1891d81fbc44d1c48cbbc3be8eea
7
- data.tar.gz: 641956572610184f0437f0869e2ee9da3c624e493284ac3d4e0e88157395c4b8d41840ac4f1d351b63032aff1fde091862a2b83401c5ee55c1d83ac52a856a70
6
+ metadata.gz: acce9361e6ec70f4daf26bbc205724b25dddb7da5437f4fcbca6b7c5fb2df2a56f22e2f6c1cdcd63fac49c5f61c9ed052336223021c40dab522de0eecdaae562
7
+ data.tar.gz: 606a7695c7a0d722c43e6a574397dfd1870452f972d2d2a7147e67424e750645fdf2c9b5c18edb67dce05d03df1cfa3bc5535c8cda07ae9063a8795ddf4708cb
data/README.md CHANGED
@@ -45,36 +45,120 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
45
45
 
46
46
  1. You may want more control over the method used to perform a scraping task. For example, a legislature may publish legislators before 1997 in one format and legislators after 1997 in another format. In this case, you may want to select the method used to scrape legislators according to the year. See [legislator.rb](http://opennorth.github.io/pupa-ruby/docs/legislator.html).
47
47
 
48
+ ### Automatic response parsing
49
+
50
+ JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems.
51
+
48
52
  ## Performance
49
53
 
50
54
  Pupa.rb offers several ways to significantly improve performance.
51
55
 
52
- In an example case, reducing file I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
56
+ In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
53
57
 
54
- The `import` action's performance (when using a dependency graph) is currently limited by MongoDB.
58
+ The `import` action's performance is currently limited by MongoDB when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
55
59
 
56
- ### Caching HTTP requests
60
+ ### Reducing HTTP requests
57
61
 
58
62
  HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
59
63
 
60
- ruby cat.rb --cache_dir my_cache_dir
64
+ ruby cat.rb --cache_dir /tmp/my_cache_dir
65
+
66
+ ### Parallelizing HTTP requests
67
+
68
+ To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
69
+
70
+ ```ruby
71
+ require 'pupa'
72
+ require 'typhoeus'
73
+ require 'typhoeus/adapters/faraday'
74
+ ```
75
+
76
+ Then, in your scraping methods, write code like:
77
+
78
+ ```ruby
79
+ responses = []
80
+
81
+ # Change the maximum number of concurrent requests (default 200). You usually
82
+ # need to tweak this number by trial and error.
83
+ # @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
84
+ manager = Typhoeus::Hydra.new(max_concurrency: 20)
85
+
86
+ begin
87
+ # Send HTTP requests in parallel.
88
+ client.in_parallel(manager) do
89
+ responses << client.get('http://example.com/foo')
90
+ responses << client.get('http://example.com/bar')
91
+ # More requests...
92
+ end
93
+ rescue Faraday::Error::ClientError => e
94
+ # Log an error message if, for example, you exceed a server's maximum number
95
+ # of concurrent connections or if you exceed an API's rate limit.
96
+ error(e.response.inspect)
97
+ end
98
+
99
+ # Responses are now available for use.
100
+ responses.each do |response|
101
+ # Only process the finished responses.
102
+ if response.success?
103
+ # If success...
104
+ elsif response.finished?
105
+ # If error...
106
+ end
107
+ end
108
+ ```
109
+
110
+ ### Reducing disk I/O
111
+
112
+ After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
113
+
114
+ #### RAM file systems
115
+
116
+ A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
61
117
 
62
- ### Reducing file I/O
118
+ ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
119
+ diskutil erasevolume HFS+ 'ramdisk' $ramdisk
63
120
 
64
- After HTTP requests, file I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk. You may store HTTP responses in [Memcached](http://memcached.org/) like so:
121
+ You can then set the `output_dir` and `cache_dir` on OS X as:
122
+
123
+ ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
124
+
125
+ Once you are done with the RAM disk, release the memory:
126
+
127
+ diskutil unmount $ramdisk
128
+ hdiutil detach $ramdisk
129
+
130
+ Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
131
+
132
+ #### Memcached
133
+
134
+ You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
65
135
 
66
136
  ruby cat.rb --cache_dir memcached://localhost:11211
67
137
 
68
- And you may store JSON documents in [Redis](http://redis.io/) like so:
138
+ The data in Memcached will be lost between reboots.
139
+
140
+ #### Redis
141
+
142
+ You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
69
143
 
70
144
  ruby cat.rb --output_dir redis://localhost:6379/0
71
145
 
72
- Note that Pupa.rb flushes the JSON documents before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/1`, where `1` is the Redis database number.
146
+ To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
147
+
148
+ ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
149
+
150
+ Requiring the `hiredis` gem will slightly improve performance.
151
+
152
+ Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
73
153
 
74
154
  ### Skipping validation
75
155
 
76
156
  The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
77
157
 
158
+ ### Parsing JSON
159
+
160
+ If the rest of your scraper is fast, you may see an improvement by using the `oj` gem. Just `require 'oj'` and Pupa.rb will automatically pick it up, since it uses [MultiJson](https://github.com/intridea/multi_json).
161
+
78
162
  ### Profiling
79
163
 
80
164
  You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
@@ -85,7 +169,7 @@ Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `scrip
85
169
 
86
170
  CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
87
171
 
88
- You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to change the behavior of the `json-schema` gem, for whatever reason.
172
+ You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
89
173
 
90
174
  [perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
91
175
 
@@ -6,6 +6,7 @@ module Pupa
6
6
 
7
7
  included do
8
8
  attr_reader :contact_details
9
+ dump :contact_details
9
10
  end
10
11
 
11
12
  # Sets the contact details.
@@ -6,6 +6,7 @@ module Pupa
6
6
 
7
7
  included do
8
8
  attr_reader :identifiers
9
+ dump :identifiers
9
10
  end
10
11
 
11
12
  # Sets the identifiers.
@@ -6,6 +6,7 @@ module Pupa
6
6
 
7
7
  included do
8
8
  attr_accessor :links
9
+ dump :links
9
10
  end
10
11
 
11
12
  # Adds a URL.
@@ -6,6 +6,7 @@ module Pupa
6
6
 
7
7
  included do
8
8
  attr_accessor :other_names
9
+ dump :other_names
9
10
  end
10
11
 
11
12
  # Adds an alternate or former name.
@@ -6,6 +6,7 @@ module Pupa
6
6
 
7
7
  included do
8
8
  attr_accessor :sources
9
+ dump :sources
9
10
  end
10
11
 
11
12
  # Adds a source to the object.
@@ -8,6 +8,7 @@ module Pupa
8
8
 
9
9
  included do
10
10
  attr_accessor :created_at, :updated_at
11
+ dump :created_at, :updated_at
11
12
 
12
13
  set_callback(:create, :before) do |object|
13
14
  object.created_at = Time.now.utc
@@ -1,6 +1,8 @@
1
1
  module Pupa
2
2
  # A relationship between a person and an organization.
3
- class Membership < Base
3
+ class Membership
4
+ include Model
5
+
4
6
  self.schema = 'popolo/membership'
5
7
 
6
8
  include Concerns::Timestamps
@@ -10,6 +12,8 @@ module Pupa
10
12
 
11
13
  attr_accessor :label, :role, :person_id, :organization_id, :post_id,
12
14
  :start_date, :end_date
15
+ dump :label, :role, :person_id, :organization_id, :post_id,
16
+ :start_date, :end_date
13
17
 
14
18
  foreign_key :person_id, :organization_id, :post_id
15
19
 
@@ -3,9 +3,6 @@ require 'securerandom'
3
3
  require 'set'
4
4
 
5
5
  require 'active_support/callbacks'
6
- require 'active_support/core_ext/hash/except'
7
- require 'active_support/core_ext/hash/keys'
8
- require 'active_support/core_ext/hash/slice'
9
6
  require 'active_support/core_ext/object/try'
10
7
  require 'json-schema'
11
8
 
@@ -14,43 +11,36 @@ require 'pupa/refinements/json-schema'
14
11
  JSON::Validator.cache_schemas = true
15
12
 
16
13
  module Pupa
17
- # The base class from which other primary Popolo classes inherit.
18
- class Base
19
- include ActiveSupport::Callbacks
20
- define_callbacks :create, :save
21
-
22
- class_attribute :json_schema
23
- class_attribute :properties
24
- class_attribute :foreign_keys
25
- class_attribute :foreign_objects
26
-
27
- self.properties = Set.new
28
- self.foreign_keys = Set.new
29
- self.foreign_objects = Set.new
30
-
31
- class << self
32
- # Declare the class' properties.
33
- #
34
- # When converting an object to a hash using the `to_h` method, only the
35
- # properties declared with `attr_accessor` or `attr_reader` will be
36
- # included in the hash.
37
- #
38
- # @param [Array<Symbol>] the class' properties
39
- def attr_accessor(*attributes)
40
- self.properties += attributes # use assignment to not overwrite the parent's attribute
41
- super
42
- end
14
+ # Adds methods expected by Pupa processors.
15
+ module Model
16
+ extend ActiveSupport::Concern
43
17
 
44
- # Declare the class' properties.
45
- #
46
- # When converting an object to a hash using the `to_h` method, only the
47
- # properties declared with `attr_accessor` or `attr_reader` will be
48
- # included in the hash.
18
+ included do
19
+ include ActiveSupport::Callbacks
20
+ define_callbacks :create, :save
21
+
22
+ class_attribute :json_schema
23
+ class_attribute :properties
24
+ class_attribute :foreign_keys
25
+ class_attribute :foreign_objects
26
+
27
+ self.properties = Set.new
28
+ self.foreign_keys = Set.new
29
+ self.foreign_objects = Set.new
30
+
31
+ attr_reader :_id
32
+ attr_accessor :_type, :extras
33
+
34
+ dump :_id, :_type, :extras
35
+ end
36
+
37
+ module ClassMethods
38
+ # Declare which properties should be dumped to JSON after a scraping task
39
+ # is complete. A subset of these properties will be imported to MongoDB.
49
40
  #
50
- # @param [Array<Symbol>] the class' properties
51
- def attr_reader(*attributes)
41
+ # @param [Array<Symbol>] the properties to dump to JSON
42
+ def dump(*attributes)
52
43
  self.properties += attributes # use assignment to not overwrite the parent's attribute
53
- super
54
44
  end
55
45
 
56
46
  # Declare the class' foreign keys.
@@ -91,8 +81,6 @@ module Pupa
91
81
  end
92
82
  end
93
83
 
94
- attr_accessor :_id, :_type, :extras
95
-
96
84
  # @param [Hash] properties the object's properties
97
85
  def initialize(properties = {})
98
86
  @_type = self.class.to_s.underscore
@@ -149,14 +137,14 @@ module Pupa
149
137
  #
150
138
  # @return [Hash] a subset of the object's properties
151
139
  def fingerprint
152
- to_h.except(:_id)
140
+ to_h(persist: true).except(:_id)
153
141
  end
154
142
 
155
143
  # Returns the object's foreign keys and foreign objects.
156
144
  #
157
145
  # @return [Hash] the object's foreign keys and foreign objects
158
146
  def foreign_properties
159
- to_h(include_foreign_objects: true).slice(*foreign_keys + foreign_objects)
147
+ to_h.slice(*foreign_keys + foreign_objects)
160
148
  end
161
149
 
162
150
  # Validates the object against the schema.
@@ -165,17 +153,19 @@ module Pupa
165
153
  def validate!
166
154
  if self.class.json_schema
167
155
  # JSON::Validator#initialize_schema runs fastest if given a hash.
168
- JSON::Validator.validate!(self.class.json_schema, stringify_keys(to_h))
156
+ JSON::Validator.validate!(self.class.json_schema, stringify_keys(to_h(persist: true)))
169
157
  end
170
158
  end
171
159
 
172
160
  # Returns the object as a hash.
173
161
  #
174
- # @param [Boolean] include_foreign_objects whether to include foreign objects
162
+ # @param [Boolean] persist whether the object is being persisted, validated
163
+ # or used as a MongoDB selecto, in which case foreign objects (i.e. hints)
164
+ # are excluded
175
165
  # @return [Hash] the object as a hash
176
- def to_h(include_foreign_objects: false)
166
+ def to_h(persist: false)
177
167
  {}.tap do |hash|
178
- (include_foreign_objects ? properties : properties - foreign_objects).each do |property|
168
+ (persist ? properties - foreign_objects : properties).each do |property|
179
169
  value = self[property]
180
170
  if value == false || value.present?
181
171
  hash[property] = value
@@ -1,7 +1,9 @@
1
1
  module Pupa
2
2
  # A group with a common purpose or reason for existence that goes beyond the set
3
3
  # of people belonging to it.
4
- class Organization < Base
4
+ class Organization
5
+ include Model
6
+
5
7
  self.schema = 'popolo/organization'
6
8
 
7
9
  include Concerns::Timestamps
@@ -13,6 +15,8 @@ module Pupa
13
15
 
14
16
  attr_accessor :name, :classification, :parent_id, :parent, :founding_date,
15
17
  :dissolution_date, :image
18
+ dump :name, :classification, :parent_id, :parent, :founding_date,
19
+ :dissolution_date, :image
16
20
 
17
21
  foreign_key :parent_id
18
22
 
@@ -1,6 +1,8 @@
1
1
  module Pupa
2
2
  # A real person, alive or dead.
3
- class Person < Base
3
+ class Person
4
+ include Model
5
+
4
6
  self.schema = 'popolo/person'
5
7
 
6
8
  include Concerns::Timestamps
@@ -13,6 +15,9 @@ module Pupa
13
15
  attr_accessor :name, :family_name, :given_name, :additional_name,
14
16
  :honorific_prefix, :honorific_suffix, :patronymic_name, :sort_name,
15
17
  :email, :gender, :birth_date, :death_date, :image, :summary, :biography
18
+ dump :name, :family_name, :given_name, :additional_name,
19
+ :honorific_prefix, :honorific_suffix, :patronymic_name, :sort_name,
20
+ :email, :gender, :birth_date, :death_date, :image, :summary, :biography
16
21
 
17
22
  # Returns the person's name.
18
23
  #
@@ -1,6 +1,8 @@
1
1
  module Pupa
2
2
  # A position that exists independent of the person holding it.
3
- class Post < Base
3
+ class Post
4
+ include Model
5
+
4
6
  self.schema = 'popolo/post'
5
7
 
6
8
  include Concerns::Timestamps
@@ -9,6 +11,7 @@ module Pupa
9
11
  include Concerns::Linkable
10
12
 
11
13
  attr_accessor :label, :role, :organization_id, :start_date, :end_date
14
+ dump :label, :role, :organization_id, :start_date, :end_date
12
15
 
13
16
  foreign_key :organization_id
14
17
 
@@ -4,6 +4,8 @@ require 'faraday_middleware/response_middleware'
4
4
 
5
5
  require 'pupa/processor/middleware/logger'
6
6
  require 'pupa/processor/middleware/parse_html'
7
+ require 'pupa/processor/middleware/parse_json'
8
+ require 'pupa/processor/middleware/raise_error'
7
9
  require 'pupa/refinements/faraday_middleware'
8
10
 
9
11
  begin
@@ -18,7 +20,9 @@ module Pupa
18
20
  class Client
19
21
  # Returns a configured Faraday HTTP client.
20
22
  #
21
- # In order to automatically parse XML responses, you must `require 'multi_xml'`.
23
+ # To automatically parse XML responses, you must `require 'multi_xml'`.
24
+ #
25
+ # Memcached support depends on the `dalli` gem.
22
26
  #
23
27
  # @param [String] cache_dir a directory or a Memcached address
24
28
  # (e.g. `memcached://localhost:11211`) in which to cache requests
@@ -29,16 +33,19 @@ module Pupa
29
33
  Faraday.new do |connection|
30
34
  connection.request :url_encoded
31
35
  connection.use Middleware::Logger, Logger.new('faraday', level: level)
36
+ connection.use Middleware::RaiseError # useful for breaking concurrent requests
37
+
38
+ # @see http://tools.ietf.org/html/rfc4627
39
+ connection.use Middleware::ParseJson, content_type: /\bjson$/
32
40
 
33
41
  # @see http://tools.ietf.org/html/rfc2854
34
42
  # @see http://tools.ietf.org/html/rfc3236
35
- connection.use Middleware::ParseHtml, content_type: %w(text/html application/xhtml+xml)
36
-
37
- # @see http://tools.ietf.org/html/rfc4627
38
- connection.use FaradayMiddleware::ParseJson, content_type: /\bjson$/
43
+ if defined?(Nokogiri)
44
+ connection.use Middleware::ParseHtml, content_type: %w(text/html application/xhtml+xml)
45
+ end
39
46
 
47
+ # @see http://tools.ietf.org/html/rfc3023
40
48
  if defined?(MultiXml)
41
- # @see http://tools.ietf.org/html/rfc3023
42
49
  connection.use FaradayMiddleware::ParseXml, content_type: /\bxml$/
43
50
  end
44
51
 
@@ -53,7 +60,11 @@ module Pupa
53
60
  end
54
61
  end
55
62
 
56
- connection.adapter Faraday.default_adapter # must be last
63
+ if defined?(Typhoeus)
64
+ connection.adapter :typhoeus
65
+ else
66
+ connection.adapter Faraday.default_adapter # must be last
67
+ end
57
68
  end
58
69
  end
59
70
  end
@@ -34,7 +34,7 @@ module Pupa
34
34
  # @return [Hash] the value of the given key
35
35
  def read(name)
36
36
  File.open(namespaced_key(name)) do |f|
37
- JSON.load(f)
37
+ MultiJson.load(f)
38
38
  end
39
39
  end
40
40
 
@@ -54,7 +54,28 @@ module Pupa
54
54
  # @param [Hash] value a value
55
55
  def write(name, value)
56
56
  File.open(namespaced_key(name), 'w') do |f|
57
- JSON.dump(value, f)
57
+ f.write(MultiJson.dump(value))
58
+ end
59
+ end
60
+
61
+ # Writes, as JSON, the value to a file with the given name, unless such
62
+ # a file exists.
63
+ #
64
+ # @param [String] name a key
65
+ # @param [Hash] value a value
66
+ # @return [Boolean] whether the key was set
67
+ def write_unless_exists(name, value)
68
+ !exist?(name).tap do |exists|
69
+ write(name, value) unless exists
70
+ end
71
+ end
72
+
73
+ # Writes, as JSON, the values to files with the given names.
74
+ #
75
+ # @param [Hash] pairs key-value pairs
76
+ def write_multi(pairs)
77
+ pairs.each do |name,value|
78
+ write(name, value)
58
79
  end
59
80
  end
60
81
 
@@ -72,6 +93,11 @@ module Pupa
72
93
  end
73
94
  end
74
95
 
96
+ # Collects commands to run all at once.
97
+ def pipelined
98
+ yield
99
+ end
100
+
75
101
  private
76
102
 
77
103
  def namespaced_key(name)
@@ -8,16 +8,17 @@ module Pupa
8
8
  # can select a different database than the default `0` for use with Pupa
9
9
  # by passing an argument like `redis://localhost:6379/0`.
10
10
  #
11
- # @note Redis support depends on the `redis` gem. For better performance,
12
- # use the `hiredis` gem as well.
11
+ # @note Redis support depends on the `redis-store` gem. You may optionally
12
+ # use the `hiredis` gem to further improve performance.
13
13
  class RedisStore
14
14
  # @param [String] address the address (e.g. `redis://localhost:6379/0`)
15
15
  # in which to dump JSON documents
16
- def initialize(address)
17
- options = {}
18
- if defined?(Hiredis)
19
- options.update(driver: :hiredis)
20
- end
16
+ # @param [Boolean] pipelined whether to enable
17
+ # [pipelining](http://redis.io/topics/pipelining)
18
+ def initialize(address, pipelined: false)
19
+ @pipelined = pipelined
20
+ options = {marshalling: false}
21
+ options.update(driver: :hiredis) if defined?(Hiredis)
21
22
  @redis = Redis::Store::Factory.create(address, options)
22
23
  end
23
24
 
@@ -41,7 +42,7 @@ module Pupa
41
42
  # @param [String] name a key
42
43
  # @return [Hash] the value of the given key
43
44
  def read(name)
44
- JSON.load(@redis.get(name))
45
+ MultiJson.load(@redis.get(name))
45
46
  end
46
47
 
47
48
  # Returns, as JSON, the values of the given keys.
@@ -49,7 +50,7 @@ module Pupa
49
50
  # @param [String] names keys
50
51
  # @return [Array<Hash>] the values of the given keys
51
52
  def read_multi(names)
52
- @redis.mget(*names).map{|value| JSON.load(value)}
53
+ @redis.mget(*names).map{|value| MultiJson.load(value)}
53
54
  end
54
55
 
55
56
  # Writes, as JSON, the value to a key.
@@ -57,7 +58,28 @@ module Pupa
57
58
  # @param [String] name a key
58
59
  # @param [Hash] value a value
59
60
  def write(name, value)
60
- @redis.set(name, JSON.dump(value))
61
+ @redis.set(name, MultiJson.dump(value))
62
+ end
63
+
64
+ # Writes, as JSON, the value to a key, unless the key exists.
65
+ #
66
+ # @param [String] name a key
67
+ # @param [Hash] value a value
68
+ # @return [Boolean] whether the key was set
69
+ def write_unless_exists(name, value)
70
+ @redis.setnx(name, MultiJson.dump(value))
71
+ end
72
+
73
+ # Writes, as JSON, the values to keys.
74
+ #
75
+ # @param [Hash] pairs key-value pairs
76
+ def write_multi(pairs)
77
+ args = []
78
+ pairs.each do |key,value|
79
+ args << key
80
+ args << MultiJson.dump(value)
81
+ end
82
+ @redis.mset(*args)
61
83
  end
62
84
 
63
85
  # Delete a key.
@@ -71,6 +93,17 @@ module Pupa
71
93
  def clear
72
94
  @redis.flushdb
73
95
  end
96
+
97
+ # Collects commands to run all at once.
98
+ def pipelined
99
+ if @pipelined
100
+ @redis.pipelined do
101
+ yield
102
+ end
103
+ else
104
+ yield
105
+ end
106
+ end
74
107
  end
75
108
  end
76
109
  end
@@ -6,12 +6,15 @@ module Pupa
6
6
  class DocumentStore
7
7
  # Returns a configured JSON document store.
8
8
  #
9
+ # See each document store for more information.
10
+ #
9
11
  # @param [String] argument the filesystem directory or Redis address
10
12
  # (e.g. `redis://localhost:6379/0`) in which to dump JSON documents
13
+ # @param [Hash] options optional arguments
11
14
  # @return a configured JSON document store
12
- def self.new(argument)
15
+ def self.new(argument, **options)
13
16
  if argument[%r{\Aredis://}]
14
- RedisStore.new(argument)
17
+ RedisStore.new(argument, options)
15
18
  else
16
19
  FileStore.new(argument)
17
20
  end
@@ -7,9 +7,9 @@ module Pupa
7
7
  class ParseHtml < FaradayMiddleware::ResponseMiddleware
8
8
  dependency 'nokogiri'
9
9
 
10
- define_parser { |body|
10
+ define_parser do |body|
11
11
  Nokogiri::HTML(body) unless body.empty?
12
- }
12
+ end
13
13
  end
14
14
  end
15
15
  end
@@ -0,0 +1,16 @@
1
+ module Pupa
2
+ class Processor
3
+ module Middleware
4
+ # A Faraday response middleware for parsing JSON.
5
+ #
6
+ # @see https://github.com/lostisland/faraday_middleware/issues/30#issuecomment-4706892
7
+ class ParseJson < FaradayMiddleware::ResponseMiddleware
8
+ dependency 'multi_json'
9
+
10
+ define_parser do |body|
11
+ MultiJson.load(body) unless body.strip.empty?
12
+ end
13
+ end
14
+ end
15
+ end
16
+ end
@@ -0,0 +1,33 @@
1
+ module Pupa
2
+ class Processor
3
+ module Middleware
4
+ # A Faraday response middleware for raising an error if unsuccessful.
5
+ #
6
+ # @see Faraday::Response::RaiseError
7
+ class RaiseError < Faraday::Response::Middleware
8
+ def on_complete(env)
9
+ case env[:status]
10
+ when 404
11
+ raise Faraday::Error::ResourceNotFound, response_values(env)
12
+ when 407
13
+ # mimic the behavior that we get with proxy requests with HTTPS
14
+ raise Faraday::Error::ConnectionFailed, %{407 "Proxy Authentication Required "}
15
+ when 400...600
16
+ raise Faraday::Error::ClientError, response_values(env)
17
+ end
18
+ end
19
+
20
+ def response_values(env) # XXX add more keys
21
+ {
22
+ method: env[:method],
23
+ url: env[:url].to_s,
24
+ request_headers: env[:request_headers],
25
+ status: env[:status],
26
+ response_headers: env[:response_headers],
27
+ body: env[:body].to_s,
28
+ }
29
+ end
30
+ end
31
+ end
32
+ end
33
+ end
@@ -24,7 +24,7 @@ module Pupa
24
24
  when 1
25
25
  query.first
26
26
  else
27
- raise Errors::TooManyMatches, "selector matches multiple documents during find: #{collection_name} #{JSON.dump(selector)}"
27
+ raise Errors::TooManyMatches, "selector matches multiple documents during find: #{collection_name} #{MultiJson.dump(selector)}"
28
28
  end
29
29
  end
30
30
 
@@ -41,14 +41,14 @@ module Pupa
41
41
  case query.count
42
42
  when 0
43
43
  @object.run_callbacks(:create) do
44
- collection.insert(@object.to_h)
44
+ collection.insert(@object.to_h(persist: true))
45
45
  @object._id.to_s
46
46
  end
47
47
  when 1
48
- query.update(@object.to_h)
48
+ query.update(@object.to_h(persist: true))
49
49
  query.first['_id'].to_s
50
50
  else
51
- raise Errors::TooManyMatches, "selector matches multiple documents during save: #{collection_name} #{JSON.dump(selector)}"
51
+ raise Errors::TooManyMatches, "selector matches multiple documents during save: #{collection_name} #{MultiJson.dump(selector)}"
52
52
  end
53
53
  end
54
54
  end
@@ -1,7 +1,3 @@
1
- require 'json'
2
-
3
- require 'nokogiri'
4
-
5
1
  require 'pupa/processor/client'
6
2
  require 'pupa/processor/dependency_graph'
7
3
  require 'pupa/processor/helper'
@@ -30,12 +26,13 @@ module Pupa
30
26
  # @param [String] cache_dir the directory or Memcached address
31
27
  # (e.g. `memcached://localhost:11211`) in which to cache HTTP responses
32
28
  # @param [Integer] expires_in the cache's expiration time in seconds
29
+ # @param [Boolean] pipelined whether to dump JSON documents all at once
33
30
  # @param [Boolean] validate whether to validate JSON documents
34
31
  # @param [String] level the log level
35
32
  # @param [String,IO] logdev the log device
36
33
  # @param [Hash] options criteria for selecting the methods to run
37
- def initialize(output_dir, cache_dir: nil, expires_in: 86400, validate: true, level: 'INFO', logdev: STDOUT, options: {})
38
- @store = DocumentStore.new(output_dir)
34
+ def initialize(output_dir, cache_dir: nil, expires_in: 86400, pipelined: false, validate: true, level: 'INFO', logdev: STDOUT, options: {})
35
+ @store = DocumentStore.new(output_dir, pipelined: pipelined)
39
36
  @client = Client.new(cache_dir: cache_dir, expires_in: expires_in, level: level)
40
37
  @logger = Logger.new('pupa', level: level, logdev: logdev)
41
38
  @validate = validate
@@ -73,6 +70,15 @@ module Pupa
73
70
  client.post(url, params).body
74
71
  end
75
72
 
73
+ # Yields the object to the transformation task for processing, e.g. saving
74
+ # to disk, printing to CSV, etc.
75
+ #
76
+ # @param [Object] an object
77
+ # @note All the good terms are taken by Ruby: `return`, `send` and `yield`.
78
+ def dispatch(object)
79
+ Fiber.yield(object)
80
+ end
81
+
76
82
  # Adds a scraping task to Pupa.rb.
77
83
  #
78
84
  # Defines a method whose name is identical to `task_name`. This method
@@ -113,9 +119,11 @@ module Pupa
113
119
  # @return [Integer] the number of scraped objects
114
120
  def dump_scraped_objects(task_name)
115
121
  count = 0
116
- send(task_name).each do |object|
117
- count += 1 # we don't know the size of the enumeration
118
- dump_scraped_object(object)
122
+ @store.pipelined do
123
+ send(task_name).each do |object|
124
+ count += 1 # we don't know the size of the enumeration
125
+ dump_scraped_object(object)
126
+ end
119
127
  end
120
128
  count
121
129
  end
@@ -182,7 +190,7 @@ module Pupa
182
190
  end
183
191
 
184
192
  unless objects.empty?
185
- raise Errors::UnprocessableEntity, "couldn't resolve #{objects.size}/#{size} objects:\n #{objects.values.map{|object| JSON.dump(object.foreign_properties)}.join("\n ")}"
193
+ raise Errors::UnprocessableEntity, "couldn't resolve #{objects.size}/#{size} objects:\n #{objects.values.map{|object| MultiJson.dump(object.foreign_properties)}.join("\n ")}"
186
194
  end
187
195
  end
188
196
 
@@ -222,14 +230,12 @@ module Pupa
222
230
  type = object.class.to_s.demodulize.underscore
223
231
  name = "#{type}_#{object._id.gsub(File::SEPARATOR, '_')}.json"
224
232
 
225
- if @store.exist?(name)
233
+ if @store.write_unless_exists(name, object.to_h)
234
+ info {"save #{type} #{object.to_s} as #{name}"}
235
+ else
226
236
  raise Errors::DuplicateObjectIdError, "duplicate object ID: #{object._id} (was the same objected yielded twice?)"
227
237
  end
228
238
 
229
- info {"save #{type} #{object.to_s} as #{name}"}
230
-
231
- @store.write(name, object.to_h(include_foreign_objects: true))
232
-
233
239
  if @validate
234
240
  begin
235
241
  object.validate!
data/lib/pupa/runner.rb CHANGED
@@ -18,6 +18,7 @@ module Pupa
18
18
  output_dir: File.expand_path('scraped_data', Dir.pwd),
19
19
  cache_dir: File.expand_path('web_cache', Dir.pwd),
20
20
  expires_in: 86400, # 1 day
21
+ pipelined: false,
21
22
  validate: true,
22
23
  host_with_port: 'localhost:27017',
23
24
  database: 'pupa',
@@ -81,6 +82,9 @@ module Pupa
81
82
  opts.on('-e', '--expires_in SECONDS', "The cache's expiration time in seconds") do |v|
82
83
  options.expires_in = v
83
84
  end
85
+ opts.on('--pipelined', 'Dump JSON documents all at once') do |v|
86
+ options.pipelined = v
87
+ end
84
88
  opts.on('--[no-]validate', 'Validate JSON documents') do |v|
85
89
  options.validate = v
86
90
  end
@@ -143,6 +147,7 @@ module Pupa
143
147
  processor = @processor_class.new(options.output_dir,
144
148
  cache_dir: options.cache_dir,
145
149
  expires_in: options.expires_in,
150
+ pipelined: options.pipelined,
146
151
  validate: options.validate,
147
152
  level: options.level,
148
153
  options: Hash[*rest])
@@ -173,7 +178,7 @@ module Pupa
173
178
  report = {
174
179
  plan: {
175
180
  processor: @processor_class,
176
- arguments: options.to_h,
181
+ arguments: options.dup.to_h,
177
182
  options: rest,
178
183
  },
179
184
  start: Time.now.utc,
@@ -198,7 +203,7 @@ module Pupa
198
203
 
199
204
  report[:end] = Time.now.utc
200
205
  report[:time] = report[:end] - report[:start]
201
- puts JSON.dump(report)
206
+ puts MultiJson.dump(report)
202
207
  end
203
208
  end
204
209
  end
data/lib/pupa/version.rb CHANGED
@@ -1,3 +1,3 @@
1
1
  module Pupa
2
- VERSION = "0.0.8"
2
+ VERSION = "0.0.9"
3
3
  end
data/lib/pupa.rb CHANGED
@@ -3,6 +3,8 @@ require 'forwardable'
3
3
 
4
4
  require 'active_support/concern'
5
5
  require 'active_support/core_ext/class/attribute'
6
+ require 'active_support/core_ext/hash/except'
7
+ require 'active_support/core_ext/hash/slice'
6
8
  require 'active_support/core_ext/object/blank'
7
9
  require 'active_support/inflector'
8
10
 
@@ -18,7 +20,7 @@ require 'pupa/models/concerns/nameable'
18
20
  require 'pupa/models/concerns/sourceable'
19
21
  require 'pupa/models/concerns/timestamps'
20
22
 
21
- require 'pupa/models/base'
23
+ require 'pupa/models/model'
22
24
  require 'pupa/models/contact_detail_list'
23
25
  require 'pupa/models/identifier_list'
24
26
  require 'pupa/models/membership'
data/pupa.gemspec CHANGED
@@ -22,15 +22,16 @@ Gem::Specification.new do |s|
22
22
  s.add_runtime_dependency('json-schema', '~> 2.1.3')
23
23
  s.add_runtime_dependency('mail')
24
24
  s.add_runtime_dependency('moped', '~> 1.5.1')
25
- s.add_runtime_dependency('nokogiri', '~> 1.6.0')
26
25
 
27
26
  s.add_development_dependency('coveralls')
28
27
  s.add_development_dependency('dalli')
29
28
  s.add_development_dependency('json', '~> 1.7.7') # to silence coveralls warning
30
29
  s.add_development_dependency('multi_xml')
30
+ s.add_development_dependency('nokogiri', '~> 1.6.0')
31
31
  s.add_development_dependency('octokit') # to update Popolo schema
32
32
  s.add_development_dependency('rake')
33
33
  s.add_development_dependency('redis-store')
34
34
  s.add_development_dependency('rspec', '~> 2.10')
35
+ s.add_development_dependency('typhoeus')
35
36
  s.add_development_dependency('vcr', '~> 2.5.0')
36
37
  end
@@ -1,8 +1,10 @@
1
1
  require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
2
2
 
3
- describe Pupa::Base do
3
+ describe Pupa::Model do
4
4
  module Music
5
- class Band < Pupa::Base
5
+ class Band
6
+ include Pupa::Model
7
+
6
8
  self.schema = {
7
9
  '$schema' => 'http://json-schema.org/draft-03/schema#',
8
10
  'properties' => {
@@ -19,14 +21,10 @@ describe Pupa::Base do
19
21
  },
20
22
  }
21
23
 
22
- attr_accessor :label, :founding_date, :inactive, :label_id, :manager_id, :links
23
- attr_reader :name
24
+ attr_accessor :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links
25
+ dump :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links
24
26
  foreign_key :label_id, :manager_id
25
27
  foreign_object :label
26
-
27
- def name=(name)
28
- @name = name
29
- end
30
28
  end
31
29
  end
32
30
 
@@ -38,20 +36,14 @@ describe Pupa::Base do
38
36
  Music::Band.new(properties)
39
37
  end
40
38
 
41
- describe '.attr_accessor' do
39
+ describe '.dump' do
42
40
  it 'should add properties' do
43
- [:_id, :_type, :extras, :label, :founding_date, :inactive, :label_id, :manager_id, :links].each do |property|
41
+ [:_id, :_type, :extras, :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links].each do |property|
44
42
  Music::Band.properties.to_a.should include(property)
45
43
  end
46
44
  end
47
45
  end
48
46
 
49
- describe '.attr_reader' do
50
- it 'should add properties' do
51
- Music::Band.properties.to_a.should include(:name)
52
- end
53
- end
54
-
55
47
  describe '.foreign_key' do
56
48
  it 'should add foreign keys' do
57
49
  Music::Band.foreign_keys.to_a.should == [:label_id, :manager_id]
@@ -66,13 +58,15 @@ describe Pupa::Base do
66
58
 
67
59
  describe '.schema=' do
68
60
  let :klass_with_absolute_path do
69
- Class.new(Pupa::Base) do
61
+ Class.new do
62
+ include Pupa::Model
70
63
  self.schema = '/path/to/schema.json'
71
64
  end
72
65
  end
73
66
 
74
67
  let :klass_with_relative_path do
75
- Class.new(Pupa::Base) do
68
+ Class.new do
69
+ include Pupa::Model
76
70
  self.schema = 'schema'
77
71
  end
78
72
  end
@@ -178,7 +172,9 @@ describe Pupa::Base do
178
172
 
179
173
  describe '#validate!' do
180
174
  let :klass_without_schema do
181
- Class.new(Pupa::Base)
175
+ Class.new do
176
+ include Pupa::Model
177
+ end
182
178
  end
183
179
 
184
180
  it 'should do nothing if the schema is not set' do
@@ -196,12 +192,12 @@ describe Pupa::Base do
196
192
  end
197
193
 
198
194
  describe '#to_h' do
199
- it 'should not include foreign objects by default' do
200
- object.to_h.should == {_id: object._id, _type: 'music/band', name: 'Moderat', inactive: false, manager_id: '1', links: [{url: 'http://moderat.fm/'}]}
195
+ it 'should include all properties by default' do
196
+ object.to_h.should == {_id: object._id, _type: 'music/band', name: 'Moderat', label: {name: 'Mute'}, inactive: false, manager_id: '1', links: [{url: 'http://moderat.fm/'}]}
201
197
  end
202
198
 
203
- it 'should include foreign objects if desired' do
204
- object.to_h(include_foreign_objects: true).should == {_id: object._id, _type: 'music/band', name: 'Moderat', label: {name: 'Mute'}, inactive: false, manager_id: '1', links: [{url: 'http://moderat.fm/'}]}
199
+ it 'should exclude foreign objects if persisting' do
200
+ object.to_h(persist: true).should == {_id: object._id, _type: 'music/band', name: 'Moderat', inactive: false, manager_id: '1', links: [{url: 'http://moderat.fm/'}]}
205
201
  end
206
202
 
207
203
  it 'should not include blank properties' do
@@ -2,7 +2,8 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
2
 
3
3
  describe Pupa::Concerns::Contactable do
4
4
  let :klass do
5
- Class.new(Pupa::Base) do
5
+ Class.new do
6
+ include Pupa::Model
6
7
  include Pupa::Concerns::Contactable
7
8
  end
8
9
  end
@@ -2,7 +2,8 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
2
 
3
3
  describe Pupa::Concerns::Identifiable do
4
4
  let :klass do
5
- Class.new(Pupa::Base) do
5
+ Class.new do
6
+ include Pupa::Model
6
7
  include Pupa::Concerns::Identifiable
7
8
  end
8
9
  end
@@ -2,7 +2,8 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
2
 
3
3
  describe Pupa::Concerns::Linkable do
4
4
  let :klass do
5
- Class.new(Pupa::Base) do
5
+ Class.new do
6
+ include Pupa::Model
6
7
  include Pupa::Concerns::Linkable
7
8
  end
8
9
  end
@@ -2,7 +2,8 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
2
 
3
3
  describe Pupa::Concerns::Nameable do
4
4
  let :klass do
5
- Class.new(Pupa::Base) do
5
+ Class.new do
6
+ include Pupa::Model
6
7
  include Pupa::Concerns::Nameable
7
8
  end
8
9
  end
@@ -2,7 +2,8 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
2
 
3
3
  describe Pupa::Concerns::Sourceable do
4
4
  let :klass do
5
- Class.new(Pupa::Base) do
5
+ Class.new do
6
+ include Pupa::Model
6
7
  include Pupa::Concerns::Sourceable
7
8
  end
8
9
  end
@@ -2,7 +2,8 @@ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
2
 
3
3
  describe Pupa::Concerns::Timestamps do
4
4
  let :klass do
5
- Class.new(Pupa::Base) do
5
+ Class.new do
6
+ include Pupa::Model
6
7
  include Pupa::Concerns::Timestamps
7
8
 
8
9
  def save
@@ -42,6 +42,38 @@ describe Pupa::Processor::DocumentStore::FileStore do
42
42
  end
43
43
  end
44
44
 
45
+ describe '#write_unless_exists' do
46
+ it 'should write an entry with the given value for the given key' do
47
+ store.exist?('new.json').should == false
48
+ store.write_unless_exists('new.json', {'name' => 'new'}).should == true
49
+ store.read('new.json').should == {'name' => 'new'}
50
+ store.delete('new.json') # cleanup
51
+ end
52
+
53
+ it 'should not write an entry with the given value for the given key if the key exists' do
54
+ store.write_unless_exists('foo.json', {'name' => 'new'}).should == false
55
+ store.read('foo.json').should == {'name' => 'foo'}
56
+ end
57
+ end
58
+
59
+ describe '#write_multi' do
60
+ it 'should write entries with the given values for the given keys' do
61
+ pairs = {}
62
+ %w(new1 new2).each do |name|
63
+ pairs["#{name}.json"] = {'name' => name}
64
+ end
65
+
66
+ pairs.keys.each do |name|
67
+ store.exist?(name).should == false
68
+ end
69
+ store.write_multi(pairs)
70
+ store.read_multi(pairs.keys).should == [{'name' => 'new1'}, {'name' => 'new2'}]
71
+ pairs.keys.each do |name| # cleanup
72
+ store.delete(name)
73
+ end
74
+ end
75
+ end
76
+
45
77
  describe '#delete' do
46
78
  it 'should delete an entry with the given key from the store' do
47
79
  store.write('new.json', {'name' => 'new'})
@@ -6,6 +6,7 @@ describe Pupa::Processor::DocumentStore::RedisStore do
6
6
  end
7
7
 
8
8
  before :all do
9
+ store.clear
9
10
  %w(foo bar baz).each do |name|
10
11
  store.write("#{name}.json", {'name' => name})
11
12
  end
@@ -48,6 +49,38 @@ describe Pupa::Processor::DocumentStore::RedisStore do
48
49
  end
49
50
  end
50
51
 
52
+ describe '#write_unless_exists' do
53
+ it 'should write an entry with the given value for the given key' do
54
+ store.exist?('new.json').should == false
55
+ store.write_unless_exists('new.json', {'name' => 'new'}).should == true
56
+ store.read('new.json').should == {'name' => 'new'}
57
+ store.delete('new.json') # cleanup
58
+ end
59
+
60
+ it 'should not write an entry with the given value for the given key if the key exists' do
61
+ store.write_unless_exists('foo.json', {'name' => 'new'}).should == false
62
+ store.read('foo.json').should == {'name' => 'foo'}
63
+ end
64
+ end
65
+
66
+ describe '#write_multi' do
67
+ it 'should write entries with the given values for the given keys' do
68
+ pairs = {}
69
+ %w(new1 new2).each do |name|
70
+ pairs["#{name}.json"] = {'name' => name}
71
+ end
72
+
73
+ pairs.keys.each do |name|
74
+ store.exist?(name).should == false
75
+ end
76
+ store.write_multi(pairs)
77
+ store.read_multi(pairs.keys).should == [{'name' => 'new1'}, {'name' => 'new2'}]
78
+ pairs.keys.each do |name| # cleanup
79
+ store.delete(name)
80
+ end
81
+ end
82
+ end
83
+
51
84
  describe '#delete' do
52
85
  it 'should delete an entry with the given key from the store' do
53
86
  store.write('new.json', {'name' => 'new'})
@@ -8,7 +8,7 @@ describe Pupa::Processor::DocumentStore do
8
8
  end
9
9
 
10
10
  it 'should use Redis' do
11
- Pupa::Processor::DocumentStore::RedisStore.should_receive(:new).with('redis://localhost').and_call_original
11
+ Pupa::Processor::DocumentStore::RedisStore.should_receive(:new).with('redis://localhost', {}).and_call_original
12
12
  Pupa::Processor::DocumentStore.new('redis://localhost')
13
13
  end
14
14
  end
@@ -15,7 +15,7 @@ describe Pupa::Processor do
15
15
  end
16
16
 
17
17
  def scrape_people
18
- Fiber.yield(person)
18
+ dispatch(person)
19
19
  end
20
20
  end
21
21
 
data/spec/spec_helper.rb CHANGED
@@ -4,6 +4,7 @@ require 'coveralls'
4
4
  Coveralls.wear!
5
5
 
6
6
  require 'multi_xml'
7
+ require 'nokogiri'
7
8
  require 'redis-store'
8
9
  require 'rspec'
9
10
  require 'vcr'
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pupa
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.8
4
+ version: 0.0.9
5
5
  platform: ruby
6
6
  authors:
7
7
  - Open North
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-09-27 00:00:00.000000000 Z
11
+ date: 2013-09-30 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: activesupport
@@ -94,20 +94,6 @@ dependencies:
94
94
  - - ~>
95
95
  - !ruby/object:Gem::Version
96
96
  version: 1.5.1
97
- - !ruby/object:Gem::Dependency
98
- name: nokogiri
99
- requirement: !ruby/object:Gem::Requirement
100
- requirements:
101
- - - ~>
102
- - !ruby/object:Gem::Version
103
- version: 1.6.0
104
- type: :runtime
105
- prerelease: false
106
- version_requirements: !ruby/object:Gem::Requirement
107
- requirements:
108
- - - ~>
109
- - !ruby/object:Gem::Version
110
- version: 1.6.0
111
97
  - !ruby/object:Gem::Dependency
112
98
  name: coveralls
113
99
  requirement: !ruby/object:Gem::Requirement
@@ -164,6 +150,20 @@ dependencies:
164
150
  - - '>='
165
151
  - !ruby/object:Gem::Version
166
152
  version: '0'
153
+ - !ruby/object:Gem::Dependency
154
+ name: nokogiri
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - ~>
158
+ - !ruby/object:Gem::Version
159
+ version: 1.6.0
160
+ type: :development
161
+ prerelease: false
162
+ version_requirements: !ruby/object:Gem::Requirement
163
+ requirements:
164
+ - - ~>
165
+ - !ruby/object:Gem::Version
166
+ version: 1.6.0
167
167
  - !ruby/object:Gem::Dependency
168
168
  name: octokit
169
169
  requirement: !ruby/object:Gem::Requirement
@@ -220,6 +220,20 @@ dependencies:
220
220
  - - ~>
221
221
  - !ruby/object:Gem::Version
222
222
  version: '2.10'
223
+ - !ruby/object:Gem::Dependency
224
+ name: typhoeus
225
+ requirement: !ruby/object:Gem::Requirement
226
+ requirements:
227
+ - - '>='
228
+ - !ruby/object:Gem::Version
229
+ version: '0'
230
+ type: :development
231
+ prerelease: false
232
+ version_requirements: !ruby/object:Gem::Requirement
233
+ requirements:
234
+ - - '>='
235
+ - !ruby/object:Gem::Version
236
+ version: '0'
223
237
  - !ruby/object:Gem::Dependency
224
238
  name: vcr
225
239
  requirement: !ruby/object:Gem::Requirement
@@ -252,7 +266,6 @@ files:
252
266
  - lib/pupa.rb
253
267
  - lib/pupa/errors.rb
254
268
  - lib/pupa/logger.rb
255
- - lib/pupa/models/base.rb
256
269
  - lib/pupa/models/concerns/contactable.rb
257
270
  - lib/pupa/models/concerns/identifiable.rb
258
271
  - lib/pupa/models/concerns/linkable.rb
@@ -262,6 +275,7 @@ files:
262
275
  - lib/pupa/models/contact_detail_list.rb
263
276
  - lib/pupa/models/identifier_list.rb
264
277
  - lib/pupa/models/membership.rb
278
+ - lib/pupa/models/model.rb
265
279
  - lib/pupa/models/organization.rb
266
280
  - lib/pupa/models/person.rb
267
281
  - lib/pupa/models/post.rb
@@ -274,6 +288,8 @@ files:
274
288
  - lib/pupa/processor/helper.rb
275
289
  - lib/pupa/processor/middleware/logger.rb
276
290
  - lib/pupa/processor/middleware/parse_html.rb
291
+ - lib/pupa/processor/middleware/parse_json.rb
292
+ - lib/pupa/processor/middleware/raise_error.rb
277
293
  - lib/pupa/processor/persistence.rb
278
294
  - lib/pupa/processor/yielder.rb
279
295
  - lib/pupa/refinements/faraday_middleware.rb