pupa 0.0.8 → 0.0.9
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +93 -9
- data/lib/pupa/models/concerns/contactable.rb +1 -0
- data/lib/pupa/models/concerns/identifiable.rb +1 -0
- data/lib/pupa/models/concerns/linkable.rb +1 -0
- data/lib/pupa/models/concerns/nameable.rb +1 -0
- data/lib/pupa/models/concerns/sourceable.rb +1 -0
- data/lib/pupa/models/concerns/timestamps.rb +1 -0
- data/lib/pupa/models/membership.rb +5 -1
- data/lib/pupa/models/{base.rb → model.rb} +35 -45
- data/lib/pupa/models/organization.rb +5 -1
- data/lib/pupa/models/person.rb +6 -1
- data/lib/pupa/models/post.rb +4 -1
- data/lib/pupa/processor/client.rb +18 -7
- data/lib/pupa/processor/document_store/file_store.rb +28 -2
- data/lib/pupa/processor/document_store/redis_store.rb +43 -10
- data/lib/pupa/processor/document_store.rb +5 -2
- data/lib/pupa/processor/middleware/parse_html.rb +2 -2
- data/lib/pupa/processor/middleware/parse_json.rb +16 -0
- data/lib/pupa/processor/middleware/raise_error.rb +33 -0
- data/lib/pupa/processor/persistence.rb +4 -4
- data/lib/pupa/processor.rb +21 -15
- data/lib/pupa/runner.rb +7 -2
- data/lib/pupa/version.rb +1 -1
- data/lib/pupa.rb +3 -1
- data/pupa.gemspec +2 -1
- data/spec/models/base_spec.rb +19 -23
- data/spec/models/concerns/contactable_spec.rb +2 -1
- data/spec/models/concerns/identifiable_spec.rb +2 -1
- data/spec/models/concerns/linkable_spec.rb +2 -1
- data/spec/models/concerns/nameable_spec.rb +2 -1
- data/spec/models/concerns/sourceable_spec.rb +2 -1
- data/spec/models/concerns/timestamps_spec.rb +2 -1
- data/spec/processor/document_store/file_store_spec.rb +32 -0
- data/spec/processor/document_store/redis_store_spec.rb +33 -0
- data/spec/processor/document_store_spec.rb +1 -1
- data/spec/processor_spec.rb +1 -1
- data/spec/spec_helper.rb +1 -0
- metadata +33 -17
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 8124ac65b9df49b337205ce22fe6491ead5a03ec
|
4
|
+
data.tar.gz: e28195cea41f576dea3fa58def6ccfa403d2cf8c
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: acce9361e6ec70f4daf26bbc205724b25dddb7da5437f4fcbca6b7c5fb2df2a56f22e2f6c1cdcd63fac49c5f61c9ed052336223021c40dab522de0eecdaae562
|
7
|
+
data.tar.gz: 606a7695c7a0d722c43e6a574397dfd1870452f972d2d2a7147e67424e750645fdf2c9b5c18edb67dce05d03df1cfa3bc5535c8cda07ae9063a8795ddf4708cb
|
data/README.md
CHANGED
@@ -45,36 +45,120 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
|
|
45
45
|
|
46
46
|
1. You may want more control over the method used to perform a scraping task. For example, a legislature may publish legislators before 1997 in one format and legislators after 1997 in another format. In this case, you may want to select the method used to scrape legislators according to the year. See [legislator.rb](http://opennorth.github.io/pupa-ruby/docs/legislator.html).
|
47
47
|
|
48
|
+
### Automatic response parsing
|
49
|
+
|
50
|
+
JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems.
|
51
|
+
|
48
52
|
## Performance
|
49
53
|
|
50
54
|
Pupa.rb offers several ways to significantly improve performance.
|
51
55
|
|
52
|
-
In an example case, reducing
|
56
|
+
In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
|
53
57
|
|
54
|
-
The `import` action's performance
|
58
|
+
The `import` action's performance is currently limited by MongoDB when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
|
55
59
|
|
56
|
-
###
|
60
|
+
### Reducing HTTP requests
|
57
61
|
|
58
62
|
HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
|
59
63
|
|
60
|
-
ruby cat.rb --cache_dir my_cache_dir
|
64
|
+
ruby cat.rb --cache_dir /tmp/my_cache_dir
|
65
|
+
|
66
|
+
### Parallelizing HTTP requests
|
67
|
+
|
68
|
+
To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
|
69
|
+
|
70
|
+
```ruby
|
71
|
+
require 'pupa'
|
72
|
+
require 'typhoeus'
|
73
|
+
require 'typhoeus/adapters/faraday'
|
74
|
+
```
|
75
|
+
|
76
|
+
Then, in your scraping methods, write code like:
|
77
|
+
|
78
|
+
```ruby
|
79
|
+
responses = []
|
80
|
+
|
81
|
+
# Change the maximum number of concurrent requests (default 200). You usually
|
82
|
+
# need to tweak this number by trial and error.
|
83
|
+
# @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
|
84
|
+
manager = Typhoeus::Hydra.new(max_concurrency: 20)
|
85
|
+
|
86
|
+
begin
|
87
|
+
# Send HTTP requests in parallel.
|
88
|
+
client.in_parallel(manager) do
|
89
|
+
responses << client.get('http://example.com/foo')
|
90
|
+
responses << client.get('http://example.com/bar')
|
91
|
+
# More requests...
|
92
|
+
end
|
93
|
+
rescue Faraday::Error::ClientError => e
|
94
|
+
# Log an error message if, for example, you exceed a server's maximum number
|
95
|
+
# of concurrent connections or if you exceed an API's rate limit.
|
96
|
+
error(e.response.inspect)
|
97
|
+
end
|
98
|
+
|
99
|
+
# Responses are now available for use.
|
100
|
+
responses.each do |response|
|
101
|
+
# Only process the finished responses.
|
102
|
+
if response.success?
|
103
|
+
# If success...
|
104
|
+
elsif response.finished?
|
105
|
+
# If error...
|
106
|
+
end
|
107
|
+
end
|
108
|
+
```
|
109
|
+
|
110
|
+
### Reducing disk I/O
|
111
|
+
|
112
|
+
After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
|
113
|
+
|
114
|
+
#### RAM file systems
|
115
|
+
|
116
|
+
A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
|
61
117
|
|
62
|
-
|
118
|
+
ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
|
119
|
+
diskutil erasevolume HFS+ 'ramdisk' $ramdisk
|
63
120
|
|
64
|
-
|
121
|
+
You can then set the `output_dir` and `cache_dir` on OS X as:
|
122
|
+
|
123
|
+
ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
|
124
|
+
|
125
|
+
Once you are done with the RAM disk, release the memory:
|
126
|
+
|
127
|
+
diskutil unmount $ramdisk
|
128
|
+
hdiutil detach $ramdisk
|
129
|
+
|
130
|
+
Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
|
131
|
+
|
132
|
+
#### Memcached
|
133
|
+
|
134
|
+
You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
|
65
135
|
|
66
136
|
ruby cat.rb --cache_dir memcached://localhost:11211
|
67
137
|
|
68
|
-
|
138
|
+
The data in Memcached will be lost between reboots.
|
139
|
+
|
140
|
+
#### Redis
|
141
|
+
|
142
|
+
You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
|
69
143
|
|
70
144
|
ruby cat.rb --output_dir redis://localhost:6379/0
|
71
145
|
|
72
|
-
|
146
|
+
To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
|
147
|
+
|
148
|
+
ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
|
149
|
+
|
150
|
+
Requiring the `hiredis` gem will slightly improve performance.
|
151
|
+
|
152
|
+
Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
|
73
153
|
|
74
154
|
### Skipping validation
|
75
155
|
|
76
156
|
The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
|
77
157
|
|
158
|
+
### Parsing JSON
|
159
|
+
|
160
|
+
If the rest of your scraper is fast, you may see an improvement by using the `oj` gem. Just `require 'oj'` and Pupa.rb will automatically pick it up, since it uses [MultiJson](https://github.com/intridea/multi_json).
|
161
|
+
|
78
162
|
### Profiling
|
79
163
|
|
80
164
|
You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
|
@@ -85,7 +169,7 @@ Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `scrip
|
|
85
169
|
|
86
170
|
CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
|
87
171
|
|
88
|
-
You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to
|
172
|
+
You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
|
89
173
|
|
90
174
|
[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
|
91
175
|
|
@@ -1,6 +1,8 @@
|
|
1
1
|
module Pupa
|
2
2
|
# A relationship between a person and an organization.
|
3
|
-
class Membership
|
3
|
+
class Membership
|
4
|
+
include Model
|
5
|
+
|
4
6
|
self.schema = 'popolo/membership'
|
5
7
|
|
6
8
|
include Concerns::Timestamps
|
@@ -10,6 +12,8 @@ module Pupa
|
|
10
12
|
|
11
13
|
attr_accessor :label, :role, :person_id, :organization_id, :post_id,
|
12
14
|
:start_date, :end_date
|
15
|
+
dump :label, :role, :person_id, :organization_id, :post_id,
|
16
|
+
:start_date, :end_date
|
13
17
|
|
14
18
|
foreign_key :person_id, :organization_id, :post_id
|
15
19
|
|
@@ -3,9 +3,6 @@ require 'securerandom'
|
|
3
3
|
require 'set'
|
4
4
|
|
5
5
|
require 'active_support/callbacks'
|
6
|
-
require 'active_support/core_ext/hash/except'
|
7
|
-
require 'active_support/core_ext/hash/keys'
|
8
|
-
require 'active_support/core_ext/hash/slice'
|
9
6
|
require 'active_support/core_ext/object/try'
|
10
7
|
require 'json-schema'
|
11
8
|
|
@@ -14,43 +11,36 @@ require 'pupa/refinements/json-schema'
|
|
14
11
|
JSON::Validator.cache_schemas = true
|
15
12
|
|
16
13
|
module Pupa
|
17
|
-
#
|
18
|
-
|
19
|
-
|
20
|
-
define_callbacks :create, :save
|
21
|
-
|
22
|
-
class_attribute :json_schema
|
23
|
-
class_attribute :properties
|
24
|
-
class_attribute :foreign_keys
|
25
|
-
class_attribute :foreign_objects
|
26
|
-
|
27
|
-
self.properties = Set.new
|
28
|
-
self.foreign_keys = Set.new
|
29
|
-
self.foreign_objects = Set.new
|
30
|
-
|
31
|
-
class << self
|
32
|
-
# Declare the class' properties.
|
33
|
-
#
|
34
|
-
# When converting an object to a hash using the `to_h` method, only the
|
35
|
-
# properties declared with `attr_accessor` or `attr_reader` will be
|
36
|
-
# included in the hash.
|
37
|
-
#
|
38
|
-
# @param [Array<Symbol>] the class' properties
|
39
|
-
def attr_accessor(*attributes)
|
40
|
-
self.properties += attributes # use assignment to not overwrite the parent's attribute
|
41
|
-
super
|
42
|
-
end
|
14
|
+
# Adds methods expected by Pupa processors.
|
15
|
+
module Model
|
16
|
+
extend ActiveSupport::Concern
|
43
17
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
18
|
+
included do
|
19
|
+
include ActiveSupport::Callbacks
|
20
|
+
define_callbacks :create, :save
|
21
|
+
|
22
|
+
class_attribute :json_schema
|
23
|
+
class_attribute :properties
|
24
|
+
class_attribute :foreign_keys
|
25
|
+
class_attribute :foreign_objects
|
26
|
+
|
27
|
+
self.properties = Set.new
|
28
|
+
self.foreign_keys = Set.new
|
29
|
+
self.foreign_objects = Set.new
|
30
|
+
|
31
|
+
attr_reader :_id
|
32
|
+
attr_accessor :_type, :extras
|
33
|
+
|
34
|
+
dump :_id, :_type, :extras
|
35
|
+
end
|
36
|
+
|
37
|
+
module ClassMethods
|
38
|
+
# Declare which properties should be dumped to JSON after a scraping task
|
39
|
+
# is complete. A subset of these properties will be imported to MongoDB.
|
49
40
|
#
|
50
|
-
# @param [Array<Symbol>] the
|
51
|
-
def
|
41
|
+
# @param [Array<Symbol>] the properties to dump to JSON
|
42
|
+
def dump(*attributes)
|
52
43
|
self.properties += attributes # use assignment to not overwrite the parent's attribute
|
53
|
-
super
|
54
44
|
end
|
55
45
|
|
56
46
|
# Declare the class' foreign keys.
|
@@ -91,8 +81,6 @@ module Pupa
|
|
91
81
|
end
|
92
82
|
end
|
93
83
|
|
94
|
-
attr_accessor :_id, :_type, :extras
|
95
|
-
|
96
84
|
# @param [Hash] properties the object's properties
|
97
85
|
def initialize(properties = {})
|
98
86
|
@_type = self.class.to_s.underscore
|
@@ -149,14 +137,14 @@ module Pupa
|
|
149
137
|
#
|
150
138
|
# @return [Hash] a subset of the object's properties
|
151
139
|
def fingerprint
|
152
|
-
to_h.except(:_id)
|
140
|
+
to_h(persist: true).except(:_id)
|
153
141
|
end
|
154
142
|
|
155
143
|
# Returns the object's foreign keys and foreign objects.
|
156
144
|
#
|
157
145
|
# @return [Hash] the object's foreign keys and foreign objects
|
158
146
|
def foreign_properties
|
159
|
-
to_h
|
147
|
+
to_h.slice(*foreign_keys + foreign_objects)
|
160
148
|
end
|
161
149
|
|
162
150
|
# Validates the object against the schema.
|
@@ -165,17 +153,19 @@ module Pupa
|
|
165
153
|
def validate!
|
166
154
|
if self.class.json_schema
|
167
155
|
# JSON::Validator#initialize_schema runs fastest if given a hash.
|
168
|
-
JSON::Validator.validate!(self.class.json_schema, stringify_keys(to_h))
|
156
|
+
JSON::Validator.validate!(self.class.json_schema, stringify_keys(to_h(persist: true)))
|
169
157
|
end
|
170
158
|
end
|
171
159
|
|
172
160
|
# Returns the object as a hash.
|
173
161
|
#
|
174
|
-
# @param [Boolean]
|
162
|
+
# @param [Boolean] persist whether the object is being persisted, validated
|
163
|
+
# or used as a MongoDB selecto, in which case foreign objects (i.e. hints)
|
164
|
+
# are excluded
|
175
165
|
# @return [Hash] the object as a hash
|
176
|
-
def to_h(
|
166
|
+
def to_h(persist: false)
|
177
167
|
{}.tap do |hash|
|
178
|
-
(
|
168
|
+
(persist ? properties - foreign_objects : properties).each do |property|
|
179
169
|
value = self[property]
|
180
170
|
if value == false || value.present?
|
181
171
|
hash[property] = value
|
@@ -1,7 +1,9 @@
|
|
1
1
|
module Pupa
|
2
2
|
# A group with a common purpose or reason for existence that goes beyond the set
|
3
3
|
# of people belonging to it.
|
4
|
-
class Organization
|
4
|
+
class Organization
|
5
|
+
include Model
|
6
|
+
|
5
7
|
self.schema = 'popolo/organization'
|
6
8
|
|
7
9
|
include Concerns::Timestamps
|
@@ -13,6 +15,8 @@ module Pupa
|
|
13
15
|
|
14
16
|
attr_accessor :name, :classification, :parent_id, :parent, :founding_date,
|
15
17
|
:dissolution_date, :image
|
18
|
+
dump :name, :classification, :parent_id, :parent, :founding_date,
|
19
|
+
:dissolution_date, :image
|
16
20
|
|
17
21
|
foreign_key :parent_id
|
18
22
|
|
data/lib/pupa/models/person.rb
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
module Pupa
|
2
2
|
# A real person, alive or dead.
|
3
|
-
class Person
|
3
|
+
class Person
|
4
|
+
include Model
|
5
|
+
|
4
6
|
self.schema = 'popolo/person'
|
5
7
|
|
6
8
|
include Concerns::Timestamps
|
@@ -13,6 +15,9 @@ module Pupa
|
|
13
15
|
attr_accessor :name, :family_name, :given_name, :additional_name,
|
14
16
|
:honorific_prefix, :honorific_suffix, :patronymic_name, :sort_name,
|
15
17
|
:email, :gender, :birth_date, :death_date, :image, :summary, :biography
|
18
|
+
dump :name, :family_name, :given_name, :additional_name,
|
19
|
+
:honorific_prefix, :honorific_suffix, :patronymic_name, :sort_name,
|
20
|
+
:email, :gender, :birth_date, :death_date, :image, :summary, :biography
|
16
21
|
|
17
22
|
# Returns the person's name.
|
18
23
|
#
|
data/lib/pupa/models/post.rb
CHANGED
@@ -1,6 +1,8 @@
|
|
1
1
|
module Pupa
|
2
2
|
# A position that exists independent of the person holding it.
|
3
|
-
class Post
|
3
|
+
class Post
|
4
|
+
include Model
|
5
|
+
|
4
6
|
self.schema = 'popolo/post'
|
5
7
|
|
6
8
|
include Concerns::Timestamps
|
@@ -9,6 +11,7 @@ module Pupa
|
|
9
11
|
include Concerns::Linkable
|
10
12
|
|
11
13
|
attr_accessor :label, :role, :organization_id, :start_date, :end_date
|
14
|
+
dump :label, :role, :organization_id, :start_date, :end_date
|
12
15
|
|
13
16
|
foreign_key :organization_id
|
14
17
|
|
@@ -4,6 +4,8 @@ require 'faraday_middleware/response_middleware'
|
|
4
4
|
|
5
5
|
require 'pupa/processor/middleware/logger'
|
6
6
|
require 'pupa/processor/middleware/parse_html'
|
7
|
+
require 'pupa/processor/middleware/parse_json'
|
8
|
+
require 'pupa/processor/middleware/raise_error'
|
7
9
|
require 'pupa/refinements/faraday_middleware'
|
8
10
|
|
9
11
|
begin
|
@@ -18,7 +20,9 @@ module Pupa
|
|
18
20
|
class Client
|
19
21
|
# Returns a configured Faraday HTTP client.
|
20
22
|
#
|
21
|
-
#
|
23
|
+
# To automatically parse XML responses, you must `require 'multi_xml'`.
|
24
|
+
#
|
25
|
+
# Memcached support depends on the `dalli` gem.
|
22
26
|
#
|
23
27
|
# @param [String] cache_dir a directory or a Memcached address
|
24
28
|
# (e.g. `memcached://localhost:11211`) in which to cache requests
|
@@ -29,16 +33,19 @@ module Pupa
|
|
29
33
|
Faraday.new do |connection|
|
30
34
|
connection.request :url_encoded
|
31
35
|
connection.use Middleware::Logger, Logger.new('faraday', level: level)
|
36
|
+
connection.use Middleware::RaiseError # useful for breaking concurrent requests
|
37
|
+
|
38
|
+
# @see http://tools.ietf.org/html/rfc4627
|
39
|
+
connection.use Middleware::ParseJson, content_type: /\bjson$/
|
32
40
|
|
33
41
|
# @see http://tools.ietf.org/html/rfc2854
|
34
42
|
# @see http://tools.ietf.org/html/rfc3236
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
connection.use FaradayMiddleware::ParseJson, content_type: /\bjson$/
|
43
|
+
if defined?(Nokogiri)
|
44
|
+
connection.use Middleware::ParseHtml, content_type: %w(text/html application/xhtml+xml)
|
45
|
+
end
|
39
46
|
|
47
|
+
# @see http://tools.ietf.org/html/rfc3023
|
40
48
|
if defined?(MultiXml)
|
41
|
-
# @see http://tools.ietf.org/html/rfc3023
|
42
49
|
connection.use FaradayMiddleware::ParseXml, content_type: /\bxml$/
|
43
50
|
end
|
44
51
|
|
@@ -53,7 +60,11 @@ module Pupa
|
|
53
60
|
end
|
54
61
|
end
|
55
62
|
|
56
|
-
|
63
|
+
if defined?(Typhoeus)
|
64
|
+
connection.adapter :typhoeus
|
65
|
+
else
|
66
|
+
connection.adapter Faraday.default_adapter # must be last
|
67
|
+
end
|
57
68
|
end
|
58
69
|
end
|
59
70
|
end
|
@@ -34,7 +34,7 @@ module Pupa
|
|
34
34
|
# @return [Hash] the value of the given key
|
35
35
|
def read(name)
|
36
36
|
File.open(namespaced_key(name)) do |f|
|
37
|
-
|
37
|
+
MultiJson.load(f)
|
38
38
|
end
|
39
39
|
end
|
40
40
|
|
@@ -54,7 +54,28 @@ module Pupa
|
|
54
54
|
# @param [Hash] value a value
|
55
55
|
def write(name, value)
|
56
56
|
File.open(namespaced_key(name), 'w') do |f|
|
57
|
-
|
57
|
+
f.write(MultiJson.dump(value))
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
# Writes, as JSON, the value to a file with the given name, unless such
|
62
|
+
# a file exists.
|
63
|
+
#
|
64
|
+
# @param [String] name a key
|
65
|
+
# @param [Hash] value a value
|
66
|
+
# @return [Boolean] whether the key was set
|
67
|
+
def write_unless_exists(name, value)
|
68
|
+
!exist?(name).tap do |exists|
|
69
|
+
write(name, value) unless exists
|
70
|
+
end
|
71
|
+
end
|
72
|
+
|
73
|
+
# Writes, as JSON, the values to files with the given names.
|
74
|
+
#
|
75
|
+
# @param [Hash] pairs key-value pairs
|
76
|
+
def write_multi(pairs)
|
77
|
+
pairs.each do |name,value|
|
78
|
+
write(name, value)
|
58
79
|
end
|
59
80
|
end
|
60
81
|
|
@@ -72,6 +93,11 @@ module Pupa
|
|
72
93
|
end
|
73
94
|
end
|
74
95
|
|
96
|
+
# Collects commands to run all at once.
|
97
|
+
def pipelined
|
98
|
+
yield
|
99
|
+
end
|
100
|
+
|
75
101
|
private
|
76
102
|
|
77
103
|
def namespaced_key(name)
|
@@ -8,16 +8,17 @@ module Pupa
|
|
8
8
|
# can select a different database than the default `0` for use with Pupa
|
9
9
|
# by passing an argument like `redis://localhost:6379/0`.
|
10
10
|
#
|
11
|
-
# @note Redis support depends on the `redis` gem.
|
12
|
-
# use the `hiredis` gem
|
11
|
+
# @note Redis support depends on the `redis-store` gem. You may optionally
|
12
|
+
# use the `hiredis` gem to further improve performance.
|
13
13
|
class RedisStore
|
14
14
|
# @param [String] address the address (e.g. `redis://localhost:6379/0`)
|
15
15
|
# in which to dump JSON documents
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
16
|
+
# @param [Boolean] pipelined whether to enable
|
17
|
+
# [pipelining](http://redis.io/topics/pipelining)
|
18
|
+
def initialize(address, pipelined: false)
|
19
|
+
@pipelined = pipelined
|
20
|
+
options = {marshalling: false}
|
21
|
+
options.update(driver: :hiredis) if defined?(Hiredis)
|
21
22
|
@redis = Redis::Store::Factory.create(address, options)
|
22
23
|
end
|
23
24
|
|
@@ -41,7 +42,7 @@ module Pupa
|
|
41
42
|
# @param [String] name a key
|
42
43
|
# @return [Hash] the value of the given key
|
43
44
|
def read(name)
|
44
|
-
|
45
|
+
MultiJson.load(@redis.get(name))
|
45
46
|
end
|
46
47
|
|
47
48
|
# Returns, as JSON, the values of the given keys.
|
@@ -49,7 +50,7 @@ module Pupa
|
|
49
50
|
# @param [String] names keys
|
50
51
|
# @return [Array<Hash>] the values of the given keys
|
51
52
|
def read_multi(names)
|
52
|
-
@redis.mget(*names).map{|value|
|
53
|
+
@redis.mget(*names).map{|value| MultiJson.load(value)}
|
53
54
|
end
|
54
55
|
|
55
56
|
# Writes, as JSON, the value to a key.
|
@@ -57,7 +58,28 @@ module Pupa
|
|
57
58
|
# @param [String] name a key
|
58
59
|
# @param [Hash] value a value
|
59
60
|
def write(name, value)
|
60
|
-
@redis.set(name,
|
61
|
+
@redis.set(name, MultiJson.dump(value))
|
62
|
+
end
|
63
|
+
|
64
|
+
# Writes, as JSON, the value to a key, unless the key exists.
|
65
|
+
#
|
66
|
+
# @param [String] name a key
|
67
|
+
# @param [Hash] value a value
|
68
|
+
# @return [Boolean] whether the key was set
|
69
|
+
def write_unless_exists(name, value)
|
70
|
+
@redis.setnx(name, MultiJson.dump(value))
|
71
|
+
end
|
72
|
+
|
73
|
+
# Writes, as JSON, the values to keys.
|
74
|
+
#
|
75
|
+
# @param [Hash] pairs key-value pairs
|
76
|
+
def write_multi(pairs)
|
77
|
+
args = []
|
78
|
+
pairs.each do |key,value|
|
79
|
+
args << key
|
80
|
+
args << MultiJson.dump(value)
|
81
|
+
end
|
82
|
+
@redis.mset(*args)
|
61
83
|
end
|
62
84
|
|
63
85
|
# Delete a key.
|
@@ -71,6 +93,17 @@ module Pupa
|
|
71
93
|
def clear
|
72
94
|
@redis.flushdb
|
73
95
|
end
|
96
|
+
|
97
|
+
# Collects commands to run all at once.
|
98
|
+
def pipelined
|
99
|
+
if @pipelined
|
100
|
+
@redis.pipelined do
|
101
|
+
yield
|
102
|
+
end
|
103
|
+
else
|
104
|
+
yield
|
105
|
+
end
|
106
|
+
end
|
74
107
|
end
|
75
108
|
end
|
76
109
|
end
|
@@ -6,12 +6,15 @@ module Pupa
|
|
6
6
|
class DocumentStore
|
7
7
|
# Returns a configured JSON document store.
|
8
8
|
#
|
9
|
+
# See each document store for more information.
|
10
|
+
#
|
9
11
|
# @param [String] argument the filesystem directory or Redis address
|
10
12
|
# (e.g. `redis://localhost:6379/0`) in which to dump JSON documents
|
13
|
+
# @param [Hash] options optional arguments
|
11
14
|
# @return a configured JSON document store
|
12
|
-
def self.new(argument)
|
15
|
+
def self.new(argument, **options)
|
13
16
|
if argument[%r{\Aredis://}]
|
14
|
-
RedisStore.new(argument)
|
17
|
+
RedisStore.new(argument, options)
|
15
18
|
else
|
16
19
|
FileStore.new(argument)
|
17
20
|
end
|
@@ -0,0 +1,16 @@
|
|
1
|
+
module Pupa
|
2
|
+
class Processor
|
3
|
+
module Middleware
|
4
|
+
# A Faraday response middleware for parsing JSON.
|
5
|
+
#
|
6
|
+
# @see https://github.com/lostisland/faraday_middleware/issues/30#issuecomment-4706892
|
7
|
+
class ParseJson < FaradayMiddleware::ResponseMiddleware
|
8
|
+
dependency 'multi_json'
|
9
|
+
|
10
|
+
define_parser do |body|
|
11
|
+
MultiJson.load(body) unless body.strip.empty?
|
12
|
+
end
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
16
|
+
end
|
@@ -0,0 +1,33 @@
|
|
1
|
+
module Pupa
|
2
|
+
class Processor
|
3
|
+
module Middleware
|
4
|
+
# A Faraday response middleware for raising an error if unsuccessful.
|
5
|
+
#
|
6
|
+
# @see Faraday::Response::RaiseError
|
7
|
+
class RaiseError < Faraday::Response::Middleware
|
8
|
+
def on_complete(env)
|
9
|
+
case env[:status]
|
10
|
+
when 404
|
11
|
+
raise Faraday::Error::ResourceNotFound, response_values(env)
|
12
|
+
when 407
|
13
|
+
# mimic the behavior that we get with proxy requests with HTTPS
|
14
|
+
raise Faraday::Error::ConnectionFailed, %{407 "Proxy Authentication Required "}
|
15
|
+
when 400...600
|
16
|
+
raise Faraday::Error::ClientError, response_values(env)
|
17
|
+
end
|
18
|
+
end
|
19
|
+
|
20
|
+
def response_values(env) # XXX add more keys
|
21
|
+
{
|
22
|
+
method: env[:method],
|
23
|
+
url: env[:url].to_s,
|
24
|
+
request_headers: env[:request_headers],
|
25
|
+
status: env[:status],
|
26
|
+
response_headers: env[:response_headers],
|
27
|
+
body: env[:body].to_s,
|
28
|
+
}
|
29
|
+
end
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
@@ -24,7 +24,7 @@ module Pupa
|
|
24
24
|
when 1
|
25
25
|
query.first
|
26
26
|
else
|
27
|
-
raise Errors::TooManyMatches, "selector matches multiple documents during find: #{collection_name} #{
|
27
|
+
raise Errors::TooManyMatches, "selector matches multiple documents during find: #{collection_name} #{MultiJson.dump(selector)}"
|
28
28
|
end
|
29
29
|
end
|
30
30
|
|
@@ -41,14 +41,14 @@ module Pupa
|
|
41
41
|
case query.count
|
42
42
|
when 0
|
43
43
|
@object.run_callbacks(:create) do
|
44
|
-
collection.insert(@object.to_h)
|
44
|
+
collection.insert(@object.to_h(persist: true))
|
45
45
|
@object._id.to_s
|
46
46
|
end
|
47
47
|
when 1
|
48
|
-
query.update(@object.to_h)
|
48
|
+
query.update(@object.to_h(persist: true))
|
49
49
|
query.first['_id'].to_s
|
50
50
|
else
|
51
|
-
raise Errors::TooManyMatches, "selector matches multiple documents during save: #{collection_name} #{
|
51
|
+
raise Errors::TooManyMatches, "selector matches multiple documents during save: #{collection_name} #{MultiJson.dump(selector)}"
|
52
52
|
end
|
53
53
|
end
|
54
54
|
end
|
data/lib/pupa/processor.rb
CHANGED
@@ -1,7 +1,3 @@
|
|
1
|
-
require 'json'
|
2
|
-
|
3
|
-
require 'nokogiri'
|
4
|
-
|
5
1
|
require 'pupa/processor/client'
|
6
2
|
require 'pupa/processor/dependency_graph'
|
7
3
|
require 'pupa/processor/helper'
|
@@ -30,12 +26,13 @@ module Pupa
|
|
30
26
|
# @param [String] cache_dir the directory or Memcached address
|
31
27
|
# (e.g. `memcached://localhost:11211`) in which to cache HTTP responses
|
32
28
|
# @param [Integer] expires_in the cache's expiration time in seconds
|
29
|
+
# @param [Boolean] pipelined whether to dump JSON documents all at once
|
33
30
|
# @param [Boolean] validate whether to validate JSON documents
|
34
31
|
# @param [String] level the log level
|
35
32
|
# @param [String,IO] logdev the log device
|
36
33
|
# @param [Hash] options criteria for selecting the methods to run
|
37
|
-
def initialize(output_dir, cache_dir: nil, expires_in: 86400, validate: true, level: 'INFO', logdev: STDOUT, options: {})
|
38
|
-
@store = DocumentStore.new(output_dir)
|
34
|
+
def initialize(output_dir, cache_dir: nil, expires_in: 86400, pipelined: false, validate: true, level: 'INFO', logdev: STDOUT, options: {})
|
35
|
+
@store = DocumentStore.new(output_dir, pipelined: pipelined)
|
39
36
|
@client = Client.new(cache_dir: cache_dir, expires_in: expires_in, level: level)
|
40
37
|
@logger = Logger.new('pupa', level: level, logdev: logdev)
|
41
38
|
@validate = validate
|
@@ -73,6 +70,15 @@ module Pupa
|
|
73
70
|
client.post(url, params).body
|
74
71
|
end
|
75
72
|
|
73
|
+
# Yields the object to the transformation task for processing, e.g. saving
|
74
|
+
# to disk, printing to CSV, etc.
|
75
|
+
#
|
76
|
+
# @param [Object] an object
|
77
|
+
# @note All the good terms are taken by Ruby: `return`, `send` and `yield`.
|
78
|
+
def dispatch(object)
|
79
|
+
Fiber.yield(object)
|
80
|
+
end
|
81
|
+
|
76
82
|
# Adds a scraping task to Pupa.rb.
|
77
83
|
#
|
78
84
|
# Defines a method whose name is identical to `task_name`. This method
|
@@ -113,9 +119,11 @@ module Pupa
|
|
113
119
|
# @return [Integer] the number of scraped objects
|
114
120
|
def dump_scraped_objects(task_name)
|
115
121
|
count = 0
|
116
|
-
|
117
|
-
|
118
|
-
|
122
|
+
@store.pipelined do
|
123
|
+
send(task_name).each do |object|
|
124
|
+
count += 1 # we don't know the size of the enumeration
|
125
|
+
dump_scraped_object(object)
|
126
|
+
end
|
119
127
|
end
|
120
128
|
count
|
121
129
|
end
|
@@ -182,7 +190,7 @@ module Pupa
|
|
182
190
|
end
|
183
191
|
|
184
192
|
unless objects.empty?
|
185
|
-
raise Errors::UnprocessableEntity, "couldn't resolve #{objects.size}/#{size} objects:\n #{objects.values.map{|object|
|
193
|
+
raise Errors::UnprocessableEntity, "couldn't resolve #{objects.size}/#{size} objects:\n #{objects.values.map{|object| MultiJson.dump(object.foreign_properties)}.join("\n ")}"
|
186
194
|
end
|
187
195
|
end
|
188
196
|
|
@@ -222,14 +230,12 @@ module Pupa
|
|
222
230
|
type = object.class.to_s.demodulize.underscore
|
223
231
|
name = "#{type}_#{object._id.gsub(File::SEPARATOR, '_')}.json"
|
224
232
|
|
225
|
-
if @store.
|
233
|
+
if @store.write_unless_exists(name, object.to_h)
|
234
|
+
info {"save #{type} #{object.to_s} as #{name}"}
|
235
|
+
else
|
226
236
|
raise Errors::DuplicateObjectIdError, "duplicate object ID: #{object._id} (was the same objected yielded twice?)"
|
227
237
|
end
|
228
238
|
|
229
|
-
info {"save #{type} #{object.to_s} as #{name}"}
|
230
|
-
|
231
|
-
@store.write(name, object.to_h(include_foreign_objects: true))
|
232
|
-
|
233
239
|
if @validate
|
234
240
|
begin
|
235
241
|
object.validate!
|
data/lib/pupa/runner.rb
CHANGED
@@ -18,6 +18,7 @@ module Pupa
|
|
18
18
|
output_dir: File.expand_path('scraped_data', Dir.pwd),
|
19
19
|
cache_dir: File.expand_path('web_cache', Dir.pwd),
|
20
20
|
expires_in: 86400, # 1 day
|
21
|
+
pipelined: false,
|
21
22
|
validate: true,
|
22
23
|
host_with_port: 'localhost:27017',
|
23
24
|
database: 'pupa',
|
@@ -81,6 +82,9 @@ module Pupa
|
|
81
82
|
opts.on('-e', '--expires_in SECONDS', "The cache's expiration time in seconds") do |v|
|
82
83
|
options.expires_in = v
|
83
84
|
end
|
85
|
+
opts.on('--pipelined', 'Dump JSON documents all at once') do |v|
|
86
|
+
options.pipelined = v
|
87
|
+
end
|
84
88
|
opts.on('--[no-]validate', 'Validate JSON documents') do |v|
|
85
89
|
options.validate = v
|
86
90
|
end
|
@@ -143,6 +147,7 @@ module Pupa
|
|
143
147
|
processor = @processor_class.new(options.output_dir,
|
144
148
|
cache_dir: options.cache_dir,
|
145
149
|
expires_in: options.expires_in,
|
150
|
+
pipelined: options.pipelined,
|
146
151
|
validate: options.validate,
|
147
152
|
level: options.level,
|
148
153
|
options: Hash[*rest])
|
@@ -173,7 +178,7 @@ module Pupa
|
|
173
178
|
report = {
|
174
179
|
plan: {
|
175
180
|
processor: @processor_class,
|
176
|
-
arguments: options.to_h,
|
181
|
+
arguments: options.dup.to_h,
|
177
182
|
options: rest,
|
178
183
|
},
|
179
184
|
start: Time.now.utc,
|
@@ -198,7 +203,7 @@ module Pupa
|
|
198
203
|
|
199
204
|
report[:end] = Time.now.utc
|
200
205
|
report[:time] = report[:end] - report[:start]
|
201
|
-
puts
|
206
|
+
puts MultiJson.dump(report)
|
202
207
|
end
|
203
208
|
end
|
204
209
|
end
|
data/lib/pupa/version.rb
CHANGED
data/lib/pupa.rb
CHANGED
@@ -3,6 +3,8 @@ require 'forwardable'
|
|
3
3
|
|
4
4
|
require 'active_support/concern'
|
5
5
|
require 'active_support/core_ext/class/attribute'
|
6
|
+
require 'active_support/core_ext/hash/except'
|
7
|
+
require 'active_support/core_ext/hash/slice'
|
6
8
|
require 'active_support/core_ext/object/blank'
|
7
9
|
require 'active_support/inflector'
|
8
10
|
|
@@ -18,7 +20,7 @@ require 'pupa/models/concerns/nameable'
|
|
18
20
|
require 'pupa/models/concerns/sourceable'
|
19
21
|
require 'pupa/models/concerns/timestamps'
|
20
22
|
|
21
|
-
require 'pupa/models/
|
23
|
+
require 'pupa/models/model'
|
22
24
|
require 'pupa/models/contact_detail_list'
|
23
25
|
require 'pupa/models/identifier_list'
|
24
26
|
require 'pupa/models/membership'
|
data/pupa.gemspec
CHANGED
@@ -22,15 +22,16 @@ Gem::Specification.new do |s|
|
|
22
22
|
s.add_runtime_dependency('json-schema', '~> 2.1.3')
|
23
23
|
s.add_runtime_dependency('mail')
|
24
24
|
s.add_runtime_dependency('moped', '~> 1.5.1')
|
25
|
-
s.add_runtime_dependency('nokogiri', '~> 1.6.0')
|
26
25
|
|
27
26
|
s.add_development_dependency('coveralls')
|
28
27
|
s.add_development_dependency('dalli')
|
29
28
|
s.add_development_dependency('json', '~> 1.7.7') # to silence coveralls warning
|
30
29
|
s.add_development_dependency('multi_xml')
|
30
|
+
s.add_development_dependency('nokogiri', '~> 1.6.0')
|
31
31
|
s.add_development_dependency('octokit') # to update Popolo schema
|
32
32
|
s.add_development_dependency('rake')
|
33
33
|
s.add_development_dependency('redis-store')
|
34
34
|
s.add_development_dependency('rspec', '~> 2.10')
|
35
|
+
s.add_development_dependency('typhoeus')
|
35
36
|
s.add_development_dependency('vcr', '~> 2.5.0')
|
36
37
|
end
|
data/spec/models/base_spec.rb
CHANGED
@@ -1,8 +1,10 @@
|
|
1
1
|
require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
|
2
2
|
|
3
|
-
describe Pupa::
|
3
|
+
describe Pupa::Model do
|
4
4
|
module Music
|
5
|
-
class Band
|
5
|
+
class Band
|
6
|
+
include Pupa::Model
|
7
|
+
|
6
8
|
self.schema = {
|
7
9
|
'$schema' => 'http://json-schema.org/draft-03/schema#',
|
8
10
|
'properties' => {
|
@@ -19,14 +21,10 @@ describe Pupa::Base do
|
|
19
21
|
},
|
20
22
|
}
|
21
23
|
|
22
|
-
attr_accessor :label, :founding_date, :inactive, :label_id, :manager_id, :links
|
23
|
-
|
24
|
+
attr_accessor :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links
|
25
|
+
dump :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links
|
24
26
|
foreign_key :label_id, :manager_id
|
25
27
|
foreign_object :label
|
26
|
-
|
27
|
-
def name=(name)
|
28
|
-
@name = name
|
29
|
-
end
|
30
28
|
end
|
31
29
|
end
|
32
30
|
|
@@ -38,20 +36,14 @@ describe Pupa::Base do
|
|
38
36
|
Music::Band.new(properties)
|
39
37
|
end
|
40
38
|
|
41
|
-
describe '.
|
39
|
+
describe '.dump' do
|
42
40
|
it 'should add properties' do
|
43
|
-
[:_id, :_type, :extras, :label, :founding_date, :inactive, :label_id, :manager_id, :links].each do |property|
|
41
|
+
[:_id, :_type, :extras, :name, :label, :founding_date, :inactive, :label_id, :manager_id, :links].each do |property|
|
44
42
|
Music::Band.properties.to_a.should include(property)
|
45
43
|
end
|
46
44
|
end
|
47
45
|
end
|
48
46
|
|
49
|
-
describe '.attr_reader' do
|
50
|
-
it 'should add properties' do
|
51
|
-
Music::Band.properties.to_a.should include(:name)
|
52
|
-
end
|
53
|
-
end
|
54
|
-
|
55
47
|
describe '.foreign_key' do
|
56
48
|
it 'should add foreign keys' do
|
57
49
|
Music::Band.foreign_keys.to_a.should == [:label_id, :manager_id]
|
@@ -66,13 +58,15 @@ describe Pupa::Base do
|
|
66
58
|
|
67
59
|
describe '.schema=' do
|
68
60
|
let :klass_with_absolute_path do
|
69
|
-
Class.new
|
61
|
+
Class.new do
|
62
|
+
include Pupa::Model
|
70
63
|
self.schema = '/path/to/schema.json'
|
71
64
|
end
|
72
65
|
end
|
73
66
|
|
74
67
|
let :klass_with_relative_path do
|
75
|
-
Class.new
|
68
|
+
Class.new do
|
69
|
+
include Pupa::Model
|
76
70
|
self.schema = 'schema'
|
77
71
|
end
|
78
72
|
end
|
@@ -178,7 +172,9 @@ describe Pupa::Base do
|
|
178
172
|
|
179
173
|
describe '#validate!' do
|
180
174
|
let :klass_without_schema do
|
181
|
-
Class.new
|
175
|
+
Class.new do
|
176
|
+
include Pupa::Model
|
177
|
+
end
|
182
178
|
end
|
183
179
|
|
184
180
|
it 'should do nothing if the schema is not set' do
|
@@ -196,12 +192,12 @@ describe Pupa::Base do
|
|
196
192
|
end
|
197
193
|
|
198
194
|
describe '#to_h' do
|
199
|
-
it 'should
|
200
|
-
object.to_h.should == {_id: object._id, _type: 'music/band', name: 'Moderat', inactive: false, manager_id: '1', links: [{url: 'http://moderat.fm/'}]}
|
195
|
+
it 'should include all properties by default' do
|
196
|
+
object.to_h.should == {_id: object._id, _type: 'music/band', name: 'Moderat', label: {name: 'Mute'}, inactive: false, manager_id: '1', links: [{url: 'http://moderat.fm/'}]}
|
201
197
|
end
|
202
198
|
|
203
|
-
it 'should
|
204
|
-
object.to_h(
|
199
|
+
it 'should exclude foreign objects if persisting' do
|
200
|
+
object.to_h(persist: true).should == {_id: object._id, _type: 'music/band', name: 'Moderat', inactive: false, manager_id: '1', links: [{url: 'http://moderat.fm/'}]}
|
205
201
|
end
|
206
202
|
|
207
203
|
it 'should not include blank properties' do
|
@@ -42,6 +42,38 @@ describe Pupa::Processor::DocumentStore::FileStore do
|
|
42
42
|
end
|
43
43
|
end
|
44
44
|
|
45
|
+
describe '#write_unless_exists' do
|
46
|
+
it 'should write an entry with the given value for the given key' do
|
47
|
+
store.exist?('new.json').should == false
|
48
|
+
store.write_unless_exists('new.json', {'name' => 'new'}).should == true
|
49
|
+
store.read('new.json').should == {'name' => 'new'}
|
50
|
+
store.delete('new.json') # cleanup
|
51
|
+
end
|
52
|
+
|
53
|
+
it 'should not write an entry with the given value for the given key if the key exists' do
|
54
|
+
store.write_unless_exists('foo.json', {'name' => 'new'}).should == false
|
55
|
+
store.read('foo.json').should == {'name' => 'foo'}
|
56
|
+
end
|
57
|
+
end
|
58
|
+
|
59
|
+
describe '#write_multi' do
|
60
|
+
it 'should write entries with the given values for the given keys' do
|
61
|
+
pairs = {}
|
62
|
+
%w(new1 new2).each do |name|
|
63
|
+
pairs["#{name}.json"] = {'name' => name}
|
64
|
+
end
|
65
|
+
|
66
|
+
pairs.keys.each do |name|
|
67
|
+
store.exist?(name).should == false
|
68
|
+
end
|
69
|
+
store.write_multi(pairs)
|
70
|
+
store.read_multi(pairs.keys).should == [{'name' => 'new1'}, {'name' => 'new2'}]
|
71
|
+
pairs.keys.each do |name| # cleanup
|
72
|
+
store.delete(name)
|
73
|
+
end
|
74
|
+
end
|
75
|
+
end
|
76
|
+
|
45
77
|
describe '#delete' do
|
46
78
|
it 'should delete an entry with the given key from the store' do
|
47
79
|
store.write('new.json', {'name' => 'new'})
|
@@ -6,6 +6,7 @@ describe Pupa::Processor::DocumentStore::RedisStore do
|
|
6
6
|
end
|
7
7
|
|
8
8
|
before :all do
|
9
|
+
store.clear
|
9
10
|
%w(foo bar baz).each do |name|
|
10
11
|
store.write("#{name}.json", {'name' => name})
|
11
12
|
end
|
@@ -48,6 +49,38 @@ describe Pupa::Processor::DocumentStore::RedisStore do
|
|
48
49
|
end
|
49
50
|
end
|
50
51
|
|
52
|
+
describe '#write_unless_exists' do
|
53
|
+
it 'should write an entry with the given value for the given key' do
|
54
|
+
store.exist?('new.json').should == false
|
55
|
+
store.write_unless_exists('new.json', {'name' => 'new'}).should == true
|
56
|
+
store.read('new.json').should == {'name' => 'new'}
|
57
|
+
store.delete('new.json') # cleanup
|
58
|
+
end
|
59
|
+
|
60
|
+
it 'should not write an entry with the given value for the given key if the key exists' do
|
61
|
+
store.write_unless_exists('foo.json', {'name' => 'new'}).should == false
|
62
|
+
store.read('foo.json').should == {'name' => 'foo'}
|
63
|
+
end
|
64
|
+
end
|
65
|
+
|
66
|
+
describe '#write_multi' do
|
67
|
+
it 'should write entries with the given values for the given keys' do
|
68
|
+
pairs = {}
|
69
|
+
%w(new1 new2).each do |name|
|
70
|
+
pairs["#{name}.json"] = {'name' => name}
|
71
|
+
end
|
72
|
+
|
73
|
+
pairs.keys.each do |name|
|
74
|
+
store.exist?(name).should == false
|
75
|
+
end
|
76
|
+
store.write_multi(pairs)
|
77
|
+
store.read_multi(pairs.keys).should == [{'name' => 'new1'}, {'name' => 'new2'}]
|
78
|
+
pairs.keys.each do |name| # cleanup
|
79
|
+
store.delete(name)
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|
83
|
+
|
51
84
|
describe '#delete' do
|
52
85
|
it 'should delete an entry with the given key from the store' do
|
53
86
|
store.write('new.json', {'name' => 'new'})
|
@@ -8,7 +8,7 @@ describe Pupa::Processor::DocumentStore do
|
|
8
8
|
end
|
9
9
|
|
10
10
|
it 'should use Redis' do
|
11
|
-
Pupa::Processor::DocumentStore::RedisStore.should_receive(:new).with('redis://localhost').and_call_original
|
11
|
+
Pupa::Processor::DocumentStore::RedisStore.should_receive(:new).with('redis://localhost', {}).and_call_original
|
12
12
|
Pupa::Processor::DocumentStore.new('redis://localhost')
|
13
13
|
end
|
14
14
|
end
|
data/spec/processor_spec.rb
CHANGED
data/spec/spec_helper.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pupa
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.9
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Open North
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2013-09-
|
11
|
+
date: 2013-09-30 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -94,20 +94,6 @@ dependencies:
|
|
94
94
|
- - ~>
|
95
95
|
- !ruby/object:Gem::Version
|
96
96
|
version: 1.5.1
|
97
|
-
- !ruby/object:Gem::Dependency
|
98
|
-
name: nokogiri
|
99
|
-
requirement: !ruby/object:Gem::Requirement
|
100
|
-
requirements:
|
101
|
-
- - ~>
|
102
|
-
- !ruby/object:Gem::Version
|
103
|
-
version: 1.6.0
|
104
|
-
type: :runtime
|
105
|
-
prerelease: false
|
106
|
-
version_requirements: !ruby/object:Gem::Requirement
|
107
|
-
requirements:
|
108
|
-
- - ~>
|
109
|
-
- !ruby/object:Gem::Version
|
110
|
-
version: 1.6.0
|
111
97
|
- !ruby/object:Gem::Dependency
|
112
98
|
name: coveralls
|
113
99
|
requirement: !ruby/object:Gem::Requirement
|
@@ -164,6 +150,20 @@ dependencies:
|
|
164
150
|
- - '>='
|
165
151
|
- !ruby/object:Gem::Version
|
166
152
|
version: '0'
|
153
|
+
- !ruby/object:Gem::Dependency
|
154
|
+
name: nokogiri
|
155
|
+
requirement: !ruby/object:Gem::Requirement
|
156
|
+
requirements:
|
157
|
+
- - ~>
|
158
|
+
- !ruby/object:Gem::Version
|
159
|
+
version: 1.6.0
|
160
|
+
type: :development
|
161
|
+
prerelease: false
|
162
|
+
version_requirements: !ruby/object:Gem::Requirement
|
163
|
+
requirements:
|
164
|
+
- - ~>
|
165
|
+
- !ruby/object:Gem::Version
|
166
|
+
version: 1.6.0
|
167
167
|
- !ruby/object:Gem::Dependency
|
168
168
|
name: octokit
|
169
169
|
requirement: !ruby/object:Gem::Requirement
|
@@ -220,6 +220,20 @@ dependencies:
|
|
220
220
|
- - ~>
|
221
221
|
- !ruby/object:Gem::Version
|
222
222
|
version: '2.10'
|
223
|
+
- !ruby/object:Gem::Dependency
|
224
|
+
name: typhoeus
|
225
|
+
requirement: !ruby/object:Gem::Requirement
|
226
|
+
requirements:
|
227
|
+
- - '>='
|
228
|
+
- !ruby/object:Gem::Version
|
229
|
+
version: '0'
|
230
|
+
type: :development
|
231
|
+
prerelease: false
|
232
|
+
version_requirements: !ruby/object:Gem::Requirement
|
233
|
+
requirements:
|
234
|
+
- - '>='
|
235
|
+
- !ruby/object:Gem::Version
|
236
|
+
version: '0'
|
223
237
|
- !ruby/object:Gem::Dependency
|
224
238
|
name: vcr
|
225
239
|
requirement: !ruby/object:Gem::Requirement
|
@@ -252,7 +266,6 @@ files:
|
|
252
266
|
- lib/pupa.rb
|
253
267
|
- lib/pupa/errors.rb
|
254
268
|
- lib/pupa/logger.rb
|
255
|
-
- lib/pupa/models/base.rb
|
256
269
|
- lib/pupa/models/concerns/contactable.rb
|
257
270
|
- lib/pupa/models/concerns/identifiable.rb
|
258
271
|
- lib/pupa/models/concerns/linkable.rb
|
@@ -262,6 +275,7 @@ files:
|
|
262
275
|
- lib/pupa/models/contact_detail_list.rb
|
263
276
|
- lib/pupa/models/identifier_list.rb
|
264
277
|
- lib/pupa/models/membership.rb
|
278
|
+
- lib/pupa/models/model.rb
|
265
279
|
- lib/pupa/models/organization.rb
|
266
280
|
- lib/pupa/models/person.rb
|
267
281
|
- lib/pupa/models/post.rb
|
@@ -274,6 +288,8 @@ files:
|
|
274
288
|
- lib/pupa/processor/helper.rb
|
275
289
|
- lib/pupa/processor/middleware/logger.rb
|
276
290
|
- lib/pupa/processor/middleware/parse_html.rb
|
291
|
+
- lib/pupa/processor/middleware/parse_json.rb
|
292
|
+
- lib/pupa/processor/middleware/raise_error.rb
|
277
293
|
- lib/pupa/processor/persistence.rb
|
278
294
|
- lib/pupa/processor/yielder.rb
|
279
295
|
- lib/pupa/refinements/faraday_middleware.rb
|