pupa 0.1.4 → 0.1.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/PERFORMANCE.md +129 -0
- data/README.md +5 -131
- data/lib/pupa/processor/client.rb +5 -2
- data/lib/pupa/processor.rb +4 -2
- data/lib/pupa/runner.rb +23 -13
- data/lib/pupa/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d794649266b975f92ee8ff502a3de21390dc540b
|
4
|
+
data.tar.gz: 59b89a81274d35ee848d944da9f4337295ab8567
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 19582ce0e29e5a9ad52d4dabb216664d418fe738b4bf9b534a51b161eae209616df8bc87138a5f34253dace8c657ae21c0a2684024258dc918a94dfc63476a68
|
7
|
+
data.tar.gz: 3464169a23f255de3e4b357e245135fc571bc11642a1929d448a88fc1382e5418ed96b3214317c30bae90cb535f90ab6bd4a2bd94d1403ce840a1ccbf6cad734
|
data/PERFORMANCE.md
ADDED
@@ -0,0 +1,129 @@
|
|
1
|
+
# Pupa.rb: A Data Scraping Framework
|
2
|
+
|
3
|
+
## Performance
|
4
|
+
|
5
|
+
Pupa.rb offers several ways to significantly improve performance.
|
6
|
+
|
7
|
+
In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
|
8
|
+
|
9
|
+
The `import` action's performance is currently limited by the database when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
|
10
|
+
|
11
|
+
### Reducing HTTP requests
|
12
|
+
|
13
|
+
HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
|
14
|
+
|
15
|
+
ruby cat.rb --cache_dir /tmp/my_cache_dir
|
16
|
+
|
17
|
+
### Parallelizing HTTP requests
|
18
|
+
|
19
|
+
To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
require 'pupa'
|
23
|
+
require 'typhoeus'
|
24
|
+
require 'typhoeus/adapters/faraday'
|
25
|
+
```
|
26
|
+
|
27
|
+
Then, in your scraping methods, write code like:
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
responses = []
|
31
|
+
|
32
|
+
# Change the maximum number of concurrent requests (default 200). You usually
|
33
|
+
# need to tweak this number by trial and error.
|
34
|
+
# @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
|
35
|
+
manager = Typhoeus::Hydra.new(max_concurrency: 20)
|
36
|
+
|
37
|
+
begin
|
38
|
+
# Send HTTP requests in parallel.
|
39
|
+
client.in_parallel(manager) do
|
40
|
+
responses << client.get('http://example.com/foo')
|
41
|
+
responses << client.get('http://example.com/bar')
|
42
|
+
# More requests...
|
43
|
+
end
|
44
|
+
rescue Faraday::Error::ClientError => e
|
45
|
+
# Log an error message if, for example, you exceed a server's maximum number
|
46
|
+
# of concurrent connections or if you exceed an API's rate limit.
|
47
|
+
error(e.response.inspect)
|
48
|
+
end
|
49
|
+
|
50
|
+
# Responses are now available for use.
|
51
|
+
responses.each do |response|
|
52
|
+
# Only process the finished responses.
|
53
|
+
if response.success?
|
54
|
+
# If success...
|
55
|
+
elsif response.finished?
|
56
|
+
# If error...
|
57
|
+
end
|
58
|
+
end
|
59
|
+
```
|
60
|
+
|
61
|
+
### Reducing disk I/O
|
62
|
+
|
63
|
+
After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
|
64
|
+
|
65
|
+
#### RAM file systems
|
66
|
+
|
67
|
+
A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
|
68
|
+
|
69
|
+
ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
|
70
|
+
diskutil erasevolume HFS+ 'ramdisk' $ramdisk
|
71
|
+
|
72
|
+
You can then set the `output_dir` and `cache_dir` on OS X as:
|
73
|
+
|
74
|
+
ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
|
75
|
+
|
76
|
+
Once you are done with the RAM disk, release the memory:
|
77
|
+
|
78
|
+
diskutil unmount $ramdisk
|
79
|
+
hdiutil detach $ramdisk
|
80
|
+
|
81
|
+
Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
|
82
|
+
|
83
|
+
#### Memcached
|
84
|
+
|
85
|
+
You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
|
86
|
+
|
87
|
+
ruby cat.rb --cache_dir memcached://localhost:11211
|
88
|
+
|
89
|
+
The data in Memcached will be lost between reboots.
|
90
|
+
|
91
|
+
#### Redis
|
92
|
+
|
93
|
+
You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
|
94
|
+
|
95
|
+
ruby cat.rb --output_dir redis://localhost:6379/0
|
96
|
+
|
97
|
+
To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
|
98
|
+
|
99
|
+
ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
|
100
|
+
|
101
|
+
Requiring the `hiredis` gem will slightly improve performance.
|
102
|
+
|
103
|
+
Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
|
104
|
+
|
105
|
+
### Skipping validation
|
106
|
+
|
107
|
+
The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
|
108
|
+
|
109
|
+
The [pupa-validate](https://npmjs.org/package/pupa-validate) npm package can be used to validate JSON documents using the faster JSV. In an example case, using JSV instead of the `json-schema` gem reduced by half the time to validate 10,000 documents.
|
110
|
+
|
111
|
+
### Ruby version
|
112
|
+
|
113
|
+
Pupa.rb requires Ruby 2.x. If you have already made all the above optimizations, you may notice a significant improvement by using Ruby 2.1, which has better garbage collection than Ruby 2.0.
|
114
|
+
|
115
|
+
### Profiling
|
116
|
+
|
117
|
+
You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
|
118
|
+
|
119
|
+
gem install perftools.rb
|
120
|
+
|
121
|
+
Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
|
122
|
+
|
123
|
+
CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
|
124
|
+
|
125
|
+
You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
|
126
|
+
|
127
|
+
[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
|
128
|
+
|
129
|
+
pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf
|
data/README.md
CHANGED
@@ -69,7 +69,7 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
|
|
69
69
|
|
70
70
|
JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems.
|
71
71
|
|
72
|
-
|
72
|
+
## [OpenCivicData](http://opencivicdata.org/) compatibility
|
73
73
|
|
74
74
|
Both Pupa.rb and Sunlight Labs' [Pupa](https://github.com/opencivicdata/pupa) implement models for people, organizations and memberships from the [Popolo](http://popoloproject.com/) open government data specification. Pupa.rb lets you use your own classes, but Pupa only supports a fixed set of classes. A consequence of Pupa.rb's flexibility is that the value of the `_type` property for `Person`, `Organization` and `Membership` objects differs between Pupa.rb and Pupa. Pupa.rb has namespaced types like `pupa/person` – to allow Ruby to load the `Person` class in the `Pupa` module – whereas Pupa has unnamespaced types like `person`.
|
75
75
|
|
@@ -81,138 +81,8 @@ require 'pupa/refinements/opencivicdata'
|
|
81
81
|
|
82
82
|
It is not currently possible to run the `scrape` action with one of Pupa.rb and Pupa, and to then run the `import` action with the other. Both actions must be run by the same library.
|
83
83
|
|
84
|
-
## Performance
|
85
|
-
|
86
|
-
Pupa.rb offers several ways to significantly improve performance.
|
87
|
-
|
88
|
-
In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
|
89
|
-
|
90
|
-
The `import` action's performance is currently limited by MongoDB when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
|
91
|
-
|
92
|
-
### Reducing HTTP requests
|
93
|
-
|
94
|
-
HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
|
95
|
-
|
96
|
-
ruby cat.rb --cache_dir /tmp/my_cache_dir
|
97
|
-
|
98
|
-
### Parallelizing HTTP requests
|
99
|
-
|
100
|
-
To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
|
101
|
-
|
102
|
-
```ruby
|
103
|
-
require 'pupa'
|
104
|
-
require 'typhoeus'
|
105
|
-
require 'typhoeus/adapters/faraday'
|
106
|
-
```
|
107
|
-
|
108
|
-
Then, in your scraping methods, write code like:
|
109
|
-
|
110
|
-
```ruby
|
111
|
-
responses = []
|
112
|
-
|
113
|
-
# Change the maximum number of concurrent requests (default 200). You usually
|
114
|
-
# need to tweak this number by trial and error.
|
115
|
-
# @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
|
116
|
-
manager = Typhoeus::Hydra.new(max_concurrency: 20)
|
117
|
-
|
118
|
-
begin
|
119
|
-
# Send HTTP requests in parallel.
|
120
|
-
client.in_parallel(manager) do
|
121
|
-
responses << client.get('http://example.com/foo')
|
122
|
-
responses << client.get('http://example.com/bar')
|
123
|
-
# More requests...
|
124
|
-
end
|
125
|
-
rescue Faraday::Error::ClientError => e
|
126
|
-
# Log an error message if, for example, you exceed a server's maximum number
|
127
|
-
# of concurrent connections or if you exceed an API's rate limit.
|
128
|
-
error(e.response.inspect)
|
129
|
-
end
|
130
|
-
|
131
|
-
# Responses are now available for use.
|
132
|
-
responses.each do |response|
|
133
|
-
# Only process the finished responses.
|
134
|
-
if response.success?
|
135
|
-
# If success...
|
136
|
-
elsif response.finished?
|
137
|
-
# If error...
|
138
|
-
end
|
139
|
-
end
|
140
|
-
```
|
141
|
-
|
142
|
-
### Reducing disk I/O
|
143
|
-
|
144
|
-
After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
|
145
|
-
|
146
|
-
#### RAM file systems
|
147
|
-
|
148
|
-
A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
|
149
|
-
|
150
|
-
ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
|
151
|
-
diskutil erasevolume HFS+ 'ramdisk' $ramdisk
|
152
|
-
|
153
|
-
You can then set the `output_dir` and `cache_dir` on OS X as:
|
154
|
-
|
155
|
-
ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
|
156
|
-
|
157
|
-
Once you are done with the RAM disk, release the memory:
|
158
|
-
|
159
|
-
diskutil unmount $ramdisk
|
160
|
-
hdiutil detach $ramdisk
|
161
|
-
|
162
|
-
Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
|
163
|
-
|
164
|
-
#### Memcached
|
165
|
-
|
166
|
-
You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
|
167
|
-
|
168
|
-
ruby cat.rb --cache_dir memcached://localhost:11211
|
169
|
-
|
170
|
-
The data in Memcached will be lost between reboots.
|
171
|
-
|
172
|
-
#### Redis
|
173
|
-
|
174
|
-
You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
|
175
|
-
|
176
|
-
ruby cat.rb --output_dir redis://localhost:6379/0
|
177
|
-
|
178
|
-
To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
|
179
|
-
|
180
|
-
ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
|
181
|
-
|
182
|
-
Requiring the `hiredis` gem will slightly improve performance.
|
183
|
-
|
184
|
-
Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
|
185
|
-
|
186
|
-
### Skipping validation
|
187
|
-
|
188
|
-
The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
|
189
|
-
|
190
|
-
The [pupa-validate](https://npmjs.org/package/pupa-validate) npm package can be used to validate JSON documents using the faster JSV. In an example case, using JSV instead of the `json-schema` gem reduced by half the time to validate 10,000 documents.
|
191
|
-
|
192
|
-
### Ruby version
|
193
|
-
|
194
|
-
Pupa.rb requires Ruby 2.x. If you have already made all the above optimizations, you may notice a significant improvement by using Ruby 2.1, which has better garbage collection than Ruby 2.0.
|
195
|
-
|
196
|
-
### Profiling
|
197
|
-
|
198
|
-
You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
|
199
|
-
|
200
|
-
gem install perftools.rb
|
201
|
-
|
202
|
-
Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
|
203
|
-
|
204
|
-
CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
|
205
|
-
|
206
|
-
You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
|
207
|
-
|
208
|
-
[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
|
209
|
-
|
210
|
-
pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf
|
211
|
-
|
212
84
|
## Integration with ODMs
|
213
85
|
|
214
|
-
### Mongoid
|
215
|
-
|
216
86
|
`Pupa::Model` is incompatible with `Mongoid::Document`. Don't do this:
|
217
87
|
|
218
88
|
```ruby
|
@@ -224,6 +94,10 @@ end
|
|
224
94
|
|
225
95
|
Instead, have a scraping model that includes `Pupa::Model` and an app model that includes `Mongoid::Document`.
|
226
96
|
|
97
|
+
## Performance
|
98
|
+
|
99
|
+
Pupa.rb offers several ways to significantly improve performance. [Read the documentation.](https://github.com/opennorth/pupa-ruby/blob/master/PERFORMANCE.md#readme)
|
100
|
+
|
227
101
|
## Testing
|
228
102
|
|
229
103
|
**DO NOT** run this gem's specs if you are using Redis database number 15 on `localhost`!
|
@@ -30,9 +30,12 @@ module Pupa
|
|
30
30
|
# (e.g. `memcached://localhost:11211`) in which to cache requests
|
31
31
|
# @param [Integer] expires_in the cache's expiration time in seconds
|
32
32
|
# @param [Integer] value_max_bytes the maximum Memcached item size
|
33
|
+
# @param [String] memcached_username the Memcached username
|
34
|
+
# @param [String] memcached_password the Memcached password
|
33
35
|
# @param [String] level the log level
|
36
|
+
# @param [String,IO] logdev the log device
|
34
37
|
# @return [Faraday::Connection] a configured Faraday HTTP client
|
35
|
-
def self.new(cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, level: 'INFO') # 1 day
|
38
|
+
def self.new(cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, memcached_username: nil, memcached_password: nil, level: 'INFO', logdev: STDOUT) # 1 day
|
36
39
|
Faraday.new do |connection|
|
37
40
|
connection.request :url_encoded
|
38
41
|
connection.use Middleware::Logger, Logger.new('faraday', level: level)
|
@@ -59,7 +62,7 @@ module Pupa
|
|
59
62
|
connection.response :caching do
|
60
63
|
address = cache_dir[%r{\Amemcached://(.+)\z}, 1]
|
61
64
|
if address
|
62
|
-
ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in, value_max_bytes: Integer(value_max_bytes))
|
65
|
+
ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in, value_max_bytes: Integer(value_max_bytes), username: memcached_username, password: memcached_password)
|
63
66
|
else
|
64
67
|
ActiveSupport::Cache::FileStore.new(cache_dir, expires_in: expires_in)
|
65
68
|
end
|
data/lib/pupa/processor.rb
CHANGED
@@ -25,14 +25,16 @@ module Pupa
|
|
25
25
|
# (e.g. `memcached://localhost:11211`) in which to cache HTTP responses
|
26
26
|
# @param [Integer] expires_in the cache's expiration time in seconds
|
27
27
|
# @param [Integer] value_max_bytes the maximum Memcached item size
|
28
|
+
# @param [String] memcached_username the Memcached username
|
29
|
+
# @param [String] memcached_password the Memcached password
|
28
30
|
# @param [String] database_url the database URL
|
29
31
|
# @param [Boolean] validate whether to validate JSON documents
|
30
32
|
# @param [String] level the log level
|
31
33
|
# @param [String,IO] logdev the log device
|
32
34
|
# @param [Hash] options criteria for selecting the methods to run
|
33
|
-
def initialize(output_dir, pipelined: false, cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, database_url: 'mongodb://localhost:27017/pupa', validate: true, level: 'INFO', logdev: STDOUT, options: {})
|
35
|
+
def initialize(output_dir, pipelined: false, cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, memcached_username: nil, memcached_password: nil, database_url: 'mongodb://localhost:27017/pupa', validate: true, level: 'INFO', logdev: STDOUT, options: {})
|
34
36
|
@store = DocumentStore.new(output_dir, pipelined: pipelined)
|
35
|
-
@client = Client.new(cache_dir: cache_dir, expires_in: expires_in, value_max_bytes: value_max_bytes, level: level)
|
37
|
+
@client = Client.new(cache_dir: cache_dir, expires_in: expires_in, value_max_bytes: value_max_bytes, memcached_username: memcached_username, memcached_password: memcached_password, level: level, logdev: logdev)
|
36
38
|
@connection = Connection.new(database_url)
|
37
39
|
@logger = Logger.new('pupa', level: level, logdev: logdev)
|
38
40
|
@validate = validate
|
data/lib/pupa/runner.rb
CHANGED
@@ -11,17 +11,19 @@ module Pupa
|
|
11
11
|
@processor_class = processor_class
|
12
12
|
|
13
13
|
@options = OpenStruct.new({
|
14
|
-
actions:
|
15
|
-
tasks:
|
16
|
-
output_dir:
|
17
|
-
pipelined:
|
18
|
-
cache_dir:
|
19
|
-
expires_in:
|
20
|
-
value_max_bytes:
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
14
|
+
actions: [],
|
15
|
+
tasks: [],
|
16
|
+
output_dir: File.expand_path('scraped_data', Dir.pwd),
|
17
|
+
pipelined: false,
|
18
|
+
cache_dir: File.expand_path('web_cache', Dir.pwd),
|
19
|
+
expires_in: 86400, # 1 day
|
20
|
+
value_max_bytes: 1048576, # 1 MB
|
21
|
+
memcached_username: nil,
|
22
|
+
memcached_password: nil,
|
23
|
+
database_url: 'mongodb://localhost:27017/pupa',
|
24
|
+
validate: true,
|
25
|
+
level: 'INFO',
|
26
|
+
dry_run: false,
|
25
27
|
}.merge(defaults))
|
26
28
|
|
27
29
|
@actions = {
|
@@ -86,7 +88,13 @@ module Pupa
|
|
86
88
|
opts.on('--value_max_bytes BYTES', "The maximum Memcached item size") do |v|
|
87
89
|
options.value_max_bytes = v
|
88
90
|
end
|
89
|
-
opts.on('
|
91
|
+
opts.on('--memcached_username USERNAME', "The Memcached username") do |v|
|
92
|
+
options.memcached_username = v
|
93
|
+
end
|
94
|
+
opts.on('--memcached_password USERNAME', "The Memcached password") do |v|
|
95
|
+
options.memcached_password = v
|
96
|
+
end
|
97
|
+
opts.on('-d', '--database_url', 'The database URL (e.g. mongodb://USER:PASSWORD@localhost:27017/pupa or postgres://USER:PASSWORD@localhost:5432/pupa') do |v|
|
90
98
|
options.database_url = v
|
91
99
|
end
|
92
100
|
opts.on('--[no-]validate', 'Validate JSON documents') do |v|
|
@@ -147,6 +155,8 @@ module Pupa
|
|
147
155
|
cache_dir: options.cache_dir,
|
148
156
|
expires_in: options.expires_in,
|
149
157
|
value_max_bytes: options.value_max_bytes,
|
158
|
+
memcached_username: options.memcached_username,
|
159
|
+
memcached_password: options.memcached_password,
|
150
160
|
database_url: options.database_url,
|
151
161
|
validate: options.validate,
|
152
162
|
level: options.level,
|
@@ -165,7 +175,7 @@ module Pupa
|
|
165
175
|
end
|
166
176
|
|
167
177
|
if options.level == 'DEBUG'
|
168
|
-
%w(output_dir pipelined cache_dir expires_in value_max_bytes database_url validate level).each do |option|
|
178
|
+
%w(output_dir pipelined cache_dir expires_in value_max_bytes memcached_username memcached_password database_url validate level).each do |option|
|
169
179
|
puts "#{option}: #{options[option]}"
|
170
180
|
end
|
171
181
|
unless rest.empty?
|
data/lib/pupa/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pupa
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Open North
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-
|
11
|
+
date: 2014-07-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -288,6 +288,7 @@ files:
|
|
288
288
|
- ".yardopts"
|
289
289
|
- Gemfile
|
290
290
|
- LICENSE
|
291
|
+
- PERFORMANCE.md
|
291
292
|
- README.md
|
292
293
|
- Rakefile
|
293
294
|
- USAGE
|