pupa 0.1.4 → 0.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/PERFORMANCE.md +129 -0
- data/README.md +5 -131
- data/lib/pupa/processor/client.rb +5 -2
- data/lib/pupa/processor.rb +4 -2
- data/lib/pupa/runner.rb +23 -13
- data/lib/pupa/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d794649266b975f92ee8ff502a3de21390dc540b
|
4
|
+
data.tar.gz: 59b89a81274d35ee848d944da9f4337295ab8567
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 19582ce0e29e5a9ad52d4dabb216664d418fe738b4bf9b534a51b161eae209616df8bc87138a5f34253dace8c657ae21c0a2684024258dc918a94dfc63476a68
|
7
|
+
data.tar.gz: 3464169a23f255de3e4b357e245135fc571bc11642a1929d448a88fc1382e5418ed96b3214317c30bae90cb535f90ab6bd4a2bd94d1403ce840a1ccbf6cad734
|
data/PERFORMANCE.md
ADDED
@@ -0,0 +1,129 @@
|
|
1
|
+
# Pupa.rb: A Data Scraping Framework
|
2
|
+
|
3
|
+
## Performance
|
4
|
+
|
5
|
+
Pupa.rb offers several ways to significantly improve performance.
|
6
|
+
|
7
|
+
In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
|
8
|
+
|
9
|
+
The `import` action's performance is currently limited by the database when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
|
10
|
+
|
11
|
+
### Reducing HTTP requests
|
12
|
+
|
13
|
+
HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
|
14
|
+
|
15
|
+
ruby cat.rb --cache_dir /tmp/my_cache_dir
|
16
|
+
|
17
|
+
### Parallelizing HTTP requests
|
18
|
+
|
19
|
+
To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
require 'pupa'
|
23
|
+
require 'typhoeus'
|
24
|
+
require 'typhoeus/adapters/faraday'
|
25
|
+
```
|
26
|
+
|
27
|
+
Then, in your scraping methods, write code like:
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
responses = []
|
31
|
+
|
32
|
+
# Change the maximum number of concurrent requests (default 200). You usually
|
33
|
+
# need to tweak this number by trial and error.
|
34
|
+
# @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
|
35
|
+
manager = Typhoeus::Hydra.new(max_concurrency: 20)
|
36
|
+
|
37
|
+
begin
|
38
|
+
# Send HTTP requests in parallel.
|
39
|
+
client.in_parallel(manager) do
|
40
|
+
responses << client.get('http://example.com/foo')
|
41
|
+
responses << client.get('http://example.com/bar')
|
42
|
+
# More requests...
|
43
|
+
end
|
44
|
+
rescue Faraday::Error::ClientError => e
|
45
|
+
# Log an error message if, for example, you exceed a server's maximum number
|
46
|
+
# of concurrent connections or if you exceed an API's rate limit.
|
47
|
+
error(e.response.inspect)
|
48
|
+
end
|
49
|
+
|
50
|
+
# Responses are now available for use.
|
51
|
+
responses.each do |response|
|
52
|
+
# Only process the finished responses.
|
53
|
+
if response.success?
|
54
|
+
# If success...
|
55
|
+
elsif response.finished?
|
56
|
+
# If error...
|
57
|
+
end
|
58
|
+
end
|
59
|
+
```
|
60
|
+
|
61
|
+
### Reducing disk I/O
|
62
|
+
|
63
|
+
After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
|
64
|
+
|
65
|
+
#### RAM file systems
|
66
|
+
|
67
|
+
A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
|
68
|
+
|
69
|
+
ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
|
70
|
+
diskutil erasevolume HFS+ 'ramdisk' $ramdisk
|
71
|
+
|
72
|
+
You can then set the `output_dir` and `cache_dir` on OS X as:
|
73
|
+
|
74
|
+
ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
|
75
|
+
|
76
|
+
Once you are done with the RAM disk, release the memory:
|
77
|
+
|
78
|
+
diskutil unmount $ramdisk
|
79
|
+
hdiutil detach $ramdisk
|
80
|
+
|
81
|
+
Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
|
82
|
+
|
83
|
+
#### Memcached
|
84
|
+
|
85
|
+
You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
|
86
|
+
|
87
|
+
ruby cat.rb --cache_dir memcached://localhost:11211
|
88
|
+
|
89
|
+
The data in Memcached will be lost between reboots.
|
90
|
+
|
91
|
+
#### Redis
|
92
|
+
|
93
|
+
You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
|
94
|
+
|
95
|
+
ruby cat.rb --output_dir redis://localhost:6379/0
|
96
|
+
|
97
|
+
To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
|
98
|
+
|
99
|
+
ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
|
100
|
+
|
101
|
+
Requiring the `hiredis` gem will slightly improve performance.
|
102
|
+
|
103
|
+
Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
|
104
|
+
|
105
|
+
### Skipping validation
|
106
|
+
|
107
|
+
The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
|
108
|
+
|
109
|
+
The [pupa-validate](https://npmjs.org/package/pupa-validate) npm package can be used to validate JSON documents using the faster JSV. In an example case, using JSV instead of the `json-schema` gem reduced by half the time to validate 10,000 documents.
|
110
|
+
|
111
|
+
### Ruby version
|
112
|
+
|
113
|
+
Pupa.rb requires Ruby 2.x. If you have already made all the above optimizations, you may notice a significant improvement by using Ruby 2.1, which has better garbage collection than Ruby 2.0.
|
114
|
+
|
115
|
+
### Profiling
|
116
|
+
|
117
|
+
You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
|
118
|
+
|
119
|
+
gem install perftools.rb
|
120
|
+
|
121
|
+
Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
|
122
|
+
|
123
|
+
CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
|
124
|
+
|
125
|
+
You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
|
126
|
+
|
127
|
+
[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
|
128
|
+
|
129
|
+
pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf
|
data/README.md
CHANGED
@@ -69,7 +69,7 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
|
|
69
69
|
|
70
70
|
JSON parsing is enabled by default. To enable automatic parsing of HTML and XML, require the `nokogiri` and `multi_xml` gems.
|
71
71
|
|
72
|
-
|
72
|
+
## [OpenCivicData](http://opencivicdata.org/) compatibility
|
73
73
|
|
74
74
|
Both Pupa.rb and Sunlight Labs' [Pupa](https://github.com/opencivicdata/pupa) implement models for people, organizations and memberships from the [Popolo](http://popoloproject.com/) open government data specification. Pupa.rb lets you use your own classes, but Pupa only supports a fixed set of classes. A consequence of Pupa.rb's flexibility is that the value of the `_type` property for `Person`, `Organization` and `Membership` objects differs between Pupa.rb and Pupa. Pupa.rb has namespaced types like `pupa/person` – to allow Ruby to load the `Person` class in the `Pupa` module – whereas Pupa has unnamespaced types like `person`.
|
75
75
|
|
@@ -81,138 +81,8 @@ require 'pupa/refinements/opencivicdata'
|
|
81
81
|
|
82
82
|
It is not currently possible to run the `scrape` action with one of Pupa.rb and Pupa, and to then run the `import` action with the other. Both actions must be run by the same library.
|
83
83
|
|
84
|
-
## Performance
|
85
|
-
|
86
|
-
Pupa.rb offers several ways to significantly improve performance.
|
87
|
-
|
88
|
-
In an example case, reducing disk I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
|
89
|
-
|
90
|
-
The `import` action's performance is currently limited by MongoDB when a dependency graph is used to determine the evaluation order. If a dependency graph cannot be used because you don't know a related object's ID, [several optimizations](https://github.com/opennorth/pupa-ruby/issues/12) can be implemented to improve performance.
|
91
|
-
|
92
|
-
### Reducing HTTP requests
|
93
|
-
|
94
|
-
HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
|
95
|
-
|
96
|
-
ruby cat.rb --cache_dir /tmp/my_cache_dir
|
97
|
-
|
98
|
-
### Parallelizing HTTP requests
|
99
|
-
|
100
|
-
To enable parallel requests, use the `typhoeus` gem. Unless you are using an old version of Typhoeus (< 0.5), both Faraday and Typhoeus define a Faraday adapter, but you must use the one defined by Typhoeus, like so:
|
101
|
-
|
102
|
-
```ruby
|
103
|
-
require 'pupa'
|
104
|
-
require 'typhoeus'
|
105
|
-
require 'typhoeus/adapters/faraday'
|
106
|
-
```
|
107
|
-
|
108
|
-
Then, in your scraping methods, write code like:
|
109
|
-
|
110
|
-
```ruby
|
111
|
-
responses = []
|
112
|
-
|
113
|
-
# Change the maximum number of concurrent requests (default 200). You usually
|
114
|
-
# need to tweak this number by trial and error.
|
115
|
-
# @see https://github.com/lostisland/faraday/wiki/Parallel-requests#advanced-use
|
116
|
-
manager = Typhoeus::Hydra.new(max_concurrency: 20)
|
117
|
-
|
118
|
-
begin
|
119
|
-
# Send HTTP requests in parallel.
|
120
|
-
client.in_parallel(manager) do
|
121
|
-
responses << client.get('http://example.com/foo')
|
122
|
-
responses << client.get('http://example.com/bar')
|
123
|
-
# More requests...
|
124
|
-
end
|
125
|
-
rescue Faraday::Error::ClientError => e
|
126
|
-
# Log an error message if, for example, you exceed a server's maximum number
|
127
|
-
# of concurrent connections or if you exceed an API's rate limit.
|
128
|
-
error(e.response.inspect)
|
129
|
-
end
|
130
|
-
|
131
|
-
# Responses are now available for use.
|
132
|
-
responses.each do |response|
|
133
|
-
# Only process the finished responses.
|
134
|
-
if response.success?
|
135
|
-
# If success...
|
136
|
-
elsif response.finished?
|
137
|
-
# If error...
|
138
|
-
end
|
139
|
-
end
|
140
|
-
```
|
141
|
-
|
142
|
-
### Reducing disk I/O
|
143
|
-
|
144
|
-
After HTTP requests, disk I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk.
|
145
|
-
|
146
|
-
#### RAM file systems
|
147
|
-
|
148
|
-
A simple solution is to create a file system in RAM, like `tmpfs` on Linux for example, and to use it as your `output_dir` and `cache_dir`. On OS X, you must create a RAM disk. To create a 128MB RAM disk, for example, run:
|
149
|
-
|
150
|
-
ramdisk=$(hdiutil attach -nomount ram://$((128 * 2048)) | tr -d ' \t')
|
151
|
-
diskutil erasevolume HFS+ 'ramdisk' $ramdisk
|
152
|
-
|
153
|
-
You can then set the `output_dir` and `cache_dir` on OS X as:
|
154
|
-
|
155
|
-
ruby cat.rb --output_dir /Volumes/ramdisk/scraped_data --cache_dir /Volumes/ramdisk/web_cache
|
156
|
-
|
157
|
-
Once you are done with the RAM disk, release the memory:
|
158
|
-
|
159
|
-
diskutil unmount $ramdisk
|
160
|
-
hdiutil detach $ramdisk
|
161
|
-
|
162
|
-
Using a RAM disk will significantly improve performance; however, the data will be lost between reboots unless you move the data to a hard disk. Using Memcached (for caching) and Redis (for storage) is moderately faster than using a RAM disk, and Redis will not lose your output data between reboots.
|
163
|
-
|
164
|
-
#### Memcached
|
165
|
-
|
166
|
-
You may cache HTTP responses in [Memcached](http://memcached.org/). First, require the `dalli` gem. Then:
|
167
|
-
|
168
|
-
ruby cat.rb --cache_dir memcached://localhost:11211
|
169
|
-
|
170
|
-
The data in Memcached will be lost between reboots.
|
171
|
-
|
172
|
-
#### Redis
|
173
|
-
|
174
|
-
You may dump JSON documents in [Redis](http://redis.io/). First, require the `redis-store` gem. Then:
|
175
|
-
|
176
|
-
ruby cat.rb --output_dir redis://localhost:6379/0
|
177
|
-
|
178
|
-
To dump JSON documents in Redis moderately faster, use [pipelining](http://redis.io/topics/pipelining):
|
179
|
-
|
180
|
-
ruby cat.rb --output_dir redis://localhost:6379/0 --pipelined
|
181
|
-
|
182
|
-
Requiring the `hiredis` gem will slightly improve performance.
|
183
|
-
|
184
|
-
Note that Pupa.rb flushes the Redis database before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/15`, where `15` is the database number.
|
185
|
-
|
186
|
-
### Skipping validation
|
187
|
-
|
188
|
-
The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
|
189
|
-
|
190
|
-
The [pupa-validate](https://npmjs.org/package/pupa-validate) npm package can be used to validate JSON documents using the faster JSV. In an example case, using JSV instead of the `json-schema` gem reduced by half the time to validate 10,000 documents.
|
191
|
-
|
192
|
-
### Ruby version
|
193
|
-
|
194
|
-
Pupa.rb requires Ruby 2.x. If you have already made all the above optimizations, you may notice a significant improvement by using Ruby 2.1, which has better garbage collection than Ruby 2.0.
|
195
|
-
|
196
|
-
### Profiling
|
197
|
-
|
198
|
-
You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
|
199
|
-
|
200
|
-
gem install perftools.rb
|
201
|
-
|
202
|
-
Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
|
203
|
-
|
204
|
-
CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
|
205
|
-
|
206
|
-
You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to interfere with HTTP requests, for whatever reason.
|
207
|
-
|
208
|
-
[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
|
209
|
-
|
210
|
-
pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf
|
211
|
-
|
212
84
|
## Integration with ODMs
|
213
85
|
|
214
|
-
### Mongoid
|
215
|
-
|
216
86
|
`Pupa::Model` is incompatible with `Mongoid::Document`. Don't do this:
|
217
87
|
|
218
88
|
```ruby
|
@@ -224,6 +94,10 @@ end
|
|
224
94
|
|
225
95
|
Instead, have a scraping model that includes `Pupa::Model` and an app model that includes `Mongoid::Document`.
|
226
96
|
|
97
|
+
## Performance
|
98
|
+
|
99
|
+
Pupa.rb offers several ways to significantly improve performance. [Read the documentation.](https://github.com/opennorth/pupa-ruby/blob/master/PERFORMANCE.md#readme)
|
100
|
+
|
227
101
|
## Testing
|
228
102
|
|
229
103
|
**DO NOT** run this gem's specs if you are using Redis database number 15 on `localhost`!
|
@@ -30,9 +30,12 @@ module Pupa
|
|
30
30
|
# (e.g. `memcached://localhost:11211`) in which to cache requests
|
31
31
|
# @param [Integer] expires_in the cache's expiration time in seconds
|
32
32
|
# @param [Integer] value_max_bytes the maximum Memcached item size
|
33
|
+
# @param [String] memcached_username the Memcached username
|
34
|
+
# @param [String] memcached_password the Memcached password
|
33
35
|
# @param [String] level the log level
|
36
|
+
# @param [String,IO] logdev the log device
|
34
37
|
# @return [Faraday::Connection] a configured Faraday HTTP client
|
35
|
-
def self.new(cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, level: 'INFO') # 1 day
|
38
|
+
def self.new(cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, memcached_username: nil, memcached_password: nil, level: 'INFO', logdev: STDOUT) # 1 day
|
36
39
|
Faraday.new do |connection|
|
37
40
|
connection.request :url_encoded
|
38
41
|
connection.use Middleware::Logger, Logger.new('faraday', level: level)
|
@@ -59,7 +62,7 @@ module Pupa
|
|
59
62
|
connection.response :caching do
|
60
63
|
address = cache_dir[%r{\Amemcached://(.+)\z}, 1]
|
61
64
|
if address
|
62
|
-
ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in, value_max_bytes: Integer(value_max_bytes))
|
65
|
+
ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in, value_max_bytes: Integer(value_max_bytes), username: memcached_username, password: memcached_password)
|
63
66
|
else
|
64
67
|
ActiveSupport::Cache::FileStore.new(cache_dir, expires_in: expires_in)
|
65
68
|
end
|
data/lib/pupa/processor.rb
CHANGED
@@ -25,14 +25,16 @@ module Pupa
|
|
25
25
|
# (e.g. `memcached://localhost:11211`) in which to cache HTTP responses
|
26
26
|
# @param [Integer] expires_in the cache's expiration time in seconds
|
27
27
|
# @param [Integer] value_max_bytes the maximum Memcached item size
|
28
|
+
# @param [String] memcached_username the Memcached username
|
29
|
+
# @param [String] memcached_password the Memcached password
|
28
30
|
# @param [String] database_url the database URL
|
29
31
|
# @param [Boolean] validate whether to validate JSON documents
|
30
32
|
# @param [String] level the log level
|
31
33
|
# @param [String,IO] logdev the log device
|
32
34
|
# @param [Hash] options criteria for selecting the methods to run
|
33
|
-
def initialize(output_dir, pipelined: false, cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, database_url: 'mongodb://localhost:27017/pupa', validate: true, level: 'INFO', logdev: STDOUT, options: {})
|
35
|
+
def initialize(output_dir, pipelined: false, cache_dir: nil, expires_in: 86400, value_max_bytes: 1048576, memcached_username: nil, memcached_password: nil, database_url: 'mongodb://localhost:27017/pupa', validate: true, level: 'INFO', logdev: STDOUT, options: {})
|
34
36
|
@store = DocumentStore.new(output_dir, pipelined: pipelined)
|
35
|
-
@client = Client.new(cache_dir: cache_dir, expires_in: expires_in, value_max_bytes: value_max_bytes, level: level)
|
37
|
+
@client = Client.new(cache_dir: cache_dir, expires_in: expires_in, value_max_bytes: value_max_bytes, memcached_username: memcached_username, memcached_password: memcached_password, level: level, logdev: logdev)
|
36
38
|
@connection = Connection.new(database_url)
|
37
39
|
@logger = Logger.new('pupa', level: level, logdev: logdev)
|
38
40
|
@validate = validate
|
data/lib/pupa/runner.rb
CHANGED
@@ -11,17 +11,19 @@ module Pupa
|
|
11
11
|
@processor_class = processor_class
|
12
12
|
|
13
13
|
@options = OpenStruct.new({
|
14
|
-
actions:
|
15
|
-
tasks:
|
16
|
-
output_dir:
|
17
|
-
pipelined:
|
18
|
-
cache_dir:
|
19
|
-
expires_in:
|
20
|
-
value_max_bytes:
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
14
|
+
actions: [],
|
15
|
+
tasks: [],
|
16
|
+
output_dir: File.expand_path('scraped_data', Dir.pwd),
|
17
|
+
pipelined: false,
|
18
|
+
cache_dir: File.expand_path('web_cache', Dir.pwd),
|
19
|
+
expires_in: 86400, # 1 day
|
20
|
+
value_max_bytes: 1048576, # 1 MB
|
21
|
+
memcached_username: nil,
|
22
|
+
memcached_password: nil,
|
23
|
+
database_url: 'mongodb://localhost:27017/pupa',
|
24
|
+
validate: true,
|
25
|
+
level: 'INFO',
|
26
|
+
dry_run: false,
|
25
27
|
}.merge(defaults))
|
26
28
|
|
27
29
|
@actions = {
|
@@ -86,7 +88,13 @@ module Pupa
|
|
86
88
|
opts.on('--value_max_bytes BYTES', "The maximum Memcached item size") do |v|
|
87
89
|
options.value_max_bytes = v
|
88
90
|
end
|
89
|
-
opts.on('
|
91
|
+
opts.on('--memcached_username USERNAME', "The Memcached username") do |v|
|
92
|
+
options.memcached_username = v
|
93
|
+
end
|
94
|
+
opts.on('--memcached_password USERNAME', "The Memcached password") do |v|
|
95
|
+
options.memcached_password = v
|
96
|
+
end
|
97
|
+
opts.on('-d', '--database_url', 'The database URL (e.g. mongodb://USER:PASSWORD@localhost:27017/pupa or postgres://USER:PASSWORD@localhost:5432/pupa') do |v|
|
90
98
|
options.database_url = v
|
91
99
|
end
|
92
100
|
opts.on('--[no-]validate', 'Validate JSON documents') do |v|
|
@@ -147,6 +155,8 @@ module Pupa
|
|
147
155
|
cache_dir: options.cache_dir,
|
148
156
|
expires_in: options.expires_in,
|
149
157
|
value_max_bytes: options.value_max_bytes,
|
158
|
+
memcached_username: options.memcached_username,
|
159
|
+
memcached_password: options.memcached_password,
|
150
160
|
database_url: options.database_url,
|
151
161
|
validate: options.validate,
|
152
162
|
level: options.level,
|
@@ -165,7 +175,7 @@ module Pupa
|
|
165
175
|
end
|
166
176
|
|
167
177
|
if options.level == 'DEBUG'
|
168
|
-
%w(output_dir pipelined cache_dir expires_in value_max_bytes database_url validate level).each do |option|
|
178
|
+
%w(output_dir pipelined cache_dir expires_in value_max_bytes memcached_username memcached_password database_url validate level).each do |option|
|
169
179
|
puts "#{option}: #{options[option]}"
|
170
180
|
end
|
171
181
|
unless rest.empty?
|
data/lib/pupa/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pupa
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.5
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Open North
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-
|
11
|
+
date: 2014-07-11 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -288,6 +288,7 @@ files:
|
|
288
288
|
- ".yardopts"
|
289
289
|
- Gemfile
|
290
290
|
- LICENSE
|
291
|
+
- PERFORMANCE.md
|
291
292
|
- README.md
|
292
293
|
- Rakefile
|
293
294
|
- USAGE
|