pupa 0.0.7 → 0.0.8
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.travis.yml +1 -0
- data/README.md +50 -0
- data/lib/pupa.rb +1 -0
- data/lib/pupa/models/base.rb +3 -3
- data/lib/pupa/processor.rb +36 -32
- data/lib/pupa/processor/client.rb +16 -3
- data/lib/pupa/processor/document_store.rb +21 -0
- data/lib/pupa/processor/document_store/file_store.rb +83 -0
- data/lib/pupa/processor/document_store/redis_store.rb +77 -0
- data/lib/pupa/refinements/faraday_middleware.rb +1 -3
- data/lib/pupa/refinements/json-schema.rb +3 -3
- data/lib/pupa/runner.rb +13 -11
- data/lib/pupa/version.rb +1 -1
- data/pupa.gemspec +3 -1
- data/spec/cassettes/f861172f1df3bdb2052af5451f9922699d574b77.yml +62 -0
- data/spec/fixtures/bar.json +1 -0
- data/spec/fixtures/baz.json +1 -0
- data/spec/fixtures/foo.json +1 -0
- data/spec/models/base_spec.rb +24 -8
- data/spec/processor/client_spec.rb +11 -0
- data/spec/processor/document_store/file_store_spec.rb +65 -0
- data/spec/processor/document_store/redis_store_spec.rb +71 -0
- data/spec/processor/document_store_spec.rb +15 -0
- data/spec/processor/persistence_spec.rb +1 -1
- data/spec/processor_spec.rb +13 -3
- data/spec/spec_helper.rb +2 -0
- metadata +60 -15
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d4ec7210671485a2de58673a70088e415a9767b7
|
4
|
+
data.tar.gz: 8b2a77e3fe3c5775509fef59fb847ea2057f63cb
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: b3cdcf2da535ebd8d840fe2a1f6e6dd0db68de4d92b728f6f33bf7ba80f7499331f6cc9ace81da93b7e1062e7f3b8389fbab1891d81fbc44d1c48cbbc3be8eea
|
7
|
+
data.tar.gz: 641956572610184f0437f0869e2ee9da3c624e493284ac3d4e0e88157395c4b8d41840ac4f1d351b63032aff1fde091862a2b83401c5ee55c1d83ac52a856a70
|
data/.travis.yml
CHANGED
data/README.md
CHANGED
@@ -45,6 +45,56 @@ The [organization.rb](http://opennorth.github.io/pupa-ruby/docs/organization.htm
|
|
45
45
|
|
46
46
|
1. You may want more control over the method used to perform a scraping task. For example, a legislature may publish legislators before 1997 in one format and legislators after 1997 in another format. In this case, you may want to select the method used to scrape legislators according to the year. See [legislator.rb](http://opennorth.github.io/pupa-ruby/docs/legislator.html).
|
47
47
|
|
48
|
+
## Performance
|
49
|
+
|
50
|
+
Pupa.rb offers several ways to significantly improve performance.
|
51
|
+
|
52
|
+
In an example case, reducing file I/O and skipping validation as described below reduced the time to scrape 10,000 documents from 100 cached HTTP responses from 100 seconds down to 5 seconds. Like fast tests, fast scrapers make development smoother.
|
53
|
+
|
54
|
+
The `import` action's performance (when using a dependency graph) is currently limited by MongoDB.
|
55
|
+
|
56
|
+
### Caching HTTP requests
|
57
|
+
|
58
|
+
HTTP requests consume the most time. To avoid repeat HTTP requests while developing a scraper, cache all HTTP responses. Pupa.rb will by default use a `web_cache` directory in the same directory as your script. You can change the directory by setting the `--cache_dir` switch on the command line, for example:
|
59
|
+
|
60
|
+
ruby cat.rb --cache_dir my_cache_dir
|
61
|
+
|
62
|
+
### Reducing file I/O
|
63
|
+
|
64
|
+
After HTTP requests, file I/O is the slowest operation. Two types of files are written to disk: HTTP responses are written to the cache directory, and JSON documents are written to the output directory. Writing to memory is much faster than writing to disk. You may store HTTP responses in [Memcached](http://memcached.org/) like so:
|
65
|
+
|
66
|
+
ruby cat.rb --cache_dir memcached://localhost:11211
|
67
|
+
|
68
|
+
And you may store JSON documents in [Redis](http://redis.io/) like so:
|
69
|
+
|
70
|
+
ruby cat.rb --output_dir redis://localhost:6379/0
|
71
|
+
|
72
|
+
Note that Pupa.rb flushes the JSON documents before scraping. If you use Redis, **DO NOT** share a Redis database with Pupa.rb and other applications. You can select a different database than the default `0` for use with Pupa.rb by passing an argument like `redis://localhost:6379/1`, where `1` is the Redis database number.
|
73
|
+
|
74
|
+
### Skipping validation
|
75
|
+
|
76
|
+
The `json-schema` gem is slow compared to, for example, [JSV](https://github.com/garycourt/JSV). Setting the `--no-validate` switch and running JSON Schema validations separately can further reduce a scraper's running time.
|
77
|
+
|
78
|
+
### Profiling
|
79
|
+
|
80
|
+
You can profile your code using [perftools.rb](https://github.com/tmm1/perftools.rb). First, install the gem:
|
81
|
+
|
82
|
+
gem install perftools.rb
|
83
|
+
|
84
|
+
Then, run your script with the profiler (changing `/tmp/PROFILE_NAME` and `script.rb` as appropriate):
|
85
|
+
|
86
|
+
CPUPROFILE=/tmp/PROFILE_NAME RUBYOPT="-r`gem which perftools | tail -1`" ruby script.rb
|
87
|
+
|
88
|
+
You may want to set the `CPUPROFILE_REALTIME=1` flag; however, it seems to change the behavior of the `json-schema` gem, for whatever reason.
|
89
|
+
|
90
|
+
[perftools.rb](https://github.com/tmm1/perftools.rb) has several output formats. If your code is straight-forward, you can draw a graph (changing `/tmp/PROFILE_NAME` and `/tmp/PROFILE_NAME.pdf` as appropriate):
|
91
|
+
|
92
|
+
pprof.rb --pdf /tmp/PROFILE_NAME > /tmp/PROFILE_NAME.pdf
|
93
|
+
|
94
|
+
## Testing
|
95
|
+
|
96
|
+
**DO NOT** run this gem's specs if you are using Redis database number 15 on `localhost`!
|
97
|
+
|
48
98
|
## Bugs? Questions?
|
49
99
|
|
50
100
|
This project's main repository is on GitHub: [http://github.com/opennorth/pupa-ruby](http://github.com/opennorth/pupa-ruby), where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
|
data/lib/pupa.rb
CHANGED
data/lib/pupa/models/base.rb
CHANGED
@@ -84,9 +84,9 @@ module Pupa
|
|
84
84
|
self.json_schema = if Hash === value
|
85
85
|
value
|
86
86
|
elsif Pathname.new(value).absolute?
|
87
|
-
value
|
87
|
+
File.read(value)
|
88
88
|
else
|
89
|
-
File.expand_path(File.join('..', '..', '..', 'schemas', "#{value}.json"), __dir__)
|
89
|
+
File.read(File.expand_path(File.join('..', '..', '..', 'schemas', "#{value}.json"), __dir__))
|
90
90
|
end
|
91
91
|
end
|
92
92
|
end
|
@@ -164,7 +164,7 @@ module Pupa
|
|
164
164
|
# @raises [JSON::Schema::ValidationError] if the object is invalid
|
165
165
|
def validate!
|
166
166
|
if self.class.json_schema
|
167
|
-
# JSON::Validator#
|
167
|
+
# JSON::Validator#initialize_schema runs fastest if given a hash.
|
168
168
|
JSON::Validator.validate!(self.class.json_schema, stringify_keys(to_h))
|
169
169
|
end
|
170
170
|
end
|
data/lib/pupa/processor.rb
CHANGED
@@ -6,8 +6,12 @@ require 'pupa/processor/client'
|
|
6
6
|
require 'pupa/processor/dependency_graph'
|
7
7
|
require 'pupa/processor/helper'
|
8
8
|
require 'pupa/processor/persistence'
|
9
|
+
require 'pupa/processor/document_store'
|
9
10
|
require 'pupa/processor/yielder'
|
10
11
|
|
12
|
+
require 'pupa/processor/document_store/file_store'
|
13
|
+
require 'pupa/processor/document_store/redis_store'
|
14
|
+
|
11
15
|
module Pupa
|
12
16
|
# An abstract processor class from which specific processors inherit.
|
13
17
|
class Processor
|
@@ -17,23 +21,26 @@ module Pupa
|
|
17
21
|
class_attribute :tasks
|
18
22
|
self.tasks = []
|
19
23
|
|
20
|
-
attr_reader :report, :client, :options
|
24
|
+
attr_reader :report, :store, :client, :options
|
21
25
|
|
22
26
|
def_delegators :@logger, :debug, :info, :warn, :error, :fatal
|
23
27
|
|
24
|
-
# @param [String] output_dir the directory
|
25
|
-
#
|
28
|
+
# @param [String] output_dir the directory or Redis address
|
29
|
+
# (e.g. `redis://localhost:6379`) in which to dump JSON documents
|
30
|
+
# @param [String] cache_dir the directory or Memcached address
|
31
|
+
# (e.g. `memcached://localhost:11211`) in which to cache HTTP responses
|
26
32
|
# @param [Integer] expires_in the cache's expiration time in seconds
|
33
|
+
# @param [Boolean] validate whether to validate JSON documents
|
27
34
|
# @param [String] level the log level
|
28
35
|
# @param [String,IO] logdev the log device
|
29
36
|
# @param [Hash] options criteria for selecting the methods to run
|
30
|
-
def initialize(output_dir, cache_dir: nil, expires_in: 86400, level: 'INFO', logdev: STDOUT, options: {})
|
31
|
-
@
|
32
|
-
@
|
33
|
-
@
|
34
|
-
@
|
35
|
-
@
|
36
|
-
@report
|
37
|
+
def initialize(output_dir, cache_dir: nil, expires_in: 86400, validate: true, level: 'INFO', logdev: STDOUT, options: {})
|
38
|
+
@store = DocumentStore.new(output_dir)
|
39
|
+
@client = Client.new(cache_dir: cache_dir, expires_in: expires_in, level: level)
|
40
|
+
@logger = Logger.new('pupa', level: level, logdev: logdev)
|
41
|
+
@validate = validate
|
42
|
+
@options = options
|
43
|
+
@report = {}
|
37
44
|
end
|
38
45
|
|
39
46
|
# Retrieves and parses a document with a GET request.
|
@@ -213,23 +220,22 @@ module Pupa
|
|
213
220
|
# @raises [Pupa::Errors::DuplicateObjectIdError]
|
214
221
|
def dump_scraped_object(object)
|
215
222
|
type = object.class.to_s.demodulize.underscore
|
216
|
-
|
217
|
-
path = File.join(@output_dir, basename)
|
223
|
+
name = "#{type}_#{object._id.gsub(File::SEPARATOR, '_')}.json"
|
218
224
|
|
219
|
-
if
|
225
|
+
if @store.exist?(name)
|
220
226
|
raise Errors::DuplicateObjectIdError, "duplicate object ID: #{object._id} (was the same objected yielded twice?)"
|
221
227
|
end
|
222
228
|
|
223
|
-
info {"save #{type} #{object.to_s} as #{
|
229
|
+
info {"save #{type} #{object.to_s} as #{name}"}
|
224
230
|
|
225
|
-
|
226
|
-
f.write(JSON.dump(object.to_h(include_foreign_objects: true)))
|
227
|
-
end
|
231
|
+
@store.write(name, object.to_h(include_foreign_objects: true))
|
228
232
|
|
229
|
-
|
230
|
-
|
231
|
-
|
232
|
-
|
233
|
+
if @validate
|
234
|
+
begin
|
235
|
+
object.validate!
|
236
|
+
rescue JSON::Schema::ValidationError => e
|
237
|
+
warn {e.message}
|
238
|
+
end
|
233
239
|
end
|
234
240
|
end
|
235
241
|
|
@@ -238,8 +244,7 @@ module Pupa
|
|
238
244
|
# @return [Hash] a hash of scraped objects keyed by ID
|
239
245
|
def load_scraped_objects
|
240
246
|
{}.tap do |objects|
|
241
|
-
|
242
|
-
data = JSON.load(File.read(path))
|
247
|
+
@store.read_multi(@store.entries).each do |data|
|
243
248
|
object = data['_type'].camelize.constantize.new(data)
|
244
249
|
objects[object._id] = object
|
245
250
|
end
|
@@ -276,16 +281,15 @@ module Pupa
|
|
276
281
|
# @param [Hash] objects a hash of scraped objects keyed by ID
|
277
282
|
# @return [Hash] a mapping from an object ID to the ID of its duplicate
|
278
283
|
def build_losers_to_winners_map(objects)
|
284
|
+
inverse = {}
|
285
|
+
objects.each do |id,object|
|
286
|
+
(inverse[object.to_h.except(:_id)] ||= []) << id
|
287
|
+
end
|
288
|
+
|
279
289
|
{}.tap do |map|
|
280
|
-
|
281
|
-
|
282
|
-
|
283
|
-
unless map.key?(id1) # Don't search for duplicates of duplicates.
|
284
|
-
objects.drop(index + 1).each do |id2,object2|
|
285
|
-
if object1 == object2
|
286
|
-
map[id2] = id1
|
287
|
-
end
|
288
|
-
end
|
290
|
+
inverse.values.each do |ids|
|
291
|
+
ids.drop(1).each do |id|
|
292
|
+
map[id] = ids[0]
|
289
293
|
end
|
290
294
|
end
|
291
295
|
end
|
@@ -18,7 +18,10 @@ module Pupa
|
|
18
18
|
class Client
|
19
19
|
# Returns a configured Faraday HTTP client.
|
20
20
|
#
|
21
|
-
#
|
21
|
+
# In order to automatically parse XML responses, you must `require 'multi_xml'`.
|
22
|
+
#
|
23
|
+
# @param [String] cache_dir a directory or a Memcached address
|
24
|
+
# (e.g. `memcached://localhost:11211`) in which to cache requests
|
22
25
|
# @param [Integer] expires_in the cache's expiration time in seconds
|
23
26
|
# @param [String] level the log level
|
24
27
|
# @return [Faraday::Connection] a configured Faraday HTTP client
|
@@ -26,20 +29,30 @@ module Pupa
|
|
26
29
|
Faraday.new do |connection|
|
27
30
|
connection.request :url_encoded
|
28
31
|
connection.use Middleware::Logger, Logger.new('faraday', level: level)
|
32
|
+
|
29
33
|
# @see http://tools.ietf.org/html/rfc2854
|
30
34
|
# @see http://tools.ietf.org/html/rfc3236
|
31
35
|
connection.use Middleware::ParseHtml, content_type: %w(text/html application/xhtml+xml)
|
36
|
+
|
32
37
|
# @see http://tools.ietf.org/html/rfc4627
|
33
38
|
connection.use FaradayMiddleware::ParseJson, content_type: /\bjson$/
|
34
|
-
|
39
|
+
|
35
40
|
if defined?(MultiXml)
|
41
|
+
# @see http://tools.ietf.org/html/rfc3023
|
36
42
|
connection.use FaradayMiddleware::ParseXml, content_type: /\bxml$/
|
37
43
|
end
|
44
|
+
|
38
45
|
if cache_dir
|
39
46
|
connection.response :caching do
|
40
|
-
|
47
|
+
address = cache_dir[%r{\Amemcached://(.+)\z}, 1]
|
48
|
+
if address
|
49
|
+
ActiveSupport::Cache::MemCacheStore.new(address, expires_in: expires_in)
|
50
|
+
else
|
51
|
+
ActiveSupport::Cache::FileStore.new(cache_dir, expires_in: expires_in)
|
52
|
+
end
|
41
53
|
end
|
42
54
|
end
|
55
|
+
|
43
56
|
connection.adapter Faraday.default_adapter # must be last
|
44
57
|
end
|
45
58
|
end
|
@@ -0,0 +1,21 @@
|
|
1
|
+
module Pupa
|
2
|
+
class Processor
|
3
|
+
# An JSON document store factory.
|
4
|
+
#
|
5
|
+
# Heavily inspired by `ActiveSupport::Cache::Store`.
|
6
|
+
class DocumentStore
|
7
|
+
# Returns a configured JSON document store.
|
8
|
+
#
|
9
|
+
# @param [String] argument the filesystem directory or Redis address
|
10
|
+
# (e.g. `redis://localhost:6379/0`) in which to dump JSON documents
|
11
|
+
# @return a configured JSON document store
|
12
|
+
def self.new(argument)
|
13
|
+
if argument[%r{\Aredis://}]
|
14
|
+
RedisStore.new(argument)
|
15
|
+
else
|
16
|
+
FileStore.new(argument)
|
17
|
+
end
|
18
|
+
end
|
19
|
+
end
|
20
|
+
end
|
21
|
+
end
|
@@ -0,0 +1,83 @@
|
|
1
|
+
module Pupa
|
2
|
+
class Processor
|
3
|
+
class DocumentStore
|
4
|
+
# Stores JSON documents on disk.
|
5
|
+
#
|
6
|
+
# @see ActiveSupport::Cache::FileStore
|
7
|
+
class FileStore
|
8
|
+
# @param [String] output_dir the directory in which to dump JSON documents
|
9
|
+
def initialize(output_dir)
|
10
|
+
@output_dir = output_dir
|
11
|
+
FileUtils.mkdir_p(@output_dir)
|
12
|
+
end
|
13
|
+
|
14
|
+
# Returns whether a file with the given name exists.
|
15
|
+
#
|
16
|
+
# @param [String] name a key
|
17
|
+
# @return [Boolean] whether the store contains an entry for the given key
|
18
|
+
def exist?(name)
|
19
|
+
File.exist?(namespaced_key(name))
|
20
|
+
end
|
21
|
+
|
22
|
+
# Returns all file names in the storage directory.
|
23
|
+
#
|
24
|
+
# @return [Array<String>] all keys in the store
|
25
|
+
def entries
|
26
|
+
Dir.chdir(@output_dir) do
|
27
|
+
Dir['*.json']
|
28
|
+
end
|
29
|
+
end
|
30
|
+
|
31
|
+
# Returns, as JSON, the contents of the file with the given name.
|
32
|
+
#
|
33
|
+
# @param [String] name a key
|
34
|
+
# @return [Hash] the value of the given key
|
35
|
+
def read(name)
|
36
|
+
File.open(namespaced_key(name)) do |f|
|
37
|
+
JSON.load(f)
|
38
|
+
end
|
39
|
+
end
|
40
|
+
|
41
|
+
# Returns, as JSON, the contents of the files with the given names.
|
42
|
+
#
|
43
|
+
# @param [String] names keys
|
44
|
+
# @return [Array<Hash>] the values of the given keys
|
45
|
+
def read_multi(names)
|
46
|
+
names.map do |name|
|
47
|
+
read(name)
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
# Writes, as JSON, the value to a file with the given name.
|
52
|
+
#
|
53
|
+
# @param [String] name a key
|
54
|
+
# @param [Hash] value a value
|
55
|
+
def write(name, value)
|
56
|
+
File.open(namespaced_key(name), 'w') do |f|
|
57
|
+
JSON.dump(value, f)
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
# Delete a file with the given name.
|
62
|
+
#
|
63
|
+
# @param [String] name a key
|
64
|
+
def delete(name)
|
65
|
+
File.delete(namespaced_key(name))
|
66
|
+
end
|
67
|
+
|
68
|
+
# Deletes all files in the storage directory.
|
69
|
+
def clear
|
70
|
+
Dir[File.join(@output_dir, '*.json')].each do |path|
|
71
|
+
File.delete(path)
|
72
|
+
end
|
73
|
+
end
|
74
|
+
|
75
|
+
private
|
76
|
+
|
77
|
+
def namespaced_key(name)
|
78
|
+
File.join(@output_dir, name)
|
79
|
+
end
|
80
|
+
end
|
81
|
+
end
|
82
|
+
end
|
83
|
+
end
|
@@ -0,0 +1,77 @@
|
|
1
|
+
module Pupa
|
2
|
+
class Processor
|
3
|
+
class DocumentStore
|
4
|
+
# Stores JSON documents in Redis.
|
5
|
+
#
|
6
|
+
# Pupa flushes the JSON document store before scraping. If you use Redis,
|
7
|
+
# **DO NOT** share a Redis database with Pupa and other applications. You
|
8
|
+
# can select a different database than the default `0` for use with Pupa
|
9
|
+
# by passing an argument like `redis://localhost:6379/0`.
|
10
|
+
#
|
11
|
+
# @note Redis support depends on the `redis` gem. For better performance,
|
12
|
+
# use the `hiredis` gem as well.
|
13
|
+
class RedisStore
|
14
|
+
# @param [String] address the address (e.g. `redis://localhost:6379/0`)
|
15
|
+
# in which to dump JSON documents
|
16
|
+
def initialize(address)
|
17
|
+
options = {}
|
18
|
+
if defined?(Hiredis)
|
19
|
+
options.update(driver: :hiredis)
|
20
|
+
end
|
21
|
+
@redis = Redis::Store::Factory.create(address, options)
|
22
|
+
end
|
23
|
+
|
24
|
+
# Returns whether database contains an entry for the given key.
|
25
|
+
#
|
26
|
+
# @param [String] name a key
|
27
|
+
# @return [Boolean] whether the store contains an entry for the given key
|
28
|
+
def exist?(name)
|
29
|
+
@redis.exists(name)
|
30
|
+
end
|
31
|
+
|
32
|
+
# Returns all keys in the database.
|
33
|
+
#
|
34
|
+
# @return [Array<String>] all keys in the store
|
35
|
+
def entries
|
36
|
+
@redis.keys('*')
|
37
|
+
end
|
38
|
+
|
39
|
+
# Returns, as JSON, the value of the given key.
|
40
|
+
#
|
41
|
+
# @param [String] name a key
|
42
|
+
# @return [Hash] the value of the given key
|
43
|
+
def read(name)
|
44
|
+
JSON.load(@redis.get(name))
|
45
|
+
end
|
46
|
+
|
47
|
+
# Returns, as JSON, the values of the given keys.
|
48
|
+
#
|
49
|
+
# @param [String] names keys
|
50
|
+
# @return [Array<Hash>] the values of the given keys
|
51
|
+
def read_multi(names)
|
52
|
+
@redis.mget(*names).map{|value| JSON.load(value)}
|
53
|
+
end
|
54
|
+
|
55
|
+
# Writes, as JSON, the value to a key.
|
56
|
+
#
|
57
|
+
# @param [String] name a key
|
58
|
+
# @param [Hash] value a value
|
59
|
+
def write(name, value)
|
60
|
+
@redis.set(name, JSON.dump(value))
|
61
|
+
end
|
62
|
+
|
63
|
+
# Delete a key.
|
64
|
+
#
|
65
|
+
# @param [String] name a key
|
66
|
+
def delete(name)
|
67
|
+
@redis.del(name)
|
68
|
+
end
|
69
|
+
|
70
|
+
# Deletes all keys in the database.
|
71
|
+
def clear
|
72
|
+
@redis.flushdb
|
73
|
+
end
|
74
|
+
end
|
75
|
+
end
|
76
|
+
end
|
77
|
+
end
|
@@ -1,6 +1,4 @@
|
|
1
|
-
#
|
2
|
-
# only GET requests. Using Ruby's refinements doesn't seem to work, possibly
|
3
|
-
# because Faraday caches middlewares.
|
1
|
+
# Caches all requests, not only GET requests.
|
4
2
|
class FaradayMiddleware::Caching
|
5
3
|
def call(env)
|
6
4
|
# Remove if-statement to cache any request, not only GET.
|
@@ -1,9 +1,9 @@
|
|
1
1
|
module Pupa
|
2
|
-
|
2
|
+
module Refinements
|
3
3
|
# A refinement for JSON Schema to validate "email" and "uri" formats. Using
|
4
4
|
# Ruby's refinements doesn't seem to work, possibly because `refine` can't
|
5
5
|
# be used with `prepend`.
|
6
|
-
module
|
6
|
+
module FormatAttribute
|
7
7
|
# @see http://my.rails-royce.org/2010/07/21/email-validation-in-ruby-on-rails-without-regexp/
|
8
8
|
def validate(current_schema, data, fragments, processor, validator, options = {})
|
9
9
|
case current_schema.schema['format']
|
@@ -33,6 +33,6 @@ end
|
|
33
33
|
|
34
34
|
class JSON::Schema::FormatAttribute
|
35
35
|
class << self
|
36
|
-
prepend Pupa::Refinements::
|
36
|
+
prepend Pupa::Refinements::FormatAttribute
|
37
37
|
end
|
38
38
|
end
|
data/lib/pupa/runner.rb
CHANGED
@@ -1,4 +1,3 @@
|
|
1
|
-
require 'fileutils'
|
2
1
|
require 'optparse'
|
3
2
|
require 'ostruct'
|
4
3
|
|
@@ -19,6 +18,7 @@ module Pupa
|
|
19
18
|
output_dir: File.expand_path('scraped_data', Dir.pwd),
|
20
19
|
cache_dir: File.expand_path('web_cache', Dir.pwd),
|
21
20
|
expires_in: 86400, # 1 day
|
21
|
+
validate: true,
|
22
22
|
host_with_port: 'localhost:27017',
|
23
23
|
database: 'pupa',
|
24
24
|
dry_run: false,
|
@@ -72,15 +72,18 @@ module Pupa
|
|
72
72
|
opts.on('-t', '--task TASK', @processor_class.tasks, 'Select a scraping task to run (you may give this switch multiple times)', " (#{@processor_class.tasks.join(', ')})") do |v|
|
73
73
|
options.tasks << v
|
74
74
|
end
|
75
|
-
opts.on('-o', '--output_dir PATH', 'The directory in which to dump JSON documents') do |v|
|
75
|
+
opts.on('-o', '--output_dir PATH', 'The directory or Redis address (e.g. redis://localhost:6379) in which to dump JSON documents') do |v|
|
76
76
|
options.output_dir = v
|
77
77
|
end
|
78
|
-
opts.on('-c', '--cache_dir PATH', 'The directory in which to cache HTTP requests') do |v|
|
78
|
+
opts.on('-c', '--cache_dir PATH', 'The directory or Memcached address (e.g. memcached://localhost:11211) in which to cache HTTP requests') do |v|
|
79
79
|
options.cache_dir = v
|
80
80
|
end
|
81
81
|
opts.on('-e', '--expires_in SECONDS', "The cache's expiration time in seconds") do |v|
|
82
82
|
options.expires_in = v
|
83
83
|
end
|
84
|
+
opts.on('--[no-]validate', 'Validate JSON documents') do |v|
|
85
|
+
options.validate = v
|
86
|
+
end
|
84
87
|
opts.on('-H', '--host HOST:PORT', 'The host and port to MongoDB') do |v|
|
85
88
|
options.host_with_port = v
|
86
89
|
end
|
@@ -137,7 +140,12 @@ module Pupa
|
|
137
140
|
options.tasks = @processor_class.tasks
|
138
141
|
end
|
139
142
|
|
140
|
-
processor = @processor_class.new(options.output_dir,
|
143
|
+
processor = @processor_class.new(options.output_dir,
|
144
|
+
cache_dir: options.cache_dir,
|
145
|
+
expires_in: options.expires_in,
|
146
|
+
validate: options.validate,
|
147
|
+
level: options.level,
|
148
|
+
options: Hash[*rest])
|
141
149
|
|
142
150
|
options.actions.each do |action|
|
143
151
|
unless action == 'scrape' || processor.respond_to?(action)
|
@@ -174,13 +182,7 @@ module Pupa
|
|
174
182
|
Pupa.session = Moped::Session.new([options.host_with_port], database: options.database)
|
175
183
|
|
176
184
|
if options.actions.delete('scrape')
|
177
|
-
|
178
|
-
FileUtils.mkdir_p(options.cache_dir)
|
179
|
-
|
180
|
-
Dir[File.join(options.output_dir, '*.json')].each do |path|
|
181
|
-
FileUtils.rm(path)
|
182
|
-
end
|
183
|
-
|
185
|
+
processor.store.clear
|
184
186
|
report[:scrape] = {}
|
185
187
|
options.tasks.each do |task_name|
|
186
188
|
report[:scrape][task_name] = processor.dump_scraped_objects(task_name)
|
data/lib/pupa/version.rb
CHANGED
data/pupa.gemspec
CHANGED
@@ -25,10 +25,12 @@ Gem::Specification.new do |s|
|
|
25
25
|
s.add_runtime_dependency('nokogiri', '~> 1.6.0')
|
26
26
|
|
27
27
|
s.add_development_dependency('coveralls')
|
28
|
+
s.add_development_dependency('dalli')
|
28
29
|
s.add_development_dependency('json', '~> 1.7.7') # to silence coveralls warning
|
30
|
+
s.add_development_dependency('multi_xml')
|
29
31
|
s.add_development_dependency('octokit') # to update Popolo schema
|
30
32
|
s.add_development_dependency('rake')
|
33
|
+
s.add_development_dependency('redis-store')
|
31
34
|
s.add_development_dependency('rspec', '~> 2.10')
|
32
35
|
s.add_development_dependency('vcr', '~> 2.5.0')
|
33
|
-
s.add_development_dependency('multi_xml')
|
34
36
|
end
|
@@ -0,0 +1,62 @@
|
|
1
|
+
---
|
2
|
+
http_interactions:
|
3
|
+
- request:
|
4
|
+
method: get
|
5
|
+
uri: http://example.com/
|
6
|
+
body:
|
7
|
+
encoding: US-ASCII
|
8
|
+
string: ''
|
9
|
+
headers:
|
10
|
+
User-Agent:
|
11
|
+
- Faraday v0.8.8
|
12
|
+
response:
|
13
|
+
status:
|
14
|
+
code: 200
|
15
|
+
message:
|
16
|
+
headers:
|
17
|
+
accept-ranges:
|
18
|
+
- bytes
|
19
|
+
cache-control:
|
20
|
+
- max-age=604800
|
21
|
+
content-type:
|
22
|
+
- text/html
|
23
|
+
date:
|
24
|
+
- Fri, 27 Sep 2013 00:31:23 GMT
|
25
|
+
etag:
|
26
|
+
- '"3012602696"'
|
27
|
+
expires:
|
28
|
+
- Fri, 04 Oct 2013 00:31:23 GMT
|
29
|
+
last-modified:
|
30
|
+
- Fri, 09 Aug 2013 23:54:35 GMT
|
31
|
+
server:
|
32
|
+
- ECS (mdw/13C6)
|
33
|
+
x-cache:
|
34
|
+
- HIT
|
35
|
+
x-ec-custom-error:
|
36
|
+
- '1'
|
37
|
+
content-length:
|
38
|
+
- '1270'
|
39
|
+
connection:
|
40
|
+
- close
|
41
|
+
body:
|
42
|
+
encoding: UTF-8
|
43
|
+
string: "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n
|
44
|
+
\ <meta charset=\"utf-8\" />\n <meta http-equiv=\"Content-type\" content=\"text/html;
|
45
|
+
charset=utf-8\" />\n <meta name=\"viewport\" content=\"width=device-width,
|
46
|
+
initial-scale=1\" />\n <style type=\"text/css\">\n body {\n background-color:
|
47
|
+
#f0f0f2;\n margin: 0;\n padding: 0;\n font-family: \"Open
|
48
|
+
Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n \n }\n
|
49
|
+
\ div {\n width: 600px;\n margin: 5em auto;\n padding:
|
50
|
+
50px;\n background-color: #fff;\n border-radius: 1em;\n }\n
|
51
|
+
\ a:link, a:visited {\n color: #38488f;\n text-decoration:
|
52
|
+
none;\n }\n @media (max-width: 700px) {\n body {\n background-color:
|
53
|
+
#fff;\n }\n div {\n width: auto;\n margin:
|
54
|
+
0 auto;\n border-radius: 0;\n padding: 1em;\n }\n
|
55
|
+
\ }\n </style> \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n
|
56
|
+
\ <p>This domain is established to be used for illustrative examples in
|
57
|
+
documents. You may use this\n domain in examples without prior coordination
|
58
|
+
or asking for permission.</p>\n <p><a href=\"http://www.iana.org/domains/example\">More
|
59
|
+
information...</a></p>\n</div>\n</body>\n</html>\n"
|
60
|
+
http_version:
|
61
|
+
recorded_at: Fri, 27 Sep 2013 00:31:23 GMT
|
62
|
+
recorded_with: VCR 2.5.0
|
@@ -0,0 +1 @@
|
|
1
|
+
{"name":"bar"}
|
@@ -0,0 +1 @@
|
|
1
|
+
{"name":"baz"}
|
@@ -0,0 +1 @@
|
|
1
|
+
{"name":"foo"}
|
data/spec/models/base_spec.rb
CHANGED
@@ -18,9 +18,15 @@ describe Pupa::Base do
|
|
18
18
|
},
|
19
19
|
},
|
20
20
|
}
|
21
|
-
|
21
|
+
|
22
|
+
attr_accessor :label, :founding_date, :inactive, :label_id, :manager_id, :links
|
23
|
+
attr_reader :name
|
22
24
|
foreign_key :label_id, :manager_id
|
23
25
|
foreign_object :label
|
26
|
+
|
27
|
+
def name=(name)
|
28
|
+
@name = name
|
29
|
+
end
|
24
30
|
end
|
25
31
|
end
|
26
32
|
|
@@ -32,25 +38,33 @@ describe Pupa::Base do
|
|
32
38
|
Music::Band.new(properties)
|
33
39
|
end
|
34
40
|
|
35
|
-
describe '
|
41
|
+
describe '.attr_accessor' do
|
42
|
+
it 'should add properties' do
|
43
|
+
[:_id, :_type, :extras, :label, :founding_date, :inactive, :label_id, :manager_id, :links].each do |property|
|
44
|
+
Music::Band.properties.to_a.should include(property)
|
45
|
+
end
|
46
|
+
end
|
47
|
+
end
|
48
|
+
|
49
|
+
describe '.attr_reader' do
|
36
50
|
it 'should add properties' do
|
37
|
-
Music::Band.properties.to_a.should
|
51
|
+
Music::Band.properties.to_a.should include(:name)
|
38
52
|
end
|
39
53
|
end
|
40
54
|
|
41
|
-
describe '
|
55
|
+
describe '.foreign_key' do
|
42
56
|
it 'should add foreign keys' do
|
43
57
|
Music::Band.foreign_keys.to_a.should == [:label_id, :manager_id]
|
44
58
|
end
|
45
59
|
end
|
46
60
|
|
47
|
-
describe '
|
61
|
+
describe '.foreign_object' do
|
48
62
|
it 'should add foreign objects' do
|
49
63
|
Music::Band.foreign_objects.to_a.should == [:label]
|
50
64
|
end
|
51
65
|
end
|
52
66
|
|
53
|
-
describe '
|
67
|
+
describe '.schema=' do
|
54
68
|
let :klass_with_absolute_path do
|
55
69
|
Class.new(Pupa::Base) do
|
56
70
|
self.schema = '/path/to/schema.json'
|
@@ -82,11 +96,13 @@ describe Pupa::Base do
|
|
82
96
|
end
|
83
97
|
|
84
98
|
it 'should accept an absolute path' do
|
85
|
-
|
99
|
+
File.should_receive(:read).and_return('{}')
|
100
|
+
klass_with_absolute_path.json_schema.should == '{}'
|
86
101
|
end
|
87
102
|
|
88
103
|
it 'should accept a relative path' do
|
89
|
-
|
104
|
+
File.should_receive(:read).and_return('{}')
|
105
|
+
klass_with_relative_path.json_schema.should == '{}'
|
90
106
|
end
|
91
107
|
end
|
92
108
|
|
@@ -1,4 +1,15 @@
|
|
1
1
|
require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
|
2
2
|
|
3
3
|
describe Pupa::Processor::Client do
|
4
|
+
describe '.new' do
|
5
|
+
it 'should use the filesystem' do
|
6
|
+
ActiveSupport::Cache::FileStore.should_receive(:new).and_call_original
|
7
|
+
Pupa::Processor::Client.new(cache_dir: '/tmp', level: 'UNKNOWN').get('http://example.com/')
|
8
|
+
end
|
9
|
+
|
10
|
+
it 'should use Memcached' do
|
11
|
+
ActiveSupport::Cache::MemCacheStore.should_receive(:new).and_call_original
|
12
|
+
Pupa::Processor::Client.new(cache_dir: 'memcached://localhost', level: 'UNKNOWN').get('http://example.com/')
|
13
|
+
end
|
14
|
+
end
|
4
15
|
end
|
@@ -0,0 +1,65 @@
|
|
1
|
+
require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
|
2
|
+
|
3
|
+
describe Pupa::Processor::DocumentStore::FileStore do
|
4
|
+
let :store do
|
5
|
+
Pupa::Processor::DocumentStore::FileStore.new(File.expand_path(File.join('..', '..', 'fixtures'), __dir__))
|
6
|
+
end
|
7
|
+
|
8
|
+
describe '#exist?' do
|
9
|
+
it 'should return true if the store contains an entry for the given key' do
|
10
|
+
store.exist?('foo.json').should == true
|
11
|
+
end
|
12
|
+
|
13
|
+
it 'should return false if the store does not contain an entry for the given key' do
|
14
|
+
store.exist?('nonexistent').should == false
|
15
|
+
end
|
16
|
+
end
|
17
|
+
|
18
|
+
describe '#entries' do
|
19
|
+
it 'should return all keys in the store' do
|
20
|
+
store.entries.sort.should == %w(bar.json baz.json foo.json)
|
21
|
+
end
|
22
|
+
end
|
23
|
+
|
24
|
+
describe '#read' do
|
25
|
+
it 'should return the value of the given key' do
|
26
|
+
store.read('foo.json').should == {'name' => 'foo'}
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
describe '#read_multi' do
|
31
|
+
it 'should return the values of the given keys' do
|
32
|
+
store.read_multi(%w(foo.json bar.json)).should == [{'name' => 'foo'}, {'name' => 'bar'}]
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
describe '#write' do
|
37
|
+
it 'should write an entry with the given value for the given key' do
|
38
|
+
store.exist?('new.json').should == false
|
39
|
+
store.write('new.json', {'name' => 'new'})
|
40
|
+
store.read('new.json').should == {'name' => 'new'}
|
41
|
+
store.delete('new.json') # cleanup
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
45
|
+
describe '#delete' do
|
46
|
+
it 'should delete an entry with the given key from the store' do
|
47
|
+
store.write('new.json', {'name' => 'new'})
|
48
|
+
store.exist?('new.json').should == true
|
49
|
+
store.delete('new.json')
|
50
|
+
store.exist?('new.json').should == false
|
51
|
+
end
|
52
|
+
end
|
53
|
+
|
54
|
+
describe '#clear' do
|
55
|
+
it 'should delete all entries from the store' do
|
56
|
+
store.entries.sort.should == %w(bar.json baz.json foo.json)
|
57
|
+
store.clear
|
58
|
+
store.entries.should == []
|
59
|
+
|
60
|
+
%w(bar baz foo).each do |name| # cleanup
|
61
|
+
store.write("#{name}.json", {'name' => name})
|
62
|
+
end
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
@@ -0,0 +1,71 @@
|
|
1
|
+
require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
|
2
|
+
|
3
|
+
describe Pupa::Processor::DocumentStore::RedisStore do
|
4
|
+
def store
|
5
|
+
Pupa::Processor::DocumentStore::RedisStore.new('redis://localhost/15')
|
6
|
+
end
|
7
|
+
|
8
|
+
before :all do
|
9
|
+
%w(foo bar baz).each do |name|
|
10
|
+
store.write("#{name}.json", {'name' => name})
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
describe '#exist?' do
|
15
|
+
it 'should return true if the store contains an entry for the given key' do
|
16
|
+
store.exist?('foo.json').should == true
|
17
|
+
end
|
18
|
+
|
19
|
+
it 'should return false if the store does not contain an entry for the given key' do
|
20
|
+
store.exist?('nonexistent').should == false
|
21
|
+
end
|
22
|
+
end
|
23
|
+
|
24
|
+
describe '#entries' do
|
25
|
+
it 'should return all keys in the store' do
|
26
|
+
store.entries.sort.should == %w(bar.json baz.json foo.json)
|
27
|
+
end
|
28
|
+
end
|
29
|
+
|
30
|
+
describe '#read' do
|
31
|
+
it 'should return the value of the given key' do
|
32
|
+
store.read('foo.json').should == {'name' => 'foo'}
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|
36
|
+
describe '#read_multi' do
|
37
|
+
it 'should return the values of the given keys' do
|
38
|
+
store.read_multi(%w(foo.json bar.json)).should == [{'name' => 'foo'}, {'name' => 'bar'}]
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
describe '#write' do
|
43
|
+
it 'should write an entry with the given value for the given key' do
|
44
|
+
store.exist?('new.json').should == false
|
45
|
+
store.write('new.json', {'name' => 'new'})
|
46
|
+
store.read('new.json').should == {'name' => 'new'}
|
47
|
+
store.delete('new.json') # cleanup
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
describe '#delete' do
|
52
|
+
it 'should delete an entry with the given key from the store' do
|
53
|
+
store.write('new.json', {'name' => 'new'})
|
54
|
+
store.exist?('new.json').should == true
|
55
|
+
store.delete('new.json')
|
56
|
+
store.exist?('new.json').should == false
|
57
|
+
end
|
58
|
+
end
|
59
|
+
|
60
|
+
describe '#clear' do
|
61
|
+
it 'should delete all entries from the store' do
|
62
|
+
store.entries.sort.should == %w(bar.json baz.json foo.json)
|
63
|
+
store.clear
|
64
|
+
store.entries.should == []
|
65
|
+
|
66
|
+
%w(bar baz foo).each do |name| # cleanup
|
67
|
+
store.write("#{name}.json", {'name' => name})
|
68
|
+
end
|
69
|
+
end
|
70
|
+
end
|
71
|
+
end
|
@@ -0,0 +1,15 @@
|
|
1
|
+
require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
|
2
|
+
|
3
|
+
describe Pupa::Processor::DocumentStore do
|
4
|
+
describe '.new' do
|
5
|
+
it 'should use the filesystem' do
|
6
|
+
Pupa::Processor::DocumentStore::FileStore.should_receive(:new).with('/tmp').and_call_original
|
7
|
+
Pupa::Processor::DocumentStore.new('/tmp')
|
8
|
+
end
|
9
|
+
|
10
|
+
it 'should use Redis' do
|
11
|
+
Pupa::Processor::DocumentStore::RedisStore.should_receive(:new).with('redis://localhost').and_call_original
|
12
|
+
Pupa::Processor::DocumentStore.new('redis://localhost')
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
@@ -11,7 +11,7 @@ describe Pupa::Processor::Persistence do
|
|
11
11
|
Pupa.session[:people].insert(_type: 'pupa/person', name: 'non-unique')
|
12
12
|
end
|
13
13
|
|
14
|
-
describe '
|
14
|
+
describe '.find' do
|
15
15
|
it 'should return nil if no matches' do
|
16
16
|
Pupa::Processor::Persistence.find(_type: 'pupa/person', name: 'nonexistent').should == nil
|
17
17
|
end
|
data/spec/processor_spec.rb
CHANGED
@@ -31,6 +31,10 @@ describe Pupa::Processor do
|
|
31
31
|
PersonProcessor.new('/tmp', level: 'WARN', logdev: io)
|
32
32
|
end
|
33
33
|
|
34
|
+
let :novalidate do
|
35
|
+
PersonProcessor.new('/tmp', level: 'WARN', logdev: io, validate: false)
|
36
|
+
end
|
37
|
+
|
34
38
|
describe '#get' do
|
35
39
|
it 'should send a GET request' do
|
36
40
|
processor.get('http://httpbin.org/get', 'foo=bar')['args'].should == {'foo' => 'bar'}
|
@@ -51,7 +55,7 @@ describe Pupa::Processor do
|
|
51
55
|
end
|
52
56
|
end
|
53
57
|
|
54
|
-
describe '
|
58
|
+
describe '.add_scraping_task' do
|
55
59
|
it 'should add a scraping task and define a lazy method' do
|
56
60
|
PersonProcessor.tasks.should == [:people]
|
57
61
|
processor.should respond_to(:people)
|
@@ -64,9 +68,9 @@ describe Pupa::Processor do
|
|
64
68
|
end
|
65
69
|
|
66
70
|
it 'should not overwrite an existing file' do
|
67
|
-
|
71
|
+
File.open(path, 'w') {}
|
68
72
|
expect{processor.dump_scraped_objects(:people)}.to raise_error(Pupa::Errors::DuplicateObjectIdError)
|
69
|
-
|
73
|
+
File.delete(path)
|
70
74
|
end
|
71
75
|
|
72
76
|
it 'should dump a JSON document' do
|
@@ -80,6 +84,12 @@ describe Pupa::Processor do
|
|
80
84
|
processor.dump_scraped_objects(:people)
|
81
85
|
io.string.should match('http://popoloproject.com/schemas/person.json')
|
82
86
|
end
|
87
|
+
|
88
|
+
it 'should not validate the object' do
|
89
|
+
novalidate.make_person_invalid
|
90
|
+
novalidate.dump_scraped_objects(:people)
|
91
|
+
io.string.should_not match('http://popoloproject.com/schemas/person.json')
|
92
|
+
end
|
83
93
|
end
|
84
94
|
|
85
95
|
describe '#import' do
|
data/spec/spec_helper.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pupa
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Open North
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2013-09-
|
11
|
+
date: 2013-09-27 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: activesupport
|
@@ -122,6 +122,20 @@ dependencies:
|
|
122
122
|
- - '>='
|
123
123
|
- !ruby/object:Gem::Version
|
124
124
|
version: '0'
|
125
|
+
- !ruby/object:Gem::Dependency
|
126
|
+
name: dalli
|
127
|
+
requirement: !ruby/object:Gem::Requirement
|
128
|
+
requirements:
|
129
|
+
- - '>='
|
130
|
+
- !ruby/object:Gem::Version
|
131
|
+
version: '0'
|
132
|
+
type: :development
|
133
|
+
prerelease: false
|
134
|
+
version_requirements: !ruby/object:Gem::Requirement
|
135
|
+
requirements:
|
136
|
+
- - '>='
|
137
|
+
- !ruby/object:Gem::Version
|
138
|
+
version: '0'
|
125
139
|
- !ruby/object:Gem::Dependency
|
126
140
|
name: json
|
127
141
|
requirement: !ruby/object:Gem::Requirement
|
@@ -136,6 +150,20 @@ dependencies:
|
|
136
150
|
- - ~>
|
137
151
|
- !ruby/object:Gem::Version
|
138
152
|
version: 1.7.7
|
153
|
+
- !ruby/object:Gem::Dependency
|
154
|
+
name: multi_xml
|
155
|
+
requirement: !ruby/object:Gem::Requirement
|
156
|
+
requirements:
|
157
|
+
- - '>='
|
158
|
+
- !ruby/object:Gem::Version
|
159
|
+
version: '0'
|
160
|
+
type: :development
|
161
|
+
prerelease: false
|
162
|
+
version_requirements: !ruby/object:Gem::Requirement
|
163
|
+
requirements:
|
164
|
+
- - '>='
|
165
|
+
- !ruby/object:Gem::Version
|
166
|
+
version: '0'
|
139
167
|
- !ruby/object:Gem::Dependency
|
140
168
|
name: octokit
|
141
169
|
requirement: !ruby/object:Gem::Requirement
|
@@ -165,47 +193,47 @@ dependencies:
|
|
165
193
|
- !ruby/object:Gem::Version
|
166
194
|
version: '0'
|
167
195
|
- !ruby/object:Gem::Dependency
|
168
|
-
name:
|
196
|
+
name: redis-store
|
169
197
|
requirement: !ruby/object:Gem::Requirement
|
170
198
|
requirements:
|
171
|
-
- -
|
199
|
+
- - '>='
|
172
200
|
- !ruby/object:Gem::Version
|
173
|
-
version: '
|
201
|
+
version: '0'
|
174
202
|
type: :development
|
175
203
|
prerelease: false
|
176
204
|
version_requirements: !ruby/object:Gem::Requirement
|
177
205
|
requirements:
|
178
|
-
- -
|
206
|
+
- - '>='
|
179
207
|
- !ruby/object:Gem::Version
|
180
|
-
version: '
|
208
|
+
version: '0'
|
181
209
|
- !ruby/object:Gem::Dependency
|
182
|
-
name:
|
210
|
+
name: rspec
|
183
211
|
requirement: !ruby/object:Gem::Requirement
|
184
212
|
requirements:
|
185
213
|
- - ~>
|
186
214
|
- !ruby/object:Gem::Version
|
187
|
-
version: 2.
|
215
|
+
version: '2.10'
|
188
216
|
type: :development
|
189
217
|
prerelease: false
|
190
218
|
version_requirements: !ruby/object:Gem::Requirement
|
191
219
|
requirements:
|
192
220
|
- - ~>
|
193
221
|
- !ruby/object:Gem::Version
|
194
|
-
version: 2.
|
222
|
+
version: '2.10'
|
195
223
|
- !ruby/object:Gem::Dependency
|
196
|
-
name:
|
224
|
+
name: vcr
|
197
225
|
requirement: !ruby/object:Gem::Requirement
|
198
226
|
requirements:
|
199
|
-
- -
|
227
|
+
- - ~>
|
200
228
|
- !ruby/object:Gem::Version
|
201
|
-
version:
|
229
|
+
version: 2.5.0
|
202
230
|
type: :development
|
203
231
|
prerelease: false
|
204
232
|
version_requirements: !ruby/object:Gem::Requirement
|
205
233
|
requirements:
|
206
|
-
- -
|
234
|
+
- - ~>
|
207
235
|
- !ruby/object:Gem::Version
|
208
|
-
version:
|
236
|
+
version: 2.5.0
|
209
237
|
description:
|
210
238
|
email:
|
211
239
|
- info@opennorth.ca
|
@@ -240,6 +268,9 @@ files:
|
|
240
268
|
- lib/pupa/processor.rb
|
241
269
|
- lib/pupa/processor/client.rb
|
242
270
|
- lib/pupa/processor/dependency_graph.rb
|
271
|
+
- lib/pupa/processor/document_store.rb
|
272
|
+
- lib/pupa/processor/document_store/file_store.rb
|
273
|
+
- lib/pupa/processor/document_store/redis_store.rb
|
243
274
|
- lib/pupa/processor/helper.rb
|
244
275
|
- lib/pupa/processor/middleware/logger.rb
|
245
276
|
- lib/pupa/processor/middleware/parse_html.rb
|
@@ -264,6 +295,10 @@ files:
|
|
264
295
|
- spec/cassettes/ce69ff734ce852d2bfaa482bbf55d7ffb4762e87.yml
|
265
296
|
- spec/cassettes/da629b01e0836deda8a5540a4e6a08783dd7aef9.yml
|
266
297
|
- spec/cassettes/e398f35bea86b3d4c87a6934bae1eb7fca8744f9.yml
|
298
|
+
- spec/cassettes/f861172f1df3bdb2052af5451f9922699d574b77.yml
|
299
|
+
- spec/fixtures/bar.json
|
300
|
+
- spec/fixtures/baz.json
|
301
|
+
- spec/fixtures/foo.json
|
267
302
|
- spec/logger_spec.rb
|
268
303
|
- spec/models/base_spec.rb
|
269
304
|
- spec/models/concerns/contactable_spec.rb
|
@@ -280,6 +315,9 @@ files:
|
|
280
315
|
- spec/models/post_spec.rb
|
281
316
|
- spec/processor/client_spec.rb
|
282
317
|
- spec/processor/dependency_graph_spec.rb
|
318
|
+
- spec/processor/document_store/file_store_spec.rb
|
319
|
+
- spec/processor/document_store/redis_store_spec.rb
|
320
|
+
- spec/processor/document_store_spec.rb
|
283
321
|
- spec/processor/helper_spec.rb
|
284
322
|
- spec/processor/middleware/logger_spec.rb
|
285
323
|
- spec/processor/middleware/parse_html_spec.rb
|
@@ -319,6 +357,10 @@ test_files:
|
|
319
357
|
- spec/cassettes/ce69ff734ce852d2bfaa482bbf55d7ffb4762e87.yml
|
320
358
|
- spec/cassettes/da629b01e0836deda8a5540a4e6a08783dd7aef9.yml
|
321
359
|
- spec/cassettes/e398f35bea86b3d4c87a6934bae1eb7fca8744f9.yml
|
360
|
+
- spec/cassettes/f861172f1df3bdb2052af5451f9922699d574b77.yml
|
361
|
+
- spec/fixtures/bar.json
|
362
|
+
- spec/fixtures/baz.json
|
363
|
+
- spec/fixtures/foo.json
|
322
364
|
- spec/logger_spec.rb
|
323
365
|
- spec/models/base_spec.rb
|
324
366
|
- spec/models/concerns/contactable_spec.rb
|
@@ -335,6 +377,9 @@ test_files:
|
|
335
377
|
- spec/models/post_spec.rb
|
336
378
|
- spec/processor/client_spec.rb
|
337
379
|
- spec/processor/dependency_graph_spec.rb
|
380
|
+
- spec/processor/document_store/file_store_spec.rb
|
381
|
+
- spec/processor/document_store/redis_store_spec.rb
|
382
|
+
- spec/processor/document_store_spec.rb
|
338
383
|
- spec/processor/helper_spec.rb
|
339
384
|
- spec/processor/middleware/logger_spec.rb
|
340
385
|
- spec/processor/middleware/parse_html_spec.rb
|