spidey-mongo 0.2.0 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.travis.yml +11 -0
- data/CHANGELOG.md +15 -0
- data/CONTRIBUTING.md +116 -0
- data/Gemfile +13 -1
- data/LICENSE.txt +2 -2
- data/README.md +49 -43
- data/Rakefile +12 -1
- data/lib/spidey-mongo.rb +3 -1
- data/lib/spidey-mongo/version.rb +1 -1
- data/lib/spidey/strategies/mongo.rb +9 -12
- data/lib/spidey/strategies/mongo2.rb +59 -0
- data/lib/spidey/strategies/moped.rb +9 -12
- data/spec/spec_helper.rb +12 -2
- data/spec/spidey/strategies/mongo2_spec.rb +61 -0
- data/spec/spidey/strategies/mongo_spec.rb +24 -25
- data/spec/spidey/strategies/moped_spec.rb +24 -25
- data/spidey-mongo.gemspec +14 -17
- metadata +19 -75
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 096f00db0e8368887d3546d1909984af270ef83e
|
4
|
+
data.tar.gz: dbc9ebec27c141264076557098c83e7a1576dc28
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 50b3fc0d2fa3ea7837ff06e5d2f3f5bebf99047d10a43ff27dff755df492a95c649532b0ab5d0b04b8d37755b96a5558237d4c303e89f89bafc1f602e00d0e24
|
7
|
+
data.tar.gz: 6392e63f6d3eba223dade821bedf9c56bf44b1cf48146e0906fb74186ff81d80368522b4f1dbfe9d2cd14335870e64455c66c2ab86c60abfc1c506de00d3e9a6
|
data/.travis.yml
ADDED
data/CHANGELOG.md
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
### Next
|
2
|
+
|
3
|
+
* Your contribution here...
|
4
|
+
|
5
|
+
### 0.3.0
|
6
|
+
|
7
|
+
* [#3](https://github.com/joeyAghion/spidey-mongo/pull/3): Added support for Mongo Ruby Driver 2.x - [@dblock](https://github.com/dblock).
|
8
|
+
|
9
|
+
### 0.2.0
|
10
|
+
|
11
|
+
* [#1](https://github.com/joeyAghion/spidey-mongo/pull/1): Added support for Moped - [@fancyremarker](https://github.com/fancyremarker).
|
12
|
+
|
13
|
+
### 0.1.0
|
14
|
+
|
15
|
+
* Initial public release - [@joeyAghion](https://github.com/joeyAghion).
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,116 @@
|
|
1
|
+
Contributing
|
2
|
+
============
|
3
|
+
|
4
|
+
This gem is work of [many of contributors](https://github.com/joeyAghion/spidey-mongo/graphs/contributors). You're encouraged to submit [pull requests](https://github.com/joeyAghion/spidey-mongo/pulls), [propose features, ask questions and discuss issues](https://github.com/joeyAghion/spidey-mongo/issues).
|
5
|
+
|
6
|
+
#### Fork the Project
|
7
|
+
|
8
|
+
Fork the [project on Github](https://github.com/joeyAghion/spidey-mongo) and check out your copy.
|
9
|
+
|
10
|
+
```
|
11
|
+
git clone https://github.com/contributor/spidey-mongo.git
|
12
|
+
cd spidey-mongo
|
13
|
+
git remote add upstream https://github.com/joeyAghion/spidey-mongo.git
|
14
|
+
```
|
15
|
+
|
16
|
+
#### Create a Topic Branch
|
17
|
+
|
18
|
+
Make sure your fork is up-to-date and create a topic branch for your feature or bug fix.
|
19
|
+
|
20
|
+
```
|
21
|
+
git checkout master
|
22
|
+
git pull upstream master
|
23
|
+
git checkout -b my-feature-branch
|
24
|
+
```
|
25
|
+
|
26
|
+
#### Bundle Install and Test
|
27
|
+
|
28
|
+
Ensure that you can build the project and run tests.
|
29
|
+
|
30
|
+
```
|
31
|
+
bundle install
|
32
|
+
bundle exec rake
|
33
|
+
```
|
34
|
+
|
35
|
+
#### Write Tests
|
36
|
+
|
37
|
+
Try to write a test that reproduces the problem you're trying to fix or describes a feature that you want to build. Add to [spec/mongoid](spec/mongoid).
|
38
|
+
|
39
|
+
We definitely appreciate pull requests that highlight or reproduce a problem, even without a fix.
|
40
|
+
|
41
|
+
#### Write Code
|
42
|
+
|
43
|
+
Implement your feature or bug fix.
|
44
|
+
|
45
|
+
Make sure that `bundle exec rake` completes without errors.
|
46
|
+
|
47
|
+
#### Write Documentation
|
48
|
+
|
49
|
+
Document any external behavior in the [README](README.md).
|
50
|
+
|
51
|
+
#### Update Changelog
|
52
|
+
|
53
|
+
Add a line to [CHANGELOG](CHANGELOG.md) under *Next*. Make it look like every other line, including your name and link to your Github account.
|
54
|
+
|
55
|
+
#### Commit Changes
|
56
|
+
|
57
|
+
Make sure git knows your name and email address:
|
58
|
+
|
59
|
+
```
|
60
|
+
git config --global user.name "Your Name"
|
61
|
+
git config --global user.email "contributor@example.com"
|
62
|
+
```
|
63
|
+
|
64
|
+
Writing good commit logs is important. A commit log should describe what changed and why.
|
65
|
+
|
66
|
+
```
|
67
|
+
git add ...
|
68
|
+
git commit
|
69
|
+
```
|
70
|
+
|
71
|
+
#### Push
|
72
|
+
|
73
|
+
```
|
74
|
+
git push origin my-feature-branch
|
75
|
+
```
|
76
|
+
|
77
|
+
#### Make a Pull Request
|
78
|
+
|
79
|
+
Go to https://github.com/contributor/spidey-mongo and select your feature branch. Click the 'Pull Request' button and fill out the form. Pull requests are usually reviewed within a few days.
|
80
|
+
|
81
|
+
#### Rebase
|
82
|
+
|
83
|
+
If you've been working on a change for a while, rebase with upstream/master.
|
84
|
+
|
85
|
+
```
|
86
|
+
git fetch upstream
|
87
|
+
git rebase upstream/master
|
88
|
+
git push origin my-feature-branch -f
|
89
|
+
```
|
90
|
+
|
91
|
+
#### Update CHANGELOG Again
|
92
|
+
|
93
|
+
Update the [CHANGELOG](CHANGELOG.md) with the pull request number. A typical entry looks as follows.
|
94
|
+
|
95
|
+
```
|
96
|
+
* [#123](https://github.com/joeyAghion/spidey-mongo/pull/123): Reticulated splines - [@contributor](https://github.com/contributor).
|
97
|
+
```
|
98
|
+
|
99
|
+
Amend your previous commit and force push the changes.
|
100
|
+
|
101
|
+
```
|
102
|
+
git commit --amend
|
103
|
+
git push origin my-feature-branch -f
|
104
|
+
```
|
105
|
+
|
106
|
+
#### Check on Your Pull Request
|
107
|
+
|
108
|
+
Go back to your pull request after a few minutes and see whether it passed muster with Travis-CI. Everything should look green, otherwise fix issues and amend your commit as described above.
|
109
|
+
|
110
|
+
#### Be Patient
|
111
|
+
|
112
|
+
It's likely that your change will not be merged and that the nitpicky maintainers will ask you to do more, or fix seemingly benign problems. Hang on there!
|
113
|
+
|
114
|
+
#### Thank You
|
115
|
+
|
116
|
+
Please do know that we really appreciate and value your time and work. We love you, really.
|
data/Gemfile
CHANGED
@@ -1,4 +1,16 @@
|
|
1
|
-
source
|
1
|
+
source 'http://rubygems.org'
|
2
|
+
|
3
|
+
case version = ENV['MONGO_VERSION'] || 'mongo2'
|
4
|
+
when /^moped/
|
5
|
+
gem 'moped', '~> 2.0'
|
6
|
+
when /^mongo2/
|
7
|
+
gem 'mongo', '~> 2.0'
|
8
|
+
when /^mongo/
|
9
|
+
gem 'mongo', '~> 1.12'
|
10
|
+
gem 'bson_ext'
|
11
|
+
else
|
12
|
+
fail "Invalid MONGO_VERSION: #{ENV['MONGO_VERSION']}."
|
13
|
+
end
|
2
14
|
|
3
15
|
# Specify your gem's dependencies in spidey-mongo.gemspec
|
4
16
|
|
data/LICENSE.txt
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
Copyright (c) 2012 Joey Aghion,
|
1
|
+
Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors
|
2
2
|
|
3
3
|
Permission is hereby granted, free of charge, to any person obtaining
|
4
4
|
a copy of this software and associated documentation files (the
|
@@ -17,4 +17,4 @@ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
17
17
|
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
18
|
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
19
|
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
-
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
CHANGED
@@ -1,6 +1,9 @@
|
|
1
1
|
Spidey-Mongo
|
2
2
|
============
|
3
3
|
|
4
|
+
[![Build Status](https://travis-ci.org/joeyAghion/spidey-mongo.svg?branch=master)](https://travis-ci.org/joeyAghion/spidey-mongo)
|
5
|
+
[![Gem Version](https://badge.fury.io/rb/spidey-mongo.svg)](https://badge.fury.io/rb/spidey-mongo)
|
6
|
+
|
4
7
|
This gem implements a [MongoDB](http://www.mongodb.org/) back-end for [Spidey](https://github.com/joeyAghion/spidey), a very simple framework for crawling and scraping web sites.
|
5
8
|
|
6
9
|
See [Spidey](https://githubcom/joeyAghion/spidey)'s documentation for a basic example spider class.
|
@@ -12,45 +15,52 @@ Usage
|
|
12
15
|
|
13
16
|
### Install the gem
|
14
17
|
|
15
|
-
|
16
|
-
|
18
|
+
``` ruby
|
19
|
+
gem install spidey-mongo
|
20
|
+
```
|
17
21
|
|
18
22
|
### `mongo` versus `moped`
|
19
23
|
|
20
|
-
Spidey-Mongo provides
|
24
|
+
Spidey-Mongo provides three strategies:
|
21
25
|
|
22
|
-
* `Spidey::Strategies::Mongo`: Compatible with
|
23
|
-
* `Spidey::Strategies::
|
26
|
+
* `Spidey::Strategies::Mongo`: Compatible with Mongo Ruby Driver 1.x, [`mongo`](https://github.com/mongodb/mongo-ruby-driver)
|
27
|
+
* `Spidey::Strategies::Mongo2`: Compatible with Mongo Ruby Driver 2.x, [`mongo`](https://github.com/mongodb/mongo-ruby-driver), e.g., for use with Mongoid 5.x
|
28
|
+
* `Spidey::Strategies::Moped`: Compatible with the [`moped`](https://github.com/mongoid/moped) 2.x, e.g., for use with Mongoid 3.x and 4.x
|
24
29
|
|
25
30
|
You can include either strategy in your classes, as appropriate. All the examples in this README assume `Spidey::Strategies::Mongo`.
|
26
31
|
|
27
|
-
|
28
32
|
### Example spider class
|
29
33
|
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
34
|
+
```ruby
|
35
|
+
class EbaySpider < Spidey::AbstractSpider
|
36
|
+
include Spidey::Strategies::Mongo
|
37
|
+
|
38
|
+
handle "http://www.ebay.com", :process_home
|
39
|
+
|
40
|
+
def process_home(page, default_data = {})
|
41
|
+
# ...
|
42
|
+
end
|
43
|
+
end
|
44
|
+
```
|
39
45
|
|
40
46
|
### Invocation
|
41
47
|
|
42
48
|
The spider's constructor accepts new parameters for each of the MongoDB collections to employ: `url_collection`, `result_collection`, and `error_collection`.
|
43
49
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
+
```ruby
|
51
|
+
db = Mongo::Connection.new['example']
|
52
|
+
|
53
|
+
spider = EbaySpider.new(
|
54
|
+
url_collection: db['urls'],
|
55
|
+
result_collection: db['results'],
|
56
|
+
error_collection: db['errors'])
|
57
|
+
```
|
50
58
|
|
51
59
|
With persistent storage of the URL-crawling queue, it's now possible to stop crawling and resume at a later point. The `crawl` method accepts a new optional `crawl_for` parameter specifying the number of seconds after which to stop.
|
52
60
|
|
53
|
-
|
61
|
+
```
|
62
|
+
spider.crawl crawl_for: 600 # seconds, or more conveniently (w/ActiveSupport): 10.minutes
|
63
|
+
```
|
54
64
|
|
55
65
|
(The base implementation's `max_urls` parameter is also useful for this purpose.)
|
56
66
|
|
@@ -58,32 +68,28 @@ With persistent storage of the URL-crawling queue, it's now possible to stop cra
|
|
58
68
|
|
59
69
|
By default, invocations of `record(data)` by the spider simply insert new documents into the result collection. If corresponding results may already exist in the collection and should instead be updated, define a `result_key` method that returns a key by which to find the corresponding document. The method is called with a hash of the data being recorded:
|
60
70
|
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
def result_key(data)
|
65
|
-
data[:detail_url]
|
66
|
-
end
|
71
|
+
```ruby
|
72
|
+
class EbaySpider < Spidey::AbstractSpider
|
73
|
+
include Spidey::Strategies::Mongo
|
67
74
|
|
68
|
-
|
69
|
-
|
75
|
+
def result_key(data)
|
76
|
+
data[:detail_url]
|
77
|
+
end
|
70
78
|
|
71
|
-
|
79
|
+
# ...
|
80
|
+
end
|
81
|
+
```
|
72
82
|
|
73
|
-
|
74
|
-
-------
|
75
|
-
|
76
|
-
bundle exec rspec
|
77
|
-
|
78
|
-
Contributors
|
79
|
-
------------
|
83
|
+
This performs an `upsert` instead of the usual `insert` (i.e., an update if a result document matching the key already exists, or insert otherwise).
|
80
84
|
|
81
|
-
|
85
|
+
Contrbuting
|
86
|
+
-----------
|
82
87
|
|
83
|
-
|
84
|
-
-----
|
85
|
-
* Extract behaviors shared by `Mongo` and `Moped` strategies.
|
88
|
+
Please contribute! See [CONTRIBUTING](CONTRIBUTING.md) for details.
|
86
89
|
|
87
90
|
Copyright
|
88
91
|
---------
|
89
|
-
|
92
|
+
|
93
|
+
Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors.
|
94
|
+
|
95
|
+
See [LICENSE.txt](LICENSE.txt) for further details.
|
data/Rakefile
CHANGED
@@ -1 +1,12 @@
|
|
1
|
-
require
|
1
|
+
require 'bundler/gem_tasks'
|
2
|
+
|
3
|
+
Bundler.setup :default, :development
|
4
|
+
|
5
|
+
require 'rspec/core'
|
6
|
+
require 'rspec/core/rake_task'
|
7
|
+
|
8
|
+
RSpec::Core::RakeTask.new(:spec) do |spec|
|
9
|
+
spec.pattern = FileList["spec/**/#{ENV['MONGO_VERSION'] || 'mongo2'}_spec.rb"]
|
10
|
+
end
|
11
|
+
|
12
|
+
task default: :spec
|
data/lib/spidey-mongo.rb
CHANGED
data/lib/spidey-mongo/version.rb
CHANGED
@@ -18,8 +18,8 @@ module Spidey::Strategies
|
|
18
18
|
def handle(url, handler, default_data = {})
|
19
19
|
Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
|
20
20
|
url_collection.update(
|
21
|
-
{'spider' => self.class.name, 'url' => url},
|
22
|
-
{'$set' => {'handler' => handler, 'default_data' => default_data}},
|
21
|
+
{ 'spider' => self.class.name, 'url' => url },
|
22
|
+
{ '$set' => { 'handler' => handler, 'default_data' => default_data } },
|
23
23
|
upsert: true
|
24
24
|
)
|
25
25
|
end
|
@@ -28,16 +28,16 @@ module Spidey::Strategies
|
|
28
28
|
doc = data.merge('spider' => self.class.name)
|
29
29
|
Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
|
30
30
|
if respond_to?(:result_key) && key = result_key(doc)
|
31
|
-
result_collection.update({'key' => key}, {'$set' => doc}, upsert: true)
|
31
|
+
result_collection.update({ 'key' => key }, { '$set' => doc }, upsert: true)
|
32
32
|
else
|
33
33
|
result_collection.insert doc
|
34
34
|
end
|
35
35
|
end
|
36
36
|
|
37
|
-
def each_url(&
|
37
|
+
def each_url(&_block)
|
38
38
|
while url = get_next_url
|
39
|
-
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at
|
40
|
-
url_collection.update({'_id' => url['_id']}, '$set' => {last_crawled_at: Time.now})
|
39
|
+
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
|
40
|
+
url_collection.update({ '_id' => url['_id'] }, '$set' => { last_crawled_at: Time.now })
|
41
41
|
yield url['url'], url['handler'], url['default_data'].symbolize_keys
|
42
42
|
end
|
43
43
|
end
|
@@ -49,14 +49,11 @@ module Spidey::Strategies
|
|
49
49
|
Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
|
50
50
|
end
|
51
51
|
|
52
|
-
|
52
|
+
private
|
53
53
|
|
54
54
|
def get_next_url
|
55
|
-
return nil if
|
56
|
-
url_collection.find_one({spider: self.class.name},
|
57
|
-
sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]]
|
58
|
-
})
|
55
|
+
return nil if @until && Time.now >= @until # exceeded time bound
|
56
|
+
url_collection.find_one({ spider: self.class.name }, sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]])
|
59
57
|
end
|
60
|
-
|
61
58
|
end
|
62
59
|
end
|
@@ -0,0 +1,59 @@
|
|
1
|
+
module Spidey::Strategies
|
2
|
+
module Mongo2
|
3
|
+
attr_accessor :url_collection, :result_collection, :error_collection
|
4
|
+
|
5
|
+
def initialize(attrs = {})
|
6
|
+
self.url_collection = attrs.delete(:url_collection)
|
7
|
+
self.result_collection = attrs.delete(:result_collection)
|
8
|
+
self.error_collection = attrs.delete(:error_collection)
|
9
|
+
super attrs
|
10
|
+
end
|
11
|
+
|
12
|
+
def crawl(options = {})
|
13
|
+
@crawl_started_at = Time.now
|
14
|
+
@until = Time.now + options[:crawl_for] if options[:crawl_for]
|
15
|
+
super options
|
16
|
+
end
|
17
|
+
|
18
|
+
def handle(url, handler, default_data = {})
|
19
|
+
Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
|
20
|
+
url_collection.update_one(
|
21
|
+
{ 'spider' => self.class.name, 'url' => url },
|
22
|
+
{ '$set' => { 'handler' => handler, 'default_data' => default_data } },
|
23
|
+
upsert: true
|
24
|
+
)
|
25
|
+
end
|
26
|
+
|
27
|
+
def record(data)
|
28
|
+
doc = data.merge('spider' => self.class.name)
|
29
|
+
Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
|
30
|
+
if respond_to?(:result_key) && key = result_key(doc)
|
31
|
+
result_collection.update_one({ 'key' => key }, { '$set' => doc }, upsert: true)
|
32
|
+
else
|
33
|
+
result_collection.insert_one doc
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
def each_url(&_block)
|
38
|
+
while url = get_next_url
|
39
|
+
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
|
40
|
+
url_collection.update_one({ '_id' => url['_id'] }, '$set' => { last_crawled_at: Time.now })
|
41
|
+
yield url['url'], url['handler'], url['default_data'].symbolize_keys
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
45
|
+
def add_error(attrs)
|
46
|
+
error = attrs.delete(:error)
|
47
|
+
doc = attrs.merge(created_at: Time.now, error: error.class.name, message: error.message, spider: self.class.name)
|
48
|
+
error_collection.insert_one doc
|
49
|
+
Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
|
50
|
+
end
|
51
|
+
|
52
|
+
private
|
53
|
+
|
54
|
+
def get_next_url
|
55
|
+
return nil if @until && Time.now >= @until # exceeded time bound
|
56
|
+
url_collection.find({ spider: self.class.name }, sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]]).first
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
@@ -18,9 +18,9 @@ module Spidey::Strategies
|
|
18
18
|
def handle(url, handler, default_data = {})
|
19
19
|
Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
|
20
20
|
url_collection.find(
|
21
|
-
|
21
|
+
'spider' => self.class.name, 'url' => url
|
22
22
|
).upsert(
|
23
|
-
|
23
|
+
'$set' => { 'handler' => handler, 'default_data' => default_data }
|
24
24
|
)
|
25
25
|
end
|
26
26
|
|
@@ -28,16 +28,16 @@ module Spidey::Strategies
|
|
28
28
|
doc = data.merge('spider' => self.class.name)
|
29
29
|
Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
|
30
30
|
if respond_to?(:result_key) && key = result_key(doc)
|
31
|
-
result_collection.find(
|
31
|
+
result_collection.find('key' => key).upsert('$set' => doc)
|
32
32
|
else
|
33
33
|
result_collection.insert doc
|
34
34
|
end
|
35
35
|
end
|
36
36
|
|
37
|
-
def each_url(&
|
37
|
+
def each_url(&_block)
|
38
38
|
while url = get_next_url
|
39
|
-
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at
|
40
|
-
url_collection.find(
|
39
|
+
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
|
40
|
+
url_collection.find('_id' => url['_id']).update('$set' => { last_crawled_at: Time.now })
|
41
41
|
yield url['url'], url['handler'], url['default_data'].symbolize_keys
|
42
42
|
end
|
43
43
|
end
|
@@ -49,14 +49,11 @@ module Spidey::Strategies
|
|
49
49
|
Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
|
50
50
|
end
|
51
51
|
|
52
|
-
|
52
|
+
private
|
53
53
|
|
54
54
|
def get_next_url
|
55
|
-
return nil if
|
56
|
-
url_collection.find(
|
57
|
-
'last_crawled_at' => 1, '_id' => 1
|
58
|
-
}).first
|
55
|
+
return nil if @until && Time.now >= @until # exceeded time bound
|
56
|
+
url_collection.find(spider: self.class.name).sort('last_crawled_at' => 1, '_id' => 1).first
|
59
57
|
end
|
60
|
-
|
61
58
|
end
|
62
59
|
end
|
data/spec/spec_helper.rb
CHANGED
@@ -1,8 +1,18 @@
|
|
1
|
-
|
1
|
+
$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
|
2
|
+
|
3
|
+
case version = ENV['MONGO_VERSION'] || 'mongo2'
|
4
|
+
when /^moped/
|
5
|
+
require 'moped'
|
6
|
+
when /^mongo/
|
7
|
+
require 'mongo'
|
8
|
+
else
|
9
|
+
fail "Invalid MONGO_VERSION: #{ENV['MONGO_VERSION']}."
|
10
|
+
end
|
11
|
+
|
2
12
|
require 'spidey-mongo'
|
3
13
|
|
4
14
|
RSpec.configure do |config|
|
5
|
-
config.treat_symbols_as_metadata_keys_with_true_values = true
|
6
15
|
config.run_all_when_everything_filtered = true
|
7
16
|
config.filter_run :focus
|
17
|
+
config.raise_errors_for_deprecations!
|
8
18
|
end
|
@@ -0,0 +1,61 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'mongo'
|
3
|
+
|
4
|
+
describe Spidey::Strategies::Mongo do
|
5
|
+
class TestMongoSpider < Spidey::AbstractSpider
|
6
|
+
include Spidey::Strategies::Mongo2
|
7
|
+
handle 'http://www.cnn.com', :process_home
|
8
|
+
|
9
|
+
def result_key(data)
|
10
|
+
data[:detail_url]
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
before(:each) do
|
15
|
+
@db = Mongo::Client.new('mongodb://127.0.0.1:27017/spidey-mongo-test')
|
16
|
+
@spider = TestMongoSpider.new(
|
17
|
+
url_collection: @db['urls'],
|
18
|
+
result_collection: @db['results'],
|
19
|
+
error_collection: @db['errors'])
|
20
|
+
end
|
21
|
+
|
22
|
+
after(:each) do
|
23
|
+
%w( urls results errors ).each { |col| @db[col].drop }
|
24
|
+
end
|
25
|
+
|
26
|
+
it 'should add initial URLs to collection' do
|
27
|
+
doc = @db['urls'].find(url: 'http://www.cnn.com').first
|
28
|
+
expect(doc['handler']).to eq(:process_home)
|
29
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
30
|
+
end
|
31
|
+
|
32
|
+
it 'should not add duplicate URLs' do
|
33
|
+
@spider.send :handle, 'http://www.cnn.com', :process_home
|
34
|
+
expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
|
35
|
+
end
|
36
|
+
|
37
|
+
it 'should add results' do
|
38
|
+
@spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
|
39
|
+
expect(@db['results'].count).to eq(1)
|
40
|
+
doc = @db['results'].find.first
|
41
|
+
expect(doc['detail_url']).to eq('http://www.cnn.com')
|
42
|
+
expect(doc['foo']).to eq('bar')
|
43
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
44
|
+
end
|
45
|
+
|
46
|
+
it 'should update existing result' do
|
47
|
+
@db['results'].insert_one key: 'http://foo.bar', detail_url: 'http://foo.bar'
|
48
|
+
@spider.record detail_url: 'http://foo.bar', foo: 'bar'
|
49
|
+
expect(@db['results'].count).to eq(1)
|
50
|
+
end
|
51
|
+
|
52
|
+
it 'should add error' do
|
53
|
+
@spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
|
54
|
+
doc = @db['errors'].find.first
|
55
|
+
expect(doc['error']).to eq('Exception')
|
56
|
+
expect(doc['url']).to eq('http://www.cnn.com')
|
57
|
+
expect(doc['handler']).to eq(:blah)
|
58
|
+
expect(doc['message']).to eq('WTF')
|
59
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
60
|
+
end
|
61
|
+
end
|
@@ -4,7 +4,7 @@ require 'mongo'
|
|
4
4
|
describe Spidey::Strategies::Mongo do
|
5
5
|
class TestMongoSpider < Spidey::AbstractSpider
|
6
6
|
include Spidey::Strategies::Mongo
|
7
|
-
handle
|
7
|
+
handle 'http://www.cnn.com', :process_home
|
8
8
|
|
9
9
|
def result_key(data)
|
10
10
|
data[:detail_url]
|
@@ -20,43 +20,42 @@ describe Spidey::Strategies::Mongo do
|
|
20
20
|
end
|
21
21
|
|
22
22
|
after(:each) do
|
23
|
-
%w
|
23
|
+
%w( urls results errors ).each { |col| @db[col].drop }
|
24
24
|
end
|
25
25
|
|
26
|
-
it
|
27
|
-
doc = @db['urls'].find_one(url:
|
28
|
-
doc['handler'].
|
29
|
-
doc['spider'].
|
26
|
+
it 'should add initial URLs to collection' do
|
27
|
+
doc = @db['urls'].find_one(url: 'http://www.cnn.com')
|
28
|
+
expect(doc['handler']).to eq(:process_home)
|
29
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
30
30
|
end
|
31
31
|
|
32
|
-
it
|
33
|
-
@spider.send :handle,
|
34
|
-
@db['urls'].find(url:
|
32
|
+
it 'should not add duplicate URLs' do
|
33
|
+
@spider.send :handle, 'http://www.cnn.com', :process_home
|
34
|
+
expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
|
35
35
|
end
|
36
36
|
|
37
|
-
it
|
37
|
+
it 'should add results' do
|
38
38
|
@spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
|
39
|
-
@db['results'].count.
|
39
|
+
expect(@db['results'].count).to eq(1)
|
40
40
|
doc = @db['results'].find_one
|
41
|
-
doc['detail_url'].
|
42
|
-
doc['foo'].
|
43
|
-
doc['spider'].
|
41
|
+
expect(doc['detail_url']).to eq('http://www.cnn.com')
|
42
|
+
expect(doc['foo']).to eq('bar')
|
43
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
44
44
|
end
|
45
45
|
|
46
|
-
it
|
46
|
+
it 'should update existing result' do
|
47
47
|
@db['results'].insert key: 'http://foo.bar', detail_url: 'http://foo.bar'
|
48
48
|
@spider.record detail_url: 'http://foo.bar', foo: 'bar'
|
49
|
-
@db['results'].count.
|
49
|
+
expect(@db['results'].count).to eq(1)
|
50
50
|
end
|
51
51
|
|
52
|
-
it
|
53
|
-
@spider.add_error error: Exception.new(
|
52
|
+
it 'should add error' do
|
53
|
+
@spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
|
54
54
|
doc = @db['errors'].find_one
|
55
|
-
doc['error'].
|
56
|
-
doc['url'].
|
57
|
-
doc['handler'].
|
58
|
-
doc['message'].
|
59
|
-
doc['spider'].
|
55
|
+
expect(doc['error']).to eq('Exception')
|
56
|
+
expect(doc['url']).to eq('http://www.cnn.com')
|
57
|
+
expect(doc['handler']).to eq(:blah)
|
58
|
+
expect(doc['message']).to eq('WTF')
|
59
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
60
60
|
end
|
61
|
-
|
62
|
-
end
|
61
|
+
end
|
@@ -4,7 +4,7 @@ require 'moped'
|
|
4
4
|
describe Spidey::Strategies::Moped do
|
5
5
|
class TestMopedSpider < Spidey::AbstractSpider
|
6
6
|
include Spidey::Strategies::Moped
|
7
|
-
handle
|
7
|
+
handle 'http://www.cnn.com', :process_home
|
8
8
|
|
9
9
|
def result_key(data)
|
10
10
|
data[:detail_url]
|
@@ -21,43 +21,42 @@ describe Spidey::Strategies::Moped do
|
|
21
21
|
end
|
22
22
|
|
23
23
|
after(:each) do
|
24
|
-
%w
|
24
|
+
%w( urls results errors ).each { |col| @db[col].drop }
|
25
25
|
end
|
26
26
|
|
27
|
-
it
|
28
|
-
doc = @db['urls'].find(url:
|
29
|
-
doc['handler'].
|
30
|
-
doc['spider'].
|
27
|
+
it 'should add initial URLs to collection' do
|
28
|
+
doc = @db['urls'].find(url: 'http://www.cnn.com').first
|
29
|
+
expect(doc['handler']).to eq(:process_home)
|
30
|
+
expect(doc['spider']).to eq('TestMopedSpider')
|
31
31
|
end
|
32
32
|
|
33
|
-
it
|
34
|
-
@spider.send :handle,
|
35
|
-
@db['urls'].find(url:
|
33
|
+
it 'should not add duplicate URLs' do
|
34
|
+
@spider.send :handle, 'http://www.cnn.com', :process_home
|
35
|
+
expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
|
36
36
|
end
|
37
37
|
|
38
|
-
it
|
38
|
+
it 'should add results' do
|
39
39
|
@spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
|
40
|
-
@db['results'].find.count.
|
40
|
+
expect(@db['results'].find.count).to eq(1)
|
41
41
|
doc = @db['results'].find.first
|
42
|
-
doc['detail_url'].
|
43
|
-
doc['foo'].
|
44
|
-
doc['spider'].
|
42
|
+
expect(doc['detail_url']).to eq('http://www.cnn.com')
|
43
|
+
expect(doc['foo']).to eq('bar')
|
44
|
+
expect(doc['spider']).to eq('TestMopedSpider')
|
45
45
|
end
|
46
46
|
|
47
|
-
it
|
47
|
+
it 'should update existing result' do
|
48
48
|
@db['results'].insert key: 'http://foo.bar', detail_url: 'http://foo.bar'
|
49
49
|
@spider.record detail_url: 'http://foo.bar', foo: 'bar'
|
50
|
-
@db['results'].find.count.
|
50
|
+
expect(@db['results'].find.count).to eq(1)
|
51
51
|
end
|
52
52
|
|
53
|
-
it
|
54
|
-
@spider.add_error error: Exception.new(
|
53
|
+
it 'should add error' do
|
54
|
+
@spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
|
55
55
|
doc = @db['errors'].find.first
|
56
|
-
doc['error'].
|
57
|
-
doc['url'].
|
58
|
-
doc['handler'].
|
59
|
-
doc['message'].
|
60
|
-
doc['spider'].
|
56
|
+
expect(doc['error']).to eq('Exception')
|
57
|
+
expect(doc['url']).to eq('http://www.cnn.com')
|
58
|
+
expect(doc['handler']).to eq(:blah)
|
59
|
+
expect(doc['message']).to eq('WTF')
|
60
|
+
expect(doc['spider']).to eq('TestMopedSpider')
|
61
61
|
end
|
62
|
-
|
63
|
-
end
|
62
|
+
end
|
data/spidey-mongo.gemspec
CHANGED
@@ -1,29 +1,26 @@
|
|
1
1
|
# -*- encoding: utf-8 -*-
|
2
|
-
|
3
|
-
require
|
2
|
+
$LOAD_PATH.push File.expand_path('../lib', __FILE__)
|
3
|
+
require 'spidey-mongo/version'
|
4
4
|
|
5
5
|
Gem::Specification.new do |s|
|
6
|
-
s.name =
|
6
|
+
s.name = 'spidey-mongo'
|
7
7
|
s.version = Spidey::Mongo::VERSION
|
8
|
-
s.authors = [
|
9
|
-
s.email = [
|
10
|
-
s.homepage =
|
11
|
-
s.summary =
|
12
|
-
s.description =
|
8
|
+
s.authors = ['Joey Aghion']
|
9
|
+
s.email = ['joey@aghion.com']
|
10
|
+
s.homepage = 'https://github.com/joeyAghion/spidey-mongo'
|
11
|
+
s.summary = 'Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.'
|
12
|
+
s.description = 'Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.'
|
13
13
|
s.license = 'MIT'
|
14
14
|
|
15
|
-
s.rubyforge_project =
|
15
|
+
s.rubyforge_project = 'spidey-mongo'
|
16
16
|
|
17
17
|
s.files = `git ls-files`.split("\n")
|
18
18
|
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
19
|
-
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
20
|
-
s.require_paths = [
|
19
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map { |f| File.basename(f) }
|
20
|
+
s.require_paths = ['lib']
|
21
21
|
|
22
|
-
s.add_development_dependency
|
23
|
-
s.add_development_dependency
|
24
|
-
s.add_development_dependency "mongo"
|
25
|
-
s.add_development_dependency "bson_ext"
|
26
|
-
s.add_development_dependency "moped"
|
22
|
+
s.add_development_dependency 'rake'
|
23
|
+
s.add_development_dependency 'rspec'
|
27
24
|
|
28
|
-
s.add_runtime_dependency
|
25
|
+
s.add_runtime_dependency 'spidey', '>= 0.1.0'
|
29
26
|
end
|
metadata
CHANGED
@@ -1,110 +1,55 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: spidey-mongo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 0.3.0
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Joey Aghion
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2015-11-04 00:00:00.000000000 Z
|
13
12
|
dependencies:
|
14
13
|
- !ruby/object:Gem::Dependency
|
15
14
|
name: rake
|
16
15
|
requirement: !ruby/object:Gem::Requirement
|
17
|
-
none: false
|
18
16
|
requirements:
|
19
|
-
- -
|
17
|
+
- - '>='
|
20
18
|
- !ruby/object:Gem::Version
|
21
19
|
version: '0'
|
22
20
|
type: :development
|
23
21
|
prerelease: false
|
24
22
|
version_requirements: !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
23
|
requirements:
|
27
|
-
- -
|
24
|
+
- - '>='
|
28
25
|
- !ruby/object:Gem::Version
|
29
26
|
version: '0'
|
30
27
|
- !ruby/object:Gem::Dependency
|
31
28
|
name: rspec
|
32
29
|
requirement: !ruby/object:Gem::Requirement
|
33
|
-
none: false
|
34
30
|
requirements:
|
35
|
-
- -
|
31
|
+
- - '>='
|
36
32
|
- !ruby/object:Gem::Version
|
37
33
|
version: '0'
|
38
34
|
type: :development
|
39
35
|
prerelease: false
|
40
36
|
version_requirements: !ruby/object:Gem::Requirement
|
41
|
-
none: false
|
42
37
|
requirements:
|
43
|
-
- -
|
44
|
-
- !ruby/object:Gem::Version
|
45
|
-
version: '0'
|
46
|
-
- !ruby/object:Gem::Dependency
|
47
|
-
name: mongo
|
48
|
-
requirement: !ruby/object:Gem::Requirement
|
49
|
-
none: false
|
50
|
-
requirements:
|
51
|
-
- - ! '>='
|
52
|
-
- !ruby/object:Gem::Version
|
53
|
-
version: '0'
|
54
|
-
type: :development
|
55
|
-
prerelease: false
|
56
|
-
version_requirements: !ruby/object:Gem::Requirement
|
57
|
-
none: false
|
58
|
-
requirements:
|
59
|
-
- - ! '>='
|
60
|
-
- !ruby/object:Gem::Version
|
61
|
-
version: '0'
|
62
|
-
- !ruby/object:Gem::Dependency
|
63
|
-
name: bson_ext
|
64
|
-
requirement: !ruby/object:Gem::Requirement
|
65
|
-
none: false
|
66
|
-
requirements:
|
67
|
-
- - ! '>='
|
68
|
-
- !ruby/object:Gem::Version
|
69
|
-
version: '0'
|
70
|
-
type: :development
|
71
|
-
prerelease: false
|
72
|
-
version_requirements: !ruby/object:Gem::Requirement
|
73
|
-
none: false
|
74
|
-
requirements:
|
75
|
-
- - ! '>='
|
76
|
-
- !ruby/object:Gem::Version
|
77
|
-
version: '0'
|
78
|
-
- !ruby/object:Gem::Dependency
|
79
|
-
name: moped
|
80
|
-
requirement: !ruby/object:Gem::Requirement
|
81
|
-
none: false
|
82
|
-
requirements:
|
83
|
-
- - ! '>='
|
84
|
-
- !ruby/object:Gem::Version
|
85
|
-
version: '0'
|
86
|
-
type: :development
|
87
|
-
prerelease: false
|
88
|
-
version_requirements: !ruby/object:Gem::Requirement
|
89
|
-
none: false
|
90
|
-
requirements:
|
91
|
-
- - ! '>='
|
38
|
+
- - '>='
|
92
39
|
- !ruby/object:Gem::Version
|
93
40
|
version: '0'
|
94
41
|
- !ruby/object:Gem::Dependency
|
95
42
|
name: spidey
|
96
43
|
requirement: !ruby/object:Gem::Requirement
|
97
|
-
none: false
|
98
44
|
requirements:
|
99
|
-
- -
|
45
|
+
- - '>='
|
100
46
|
- !ruby/object:Gem::Version
|
101
47
|
version: 0.1.0
|
102
48
|
type: :runtime
|
103
49
|
prerelease: false
|
104
50
|
version_requirements: !ruby/object:Gem::Requirement
|
105
|
-
none: false
|
106
51
|
requirements:
|
107
|
-
- -
|
52
|
+
- - '>='
|
108
53
|
- !ruby/object:Gem::Version
|
109
54
|
version: 0.1.0
|
110
55
|
description: Implements a MongoDB back-end for Spidey, a framework for crawling and
|
@@ -116,6 +61,9 @@ extensions: []
|
|
116
61
|
extra_rdoc_files: []
|
117
62
|
files:
|
118
63
|
- .gitignore
|
64
|
+
- .travis.yml
|
65
|
+
- CHANGELOG.md
|
66
|
+
- CONTRIBUTING.md
|
119
67
|
- Gemfile
|
120
68
|
- LICENSE.txt
|
121
69
|
- README.md
|
@@ -123,44 +71,40 @@ files:
|
|
123
71
|
- lib/spidey-mongo.rb
|
124
72
|
- lib/spidey-mongo/version.rb
|
125
73
|
- lib/spidey/strategies/mongo.rb
|
74
|
+
- lib/spidey/strategies/mongo2.rb
|
126
75
|
- lib/spidey/strategies/moped.rb
|
127
76
|
- spec/spec_helper.rb
|
77
|
+
- spec/spidey/strategies/mongo2_spec.rb
|
128
78
|
- spec/spidey/strategies/mongo_spec.rb
|
129
79
|
- spec/spidey/strategies/moped_spec.rb
|
130
80
|
- spidey-mongo.gemspec
|
131
81
|
homepage: https://github.com/joeyAghion/spidey-mongo
|
132
82
|
licenses:
|
133
83
|
- MIT
|
84
|
+
metadata: {}
|
134
85
|
post_install_message:
|
135
86
|
rdoc_options: []
|
136
87
|
require_paths:
|
137
88
|
- lib
|
138
89
|
required_ruby_version: !ruby/object:Gem::Requirement
|
139
|
-
none: false
|
140
90
|
requirements:
|
141
|
-
- -
|
91
|
+
- - '>='
|
142
92
|
- !ruby/object:Gem::Version
|
143
93
|
version: '0'
|
144
|
-
segments:
|
145
|
-
- 0
|
146
|
-
hash: 987129952958952365
|
147
94
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
148
|
-
none: false
|
149
95
|
requirements:
|
150
|
-
- -
|
96
|
+
- - '>='
|
151
97
|
- !ruby/object:Gem::Version
|
152
98
|
version: '0'
|
153
|
-
segments:
|
154
|
-
- 0
|
155
|
-
hash: 987129952958952365
|
156
99
|
requirements: []
|
157
100
|
rubyforge_project: spidey-mongo
|
158
|
-
rubygems_version:
|
101
|
+
rubygems_version: 2.0.14
|
159
102
|
signing_key:
|
160
|
-
specification_version:
|
103
|
+
specification_version: 4
|
161
104
|
summary: Implements a MongoDB back-end for Spidey, a framework for crawling and scraping
|
162
105
|
web sites.
|
163
106
|
test_files:
|
164
107
|
- spec/spec_helper.rb
|
108
|
+
- spec/spidey/strategies/mongo2_spec.rb
|
165
109
|
- spec/spidey/strategies/mongo_spec.rb
|
166
110
|
- spec/spidey/strategies/moped_spec.rb
|