spidey-mongo 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.travis.yml +11 -0
- data/CHANGELOG.md +15 -0
- data/CONTRIBUTING.md +116 -0
- data/Gemfile +13 -1
- data/LICENSE.txt +2 -2
- data/README.md +49 -43
- data/Rakefile +12 -1
- data/lib/spidey-mongo.rb +3 -1
- data/lib/spidey-mongo/version.rb +1 -1
- data/lib/spidey/strategies/mongo.rb +9 -12
- data/lib/spidey/strategies/mongo2.rb +59 -0
- data/lib/spidey/strategies/moped.rb +9 -12
- data/spec/spec_helper.rb +12 -2
- data/spec/spidey/strategies/mongo2_spec.rb +61 -0
- data/spec/spidey/strategies/mongo_spec.rb +24 -25
- data/spec/spidey/strategies/moped_spec.rb +24 -25
- data/spidey-mongo.gemspec +14 -17
- metadata +19 -75
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 096f00db0e8368887d3546d1909984af270ef83e
|
4
|
+
data.tar.gz: dbc9ebec27c141264076557098c83e7a1576dc28
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 50b3fc0d2fa3ea7837ff06e5d2f3f5bebf99047d10a43ff27dff755df492a95c649532b0ab5d0b04b8d37755b96a5558237d4c303e89f89bafc1f602e00d0e24
|
7
|
+
data.tar.gz: 6392e63f6d3eba223dade821bedf9c56bf44b1cf48146e0906fb74186ff81d80368522b4f1dbfe9d2cd14335870e64455c66c2ab86c60abfc1c506de00d3e9a6
|
data/.travis.yml
ADDED
data/CHANGELOG.md
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
### Next
|
2
|
+
|
3
|
+
* Your contribution here...
|
4
|
+
|
5
|
+
### 0.3.0
|
6
|
+
|
7
|
+
* [#3](https://github.com/joeyAghion/spidey-mongo/pull/3): Added support for Mongo Ruby Driver 2.x - [@dblock](https://github.com/dblock).
|
8
|
+
|
9
|
+
### 0.2.0
|
10
|
+
|
11
|
+
* [#1](https://github.com/joeyAghion/spidey-mongo/pull/1): Added support for Moped - [@fancyremarker](https://github.com/fancyremarker).
|
12
|
+
|
13
|
+
### 0.1.0
|
14
|
+
|
15
|
+
* Initial public release - [@joeyAghion](https://github.com/joeyAghion).
|
data/CONTRIBUTING.md
ADDED
@@ -0,0 +1,116 @@
|
|
1
|
+
Contributing
|
2
|
+
============
|
3
|
+
|
4
|
+
This gem is work of [many of contributors](https://github.com/joeyAghion/spidey-mongo/graphs/contributors). You're encouraged to submit [pull requests](https://github.com/joeyAghion/spidey-mongo/pulls), [propose features, ask questions and discuss issues](https://github.com/joeyAghion/spidey-mongo/issues).
|
5
|
+
|
6
|
+
#### Fork the Project
|
7
|
+
|
8
|
+
Fork the [project on Github](https://github.com/joeyAghion/spidey-mongo) and check out your copy.
|
9
|
+
|
10
|
+
```
|
11
|
+
git clone https://github.com/contributor/spidey-mongo.git
|
12
|
+
cd spidey-mongo
|
13
|
+
git remote add upstream https://github.com/joeyAghion/spidey-mongo.git
|
14
|
+
```
|
15
|
+
|
16
|
+
#### Create a Topic Branch
|
17
|
+
|
18
|
+
Make sure your fork is up-to-date and create a topic branch for your feature or bug fix.
|
19
|
+
|
20
|
+
```
|
21
|
+
git checkout master
|
22
|
+
git pull upstream master
|
23
|
+
git checkout -b my-feature-branch
|
24
|
+
```
|
25
|
+
|
26
|
+
#### Bundle Install and Test
|
27
|
+
|
28
|
+
Ensure that you can build the project and run tests.
|
29
|
+
|
30
|
+
```
|
31
|
+
bundle install
|
32
|
+
bundle exec rake
|
33
|
+
```
|
34
|
+
|
35
|
+
#### Write Tests
|
36
|
+
|
37
|
+
Try to write a test that reproduces the problem you're trying to fix or describes a feature that you want to build. Add to [spec/mongoid](spec/mongoid).
|
38
|
+
|
39
|
+
We definitely appreciate pull requests that highlight or reproduce a problem, even without a fix.
|
40
|
+
|
41
|
+
#### Write Code
|
42
|
+
|
43
|
+
Implement your feature or bug fix.
|
44
|
+
|
45
|
+
Make sure that `bundle exec rake` completes without errors.
|
46
|
+
|
47
|
+
#### Write Documentation
|
48
|
+
|
49
|
+
Document any external behavior in the [README](README.md).
|
50
|
+
|
51
|
+
#### Update Changelog
|
52
|
+
|
53
|
+
Add a line to [CHANGELOG](CHANGELOG.md) under *Next*. Make it look like every other line, including your name and link to your Github account.
|
54
|
+
|
55
|
+
#### Commit Changes
|
56
|
+
|
57
|
+
Make sure git knows your name and email address:
|
58
|
+
|
59
|
+
```
|
60
|
+
git config --global user.name "Your Name"
|
61
|
+
git config --global user.email "contributor@example.com"
|
62
|
+
```
|
63
|
+
|
64
|
+
Writing good commit logs is important. A commit log should describe what changed and why.
|
65
|
+
|
66
|
+
```
|
67
|
+
git add ...
|
68
|
+
git commit
|
69
|
+
```
|
70
|
+
|
71
|
+
#### Push
|
72
|
+
|
73
|
+
```
|
74
|
+
git push origin my-feature-branch
|
75
|
+
```
|
76
|
+
|
77
|
+
#### Make a Pull Request
|
78
|
+
|
79
|
+
Go to https://github.com/contributor/spidey-mongo and select your feature branch. Click the 'Pull Request' button and fill out the form. Pull requests are usually reviewed within a few days.
|
80
|
+
|
81
|
+
#### Rebase
|
82
|
+
|
83
|
+
If you've been working on a change for a while, rebase with upstream/master.
|
84
|
+
|
85
|
+
```
|
86
|
+
git fetch upstream
|
87
|
+
git rebase upstream/master
|
88
|
+
git push origin my-feature-branch -f
|
89
|
+
```
|
90
|
+
|
91
|
+
#### Update CHANGELOG Again
|
92
|
+
|
93
|
+
Update the [CHANGELOG](CHANGELOG.md) with the pull request number. A typical entry looks as follows.
|
94
|
+
|
95
|
+
```
|
96
|
+
* [#123](https://github.com/joeyAghion/spidey-mongo/pull/123): Reticulated splines - [@contributor](https://github.com/contributor).
|
97
|
+
```
|
98
|
+
|
99
|
+
Amend your previous commit and force push the changes.
|
100
|
+
|
101
|
+
```
|
102
|
+
git commit --amend
|
103
|
+
git push origin my-feature-branch -f
|
104
|
+
```
|
105
|
+
|
106
|
+
#### Check on Your Pull Request
|
107
|
+
|
108
|
+
Go back to your pull request after a few minutes and see whether it passed muster with Travis-CI. Everything should look green, otherwise fix issues and amend your commit as described above.
|
109
|
+
|
110
|
+
#### Be Patient
|
111
|
+
|
112
|
+
It's likely that your change will not be merged and that the nitpicky maintainers will ask you to do more, or fix seemingly benign problems. Hang on there!
|
113
|
+
|
114
|
+
#### Thank You
|
115
|
+
|
116
|
+
Please do know that we really appreciate and value your time and work. We love you, really.
|
data/Gemfile
CHANGED
@@ -1,4 +1,16 @@
|
|
1
|
-
source
|
1
|
+
source 'http://rubygems.org'
|
2
|
+
|
3
|
+
case version = ENV['MONGO_VERSION'] || 'mongo2'
|
4
|
+
when /^moped/
|
5
|
+
gem 'moped', '~> 2.0'
|
6
|
+
when /^mongo2/
|
7
|
+
gem 'mongo', '~> 2.0'
|
8
|
+
when /^mongo/
|
9
|
+
gem 'mongo', '~> 1.12'
|
10
|
+
gem 'bson_ext'
|
11
|
+
else
|
12
|
+
fail "Invalid MONGO_VERSION: #{ENV['MONGO_VERSION']}."
|
13
|
+
end
|
2
14
|
|
3
15
|
# Specify your gem's dependencies in spidey-mongo.gemspec
|
4
16
|
|
data/LICENSE.txt
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
Copyright (c) 2012 Joey Aghion,
|
1
|
+
Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors
|
2
2
|
|
3
3
|
Permission is hereby granted, free of charge, to any person obtaining
|
4
4
|
a copy of this software and associated documentation files (the
|
@@ -17,4 +17,4 @@ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
|
17
17
|
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
18
|
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
19
|
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
-
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
CHANGED
@@ -1,6 +1,9 @@
|
|
1
1
|
Spidey-Mongo
|
2
2
|
============
|
3
3
|
|
4
|
+
[](https://travis-ci.org/joeyAghion/spidey-mongo)
|
5
|
+
[](https://badge.fury.io/rb/spidey-mongo)
|
6
|
+
|
4
7
|
This gem implements a [MongoDB](http://www.mongodb.org/) back-end for [Spidey](https://github.com/joeyAghion/spidey), a very simple framework for crawling and scraping web sites.
|
5
8
|
|
6
9
|
See [Spidey](https://githubcom/joeyAghion/spidey)'s documentation for a basic example spider class.
|
@@ -12,45 +15,52 @@ Usage
|
|
12
15
|
|
13
16
|
### Install the gem
|
14
17
|
|
15
|
-
|
16
|
-
|
18
|
+
``` ruby
|
19
|
+
gem install spidey-mongo
|
20
|
+
```
|
17
21
|
|
18
22
|
### `mongo` versus `moped`
|
19
23
|
|
20
|
-
Spidey-Mongo provides
|
24
|
+
Spidey-Mongo provides three strategies:
|
21
25
|
|
22
|
-
* `Spidey::Strategies::Mongo`: Compatible with
|
23
|
-
* `Spidey::Strategies::
|
26
|
+
* `Spidey::Strategies::Mongo`: Compatible with Mongo Ruby Driver 1.x, [`mongo`](https://github.com/mongodb/mongo-ruby-driver)
|
27
|
+
* `Spidey::Strategies::Mongo2`: Compatible with Mongo Ruby Driver 2.x, [`mongo`](https://github.com/mongodb/mongo-ruby-driver), e.g., for use with Mongoid 5.x
|
28
|
+
* `Spidey::Strategies::Moped`: Compatible with the [`moped`](https://github.com/mongoid/moped) 2.x, e.g., for use with Mongoid 3.x and 4.x
|
24
29
|
|
25
30
|
You can include either strategy in your classes, as appropriate. All the examples in this README assume `Spidey::Strategies::Mongo`.
|
26
31
|
|
27
|
-
|
28
32
|
### Example spider class
|
29
33
|
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
34
|
+
```ruby
|
35
|
+
class EbaySpider < Spidey::AbstractSpider
|
36
|
+
include Spidey::Strategies::Mongo
|
37
|
+
|
38
|
+
handle "http://www.ebay.com", :process_home
|
39
|
+
|
40
|
+
def process_home(page, default_data = {})
|
41
|
+
# ...
|
42
|
+
end
|
43
|
+
end
|
44
|
+
```
|
39
45
|
|
40
46
|
### Invocation
|
41
47
|
|
42
48
|
The spider's constructor accepts new parameters for each of the MongoDB collections to employ: `url_collection`, `result_collection`, and `error_collection`.
|
43
49
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
+
```ruby
|
51
|
+
db = Mongo::Connection.new['example']
|
52
|
+
|
53
|
+
spider = EbaySpider.new(
|
54
|
+
url_collection: db['urls'],
|
55
|
+
result_collection: db['results'],
|
56
|
+
error_collection: db['errors'])
|
57
|
+
```
|
50
58
|
|
51
59
|
With persistent storage of the URL-crawling queue, it's now possible to stop crawling and resume at a later point. The `crawl` method accepts a new optional `crawl_for` parameter specifying the number of seconds after which to stop.
|
52
60
|
|
53
|
-
|
61
|
+
```
|
62
|
+
spider.crawl crawl_for: 600 # seconds, or more conveniently (w/ActiveSupport): 10.minutes
|
63
|
+
```
|
54
64
|
|
55
65
|
(The base implementation's `max_urls` parameter is also useful for this purpose.)
|
56
66
|
|
@@ -58,32 +68,28 @@ With persistent storage of the URL-crawling queue, it's now possible to stop cra
|
|
58
68
|
|
59
69
|
By default, invocations of `record(data)` by the spider simply insert new documents into the result collection. If corresponding results may already exist in the collection and should instead be updated, define a `result_key` method that returns a key by which to find the corresponding document. The method is called with a hash of the data being recorded:
|
60
70
|
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
def result_key(data)
|
65
|
-
data[:detail_url]
|
66
|
-
end
|
71
|
+
```ruby
|
72
|
+
class EbaySpider < Spidey::AbstractSpider
|
73
|
+
include Spidey::Strategies::Mongo
|
67
74
|
|
68
|
-
|
69
|
-
|
75
|
+
def result_key(data)
|
76
|
+
data[:detail_url]
|
77
|
+
end
|
70
78
|
|
71
|
-
|
79
|
+
# ...
|
80
|
+
end
|
81
|
+
```
|
72
82
|
|
73
|
-
|
74
|
-
-------
|
75
|
-
|
76
|
-
bundle exec rspec
|
77
|
-
|
78
|
-
Contributors
|
79
|
-
------------
|
83
|
+
This performs an `upsert` instead of the usual `insert` (i.e., an update if a result document matching the key already exists, or insert otherwise).
|
80
84
|
|
81
|
-
|
85
|
+
Contrbuting
|
86
|
+
-----------
|
82
87
|
|
83
|
-
|
84
|
-
-----
|
85
|
-
* Extract behaviors shared by `Mongo` and `Moped` strategies.
|
88
|
+
Please contribute! See [CONTRIBUTING](CONTRIBUTING.md) for details.
|
86
89
|
|
87
90
|
Copyright
|
88
91
|
---------
|
89
|
-
|
92
|
+
|
93
|
+
Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors.
|
94
|
+
|
95
|
+
See [LICENSE.txt](LICENSE.txt) for further details.
|
data/Rakefile
CHANGED
@@ -1 +1,12 @@
|
|
1
|
-
require
|
1
|
+
require 'bundler/gem_tasks'
|
2
|
+
|
3
|
+
Bundler.setup :default, :development
|
4
|
+
|
5
|
+
require 'rspec/core'
|
6
|
+
require 'rspec/core/rake_task'
|
7
|
+
|
8
|
+
RSpec::Core::RakeTask.new(:spec) do |spec|
|
9
|
+
spec.pattern = FileList["spec/**/#{ENV['MONGO_VERSION'] || 'mongo2'}_spec.rb"]
|
10
|
+
end
|
11
|
+
|
12
|
+
task default: :spec
|
data/lib/spidey-mongo.rb
CHANGED
data/lib/spidey-mongo/version.rb
CHANGED
@@ -18,8 +18,8 @@ module Spidey::Strategies
|
|
18
18
|
def handle(url, handler, default_data = {})
|
19
19
|
Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
|
20
20
|
url_collection.update(
|
21
|
-
{'spider' => self.class.name, 'url' => url},
|
22
|
-
{'$set' => {'handler' => handler, 'default_data' => default_data}},
|
21
|
+
{ 'spider' => self.class.name, 'url' => url },
|
22
|
+
{ '$set' => { 'handler' => handler, 'default_data' => default_data } },
|
23
23
|
upsert: true
|
24
24
|
)
|
25
25
|
end
|
@@ -28,16 +28,16 @@ module Spidey::Strategies
|
|
28
28
|
doc = data.merge('spider' => self.class.name)
|
29
29
|
Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
|
30
30
|
if respond_to?(:result_key) && key = result_key(doc)
|
31
|
-
result_collection.update({'key' => key}, {'$set' => doc}, upsert: true)
|
31
|
+
result_collection.update({ 'key' => key }, { '$set' => doc }, upsert: true)
|
32
32
|
else
|
33
33
|
result_collection.insert doc
|
34
34
|
end
|
35
35
|
end
|
36
36
|
|
37
|
-
def each_url(&
|
37
|
+
def each_url(&_block)
|
38
38
|
while url = get_next_url
|
39
|
-
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at
|
40
|
-
url_collection.update({'_id' => url['_id']}, '$set' => {last_crawled_at: Time.now})
|
39
|
+
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
|
40
|
+
url_collection.update({ '_id' => url['_id'] }, '$set' => { last_crawled_at: Time.now })
|
41
41
|
yield url['url'], url['handler'], url['default_data'].symbolize_keys
|
42
42
|
end
|
43
43
|
end
|
@@ -49,14 +49,11 @@ module Spidey::Strategies
|
|
49
49
|
Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
|
50
50
|
end
|
51
51
|
|
52
|
-
|
52
|
+
private
|
53
53
|
|
54
54
|
def get_next_url
|
55
|
-
return nil if
|
56
|
-
url_collection.find_one({spider: self.class.name},
|
57
|
-
sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]]
|
58
|
-
})
|
55
|
+
return nil if @until && Time.now >= @until # exceeded time bound
|
56
|
+
url_collection.find_one({ spider: self.class.name }, sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]])
|
59
57
|
end
|
60
|
-
|
61
58
|
end
|
62
59
|
end
|
@@ -0,0 +1,59 @@
|
|
1
|
+
module Spidey::Strategies
|
2
|
+
module Mongo2
|
3
|
+
attr_accessor :url_collection, :result_collection, :error_collection
|
4
|
+
|
5
|
+
def initialize(attrs = {})
|
6
|
+
self.url_collection = attrs.delete(:url_collection)
|
7
|
+
self.result_collection = attrs.delete(:result_collection)
|
8
|
+
self.error_collection = attrs.delete(:error_collection)
|
9
|
+
super attrs
|
10
|
+
end
|
11
|
+
|
12
|
+
def crawl(options = {})
|
13
|
+
@crawl_started_at = Time.now
|
14
|
+
@until = Time.now + options[:crawl_for] if options[:crawl_for]
|
15
|
+
super options
|
16
|
+
end
|
17
|
+
|
18
|
+
def handle(url, handler, default_data = {})
|
19
|
+
Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
|
20
|
+
url_collection.update_one(
|
21
|
+
{ 'spider' => self.class.name, 'url' => url },
|
22
|
+
{ '$set' => { 'handler' => handler, 'default_data' => default_data } },
|
23
|
+
upsert: true
|
24
|
+
)
|
25
|
+
end
|
26
|
+
|
27
|
+
def record(data)
|
28
|
+
doc = data.merge('spider' => self.class.name)
|
29
|
+
Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
|
30
|
+
if respond_to?(:result_key) && key = result_key(doc)
|
31
|
+
result_collection.update_one({ 'key' => key }, { '$set' => doc }, upsert: true)
|
32
|
+
else
|
33
|
+
result_collection.insert_one doc
|
34
|
+
end
|
35
|
+
end
|
36
|
+
|
37
|
+
def each_url(&_block)
|
38
|
+
while url = get_next_url
|
39
|
+
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
|
40
|
+
url_collection.update_one({ '_id' => url['_id'] }, '$set' => { last_crawled_at: Time.now })
|
41
|
+
yield url['url'], url['handler'], url['default_data'].symbolize_keys
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
45
|
+
def add_error(attrs)
|
46
|
+
error = attrs.delete(:error)
|
47
|
+
doc = attrs.merge(created_at: Time.now, error: error.class.name, message: error.message, spider: self.class.name)
|
48
|
+
error_collection.insert_one doc
|
49
|
+
Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
|
50
|
+
end
|
51
|
+
|
52
|
+
private
|
53
|
+
|
54
|
+
def get_next_url
|
55
|
+
return nil if @until && Time.now >= @until # exceeded time bound
|
56
|
+
url_collection.find({ spider: self.class.name }, sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]]).first
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
@@ -18,9 +18,9 @@ module Spidey::Strategies
|
|
18
18
|
def handle(url, handler, default_data = {})
|
19
19
|
Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
|
20
20
|
url_collection.find(
|
21
|
-
|
21
|
+
'spider' => self.class.name, 'url' => url
|
22
22
|
).upsert(
|
23
|
-
|
23
|
+
'$set' => { 'handler' => handler, 'default_data' => default_data }
|
24
24
|
)
|
25
25
|
end
|
26
26
|
|
@@ -28,16 +28,16 @@ module Spidey::Strategies
|
|
28
28
|
doc = data.merge('spider' => self.class.name)
|
29
29
|
Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
|
30
30
|
if respond_to?(:result_key) && key = result_key(doc)
|
31
|
-
result_collection.find(
|
31
|
+
result_collection.find('key' => key).upsert('$set' => doc)
|
32
32
|
else
|
33
33
|
result_collection.insert doc
|
34
34
|
end
|
35
35
|
end
|
36
36
|
|
37
|
-
def each_url(&
|
37
|
+
def each_url(&_block)
|
38
38
|
while url = get_next_url
|
39
|
-
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at
|
40
|
-
url_collection.find(
|
39
|
+
break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
|
40
|
+
url_collection.find('_id' => url['_id']).update('$set' => { last_crawled_at: Time.now })
|
41
41
|
yield url['url'], url['handler'], url['default_data'].symbolize_keys
|
42
42
|
end
|
43
43
|
end
|
@@ -49,14 +49,11 @@ module Spidey::Strategies
|
|
49
49
|
Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
|
50
50
|
end
|
51
51
|
|
52
|
-
|
52
|
+
private
|
53
53
|
|
54
54
|
def get_next_url
|
55
|
-
return nil if
|
56
|
-
url_collection.find(
|
57
|
-
'last_crawled_at' => 1, '_id' => 1
|
58
|
-
}).first
|
55
|
+
return nil if @until && Time.now >= @until # exceeded time bound
|
56
|
+
url_collection.find(spider: self.class.name).sort('last_crawled_at' => 1, '_id' => 1).first
|
59
57
|
end
|
60
|
-
|
61
58
|
end
|
62
59
|
end
|
data/spec/spec_helper.rb
CHANGED
@@ -1,8 +1,18 @@
|
|
1
|
-
|
1
|
+
$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
|
2
|
+
|
3
|
+
case version = ENV['MONGO_VERSION'] || 'mongo2'
|
4
|
+
when /^moped/
|
5
|
+
require 'moped'
|
6
|
+
when /^mongo/
|
7
|
+
require 'mongo'
|
8
|
+
else
|
9
|
+
fail "Invalid MONGO_VERSION: #{ENV['MONGO_VERSION']}."
|
10
|
+
end
|
11
|
+
|
2
12
|
require 'spidey-mongo'
|
3
13
|
|
4
14
|
RSpec.configure do |config|
|
5
|
-
config.treat_symbols_as_metadata_keys_with_true_values = true
|
6
15
|
config.run_all_when_everything_filtered = true
|
7
16
|
config.filter_run :focus
|
17
|
+
config.raise_errors_for_deprecations!
|
8
18
|
end
|
@@ -0,0 +1,61 @@
|
|
1
|
+
require 'spec_helper'
|
2
|
+
require 'mongo'
|
3
|
+
|
4
|
+
describe Spidey::Strategies::Mongo do
|
5
|
+
class TestMongoSpider < Spidey::AbstractSpider
|
6
|
+
include Spidey::Strategies::Mongo2
|
7
|
+
handle 'http://www.cnn.com', :process_home
|
8
|
+
|
9
|
+
def result_key(data)
|
10
|
+
data[:detail_url]
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
before(:each) do
|
15
|
+
@db = Mongo::Client.new('mongodb://127.0.0.1:27017/spidey-mongo-test')
|
16
|
+
@spider = TestMongoSpider.new(
|
17
|
+
url_collection: @db['urls'],
|
18
|
+
result_collection: @db['results'],
|
19
|
+
error_collection: @db['errors'])
|
20
|
+
end
|
21
|
+
|
22
|
+
after(:each) do
|
23
|
+
%w( urls results errors ).each { |col| @db[col].drop }
|
24
|
+
end
|
25
|
+
|
26
|
+
it 'should add initial URLs to collection' do
|
27
|
+
doc = @db['urls'].find(url: 'http://www.cnn.com').first
|
28
|
+
expect(doc['handler']).to eq(:process_home)
|
29
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
30
|
+
end
|
31
|
+
|
32
|
+
it 'should not add duplicate URLs' do
|
33
|
+
@spider.send :handle, 'http://www.cnn.com', :process_home
|
34
|
+
expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
|
35
|
+
end
|
36
|
+
|
37
|
+
it 'should add results' do
|
38
|
+
@spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
|
39
|
+
expect(@db['results'].count).to eq(1)
|
40
|
+
doc = @db['results'].find.first
|
41
|
+
expect(doc['detail_url']).to eq('http://www.cnn.com')
|
42
|
+
expect(doc['foo']).to eq('bar')
|
43
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
44
|
+
end
|
45
|
+
|
46
|
+
it 'should update existing result' do
|
47
|
+
@db['results'].insert_one key: 'http://foo.bar', detail_url: 'http://foo.bar'
|
48
|
+
@spider.record detail_url: 'http://foo.bar', foo: 'bar'
|
49
|
+
expect(@db['results'].count).to eq(1)
|
50
|
+
end
|
51
|
+
|
52
|
+
it 'should add error' do
|
53
|
+
@spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
|
54
|
+
doc = @db['errors'].find.first
|
55
|
+
expect(doc['error']).to eq('Exception')
|
56
|
+
expect(doc['url']).to eq('http://www.cnn.com')
|
57
|
+
expect(doc['handler']).to eq(:blah)
|
58
|
+
expect(doc['message']).to eq('WTF')
|
59
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
60
|
+
end
|
61
|
+
end
|
@@ -4,7 +4,7 @@ require 'mongo'
|
|
4
4
|
describe Spidey::Strategies::Mongo do
|
5
5
|
class TestMongoSpider < Spidey::AbstractSpider
|
6
6
|
include Spidey::Strategies::Mongo
|
7
|
-
handle
|
7
|
+
handle 'http://www.cnn.com', :process_home
|
8
8
|
|
9
9
|
def result_key(data)
|
10
10
|
data[:detail_url]
|
@@ -20,43 +20,42 @@ describe Spidey::Strategies::Mongo do
|
|
20
20
|
end
|
21
21
|
|
22
22
|
after(:each) do
|
23
|
-
%w
|
23
|
+
%w( urls results errors ).each { |col| @db[col].drop }
|
24
24
|
end
|
25
25
|
|
26
|
-
it
|
27
|
-
doc = @db['urls'].find_one(url:
|
28
|
-
doc['handler'].
|
29
|
-
doc['spider'].
|
26
|
+
it 'should add initial URLs to collection' do
|
27
|
+
doc = @db['urls'].find_one(url: 'http://www.cnn.com')
|
28
|
+
expect(doc['handler']).to eq(:process_home)
|
29
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
30
30
|
end
|
31
31
|
|
32
|
-
it
|
33
|
-
@spider.send :handle,
|
34
|
-
@db['urls'].find(url:
|
32
|
+
it 'should not add duplicate URLs' do
|
33
|
+
@spider.send :handle, 'http://www.cnn.com', :process_home
|
34
|
+
expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
|
35
35
|
end
|
36
36
|
|
37
|
-
it
|
37
|
+
it 'should add results' do
|
38
38
|
@spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
|
39
|
-
@db['results'].count.
|
39
|
+
expect(@db['results'].count).to eq(1)
|
40
40
|
doc = @db['results'].find_one
|
41
|
-
doc['detail_url'].
|
42
|
-
doc['foo'].
|
43
|
-
doc['spider'].
|
41
|
+
expect(doc['detail_url']).to eq('http://www.cnn.com')
|
42
|
+
expect(doc['foo']).to eq('bar')
|
43
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
44
44
|
end
|
45
45
|
|
46
|
-
it
|
46
|
+
it 'should update existing result' do
|
47
47
|
@db['results'].insert key: 'http://foo.bar', detail_url: 'http://foo.bar'
|
48
48
|
@spider.record detail_url: 'http://foo.bar', foo: 'bar'
|
49
|
-
@db['results'].count.
|
49
|
+
expect(@db['results'].count).to eq(1)
|
50
50
|
end
|
51
51
|
|
52
|
-
it
|
53
|
-
@spider.add_error error: Exception.new(
|
52
|
+
it 'should add error' do
|
53
|
+
@spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
|
54
54
|
doc = @db['errors'].find_one
|
55
|
-
doc['error'].
|
56
|
-
doc['url'].
|
57
|
-
doc['handler'].
|
58
|
-
doc['message'].
|
59
|
-
doc['spider'].
|
55
|
+
expect(doc['error']).to eq('Exception')
|
56
|
+
expect(doc['url']).to eq('http://www.cnn.com')
|
57
|
+
expect(doc['handler']).to eq(:blah)
|
58
|
+
expect(doc['message']).to eq('WTF')
|
59
|
+
expect(doc['spider']).to eq('TestMongoSpider')
|
60
60
|
end
|
61
|
-
|
62
|
-
end
|
61
|
+
end
|
@@ -4,7 +4,7 @@ require 'moped'
|
|
4
4
|
describe Spidey::Strategies::Moped do
|
5
5
|
class TestMopedSpider < Spidey::AbstractSpider
|
6
6
|
include Spidey::Strategies::Moped
|
7
|
-
handle
|
7
|
+
handle 'http://www.cnn.com', :process_home
|
8
8
|
|
9
9
|
def result_key(data)
|
10
10
|
data[:detail_url]
|
@@ -21,43 +21,42 @@ describe Spidey::Strategies::Moped do
|
|
21
21
|
end
|
22
22
|
|
23
23
|
after(:each) do
|
24
|
-
%w
|
24
|
+
%w( urls results errors ).each { |col| @db[col].drop }
|
25
25
|
end
|
26
26
|
|
27
|
-
it
|
28
|
-
doc = @db['urls'].find(url:
|
29
|
-
doc['handler'].
|
30
|
-
doc['spider'].
|
27
|
+
it 'should add initial URLs to collection' do
|
28
|
+
doc = @db['urls'].find(url: 'http://www.cnn.com').first
|
29
|
+
expect(doc['handler']).to eq(:process_home)
|
30
|
+
expect(doc['spider']).to eq('TestMopedSpider')
|
31
31
|
end
|
32
32
|
|
33
|
-
it
|
34
|
-
@spider.send :handle,
|
35
|
-
@db['urls'].find(url:
|
33
|
+
it 'should not add duplicate URLs' do
|
34
|
+
@spider.send :handle, 'http://www.cnn.com', :process_home
|
35
|
+
expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
|
36
36
|
end
|
37
37
|
|
38
|
-
it
|
38
|
+
it 'should add results' do
|
39
39
|
@spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
|
40
|
-
@db['results'].find.count.
|
40
|
+
expect(@db['results'].find.count).to eq(1)
|
41
41
|
doc = @db['results'].find.first
|
42
|
-
doc['detail_url'].
|
43
|
-
doc['foo'].
|
44
|
-
doc['spider'].
|
42
|
+
expect(doc['detail_url']).to eq('http://www.cnn.com')
|
43
|
+
expect(doc['foo']).to eq('bar')
|
44
|
+
expect(doc['spider']).to eq('TestMopedSpider')
|
45
45
|
end
|
46
46
|
|
47
|
-
it
|
47
|
+
it 'should update existing result' do
|
48
48
|
@db['results'].insert key: 'http://foo.bar', detail_url: 'http://foo.bar'
|
49
49
|
@spider.record detail_url: 'http://foo.bar', foo: 'bar'
|
50
|
-
@db['results'].find.count.
|
50
|
+
expect(@db['results'].find.count).to eq(1)
|
51
51
|
end
|
52
52
|
|
53
|
-
it
|
54
|
-
@spider.add_error error: Exception.new(
|
53
|
+
it 'should add error' do
|
54
|
+
@spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
|
55
55
|
doc = @db['errors'].find.first
|
56
|
-
doc['error'].
|
57
|
-
doc['url'].
|
58
|
-
doc['handler'].
|
59
|
-
doc['message'].
|
60
|
-
doc['spider'].
|
56
|
+
expect(doc['error']).to eq('Exception')
|
57
|
+
expect(doc['url']).to eq('http://www.cnn.com')
|
58
|
+
expect(doc['handler']).to eq(:blah)
|
59
|
+
expect(doc['message']).to eq('WTF')
|
60
|
+
expect(doc['spider']).to eq('TestMopedSpider')
|
61
61
|
end
|
62
|
-
|
63
|
-
end
|
62
|
+
end
|
data/spidey-mongo.gemspec
CHANGED
@@ -1,29 +1,26 @@
|
|
1
1
|
# -*- encoding: utf-8 -*-
|
2
|
-
|
3
|
-
require
|
2
|
+
$LOAD_PATH.push File.expand_path('../lib', __FILE__)
|
3
|
+
require 'spidey-mongo/version'
|
4
4
|
|
5
5
|
Gem::Specification.new do |s|
|
6
|
-
s.name =
|
6
|
+
s.name = 'spidey-mongo'
|
7
7
|
s.version = Spidey::Mongo::VERSION
|
8
|
-
s.authors = [
|
9
|
-
s.email = [
|
10
|
-
s.homepage =
|
11
|
-
s.summary =
|
12
|
-
s.description =
|
8
|
+
s.authors = ['Joey Aghion']
|
9
|
+
s.email = ['joey@aghion.com']
|
10
|
+
s.homepage = 'https://github.com/joeyAghion/spidey-mongo'
|
11
|
+
s.summary = 'Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.'
|
12
|
+
s.description = 'Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.'
|
13
13
|
s.license = 'MIT'
|
14
14
|
|
15
|
-
s.rubyforge_project =
|
15
|
+
s.rubyforge_project = 'spidey-mongo'
|
16
16
|
|
17
17
|
s.files = `git ls-files`.split("\n")
|
18
18
|
s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
|
19
|
-
s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
|
20
|
-
s.require_paths = [
|
19
|
+
s.executables = `git ls-files -- bin/*`.split("\n").map { |f| File.basename(f) }
|
20
|
+
s.require_paths = ['lib']
|
21
21
|
|
22
|
-
s.add_development_dependency
|
23
|
-
s.add_development_dependency
|
24
|
-
s.add_development_dependency "mongo"
|
25
|
-
s.add_development_dependency "bson_ext"
|
26
|
-
s.add_development_dependency "moped"
|
22
|
+
s.add_development_dependency 'rake'
|
23
|
+
s.add_development_dependency 'rspec'
|
27
24
|
|
28
|
-
s.add_runtime_dependency
|
25
|
+
s.add_runtime_dependency 'spidey', '>= 0.1.0'
|
29
26
|
end
|
metadata
CHANGED
@@ -1,110 +1,55 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: spidey-mongo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 0.3.0
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Joey Aghion
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2015-11-04 00:00:00.000000000 Z
|
13
12
|
dependencies:
|
14
13
|
- !ruby/object:Gem::Dependency
|
15
14
|
name: rake
|
16
15
|
requirement: !ruby/object:Gem::Requirement
|
17
|
-
none: false
|
18
16
|
requirements:
|
19
|
-
- -
|
17
|
+
- - '>='
|
20
18
|
- !ruby/object:Gem::Version
|
21
19
|
version: '0'
|
22
20
|
type: :development
|
23
21
|
prerelease: false
|
24
22
|
version_requirements: !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
23
|
requirements:
|
27
|
-
- -
|
24
|
+
- - '>='
|
28
25
|
- !ruby/object:Gem::Version
|
29
26
|
version: '0'
|
30
27
|
- !ruby/object:Gem::Dependency
|
31
28
|
name: rspec
|
32
29
|
requirement: !ruby/object:Gem::Requirement
|
33
|
-
none: false
|
34
30
|
requirements:
|
35
|
-
- -
|
31
|
+
- - '>='
|
36
32
|
- !ruby/object:Gem::Version
|
37
33
|
version: '0'
|
38
34
|
type: :development
|
39
35
|
prerelease: false
|
40
36
|
version_requirements: !ruby/object:Gem::Requirement
|
41
|
-
none: false
|
42
37
|
requirements:
|
43
|
-
- -
|
44
|
-
- !ruby/object:Gem::Version
|
45
|
-
version: '0'
|
46
|
-
- !ruby/object:Gem::Dependency
|
47
|
-
name: mongo
|
48
|
-
requirement: !ruby/object:Gem::Requirement
|
49
|
-
none: false
|
50
|
-
requirements:
|
51
|
-
- - ! '>='
|
52
|
-
- !ruby/object:Gem::Version
|
53
|
-
version: '0'
|
54
|
-
type: :development
|
55
|
-
prerelease: false
|
56
|
-
version_requirements: !ruby/object:Gem::Requirement
|
57
|
-
none: false
|
58
|
-
requirements:
|
59
|
-
- - ! '>='
|
60
|
-
- !ruby/object:Gem::Version
|
61
|
-
version: '0'
|
62
|
-
- !ruby/object:Gem::Dependency
|
63
|
-
name: bson_ext
|
64
|
-
requirement: !ruby/object:Gem::Requirement
|
65
|
-
none: false
|
66
|
-
requirements:
|
67
|
-
- - ! '>='
|
68
|
-
- !ruby/object:Gem::Version
|
69
|
-
version: '0'
|
70
|
-
type: :development
|
71
|
-
prerelease: false
|
72
|
-
version_requirements: !ruby/object:Gem::Requirement
|
73
|
-
none: false
|
74
|
-
requirements:
|
75
|
-
- - ! '>='
|
76
|
-
- !ruby/object:Gem::Version
|
77
|
-
version: '0'
|
78
|
-
- !ruby/object:Gem::Dependency
|
79
|
-
name: moped
|
80
|
-
requirement: !ruby/object:Gem::Requirement
|
81
|
-
none: false
|
82
|
-
requirements:
|
83
|
-
- - ! '>='
|
84
|
-
- !ruby/object:Gem::Version
|
85
|
-
version: '0'
|
86
|
-
type: :development
|
87
|
-
prerelease: false
|
88
|
-
version_requirements: !ruby/object:Gem::Requirement
|
89
|
-
none: false
|
90
|
-
requirements:
|
91
|
-
- - ! '>='
|
38
|
+
- - '>='
|
92
39
|
- !ruby/object:Gem::Version
|
93
40
|
version: '0'
|
94
41
|
- !ruby/object:Gem::Dependency
|
95
42
|
name: spidey
|
96
43
|
requirement: !ruby/object:Gem::Requirement
|
97
|
-
none: false
|
98
44
|
requirements:
|
99
|
-
- -
|
45
|
+
- - '>='
|
100
46
|
- !ruby/object:Gem::Version
|
101
47
|
version: 0.1.0
|
102
48
|
type: :runtime
|
103
49
|
prerelease: false
|
104
50
|
version_requirements: !ruby/object:Gem::Requirement
|
105
|
-
none: false
|
106
51
|
requirements:
|
107
|
-
- -
|
52
|
+
- - '>='
|
108
53
|
- !ruby/object:Gem::Version
|
109
54
|
version: 0.1.0
|
110
55
|
description: Implements a MongoDB back-end for Spidey, a framework for crawling and
|
@@ -116,6 +61,9 @@ extensions: []
|
|
116
61
|
extra_rdoc_files: []
|
117
62
|
files:
|
118
63
|
- .gitignore
|
64
|
+
- .travis.yml
|
65
|
+
- CHANGELOG.md
|
66
|
+
- CONTRIBUTING.md
|
119
67
|
- Gemfile
|
120
68
|
- LICENSE.txt
|
121
69
|
- README.md
|
@@ -123,44 +71,40 @@ files:
|
|
123
71
|
- lib/spidey-mongo.rb
|
124
72
|
- lib/spidey-mongo/version.rb
|
125
73
|
- lib/spidey/strategies/mongo.rb
|
74
|
+
- lib/spidey/strategies/mongo2.rb
|
126
75
|
- lib/spidey/strategies/moped.rb
|
127
76
|
- spec/spec_helper.rb
|
77
|
+
- spec/spidey/strategies/mongo2_spec.rb
|
128
78
|
- spec/spidey/strategies/mongo_spec.rb
|
129
79
|
- spec/spidey/strategies/moped_spec.rb
|
130
80
|
- spidey-mongo.gemspec
|
131
81
|
homepage: https://github.com/joeyAghion/spidey-mongo
|
132
82
|
licenses:
|
133
83
|
- MIT
|
84
|
+
metadata: {}
|
134
85
|
post_install_message:
|
135
86
|
rdoc_options: []
|
136
87
|
require_paths:
|
137
88
|
- lib
|
138
89
|
required_ruby_version: !ruby/object:Gem::Requirement
|
139
|
-
none: false
|
140
90
|
requirements:
|
141
|
-
- -
|
91
|
+
- - '>='
|
142
92
|
- !ruby/object:Gem::Version
|
143
93
|
version: '0'
|
144
|
-
segments:
|
145
|
-
- 0
|
146
|
-
hash: 987129952958952365
|
147
94
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
148
|
-
none: false
|
149
95
|
requirements:
|
150
|
-
- -
|
96
|
+
- - '>='
|
151
97
|
- !ruby/object:Gem::Version
|
152
98
|
version: '0'
|
153
|
-
segments:
|
154
|
-
- 0
|
155
|
-
hash: 987129952958952365
|
156
99
|
requirements: []
|
157
100
|
rubyforge_project: spidey-mongo
|
158
|
-
rubygems_version:
|
101
|
+
rubygems_version: 2.0.14
|
159
102
|
signing_key:
|
160
|
-
specification_version:
|
103
|
+
specification_version: 4
|
161
104
|
summary: Implements a MongoDB back-end for Spidey, a framework for crawling and scraping
|
162
105
|
web sites.
|
163
106
|
test_files:
|
164
107
|
- spec/spec_helper.rb
|
108
|
+
- spec/spidey/strategies/mongo2_spec.rb
|
165
109
|
- spec/spidey/strategies/mongo_spec.rb
|
166
110
|
- spec/spidey/strategies/moped_spec.rb
|