spidey-mongo 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 096f00db0e8368887d3546d1909984af270ef83e
4
+ data.tar.gz: dbc9ebec27c141264076557098c83e7a1576dc28
5
+ SHA512:
6
+ metadata.gz: 50b3fc0d2fa3ea7837ff06e5d2f3f5bebf99047d10a43ff27dff755df492a95c649532b0ab5d0b04b8d37755b96a5558237d4c303e89f89bafc1f602e00d0e24
7
+ data.tar.gz: 6392e63f6d3eba223dade821bedf9c56bf44b1cf48146e0906fb74186ff81d80368522b4f1dbfe9d2cd14335870e64455c66c2ab86c60abfc1c506de00d3e9a6
@@ -0,0 +1,11 @@
1
+ services:
2
+ - mongodb
3
+
4
+ env:
5
+ - MONGO_VERSION=moped
6
+ - MONGO_VERSION=mongo
7
+ - MONGO_VERSION=mongo2
8
+
9
+ rvm:
10
+ - 2.2
11
+
@@ -0,0 +1,15 @@
1
+ ### Next
2
+
3
+ * Your contribution here...
4
+
5
+ ### 0.3.0
6
+
7
+ * [#3](https://github.com/joeyAghion/spidey-mongo/pull/3): Added support for Mongo Ruby Driver 2.x - [@dblock](https://github.com/dblock).
8
+
9
+ ### 0.2.0
10
+
11
+ * [#1](https://github.com/joeyAghion/spidey-mongo/pull/1): Added support for Moped - [@fancyremarker](https://github.com/fancyremarker).
12
+
13
+ ### 0.1.0
14
+
15
+ * Initial public release - [@joeyAghion](https://github.com/joeyAghion).
@@ -0,0 +1,116 @@
1
+ Contributing
2
+ ============
3
+
4
+ This gem is work of [many of contributors](https://github.com/joeyAghion/spidey-mongo/graphs/contributors). You're encouraged to submit [pull requests](https://github.com/joeyAghion/spidey-mongo/pulls), [propose features, ask questions and discuss issues](https://github.com/joeyAghion/spidey-mongo/issues).
5
+
6
+ #### Fork the Project
7
+
8
+ Fork the [project on Github](https://github.com/joeyAghion/spidey-mongo) and check out your copy.
9
+
10
+ ```
11
+ git clone https://github.com/contributor/spidey-mongo.git
12
+ cd spidey-mongo
13
+ git remote add upstream https://github.com/joeyAghion/spidey-mongo.git
14
+ ```
15
+
16
+ #### Create a Topic Branch
17
+
18
+ Make sure your fork is up-to-date and create a topic branch for your feature or bug fix.
19
+
20
+ ```
21
+ git checkout master
22
+ git pull upstream master
23
+ git checkout -b my-feature-branch
24
+ ```
25
+
26
+ #### Bundle Install and Test
27
+
28
+ Ensure that you can build the project and run tests.
29
+
30
+ ```
31
+ bundle install
32
+ bundle exec rake
33
+ ```
34
+
35
+ #### Write Tests
36
+
37
+ Try to write a test that reproduces the problem you're trying to fix or describes a feature that you want to build. Add to [spec/mongoid](spec/mongoid).
38
+
39
+ We definitely appreciate pull requests that highlight or reproduce a problem, even without a fix.
40
+
41
+ #### Write Code
42
+
43
+ Implement your feature or bug fix.
44
+
45
+ Make sure that `bundle exec rake` completes without errors.
46
+
47
+ #### Write Documentation
48
+
49
+ Document any external behavior in the [README](README.md).
50
+
51
+ #### Update Changelog
52
+
53
+ Add a line to [CHANGELOG](CHANGELOG.md) under *Next*. Make it look like every other line, including your name and link to your Github account.
54
+
55
+ #### Commit Changes
56
+
57
+ Make sure git knows your name and email address:
58
+
59
+ ```
60
+ git config --global user.name "Your Name"
61
+ git config --global user.email "contributor@example.com"
62
+ ```
63
+
64
+ Writing good commit logs is important. A commit log should describe what changed and why.
65
+
66
+ ```
67
+ git add ...
68
+ git commit
69
+ ```
70
+
71
+ #### Push
72
+
73
+ ```
74
+ git push origin my-feature-branch
75
+ ```
76
+
77
+ #### Make a Pull Request
78
+
79
+ Go to https://github.com/contributor/spidey-mongo and select your feature branch. Click the 'Pull Request' button and fill out the form. Pull requests are usually reviewed within a few days.
80
+
81
+ #### Rebase
82
+
83
+ If you've been working on a change for a while, rebase with upstream/master.
84
+
85
+ ```
86
+ git fetch upstream
87
+ git rebase upstream/master
88
+ git push origin my-feature-branch -f
89
+ ```
90
+
91
+ #### Update CHANGELOG Again
92
+
93
+ Update the [CHANGELOG](CHANGELOG.md) with the pull request number. A typical entry looks as follows.
94
+
95
+ ```
96
+ * [#123](https://github.com/joeyAghion/spidey-mongo/pull/123): Reticulated splines - [@contributor](https://github.com/contributor).
97
+ ```
98
+
99
+ Amend your previous commit and force push the changes.
100
+
101
+ ```
102
+ git commit --amend
103
+ git push origin my-feature-branch -f
104
+ ```
105
+
106
+ #### Check on Your Pull Request
107
+
108
+ Go back to your pull request after a few minutes and see whether it passed muster with Travis-CI. Everything should look green, otherwise fix issues and amend your commit as described above.
109
+
110
+ #### Be Patient
111
+
112
+ It's likely that your change will not be merged and that the nitpicky maintainers will ask you to do more, or fix seemingly benign problems. Hang on there!
113
+
114
+ #### Thank You
115
+
116
+ Please do know that we really appreciate and value your time and work. We love you, really.
data/Gemfile CHANGED
@@ -1,4 +1,16 @@
1
- source "http://rubygems.org"
1
+ source 'http://rubygems.org'
2
+
3
+ case version = ENV['MONGO_VERSION'] || 'mongo2'
4
+ when /^moped/
5
+ gem 'moped', '~> 2.0'
6
+ when /^mongo2/
7
+ gem 'mongo', '~> 2.0'
8
+ when /^mongo/
9
+ gem 'mongo', '~> 1.12'
10
+ gem 'bson_ext'
11
+ else
12
+ fail "Invalid MONGO_VERSION: #{ENV['MONGO_VERSION']}."
13
+ end
2
14
 
3
15
  # Specify your gem's dependencies in spidey-mongo.gemspec
4
16
 
@@ -1,4 +1,4 @@
1
- Copyright (c) 2012 Joey Aghion, Art.sy Inc.
1
+ Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining
4
4
  a copy of this software and associated documentation files (the
@@ -17,4 +17,4 @@ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
17
  NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
18
  LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
19
  OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
- WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md CHANGED
@@ -1,6 +1,9 @@
1
1
  Spidey-Mongo
2
2
  ============
3
3
 
4
+ [![Build Status](https://travis-ci.org/joeyAghion/spidey-mongo.svg?branch=master)](https://travis-ci.org/joeyAghion/spidey-mongo)
5
+ [![Gem Version](https://badge.fury.io/rb/spidey-mongo.svg)](https://badge.fury.io/rb/spidey-mongo)
6
+
4
7
  This gem implements a [MongoDB](http://www.mongodb.org/) back-end for [Spidey](https://github.com/joeyAghion/spidey), a very simple framework for crawling and scraping web sites.
5
8
 
6
9
  See [Spidey](https://githubcom/joeyAghion/spidey)'s documentation for a basic example spider class.
@@ -12,45 +15,52 @@ Usage
12
15
 
13
16
  ### Install the gem
14
17
 
15
- gem install spidey-mongo
16
-
18
+ ``` ruby
19
+ gem install spidey-mongo
20
+ ```
17
21
 
18
22
  ### `mongo` versus `moped`
19
23
 
20
- Spidey-Mongo provides two strategies:
24
+ Spidey-Mongo provides three strategies:
21
25
 
22
- * `Spidey::Strategies::Mongo`: Compatible with 10gen's [`mongo`](https://github.com/mongodb/mongo-ruby-driver) gem
23
- * `Spidey::Strategies::Moped`: Compatible with the [`moped`](https://github.com/mongoid/moped) gem, e.g., for use with Mongoid 3.x
26
+ * `Spidey::Strategies::Mongo`: Compatible with Mongo Ruby Driver 1.x, [`mongo`](https://github.com/mongodb/mongo-ruby-driver)
27
+ * `Spidey::Strategies::Mongo2`: Compatible with Mongo Ruby Driver 2.x, [`mongo`](https://github.com/mongodb/mongo-ruby-driver), e.g., for use with Mongoid 5.x
28
+ * `Spidey::Strategies::Moped`: Compatible with the [`moped`](https://github.com/mongoid/moped) 2.x, e.g., for use with Mongoid 3.x and 4.x
24
29
 
25
30
  You can include either strategy in your classes, as appropriate. All the examples in this README assume `Spidey::Strategies::Mongo`.
26
31
 
27
-
28
32
  ### Example spider class
29
33
 
30
- class EbaySpider < Spidey::AbstractSpider
31
- include Spidey::Strategies::Mongo
32
-
33
- handle "http://www.ebay.com", :process_home
34
-
35
- def process_home(page, default_data = {})
36
- # ...
37
- end
38
- end
34
+ ```ruby
35
+ class EbaySpider < Spidey::AbstractSpider
36
+ include Spidey::Strategies::Mongo
37
+
38
+ handle "http://www.ebay.com", :process_home
39
+
40
+ def process_home(page, default_data = {})
41
+ # ...
42
+ end
43
+ end
44
+ ```
39
45
 
40
46
  ### Invocation
41
47
 
42
48
  The spider's constructor accepts new parameters for each of the MongoDB collections to employ: `url_collection`, `result_collection`, and `error_collection`.
43
49
 
44
- db = Mongo::Connection.new['example']
45
-
46
- spider = EbaySpider.new(
47
- url_collection: db['urls'],
48
- result_collection: db['results'],
49
- error_collection: db['errors'])
50
+ ```ruby
51
+ db = Mongo::Connection.new['example']
52
+
53
+ spider = EbaySpider.new(
54
+ url_collection: db['urls'],
55
+ result_collection: db['results'],
56
+ error_collection: db['errors'])
57
+ ```
50
58
 
51
59
  With persistent storage of the URL-crawling queue, it's now possible to stop crawling and resume at a later point. The `crawl` method accepts a new optional `crawl_for` parameter specifying the number of seconds after which to stop.
52
60
 
53
- spider.crawl crawl_for: 600 # seconds, or more conveniently (w/ActiveSupport): 10.minutes
61
+ ```
62
+ spider.crawl crawl_for: 600 # seconds, or more conveniently (w/ActiveSupport): 10.minutes
63
+ ```
54
64
 
55
65
  (The base implementation's `max_urls` parameter is also useful for this purpose.)
56
66
 
@@ -58,32 +68,28 @@ With persistent storage of the URL-crawling queue, it's now possible to stop cra
58
68
 
59
69
  By default, invocations of `record(data)` by the spider simply insert new documents into the result collection. If corresponding results may already exist in the collection and should instead be updated, define a `result_key` method that returns a key by which to find the corresponding document. The method is called with a hash of the data being recorded:
60
70
 
61
- class EbaySpider < Spidey::AbstractSpider
62
- include Spidey::Strategies::Mongo
63
-
64
- def result_key(data)
65
- data[:detail_url]
66
- end
71
+ ```ruby
72
+ class EbaySpider < Spidey::AbstractSpider
73
+ include Spidey::Strategies::Mongo
67
74
 
68
- # ...
69
- end
75
+ def result_key(data)
76
+ data[:detail_url]
77
+ end
70
78
 
71
- This performs an `upsert` instead of the usual `insert` (i.e., an update if a result document matching the key already exists, or insert otherwise).
79
+ # ...
80
+ end
81
+ ```
72
82
 
73
- Testing
74
- -------
75
-
76
- bundle exec rspec
77
-
78
- Contributors
79
- ------------
83
+ This performs an `upsert` instead of the usual `insert` (i.e., an update if a result document matching the key already exists, or insert otherwise).
80
84
 
81
- [Joey Aghion](https://github.com/joeyAghion), [Frank Macreery](https://github.com/fancyremarker)
85
+ Contrbuting
86
+ -----------
82
87
 
83
- To Do
84
- -----
85
- * Extract behaviors shared by `Mongo` and `Moped` strategies.
88
+ Please contribute! See [CONTRIBUTING](CONTRIBUTING.md) for details.
86
89
 
87
90
  Copyright
88
91
  ---------
89
- Copyright (c) 2012, 2013 Joey Aghion, Artsy Inc. See [LICENSE.txt](LICENSE.txt) for further details.
92
+
93
+ Copyright (c) 2012-2015 Joey Aghion, Artsy Inc., and Contributors.
94
+
95
+ See [LICENSE.txt](LICENSE.txt) for further details.
data/Rakefile CHANGED
@@ -1 +1,12 @@
1
- require "bundler/gem_tasks"
1
+ require 'bundler/gem_tasks'
2
+
3
+ Bundler.setup :default, :development
4
+
5
+ require 'rspec/core'
6
+ require 'rspec/core/rake_task'
7
+
8
+ RSpec::Core::RakeTask.new(:spec) do |spec|
9
+ spec.pattern = FileList["spec/**/#{ENV['MONGO_VERSION'] || 'mongo2'}_spec.rb"]
10
+ end
11
+
12
+ task default: :spec
@@ -1,5 +1,7 @@
1
1
  require 'spidey'
2
2
 
3
3
  require 'spidey-mongo/version'
4
+
4
5
  require 'spidey/strategies/mongo'
5
- require 'spidey/strategies/moped'
6
+ require 'spidey/strategies/mongo2'
7
+ require 'spidey/strategies/moped'
@@ -1,5 +1,5 @@
1
1
  module Spidey
2
2
  module Mongo
3
- VERSION = "0.2.0"
3
+ VERSION = '0.3.0'
4
4
  end
5
5
  end
@@ -18,8 +18,8 @@ module Spidey::Strategies
18
18
  def handle(url, handler, default_data = {})
19
19
  Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
20
20
  url_collection.update(
21
- {'spider' => self.class.name, 'url' => url},
22
- {'$set' => {'handler' => handler, 'default_data' => default_data}},
21
+ { 'spider' => self.class.name, 'url' => url },
22
+ { '$set' => { 'handler' => handler, 'default_data' => default_data } },
23
23
  upsert: true
24
24
  )
25
25
  end
@@ -28,16 +28,16 @@ module Spidey::Strategies
28
28
  doc = data.merge('spider' => self.class.name)
29
29
  Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
30
30
  if respond_to?(:result_key) && key = result_key(doc)
31
- result_collection.update({'key' => key}, {'$set' => doc}, upsert: true)
31
+ result_collection.update({ 'key' => key }, { '$set' => doc }, upsert: true)
32
32
  else
33
33
  result_collection.insert doc
34
34
  end
35
35
  end
36
36
 
37
- def each_url(&block)
37
+ def each_url(&_block)
38
38
  while url = get_next_url
39
- break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
40
- url_collection.update({'_id' => url['_id']}, '$set' => {last_crawled_at: Time.now})
39
+ break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
40
+ url_collection.update({ '_id' => url['_id'] }, '$set' => { last_crawled_at: Time.now })
41
41
  yield url['url'], url['handler'], url['default_data'].symbolize_keys
42
42
  end
43
43
  end
@@ -49,14 +49,11 @@ module Spidey::Strategies
49
49
  Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
50
50
  end
51
51
 
52
- private
52
+ private
53
53
 
54
54
  def get_next_url
55
- return nil if (@until && Time.now >= @until) # exceeded time bound
56
- url_collection.find_one({spider: self.class.name}, {
57
- sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]]
58
- })
55
+ return nil if @until && Time.now >= @until # exceeded time bound
56
+ url_collection.find_one({ spider: self.class.name }, sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]])
59
57
  end
60
-
61
58
  end
62
59
  end
@@ -0,0 +1,59 @@
1
+ module Spidey::Strategies
2
+ module Mongo2
3
+ attr_accessor :url_collection, :result_collection, :error_collection
4
+
5
+ def initialize(attrs = {})
6
+ self.url_collection = attrs.delete(:url_collection)
7
+ self.result_collection = attrs.delete(:result_collection)
8
+ self.error_collection = attrs.delete(:error_collection)
9
+ super attrs
10
+ end
11
+
12
+ def crawl(options = {})
13
+ @crawl_started_at = Time.now
14
+ @until = Time.now + options[:crawl_for] if options[:crawl_for]
15
+ super options
16
+ end
17
+
18
+ def handle(url, handler, default_data = {})
19
+ Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
20
+ url_collection.update_one(
21
+ { 'spider' => self.class.name, 'url' => url },
22
+ { '$set' => { 'handler' => handler, 'default_data' => default_data } },
23
+ upsert: true
24
+ )
25
+ end
26
+
27
+ def record(data)
28
+ doc = data.merge('spider' => self.class.name)
29
+ Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
30
+ if respond_to?(:result_key) && key = result_key(doc)
31
+ result_collection.update_one({ 'key' => key }, { '$set' => doc }, upsert: true)
32
+ else
33
+ result_collection.insert_one doc
34
+ end
35
+ end
36
+
37
+ def each_url(&_block)
38
+ while url = get_next_url
39
+ break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
40
+ url_collection.update_one({ '_id' => url['_id'] }, '$set' => { last_crawled_at: Time.now })
41
+ yield url['url'], url['handler'], url['default_data'].symbolize_keys
42
+ end
43
+ end
44
+
45
+ def add_error(attrs)
46
+ error = attrs.delete(:error)
47
+ doc = attrs.merge(created_at: Time.now, error: error.class.name, message: error.message, spider: self.class.name)
48
+ error_collection.insert_one doc
49
+ Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
50
+ end
51
+
52
+ private
53
+
54
+ def get_next_url
55
+ return nil if @until && Time.now >= @until # exceeded time bound
56
+ url_collection.find({ spider: self.class.name }, sort: [[:last_crawled_at, ::Mongo::ASCENDING], [:_id, ::Mongo::ASCENDING]]).first
57
+ end
58
+ end
59
+ end
@@ -18,9 +18,9 @@ module Spidey::Strategies
18
18
  def handle(url, handler, default_data = {})
19
19
  Spidey.logger.info "Queueing #{url.inspect[0..200]}..."
20
20
  url_collection.find(
21
- {'spider' => self.class.name, 'url' => url}
21
+ 'spider' => self.class.name, 'url' => url
22
22
  ).upsert(
23
- {'$set' => {'handler' => handler, 'default_data' => default_data}}
23
+ '$set' => { 'handler' => handler, 'default_data' => default_data }
24
24
  )
25
25
  end
26
26
 
@@ -28,16 +28,16 @@ module Spidey::Strategies
28
28
  doc = data.merge('spider' => self.class.name)
29
29
  Spidey.logger.info "Recording #{doc.inspect[0..500]}..."
30
30
  if respond_to?(:result_key) && key = result_key(doc)
31
- result_collection.find({'key' => key}).upsert({'$set' => doc})
31
+ result_collection.find('key' => key).upsert('$set' => doc)
32
32
  else
33
33
  result_collection.insert doc
34
34
  end
35
35
  end
36
36
 
37
- def each_url(&block)
37
+ def each_url(&_block)
38
38
  while url = get_next_url
39
- break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
40
- url_collection.find({'_id' => url['_id']}).update('$set' => {last_crawled_at: Time.now})
39
+ break if url['last_crawled_at'] && url['last_crawled_at'] >= @crawl_started_at # crawled already in this batch
40
+ url_collection.find('_id' => url['_id']).update('$set' => { last_crawled_at: Time.now })
41
41
  yield url['url'], url['handler'], url['default_data'].symbolize_keys
42
42
  end
43
43
  end
@@ -49,14 +49,11 @@ module Spidey::Strategies
49
49
  Spidey.logger.error "Error on #{attrs[:url]}. #{error.class}: #{error.message}"
50
50
  end
51
51
 
52
- private
52
+ private
53
53
 
54
54
  def get_next_url
55
- return nil if (@until && Time.now >= @until) # exceeded time bound
56
- url_collection.find({spider: self.class.name}).sort({
57
- 'last_crawled_at' => 1, '_id' => 1
58
- }).first
55
+ return nil if @until && Time.now >= @until # exceeded time bound
56
+ url_collection.find(spider: self.class.name).sort('last_crawled_at' => 1, '_id' => 1).first
59
57
  end
60
-
61
58
  end
62
59
  end
@@ -1,8 +1,18 @@
1
- $:.unshift(File.dirname(__FILE__) + '/../lib')
1
+ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
2
+
3
+ case version = ENV['MONGO_VERSION'] || 'mongo2'
4
+ when /^moped/
5
+ require 'moped'
6
+ when /^mongo/
7
+ require 'mongo'
8
+ else
9
+ fail "Invalid MONGO_VERSION: #{ENV['MONGO_VERSION']}."
10
+ end
11
+
2
12
  require 'spidey-mongo'
3
13
 
4
14
  RSpec.configure do |config|
5
- config.treat_symbols_as_metadata_keys_with_true_values = true
6
15
  config.run_all_when_everything_filtered = true
7
16
  config.filter_run :focus
17
+ config.raise_errors_for_deprecations!
8
18
  end
@@ -0,0 +1,61 @@
1
+ require 'spec_helper'
2
+ require 'mongo'
3
+
4
+ describe Spidey::Strategies::Mongo do
5
+ class TestMongoSpider < Spidey::AbstractSpider
6
+ include Spidey::Strategies::Mongo2
7
+ handle 'http://www.cnn.com', :process_home
8
+
9
+ def result_key(data)
10
+ data[:detail_url]
11
+ end
12
+ end
13
+
14
+ before(:each) do
15
+ @db = Mongo::Client.new('mongodb://127.0.0.1:27017/spidey-mongo-test')
16
+ @spider = TestMongoSpider.new(
17
+ url_collection: @db['urls'],
18
+ result_collection: @db['results'],
19
+ error_collection: @db['errors'])
20
+ end
21
+
22
+ after(:each) do
23
+ %w( urls results errors ).each { |col| @db[col].drop }
24
+ end
25
+
26
+ it 'should add initial URLs to collection' do
27
+ doc = @db['urls'].find(url: 'http://www.cnn.com').first
28
+ expect(doc['handler']).to eq(:process_home)
29
+ expect(doc['spider']).to eq('TestMongoSpider')
30
+ end
31
+
32
+ it 'should not add duplicate URLs' do
33
+ @spider.send :handle, 'http://www.cnn.com', :process_home
34
+ expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
35
+ end
36
+
37
+ it 'should add results' do
38
+ @spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
39
+ expect(@db['results'].count).to eq(1)
40
+ doc = @db['results'].find.first
41
+ expect(doc['detail_url']).to eq('http://www.cnn.com')
42
+ expect(doc['foo']).to eq('bar')
43
+ expect(doc['spider']).to eq('TestMongoSpider')
44
+ end
45
+
46
+ it 'should update existing result' do
47
+ @db['results'].insert_one key: 'http://foo.bar', detail_url: 'http://foo.bar'
48
+ @spider.record detail_url: 'http://foo.bar', foo: 'bar'
49
+ expect(@db['results'].count).to eq(1)
50
+ end
51
+
52
+ it 'should add error' do
53
+ @spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
54
+ doc = @db['errors'].find.first
55
+ expect(doc['error']).to eq('Exception')
56
+ expect(doc['url']).to eq('http://www.cnn.com')
57
+ expect(doc['handler']).to eq(:blah)
58
+ expect(doc['message']).to eq('WTF')
59
+ expect(doc['spider']).to eq('TestMongoSpider')
60
+ end
61
+ end
@@ -4,7 +4,7 @@ require 'mongo'
4
4
  describe Spidey::Strategies::Mongo do
5
5
  class TestMongoSpider < Spidey::AbstractSpider
6
6
  include Spidey::Strategies::Mongo
7
- handle "http://www.cnn.com", :process_home
7
+ handle 'http://www.cnn.com', :process_home
8
8
 
9
9
  def result_key(data)
10
10
  data[:detail_url]
@@ -20,43 +20,42 @@ describe Spidey::Strategies::Mongo do
20
20
  end
21
21
 
22
22
  after(:each) do
23
- %w{ urls results errors }.each{ |col| @db[col].drop }
23
+ %w( urls results errors ).each { |col| @db[col].drop }
24
24
  end
25
25
 
26
- it "should add initial URLs to collection" do
27
- doc = @db['urls'].find_one(url: "http://www.cnn.com")
28
- doc['handler'].should == :process_home
29
- doc['spider'].should == 'TestMongoSpider'
26
+ it 'should add initial URLs to collection' do
27
+ doc = @db['urls'].find_one(url: 'http://www.cnn.com')
28
+ expect(doc['handler']).to eq(:process_home)
29
+ expect(doc['spider']).to eq('TestMongoSpider')
30
30
  end
31
31
 
32
- it "should not add duplicate URLs" do
33
- @spider.send :handle, "http://www.cnn.com", :process_home
34
- @db['urls'].find(url: "http://www.cnn.com").count.should == 1
32
+ it 'should not add duplicate URLs' do
33
+ @spider.send :handle, 'http://www.cnn.com', :process_home
34
+ expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
35
35
  end
36
36
 
37
- it "should add results" do
37
+ it 'should add results' do
38
38
  @spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
39
- @db['results'].count.should == 1
39
+ expect(@db['results'].count).to eq(1)
40
40
  doc = @db['results'].find_one
41
- doc['detail_url'].should == 'http://www.cnn.com'
42
- doc['foo'].should == 'bar'
43
- doc['spider'].should == 'TestMongoSpider'
41
+ expect(doc['detail_url']).to eq('http://www.cnn.com')
42
+ expect(doc['foo']).to eq('bar')
43
+ expect(doc['spider']).to eq('TestMongoSpider')
44
44
  end
45
45
 
46
- it "should update existing result" do
46
+ it 'should update existing result' do
47
47
  @db['results'].insert key: 'http://foo.bar', detail_url: 'http://foo.bar'
48
48
  @spider.record detail_url: 'http://foo.bar', foo: 'bar'
49
- @db['results'].count.should == 1
49
+ expect(@db['results'].count).to eq(1)
50
50
  end
51
51
 
52
- it "should add error" do
53
- @spider.add_error error: Exception.new("WTF"), url: "http://www.cnn.com", handler: :blah
52
+ it 'should add error' do
53
+ @spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
54
54
  doc = @db['errors'].find_one
55
- doc['error'].should == 'Exception'
56
- doc['url'].should == 'http://www.cnn.com'
57
- doc['handler'].should == :blah
58
- doc['message'].should == 'WTF'
59
- doc['spider'].should == 'TestMongoSpider'
55
+ expect(doc['error']).to eq('Exception')
56
+ expect(doc['url']).to eq('http://www.cnn.com')
57
+ expect(doc['handler']).to eq(:blah)
58
+ expect(doc['message']).to eq('WTF')
59
+ expect(doc['spider']).to eq('TestMongoSpider')
60
60
  end
61
-
62
- end
61
+ end
@@ -4,7 +4,7 @@ require 'moped'
4
4
  describe Spidey::Strategies::Moped do
5
5
  class TestMopedSpider < Spidey::AbstractSpider
6
6
  include Spidey::Strategies::Moped
7
- handle "http://www.cnn.com", :process_home
7
+ handle 'http://www.cnn.com', :process_home
8
8
 
9
9
  def result_key(data)
10
10
  data[:detail_url]
@@ -21,43 +21,42 @@ describe Spidey::Strategies::Moped do
21
21
  end
22
22
 
23
23
  after(:each) do
24
- %w{ urls results errors }.each{ |col| @db[col].drop }
24
+ %w( urls results errors ).each { |col| @db[col].drop }
25
25
  end
26
26
 
27
- it "should add initial URLs to collection" do
28
- doc = @db['urls'].find(url: "http://www.cnn.com").first
29
- doc['handler'].should == :process_home
30
- doc['spider'].should == 'TestMopedSpider'
27
+ it 'should add initial URLs to collection' do
28
+ doc = @db['urls'].find(url: 'http://www.cnn.com').first
29
+ expect(doc['handler']).to eq(:process_home)
30
+ expect(doc['spider']).to eq('TestMopedSpider')
31
31
  end
32
32
 
33
- it "should not add duplicate URLs" do
34
- @spider.send :handle, "http://www.cnn.com", :process_home
35
- @db['urls'].find(url: "http://www.cnn.com").count.should == 1
33
+ it 'should not add duplicate URLs' do
34
+ @spider.send :handle, 'http://www.cnn.com', :process_home
35
+ expect(@db['urls'].find(url: 'http://www.cnn.com').count).to eq(1)
36
36
  end
37
37
 
38
- it "should add results" do
38
+ it 'should add results' do
39
39
  @spider.record detail_url: 'http://www.cnn.com', foo: 'bar'
40
- @db['results'].find.count.should == 1
40
+ expect(@db['results'].find.count).to eq(1)
41
41
  doc = @db['results'].find.first
42
- doc['detail_url'].should == 'http://www.cnn.com'
43
- doc['foo'].should == 'bar'
44
- doc['spider'].should == 'TestMopedSpider'
42
+ expect(doc['detail_url']).to eq('http://www.cnn.com')
43
+ expect(doc['foo']).to eq('bar')
44
+ expect(doc['spider']).to eq('TestMopedSpider')
45
45
  end
46
46
 
47
- it "should update existing result" do
47
+ it 'should update existing result' do
48
48
  @db['results'].insert key: 'http://foo.bar', detail_url: 'http://foo.bar'
49
49
  @spider.record detail_url: 'http://foo.bar', foo: 'bar'
50
- @db['results'].find.count.should == 1
50
+ expect(@db['results'].find.count).to eq(1)
51
51
  end
52
52
 
53
- it "should add error" do
54
- @spider.add_error error: Exception.new("WTF"), url: "http://www.cnn.com", handler: :blah
53
+ it 'should add error' do
54
+ @spider.add_error error: Exception.new('WTF'), url: 'http://www.cnn.com', handler: :blah
55
55
  doc = @db['errors'].find.first
56
- doc['error'].should == 'Exception'
57
- doc['url'].should == 'http://www.cnn.com'
58
- doc['handler'].should == :blah
59
- doc['message'].should == 'WTF'
60
- doc['spider'].should == 'TestMopedSpider'
56
+ expect(doc['error']).to eq('Exception')
57
+ expect(doc['url']).to eq('http://www.cnn.com')
58
+ expect(doc['handler']).to eq(:blah)
59
+ expect(doc['message']).to eq('WTF')
60
+ expect(doc['spider']).to eq('TestMopedSpider')
61
61
  end
62
-
63
- end
62
+ end
@@ -1,29 +1,26 @@
1
1
  # -*- encoding: utf-8 -*-
2
- $:.push File.expand_path("../lib", __FILE__)
3
- require "spidey-mongo/version"
2
+ $LOAD_PATH.push File.expand_path('../lib', __FILE__)
3
+ require 'spidey-mongo/version'
4
4
 
5
5
  Gem::Specification.new do |s|
6
- s.name = "spidey-mongo"
6
+ s.name = 'spidey-mongo'
7
7
  s.version = Spidey::Mongo::VERSION
8
- s.authors = ["Joey Aghion"]
9
- s.email = ["joey@aghion.com"]
10
- s.homepage = "https://github.com/joeyAghion/spidey-mongo"
11
- s.summary = %q{Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.}
12
- s.description = %q{Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.}
8
+ s.authors = ['Joey Aghion']
9
+ s.email = ['joey@aghion.com']
10
+ s.homepage = 'https://github.com/joeyAghion/spidey-mongo'
11
+ s.summary = 'Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.'
12
+ s.description = 'Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.'
13
13
  s.license = 'MIT'
14
14
 
15
- s.rubyforge_project = "spidey-mongo"
15
+ s.rubyforge_project = 'spidey-mongo'
16
16
 
17
17
  s.files = `git ls-files`.split("\n")
18
18
  s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
19
- s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
20
- s.require_paths = ["lib"]
19
+ s.executables = `git ls-files -- bin/*`.split("\n").map { |f| File.basename(f) }
20
+ s.require_paths = ['lib']
21
21
 
22
- s.add_development_dependency "rake"
23
- s.add_development_dependency "rspec"
24
- s.add_development_dependency "mongo"
25
- s.add_development_dependency "bson_ext"
26
- s.add_development_dependency "moped"
22
+ s.add_development_dependency 'rake'
23
+ s.add_development_dependency 'rspec'
27
24
 
28
- s.add_runtime_dependency "spidey", ">= 0.1.0"
25
+ s.add_runtime_dependency 'spidey', '>= 0.1.0'
29
26
  end
metadata CHANGED
@@ -1,110 +1,55 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: spidey-mongo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
5
- prerelease:
4
+ version: 0.3.0
6
5
  platform: ruby
7
6
  authors:
8
7
  - Joey Aghion
9
8
  autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2013-08-02 00:00:00.000000000 Z
11
+ date: 2015-11-04 00:00:00.000000000 Z
13
12
  dependencies:
14
13
  - !ruby/object:Gem::Dependency
15
14
  name: rake
16
15
  requirement: !ruby/object:Gem::Requirement
17
- none: false
18
16
  requirements:
19
- - - ! '>='
17
+ - - '>='
20
18
  - !ruby/object:Gem::Version
21
19
  version: '0'
22
20
  type: :development
23
21
  prerelease: false
24
22
  version_requirements: !ruby/object:Gem::Requirement
25
- none: false
26
23
  requirements:
27
- - - ! '>='
24
+ - - '>='
28
25
  - !ruby/object:Gem::Version
29
26
  version: '0'
30
27
  - !ruby/object:Gem::Dependency
31
28
  name: rspec
32
29
  requirement: !ruby/object:Gem::Requirement
33
- none: false
34
30
  requirements:
35
- - - ! '>='
31
+ - - '>='
36
32
  - !ruby/object:Gem::Version
37
33
  version: '0'
38
34
  type: :development
39
35
  prerelease: false
40
36
  version_requirements: !ruby/object:Gem::Requirement
41
- none: false
42
37
  requirements:
43
- - - ! '>='
44
- - !ruby/object:Gem::Version
45
- version: '0'
46
- - !ruby/object:Gem::Dependency
47
- name: mongo
48
- requirement: !ruby/object:Gem::Requirement
49
- none: false
50
- requirements:
51
- - - ! '>='
52
- - !ruby/object:Gem::Version
53
- version: '0'
54
- type: :development
55
- prerelease: false
56
- version_requirements: !ruby/object:Gem::Requirement
57
- none: false
58
- requirements:
59
- - - ! '>='
60
- - !ruby/object:Gem::Version
61
- version: '0'
62
- - !ruby/object:Gem::Dependency
63
- name: bson_ext
64
- requirement: !ruby/object:Gem::Requirement
65
- none: false
66
- requirements:
67
- - - ! '>='
68
- - !ruby/object:Gem::Version
69
- version: '0'
70
- type: :development
71
- prerelease: false
72
- version_requirements: !ruby/object:Gem::Requirement
73
- none: false
74
- requirements:
75
- - - ! '>='
76
- - !ruby/object:Gem::Version
77
- version: '0'
78
- - !ruby/object:Gem::Dependency
79
- name: moped
80
- requirement: !ruby/object:Gem::Requirement
81
- none: false
82
- requirements:
83
- - - ! '>='
84
- - !ruby/object:Gem::Version
85
- version: '0'
86
- type: :development
87
- prerelease: false
88
- version_requirements: !ruby/object:Gem::Requirement
89
- none: false
90
- requirements:
91
- - - ! '>='
38
+ - - '>='
92
39
  - !ruby/object:Gem::Version
93
40
  version: '0'
94
41
  - !ruby/object:Gem::Dependency
95
42
  name: spidey
96
43
  requirement: !ruby/object:Gem::Requirement
97
- none: false
98
44
  requirements:
99
- - - ! '>='
45
+ - - '>='
100
46
  - !ruby/object:Gem::Version
101
47
  version: 0.1.0
102
48
  type: :runtime
103
49
  prerelease: false
104
50
  version_requirements: !ruby/object:Gem::Requirement
105
- none: false
106
51
  requirements:
107
- - - ! '>='
52
+ - - '>='
108
53
  - !ruby/object:Gem::Version
109
54
  version: 0.1.0
110
55
  description: Implements a MongoDB back-end for Spidey, a framework for crawling and
@@ -116,6 +61,9 @@ extensions: []
116
61
  extra_rdoc_files: []
117
62
  files:
118
63
  - .gitignore
64
+ - .travis.yml
65
+ - CHANGELOG.md
66
+ - CONTRIBUTING.md
119
67
  - Gemfile
120
68
  - LICENSE.txt
121
69
  - README.md
@@ -123,44 +71,40 @@ files:
123
71
  - lib/spidey-mongo.rb
124
72
  - lib/spidey-mongo/version.rb
125
73
  - lib/spidey/strategies/mongo.rb
74
+ - lib/spidey/strategies/mongo2.rb
126
75
  - lib/spidey/strategies/moped.rb
127
76
  - spec/spec_helper.rb
77
+ - spec/spidey/strategies/mongo2_spec.rb
128
78
  - spec/spidey/strategies/mongo_spec.rb
129
79
  - spec/spidey/strategies/moped_spec.rb
130
80
  - spidey-mongo.gemspec
131
81
  homepage: https://github.com/joeyAghion/spidey-mongo
132
82
  licenses:
133
83
  - MIT
84
+ metadata: {}
134
85
  post_install_message:
135
86
  rdoc_options: []
136
87
  require_paths:
137
88
  - lib
138
89
  required_ruby_version: !ruby/object:Gem::Requirement
139
- none: false
140
90
  requirements:
141
- - - ! '>='
91
+ - - '>='
142
92
  - !ruby/object:Gem::Version
143
93
  version: '0'
144
- segments:
145
- - 0
146
- hash: 987129952958952365
147
94
  required_rubygems_version: !ruby/object:Gem::Requirement
148
- none: false
149
95
  requirements:
150
- - - ! '>='
96
+ - - '>='
151
97
  - !ruby/object:Gem::Version
152
98
  version: '0'
153
- segments:
154
- - 0
155
- hash: 987129952958952365
156
99
  requirements: []
157
100
  rubyforge_project: spidey-mongo
158
- rubygems_version: 1.8.25
101
+ rubygems_version: 2.0.14
159
102
  signing_key:
160
- specification_version: 3
103
+ specification_version: 4
161
104
  summary: Implements a MongoDB back-end for Spidey, a framework for crawling and scraping
162
105
  web sites.
163
106
  test_files:
164
107
  - spec/spec_helper.rb
108
+ - spec/spidey/strategies/mongo2_spec.rb
165
109
  - spec/spidey/strategies/mongo_spec.rb
166
110
  - spec/spidey/strategies/moped_spec.rb