spidey-mongo 0.0.1 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE.txt ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2012 Joey Aghion, Art.sy Inc.
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,67 @@
1
+ Spidey-Mongo
2
+ ============
3
+
4
+ This gem implements a [MongoDB](http://www.mongodb.org/) back-end for [Spidey](https://github.com/joeyAghion/spidey), a very simple framework for crawling and scraping web sites.
5
+
6
+ See [Spidey](https://githubcom/joeyAghion/spidey)'s documentation for a basic example spider class.
7
+
8
+ The default implementation stores the queue of URLs being crawled, any generated results, and errors as attributes on the spider instance (i.e., in memory). By including this gem's `Spidey::Strategies::Mongo` module, spider implementations can store them in a MongoDB database instead.
9
+
10
+ Usage
11
+ -----
12
+
13
+ ### Install the gem
14
+
15
+ gem install spidey-mongo
16
+
17
+
18
+ ### Example spider class
19
+
20
+ class EbaySpider < Spidey::AbstractSpider
21
+ include Spidey::Strategies::Mongo
22
+
23
+ handle "http://www.ebay.com", :process_home
24
+
25
+ def process_home(page, default_data = {})
26
+ # ...
27
+ end
28
+ end
29
+
30
+ ### Invocation
31
+
32
+ The spider's constructor accepts new parameters for each of the MongoDB collections to employ: `url_collection`, `result_collection`, and `error_collection`.
33
+
34
+ db = Mongo::Connection.new['example']
35
+
36
+ spider = EbaySpider.new(
37
+ url_collection: db['urls'],
38
+ result_collection: db['results'],
39
+ error_collection: db['errors'])
40
+
41
+ With persistent storage of the URL-crawling queue, it's now possible to stop crawling and resume at a later point. The `crawl` method accepts a new optional `crawl_for` parameter specifying the number of seconds after which to stop.
42
+
43
+ spider.crawl crawl_for: 600 # seconds, or more conveniently (w/ActiveSupport): 10.minutes
44
+
45
+ (The base implementation's `max_urls` parameter is also useful for this purpose.)
46
+
47
+ ### Recording Results
48
+
49
+ By default, invocations of `record(data)` by the spider simply insert new documents into the result collection. If corresponding results may already exist in the collection and should instead be updated, use the `set_result_key` helper (with a proc or method symbol) to specify how to find the document to update:
50
+
51
+ class EbaySpider < Spidey::AbstractSpider
52
+ include Spidey::Strategies::Mongo
53
+ set_result_key ->(data) { data[:auction_id] }
54
+
55
+ # ...
56
+ end
57
+
58
+ This performs an `upsert` instead of the usual `insert` (i.e., an update if a result document matching the key already exists, or insert otherwise).
59
+
60
+ Testing
61
+ -------
62
+
63
+ bundle exec rspec
64
+
65
+ Copyright
66
+ ---------
67
+ Copyright (c) 2012 Joey Aghion, Art.sy Inc. See [LICENSE.txt](LICENSE.txt) for further details.
@@ -1,5 +1,5 @@
1
1
  module Spidey
2
2
  module Mongo
3
- VERSION = "0.0.1"
3
+ VERSION = "0.0.3"
4
4
  end
5
5
  end
data/spidey-mongo.gemspec CHANGED
@@ -10,6 +10,7 @@ Gem::Specification.new do |s|
10
10
  s.homepage = "https://github.com/joeyAghion/spidey-mongo"
11
11
  s.summary = %q{Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.}
12
12
  s.description = %q{Implements a MongoDB back-end for Spidey, a framework for crawling and scraping web sites.}
13
+ s.license = 'MIT'
13
14
 
14
15
  s.rubyforge_project = "spidey-mongo"
15
16
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: spidey-mongo
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.1
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -13,7 +13,7 @@ date: 2012-06-27 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rake
16
- requirement: &70361603997680 !ruby/object:Gem::Requirement
16
+ requirement: &70227312727160 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: '0'
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *70361603997680
24
+ version_requirements: *70227312727160
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: rspec
27
- requirement: &70361603997220 !ruby/object:Gem::Requirement
27
+ requirement: &70227312726140 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: '0'
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *70361603997220
35
+ version_requirements: *70227312726140
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: spidey
38
- requirement: &70361603996320 !ruby/object:Gem::Requirement
38
+ requirement: &70227312725220 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: '0'
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *70361603996320
46
+ version_requirements: *70227312725220
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: mongo
49
- requirement: &70361603995160 !ruby/object:Gem::Requirement
49
+ requirement: &70227312724420 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: '0'
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *70361603995160
57
+ version_requirements: *70227312724420
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: bson_ext
60
- requirement: &70361603994540 !ruby/object:Gem::Requirement
60
+ requirement: &70227312717760 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ! '>='
@@ -65,7 +65,7 @@ dependencies:
65
65
  version: '0'
66
66
  type: :runtime
67
67
  prerelease: false
68
- version_requirements: *70361603994540
68
+ version_requirements: *70227312717760
69
69
  description: Implements a MongoDB back-end for Spidey, a framework for crawling and
70
70
  scraping web sites.
71
71
  email:
@@ -76,6 +76,8 @@ extra_rdoc_files: []
76
76
  files:
77
77
  - .gitignore
78
78
  - Gemfile
79
+ - LICENSE.txt
80
+ - README.md
79
81
  - Rakefile
80
82
  - lib/spidey-mongo.rb
81
83
  - lib/spidey-mongo/version.rb
@@ -84,7 +86,8 @@ files:
84
86
  - spec/spidey/strategies/mongo_spec.rb
85
87
  - spidey-mongo.gemspec
86
88
  homepage: https://github.com/joeyAghion/spidey-mongo
87
- licenses: []
89
+ licenses:
90
+ - MIT
88
91
  post_install_message:
89
92
  rdoc_options: []
90
93
  require_paths:
@@ -97,7 +100,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
97
100
  version: '0'
98
101
  segments:
99
102
  - 0
100
- hash: 3377333768066102144
103
+ hash: 128629480123059091
101
104
  required_rubygems_version: !ruby/object:Gem::Requirement
102
105
  none: false
103
106
  requirements:
@@ -106,7 +109,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
106
109
  version: '0'
107
110
  segments:
108
111
  - 0
109
- hash: 3377333768066102144
112
+ hash: 128629480123059091
110
113
  requirements: []
111
114
  rubyforge_project: spidey-mongo
112
115
  rubygems_version: 1.8.10