harvestman 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.gitignore +17 -0
- data/Gemfile +4 -0
- data/LICENSE.txt +20 -0
- data/README.md +107 -0
- data/Rakefile +8 -0
- data/examples/etsy.rb +15 -0
- data/examples/wikipedia.rb +18 -0
- data/harvestman.gemspec +25 -0
- data/lib/harvestman.rb +36 -0
- data/lib/harvestman/crawler.rb +24 -0
- data/lib/harvestman/crawler/base.rb +17 -0
- data/lib/harvestman/crawler/fast.rb +19 -0
- data/lib/harvestman/crawler/parser.rb +32 -0
- data/lib/harvestman/crawler/plain.rb +17 -0
- data/lib/harvestman/version.rb +3 -0
- data/spec/harvestman_spec.rb +5 -0
- data/spec/spec_helper.rb +4 -0
- data/test/example1.html +23 -0
- data/test/example2.html +23 -0
- data/test/example3.html +23 -0
- data/test/helper.rb +2 -0
- data/test/test_harvestman.rb +33 -0
- metadata +128 -0
data/.gitignore
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
MIT License
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,107 @@
|
|
1
|
+
# Harvestman
|
2
|
+
|
3
|
+
Harvestman is a very simple, lightweight web crawler for Quick'n'Dirty™ web scraping.<br>
|
4
|
+
It's quite useful for scraping search result pages:
|
5
|
+
|
6
|
+
```ruby
|
7
|
+
require 'harvestman'
|
8
|
+
|
9
|
+
Harvestman.crawl 'http://www.foo.com/bars?page=*', (1..5) do
|
10
|
+
price = css 'div.item-price a'
|
11
|
+
...
|
12
|
+
end
|
13
|
+
```
|
14
|
+
|
15
|
+
**[!] Warning**: this gem is in alpha stage (no tests), don't use it for anything serious.
|
16
|
+
|
17
|
+
## Installation
|
18
|
+
|
19
|
+
Via command line:
|
20
|
+
|
21
|
+
$ gem install harvestman
|
22
|
+
|
23
|
+
## Basic usage
|
24
|
+
|
25
|
+
Harvestman is fairly simple to use: you specify the URL to crawl and pass in a block.
|
26
|
+
Inside the block you can call the ``css`` (or ``xpath``) method to search the HTML document and get the inner text inside each node.
|
27
|
+
See [Nokogiri](http://nokogiri.org/tutorials/searching_a_xml_html_document.html) for more information.
|
28
|
+
|
29
|
+
###### Perhaps this is best understood with an example:
|
30
|
+
|
31
|
+
```ruby
|
32
|
+
Harvestman.crawl "http://www.24pullrequests.com" do
|
33
|
+
headline = xpath "//h3"
|
34
|
+
catchy_phrase = css "div.visible-phone h3"
|
35
|
+
|
36
|
+
puts "Headline: #{headline}"
|
37
|
+
puts "Catchy phrase: #{catchy_phrase}"
|
38
|
+
end
|
39
|
+
```
|
40
|
+
|
41
|
+
## One node to rule them all
|
42
|
+
|
43
|
+
Harvestman assumes there's only one node at the path you passed to the ``css``.
|
44
|
+
If there is **more than one node** at that path, you can pass in an additional block.
|
45
|
+
|
46
|
+
###### Another example:
|
47
|
+
|
48
|
+
```ruby
|
49
|
+
Harvestman.crawl 'http://en.wikipedia.org/wiki/Main_Page' do
|
50
|
+
# Print today's featured article
|
51
|
+
tfa = css "div#mp-tfa"
|
52
|
+
|
53
|
+
puts "Today's featured article: #{tfa}"
|
54
|
+
|
55
|
+
# Print all the sister projects
|
56
|
+
sister_projects = []
|
57
|
+
|
58
|
+
css "div#mp-sister b" do
|
59
|
+
sister_projects << css("a")
|
60
|
+
end
|
61
|
+
|
62
|
+
puts "Sister projects:"
|
63
|
+
sister_projects.each { |sp| puts "- #{sp}" }
|
64
|
+
end
|
65
|
+
```
|
66
|
+
|
67
|
+
Note that inside the block we use ``css("a")`` and *not* ``css("div#mp-sister b a")``. Calls to ``css`` or ``xpath`` here assume ``div#mp-sister b`` is the parent node.
|
68
|
+
|
69
|
+
## Pages / Search results
|
70
|
+
|
71
|
+
If you want to crawl a group of similar pages (eg: search results, as shown above), you can insert a ``*`` somewhere in the URL string and it will be replaced by each element in the second argument.
|
72
|
+
|
73
|
+
###### Final example:
|
74
|
+
|
75
|
+
```ruby
|
76
|
+
require 'harvestman'
|
77
|
+
|
78
|
+
Harvestman.crawl 'http://www.etsy.com/browse/vintage-category/electronics/*', (1..3) do
|
79
|
+
css "div.listing-hover" do
|
80
|
+
title = css "div.title a"
|
81
|
+
price = css "span.listing-price"
|
82
|
+
|
83
|
+
puts "* #{title} (#{price})"
|
84
|
+
end
|
85
|
+
end
|
86
|
+
```
|
87
|
+
|
88
|
+
The above code is going to crawl Etsy's electronics category pages (from 1 to 3) and output every item's title and price. Here we're using a range ``(1..3)`` but you could've passed an array with search queries:
|
89
|
+
|
90
|
+
"http://www.site.com?query=*", ["dogs", "cats", "birds"]
|
91
|
+
|
92
|
+
### Performance
|
93
|
+
When using the * feature described above, each page is run inside a separate thread. You can disable multithreading by passing an additional argument `:plain` to the `crawl` method, like this:
|
94
|
+
|
95
|
+
```ruby
|
96
|
+
require 'harvestman'
|
97
|
+
|
98
|
+
Harvestman.crawl 'http://www.store.com/products?page=*', (1..99), :plain do
|
99
|
+
...
|
100
|
+
end
|
101
|
+
```
|
102
|
+
|
103
|
+
Needless to say, this will greatly decrease performance.
|
104
|
+
|
105
|
+
## License
|
106
|
+
|
107
|
+
See LICENSE.txt
|
data/Rakefile
ADDED
data/examples/etsy.rb
ADDED
@@ -0,0 +1,15 @@
|
|
1
|
+
require 'harvestman'
|
2
|
+
|
3
|
+
# Crawl Etsy's electronics category pages (from 1 to 3) and output every item's
|
4
|
+
# title and price.
|
5
|
+
|
6
|
+
base_url = 'http://www.etsy.com/browse/vintage-category/electronics/*'
|
7
|
+
|
8
|
+
Harvestman.crawl base_url, (1..3) do
|
9
|
+
css "div.listing-hover" do
|
10
|
+
title = css "div.title a"
|
11
|
+
price = css "span.listing-price"
|
12
|
+
|
13
|
+
puts "* #{title} (#{price})"
|
14
|
+
end
|
15
|
+
end
|
@@ -0,0 +1,18 @@
|
|
1
|
+
require 'harvestman'
|
2
|
+
|
3
|
+
Harvestman.crawl 'http://en.wikipedia.org/wiki/Main_Page' do
|
4
|
+
# Print today's featured article
|
5
|
+
tfa = css "div#mp-tfa"
|
6
|
+
|
7
|
+
puts "Today's featured article: #{tfa}"
|
8
|
+
|
9
|
+
# Print all the sister projects
|
10
|
+
sister_projects = []
|
11
|
+
|
12
|
+
css "div#mp-sister b" do
|
13
|
+
sister_projects << css("a")
|
14
|
+
end
|
15
|
+
|
16
|
+
puts "Sister projects:"
|
17
|
+
sister_projects.each { |sp| puts "- #{sp}" }
|
18
|
+
end
|
data/harvestman.gemspec
ADDED
@@ -0,0 +1,25 @@
|
|
1
|
+
# -*- encoding: utf-8 -*-
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'harvestman/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |gem|
|
7
|
+
gem.name = "harvestman"
|
8
|
+
gem.version = Harvestman::VERSION
|
9
|
+
gem.authors = ["Gabriel Vieira"]
|
10
|
+
gem.email = ["gluisvieira@gmail.com"]
|
11
|
+
gem.summary = %q{Lightweight web crawler}
|
12
|
+
gem.homepage = ""
|
13
|
+
|
14
|
+
# Runtime dependencies
|
15
|
+
gem.add_dependency "nokogiri", "~> 1.5.6"
|
16
|
+
|
17
|
+
# Development dependencies
|
18
|
+
gem.add_development_dependency "rake"
|
19
|
+
gem.add_development_dependency "rspec", "~> 2.0"
|
20
|
+
|
21
|
+
gem.files = `git ls-files`.split($/)
|
22
|
+
gem.executables = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
|
23
|
+
gem.test_files = gem.files.grep(%r{^(test|spec|features)/})
|
24
|
+
gem.require_paths = ["lib"]
|
25
|
+
end
|
data/lib/harvestman.rb
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
require "nokogiri"
|
2
|
+
require "open-uri"
|
3
|
+
|
4
|
+
require 'harvestman/version'
|
5
|
+
require 'harvestman/crawler'
|
6
|
+
|
7
|
+
module Harvestman
|
8
|
+
# Public: Crawl a website. You can visit similar URLs (eg: pages in a search
|
9
|
+
# result) by passing an optional argument.
|
10
|
+
#
|
11
|
+
# url - A String containing the url to be crawled.
|
12
|
+
# pages - Zero or more Strings that will replace a * in the
|
13
|
+
# base url. Note: this does not need to be an Array.
|
14
|
+
# type - Optional. You can use a "plain" (default) or "fast" crawler.
|
15
|
+
# Fast mode uses threads for performance.
|
16
|
+
#
|
17
|
+
# Example: Crawl Etsy.com, printing the title and price of each item in
|
18
|
+
# pages 1, 2 and 3 of the Electronics category.
|
19
|
+
#
|
20
|
+
# Harvestman.crawl 'http://www.etsy.com/browse/vintage-category/electronics/*', (1..3) do
|
21
|
+
# css "div.listing-hover" do
|
22
|
+
# title = css "div.title a"
|
23
|
+
# price = css "span.listing-price"
|
24
|
+
#
|
25
|
+
# puts "* #{title} (#{price})"
|
26
|
+
# end
|
27
|
+
# end
|
28
|
+
#
|
29
|
+
# Returns nothing.
|
30
|
+
def self.crawl(url, pages = nil, type = :fast, &block)
|
31
|
+
crawler = Harvestman::Crawler.new(url, pages, type)
|
32
|
+
if block_given?
|
33
|
+
crawler.crawl(&block)
|
34
|
+
end
|
35
|
+
end
|
36
|
+
end
|
@@ -0,0 +1,24 @@
|
|
1
|
+
module Harvestman
|
2
|
+
module Crawler
|
3
|
+
# Raised when given crawler type is not registered.
|
4
|
+
class UnknownCrawler < ArgumentError; end
|
5
|
+
|
6
|
+
def self.register(type, klass)
|
7
|
+
@crawlers ||= {}
|
8
|
+
@crawlers[type] = klass
|
9
|
+
end
|
10
|
+
|
11
|
+
def self.new(base_url, pages, type)
|
12
|
+
if crawler = @crawlers[type]
|
13
|
+
crawler.new(base_url, pages)
|
14
|
+
else
|
15
|
+
raise UnknownCrawler, "No such type: #{type}"
|
16
|
+
end
|
17
|
+
end
|
18
|
+
|
19
|
+
require 'harvestman/crawler/parser'
|
20
|
+
require 'harvestman/crawler/base'
|
21
|
+
require 'harvestman/crawler/plain'
|
22
|
+
require 'harvestman/crawler/fast'
|
23
|
+
end
|
24
|
+
end
|
@@ -0,0 +1,17 @@
|
|
1
|
+
module Harvestman
|
2
|
+
module Crawler
|
3
|
+
class Base
|
4
|
+
def initialize(base_url, pages)
|
5
|
+
@base_url = base_url
|
6
|
+
@pages = pages
|
7
|
+
end
|
8
|
+
|
9
|
+
protected
|
10
|
+
|
11
|
+
def crawl_url(url, &block)
|
12
|
+
parser = Parser.new(url)
|
13
|
+
parser.instance_eval(&block)
|
14
|
+
end
|
15
|
+
end
|
16
|
+
end
|
17
|
+
end
|
@@ -0,0 +1,19 @@
|
|
1
|
+
module Harvestman
|
2
|
+
module Crawler
|
3
|
+
class Fast < Base
|
4
|
+
def crawl(&block)
|
5
|
+
if @pages.nil?
|
6
|
+
crawl_url(@base_url, &block)
|
7
|
+
else
|
8
|
+
threads = []
|
9
|
+
@pages.each do |p|
|
10
|
+
threads << Thread.new(p) { |page| crawl_url(@base_url.gsub('*', p.to_s), &block) }
|
11
|
+
end
|
12
|
+
threads.each { |t| t.join }
|
13
|
+
end
|
14
|
+
end
|
15
|
+
end
|
16
|
+
|
17
|
+
register :fast, Fast
|
18
|
+
end
|
19
|
+
end
|
@@ -0,0 +1,32 @@
|
|
1
|
+
module Harvestman
|
2
|
+
module Crawler
|
3
|
+
class Parser
|
4
|
+
def initialize(url)
|
5
|
+
@document = Nokogiri::HTML(open(url))
|
6
|
+
end
|
7
|
+
|
8
|
+
def css(path, &block)
|
9
|
+
parse(:css, path, &block)
|
10
|
+
end
|
11
|
+
|
12
|
+
def xpath(path, &block)
|
13
|
+
parse(:xpath, path, &block)
|
14
|
+
end
|
15
|
+
|
16
|
+
private
|
17
|
+
|
18
|
+
def parse(path_type, path, &block)
|
19
|
+
if block_given?
|
20
|
+
@document.send(path_type, path).each do |node|
|
21
|
+
doc = @document
|
22
|
+
@document = node
|
23
|
+
instance_eval(&block)
|
24
|
+
@document = doc
|
25
|
+
end
|
26
|
+
else
|
27
|
+
@document.send("at_#{path_type}", path).inner_text
|
28
|
+
end
|
29
|
+
end
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
@@ -0,0 +1,17 @@
|
|
1
|
+
module Harvestman
|
2
|
+
module Crawler
|
3
|
+
class Plain < Base
|
4
|
+
def crawl(&block)
|
5
|
+
if @pages.nil?
|
6
|
+
crawl_url(@base_url, &block)
|
7
|
+
else
|
8
|
+
@pages.each do |p|
|
9
|
+
crawl_url(@base_url.gsub('*', p.to_s), &block)
|
10
|
+
end
|
11
|
+
end
|
12
|
+
end
|
13
|
+
end
|
14
|
+
|
15
|
+
register :plain, Plain
|
16
|
+
end
|
17
|
+
end
|
data/spec/spec_helper.rb
ADDED
data/test/example1.html
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
<html lang="en">
|
2
|
+
<head>
|
3
|
+
<meta charset="UTF-8" />
|
4
|
+
<title>ex1</title>
|
5
|
+
</head>
|
6
|
+
<body>
|
7
|
+
<header>
|
8
|
+
<div class="title">
|
9
|
+
<h1>ex1_header_h1</h1>
|
10
|
+
</div>
|
11
|
+
</header>
|
12
|
+
<div class="main">
|
13
|
+
<ul>
|
14
|
+
<li>ex1_main_li_1</li>
|
15
|
+
<li>ex1_main_li_2</li>
|
16
|
+
<li>ex1_main_li_3</li>
|
17
|
+
</ul>
|
18
|
+
</div>
|
19
|
+
<footer>
|
20
|
+
<span><a href="#">ex1_footer_span_a</a></span>
|
21
|
+
</footer>
|
22
|
+
</body>
|
23
|
+
</html>
|
data/test/example2.html
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
<html lang="en">
|
2
|
+
<head>
|
3
|
+
<meta charset="UTF-8" />
|
4
|
+
<title>ex2</title>
|
5
|
+
</head>
|
6
|
+
<body>
|
7
|
+
<header>
|
8
|
+
<div class="title">
|
9
|
+
<h1>ex2_header_h1</h1>
|
10
|
+
</div>
|
11
|
+
</header>
|
12
|
+
<div class="main">
|
13
|
+
<ul>
|
14
|
+
<li>ex2_main_li_1</li>
|
15
|
+
<li>ex2_main_li_2</li>
|
16
|
+
<li>ex2_main_li_3</li>
|
17
|
+
</ul>
|
18
|
+
</div>
|
19
|
+
<footer>
|
20
|
+
<span><a href="#">ex2_footer_span_a</a></span>
|
21
|
+
</footer>
|
22
|
+
</body>
|
23
|
+
</html>
|
data/test/example3.html
ADDED
@@ -0,0 +1,23 @@
|
|
1
|
+
<html lang="en">
|
2
|
+
<head>
|
3
|
+
<meta charset="UTF-8" />
|
4
|
+
<title>ex3</title>
|
5
|
+
</head>
|
6
|
+
<body>
|
7
|
+
<header>
|
8
|
+
<div class="title">
|
9
|
+
<h1>ex3_header_h1</h1>
|
10
|
+
</div>
|
11
|
+
</header>
|
12
|
+
<div class="main">
|
13
|
+
<ul>
|
14
|
+
<li>ex3_main_li_1</li>
|
15
|
+
<li>ex3_main_li_2</li>
|
16
|
+
<li>ex3_main_li_3</li>
|
17
|
+
</ul>
|
18
|
+
</div>
|
19
|
+
<footer>
|
20
|
+
<span><a href="#">ex3_footer_span_a</a></span>
|
21
|
+
</footer>
|
22
|
+
</body>
|
23
|
+
</html>
|
data/test/helper.rb
ADDED
@@ -0,0 +1,33 @@
|
|
1
|
+
require 'helper'
|
2
|
+
|
3
|
+
class TestHarvestman < Test::Unit::TestCase
|
4
|
+
def test_namespace
|
5
|
+
assert Harvestman.is_a?(Module)
|
6
|
+
end
|
7
|
+
|
8
|
+
def test_scraping
|
9
|
+
results = []
|
10
|
+
|
11
|
+
Harvestman.crawl "test/example*.html", (1..3), :plain do
|
12
|
+
r = {
|
13
|
+
:title => css("head title"),
|
14
|
+
:header => css("header div.title h1"),
|
15
|
+
:footer => css("footer span a"),
|
16
|
+
:list => []
|
17
|
+
}
|
18
|
+
|
19
|
+
css "div.main ul" do
|
20
|
+
r[:list] << css("li")
|
21
|
+
end
|
22
|
+
|
23
|
+
results << r
|
24
|
+
end
|
25
|
+
|
26
|
+
results.each_with_index do |r, i|
|
27
|
+
assert_equal(r[:title], "ex#{i+1}")
|
28
|
+
assert_equal(r[:header], "#{r[:title]}_header_h1")
|
29
|
+
assert_equal(r[:footer], "#{r[:title]}_footer_span_a")
|
30
|
+
assert_equal(r[:list].count, 3)
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
metadata
ADDED
@@ -0,0 +1,128 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: harvestman
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.1
|
5
|
+
prerelease:
|
6
|
+
platform: ruby
|
7
|
+
authors:
|
8
|
+
- Gabriel Vieira
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2013-08-28 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: nokogiri
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
none: false
|
18
|
+
requirements:
|
19
|
+
- - ~>
|
20
|
+
- !ruby/object:Gem::Version
|
21
|
+
version: 1.5.6
|
22
|
+
type: :runtime
|
23
|
+
prerelease: false
|
24
|
+
version_requirements: !ruby/object:Gem::Requirement
|
25
|
+
none: false
|
26
|
+
requirements:
|
27
|
+
- - ~>
|
28
|
+
- !ruby/object:Gem::Version
|
29
|
+
version: 1.5.6
|
30
|
+
- !ruby/object:Gem::Dependency
|
31
|
+
name: rake
|
32
|
+
requirement: !ruby/object:Gem::Requirement
|
33
|
+
none: false
|
34
|
+
requirements:
|
35
|
+
- - ! '>='
|
36
|
+
- !ruby/object:Gem::Version
|
37
|
+
version: '0'
|
38
|
+
type: :development
|
39
|
+
prerelease: false
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
none: false
|
42
|
+
requirements:
|
43
|
+
- - ! '>='
|
44
|
+
- !ruby/object:Gem::Version
|
45
|
+
version: '0'
|
46
|
+
- !ruby/object:Gem::Dependency
|
47
|
+
name: rspec
|
48
|
+
requirement: !ruby/object:Gem::Requirement
|
49
|
+
none: false
|
50
|
+
requirements:
|
51
|
+
- - ~>
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '2.0'
|
54
|
+
type: :development
|
55
|
+
prerelease: false
|
56
|
+
version_requirements: !ruby/object:Gem::Requirement
|
57
|
+
none: false
|
58
|
+
requirements:
|
59
|
+
- - ~>
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '2.0'
|
62
|
+
description:
|
63
|
+
email:
|
64
|
+
- gluisvieira@gmail.com
|
65
|
+
executables: []
|
66
|
+
extensions: []
|
67
|
+
extra_rdoc_files: []
|
68
|
+
files:
|
69
|
+
- .gitignore
|
70
|
+
- Gemfile
|
71
|
+
- LICENSE.txt
|
72
|
+
- README.md
|
73
|
+
- Rakefile
|
74
|
+
- examples/etsy.rb
|
75
|
+
- examples/wikipedia.rb
|
76
|
+
- harvestman.gemspec
|
77
|
+
- lib/harvestman.rb
|
78
|
+
- lib/harvestman/crawler.rb
|
79
|
+
- lib/harvestman/crawler/base.rb
|
80
|
+
- lib/harvestman/crawler/fast.rb
|
81
|
+
- lib/harvestman/crawler/parser.rb
|
82
|
+
- lib/harvestman/crawler/plain.rb
|
83
|
+
- lib/harvestman/version.rb
|
84
|
+
- spec/harvestman_spec.rb
|
85
|
+
- spec/spec_helper.rb
|
86
|
+
- test/example1.html
|
87
|
+
- test/example2.html
|
88
|
+
- test/example3.html
|
89
|
+
- test/helper.rb
|
90
|
+
- test/test_harvestman.rb
|
91
|
+
homepage: ''
|
92
|
+
licenses: []
|
93
|
+
post_install_message:
|
94
|
+
rdoc_options: []
|
95
|
+
require_paths:
|
96
|
+
- lib
|
97
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
98
|
+
none: false
|
99
|
+
requirements:
|
100
|
+
- - ! '>='
|
101
|
+
- !ruby/object:Gem::Version
|
102
|
+
version: '0'
|
103
|
+
segments:
|
104
|
+
- 0
|
105
|
+
hash: 129965805
|
106
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
107
|
+
none: false
|
108
|
+
requirements:
|
109
|
+
- - ! '>='
|
110
|
+
- !ruby/object:Gem::Version
|
111
|
+
version: '0'
|
112
|
+
segments:
|
113
|
+
- 0
|
114
|
+
hash: 129965805
|
115
|
+
requirements: []
|
116
|
+
rubyforge_project:
|
117
|
+
rubygems_version: 1.8.25
|
118
|
+
signing_key:
|
119
|
+
specification_version: 3
|
120
|
+
summary: Lightweight web crawler
|
121
|
+
test_files:
|
122
|
+
- spec/harvestman_spec.rb
|
123
|
+
- spec/spec_helper.rb
|
124
|
+
- test/example1.html
|
125
|
+
- test/example2.html
|
126
|
+
- test/example3.html
|
127
|
+
- test/helper.rb
|
128
|
+
- test/test_harvestman.rb
|