RubyGems - harvestman - Versions diffs - 0.1.1 - Mend

harvestman 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

data/.gitignore +17 -0
data/Gemfile +4 -0
data/LICENSE.txt +20 -0
data/README.md +107 -0
data/Rakefile +8 -0
data/examples/etsy.rb +15 -0
data/examples/wikipedia.rb +18 -0
data/harvestman.gemspec +25 -0
data/lib/harvestman.rb +36 -0
data/lib/harvestman/crawler.rb +24 -0
data/lib/harvestman/crawler/base.rb +17 -0
data/lib/harvestman/crawler/fast.rb +19 -0
data/lib/harvestman/crawler/parser.rb +32 -0
data/lib/harvestman/crawler/plain.rb +17 -0
data/lib/harvestman/version.rb +3 -0
data/spec/harvestman_spec.rb +5 -0
data/spec/spec_helper.rb +4 -0
data/test/example1.html +23 -0
data/test/example2.html +23 -0
data/test/example3.html +23 -0
data/test/helper.rb +2 -0
data/test/test_harvestman.rb +33 -0
metadata +128 -0

data/.gitignore ADDED

@@ -0,0 +1,17 @@
+*.gem
+*.rbc
+.bundle
+.config
+.yardoc
+Gemfile.lock
+InstalledFiles
+_yardoc
+coverage
+doc/
+lib/bundler/man
+pkg
+rdoc
+spec/reports
+test/tmp
+test/version_tmp
+tmp

data/Gemfile ADDED

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in harvestman.gemspec
+gemspec

data/LICENSE.txt ADDED

@@ -0,0 +1,20 @@
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,107 @@
+# Harvestman
+Harvestman is a very simple, lightweight web crawler for Quick'n'Dirty™ web scraping.<br>
+It's quite useful for scraping search result pages:
+```ruby
+require 'harvestman'
+Harvestman.crawl 'http://www.foo.com/bars?page=*', (1..5) do
+  price = css 'div.item-price a'
+  ...
+end
+```
+**[!] Warning**: this gem is in alpha stage (no tests), don't use it for anything serious.
+## Installation
+Via command line:
+    $ gem install harvestman
+## Basic usage
+Harvestman is fairly simple to use: you specify the URL to crawl and pass in a block.
+Inside the block you can call the ``css`` (or ``xpath``) method to search the HTML document and get the inner text inside each node.
+See [Nokogiri](http://nokogiri.org/tutorials/searching_a_xml_html_document.html) for more information.
+###### Perhaps this is best understood with an example:
+```ruby
+Harvestman.crawl "http://www.24pullrequests.com" do
+  headline = xpath "//h3"
+  catchy_phrase = css "div.visible-phone h3"
+  puts "Headline: #{headline}"
+  puts "Catchy phrase: #{catchy_phrase}"
+end
+```
+## One node to rule them all
+Harvestman assumes there's only one node at the path you passed to the ``css``.
+If there is **more than one node** at that path, you can pass in an additional block.
+###### Another example:
+```ruby
+Harvestman.crawl 'http://en.wikipedia.org/wiki/Main_Page' do
+  # Print today's featured article
+  tfa = css "div#mp-tfa"
+  puts "Today's featured article: #{tfa}"
+  # Print all the sister projects
+  sister_projects = []
+  css "div#mp-sister b" do
+    sister_projects << css("a")
+  end
+  puts "Sister projects:"
+  sister_projects.each { |sp| puts "- #{sp}" }
+end
+```
+Note that inside the block we use ``css("a")`` and *not* ``css("div#mp-sister b a")``. Calls to ``css`` or ``xpath`` here assume ``div#mp-sister b`` is the parent node.
+## Pages / Search results
+If you want to crawl a group of similar pages (eg: search results, as shown above), you can insert a ``*`` somewhere in the URL string and it will be replaced by each element in the second argument.
+###### Final example:
+```ruby
+require 'harvestman'
+Harvestman.crawl 'http://www.etsy.com/browse/vintage-category/electronics/*', (1..3) do
+  css "div.listing-hover" do
+    title = css "div.title a"
+    price = css "span.listing-price"
+    puts "* #{title} (#{price})"
+  end
+end
+```
+The above code is going to crawl Etsy's electronics category pages (from 1 to 3) and output every item's title and price. Here we're using a range ``(1..3)`` but you could've passed an array with search queries:
+    "http://www.site.com?query=*", ["dogs", "cats", "birds"]
+### Performance
+When using the * feature described above, each page is run inside a separate thread. You can disable multithreading by passing an additional argument `:plain` to the `crawl` method, like this:
+```ruby
+require 'harvestman'
+Harvestman.crawl 'http://www.store.com/products?page=*', (1..99), :plain do
+  ...
+end
+```
+Needless to say, this will greatly decrease performance.
+## License
+See LICENSE.txt

data/Rakefile ADDED

@@ -0,0 +1,8 @@
+require "bundler/gem_tasks"
+require 'rake/testtask'
+Rake::TestTask.new do |t|
+	t.libs << 'test'
+end
+task :default => :test

data/examples/etsy.rb ADDED

@@ -0,0 +1,15 @@
+require 'harvestman'
+# Crawl Etsy's electronics category pages (from 1 to 3) and output every item's
+# title and price.
+base_url = 'http://www.etsy.com/browse/vintage-category/electronics/*'
+Harvestman.crawl base_url, (1..3) do
+  css "div.listing-hover" do
+    title = css "div.title a"
+    price = css "span.listing-price"
+    puts "* #{title} (#{price})"
+  end
+end

data/examples/wikipedia.rb ADDED

@@ -0,0 +1,18 @@
+require 'harvestman'
+Harvestman.crawl 'http://en.wikipedia.org/wiki/Main_Page' do
+  # Print today's featured article
+  tfa = css "div#mp-tfa"
+  puts "Today's featured article: #{tfa}"
+  # Print all the sister projects
+  sister_projects = []
+  css "div#mp-sister b" do
+    sister_projects << css("a")
+  end
+  puts "Sister projects:"
+  sister_projects.each { |sp| puts "- #{sp}" }
+end

data/harvestman.gemspec ADDED

@@ -0,0 +1,25 @@
+# -*- encoding: utf-8 -*-
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'harvestman/version'
+Gem::Specification.new do |gem|
+  gem.name          = "harvestman"
+  gem.version       = Harvestman::VERSION
+  gem.authors       = ["Gabriel Vieira"]
+  gem.email         = ["gluisvieira@gmail.com"]
+  gem.summary       = %q{Lightweight web crawler}
+  gem.homepage      = ""
+  # Runtime dependencies
+  gem.add_dependency "nokogiri", "~> 1.5.6"
+  # Development dependencies
+  gem.add_development_dependency "rake"
+  gem.add_development_dependency "rspec", "~> 2.0"
+  gem.files         = `git ls-files`.split($/)
+  gem.executables   = gem.files.grep(%r{^bin/}).map{ |f| File.basename(f) }
+  gem.test_files    = gem.files.grep(%r{^(test|spec|features)/})
+  gem.require_paths = ["lib"]
+end

data/lib/harvestman.rb ADDED

@@ -0,0 +1,36 @@
+require "nokogiri"
+require "open-uri"
+require 'harvestman/version'
+require 'harvestman/crawler'
+module Harvestman
+  # Public: Crawl a website. You can visit similar URLs (eg: pages in a search
+  # result) by passing an optional argument.
+  #
+  # url   - A String containing the url to be crawled.
+  # pages - Zero or more Strings that will replace a * in the
+  #         base url. Note: this does not need to be an Array.
+  # type  - Optional. You can use a "plain" (default) or "fast" crawler.
+  #         Fast mode uses threads for performance.
+  #
+  # Example: Crawl Etsy.com, printing the title and price of each item in
+  #          pages 1, 2 and 3 of the Electronics category.
+  #
+  # Harvestman.crawl 'http://www.etsy.com/browse/vintage-category/electronics/*', (1..3) do
+  #   css "div.listing-hover" do
+  #     title = css "div.title a"
+  #     price = css "span.listing-price"
+  #
+  #     puts "* #{title} (#{price})"
+  #   end
+  # end
+  #
+  # Returns nothing.
+  def self.crawl(url, pages = nil, type = :fast, &block)
+    crawler = Harvestman::Crawler.new(url, pages, type)
+    if block_given?
+      crawler.crawl(&block)
+    end
+  end
+end

data/lib/harvestman/crawler.rb ADDED

@@ -0,0 +1,24 @@
+module Harvestman
+  module Crawler
+    # Raised when given crawler type is not registered.
+    class UnknownCrawler < ArgumentError; end
+    def self.register(type, klass)
+      @crawlers ||= {}
+      @crawlers[type] = klass
+    end
+    def self.new(base_url, pages, type)
+      if crawler = @crawlers[type]
+        crawler.new(base_url, pages)
+      else
+        raise UnknownCrawler, "No such type: #{type}"
+      end
+    end
+    require 'harvestman/crawler/parser'
+    require 'harvestman/crawler/base'
+    require 'harvestman/crawler/plain'
+    require 'harvestman/crawler/fast'
+  end
+end

data/lib/harvestman/crawler/base.rb ADDED

@@ -0,0 +1,17 @@
+module Harvestman
+  module Crawler
+    class Base
+      def initialize(base_url, pages)
+        @base_url = base_url
+        @pages = pages
+      end
+      protected
+      def crawl_url(url, &block)
+        parser = Parser.new(url)
+        parser.instance_eval(&block)
+      end
+    end
+  end
+end

data/lib/harvestman/crawler/fast.rb ADDED

@@ -0,0 +1,19 @@
+module Harvestman
+  module Crawler
+    class Fast < Base
+      def crawl(&block)
+        if @pages.nil?
+          crawl_url(@base_url, &block)
+        else
+          threads = []
+          @pages.each do |p|
+            threads << Thread.new(p) { |page| crawl_url(@base_url.gsub('*', p.to_s), &block) }
+          end
+          threads.each { |t| t.join }
+        end
+      end
+    end
+    register :fast, Fast
+  end
+end

data/lib/harvestman/crawler/parser.rb ADDED

@@ -0,0 +1,32 @@
+module Harvestman
+  module Crawler
+    class Parser
+      def initialize(url)
+        @document = Nokogiri::HTML(open(url))
+      end
+      def css(path, &block)
+        parse(:css, path, &block)
+      end
+      def xpath(path, &block)
+        parse(:xpath, path, &block)
+      end
+      private
+      def parse(path_type, path, &block)
+        if block_given?
+          @document.send(path_type, path).each do |node|
+            doc = @document
+            @document = node
+            instance_eval(&block)
+            @document = doc
+          end
+        else
+          @document.send("at_#{path_type}", path).inner_text
+        end
+      end
+    end
+  end
+end

data/lib/harvestman/crawler/plain.rb ADDED

@@ -0,0 +1,17 @@
+module Harvestman
+  module Crawler
+    class Plain < Base
+      def crawl(&block)
+        if @pages.nil?
+          crawl_url(@base_url, &block)
+        else
+          @pages.each do |p|
+            crawl_url(@base_url.gsub('*', p.to_s), &block)
+          end
+        end
+      end
+    end
+    register :plain, Plain
+  end
+end

data/lib/harvestman/version.rb ADDED

@@ -0,0 +1,3 @@
+module Harvestman
+  VERSION = "0.1.1"
+end

data/spec/harvestman_spec.rb ADDED

@@ -0,0 +1,5 @@
+require 'spec_helper'
+describe Harvestman do
+  specify { Harvestman.should be_a(Module) }
+end

data/spec/spec_helper.rb ADDED

@@ -0,0 +1,4 @@
+$:.unshift(File.expand_path('../../lib', __FILE__))
+require 'rspec'
+require 'harvestman'

data/test/example1.html ADDED

@@ -0,0 +1,23 @@
+<html lang="en">
+	<head>
+		<meta charset="UTF-8" />
+		<title>ex1</title>
+	</head>
+	<body>
+		<header>
+			<div class="title">
+				<h1>ex1_header_h1</h1>
+			</div>
+		</header>
+		<div class="main">
+			<ul>
+				<li>ex1_main_li_1</li>
+				<li>ex1_main_li_2</li>
+				<li>ex1_main_li_3</li>
+			</ul>
+		</div>
+		<footer>
+			<span><a href="#">ex1_footer_span_a</a></span>
+		</footer>
+	</body>
+</html>

data/test/example2.html ADDED

@@ -0,0 +1,23 @@
+<html lang="en">
+	<head>
+		<meta charset="UTF-8" />
+		<title>ex2</title>
+	</head>
+	<body>
+		<header>
+			<div class="title">
+				<h1>ex2_header_h1</h1>
+			</div>
+		</header>
+		<div class="main">
+			<ul>
+				<li>ex2_main_li_1</li>
+				<li>ex2_main_li_2</li>
+				<li>ex2_main_li_3</li>
+			</ul>
+		</div>
+		<footer>
+			<span><a href="#">ex2_footer_span_a</a></span>
+		</footer>
+	</body>
+</html>

data/test/example3.html ADDED

@@ -0,0 +1,23 @@
+<html lang="en">
+	<head>
+		<meta charset="UTF-8" />
+		<title>ex3</title>
+	</head>
+	<body>
+		<header>
+			<div class="title">
+				<h1>ex3_header_h1</h1>
+			</div>
+		</header>
+		<div class="main">
+			<ul>
+				<li>ex3_main_li_1</li>
+				<li>ex3_main_li_2</li>
+				<li>ex3_main_li_3</li>
+			</ul>
+		</div>
+		<footer>
+			<span><a href="#">ex3_footer_span_a</a></span>
+		</footer>
+	</body>
+</html>

data/test/helper.rb ADDED

	@@ -0,0 +1,2 @@
1	+ require 'test/unit'
2	+ require 'harvestman'

data/test/test_harvestman.rb ADDED

@@ -0,0 +1,33 @@
+require 'helper'
+class TestHarvestman < Test::Unit::TestCase
+	def test_namespace
+		assert Harvestman.is_a?(Module)
+	end
+	def test_scraping
+		results = []
+		Harvestman.crawl "test/example*.html", (1..3), :plain do
+			r = {
+				:title => css("head title"),
+				:header => css("header div.title h1"),
+				:footer => css("footer span a"),
+				:list => []
+			}
+			css "div.main ul" do
+				r[:list] << css("li")
+			end
+			results << r
+		end
+		results.each_with_index do |r, i|
+			assert_equal(r[:title], "ex#{i+1}")
+			assert_equal(r[:header], "#{r[:title]}_header_h1")
+			assert_equal(r[:footer], "#{r[:title]}_footer_span_a")
+			assert_equal(r[:list].count, 3)
+		end
+	end
+end

metadata ADDED

@@ -0,0 +1,128 @@
+--- !ruby/object:Gem::Specification
+name: harvestman
+version: !ruby/object:Gem::Version
+  version: 0.1.1
+  prerelease:
+platform: ruby
+authors:
+- Gabriel Vieira
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2013-08-28 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: nokogiri
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 1.5.6
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 1.5.6
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '2.0'
+description:
+email:
+- gluisvieira@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- .gitignore
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- examples/etsy.rb
+- examples/wikipedia.rb
+- harvestman.gemspec
+- lib/harvestman.rb
+- lib/harvestman/crawler.rb
+- lib/harvestman/crawler/base.rb
+- lib/harvestman/crawler/fast.rb
+- lib/harvestman/crawler/parser.rb
+- lib/harvestman/crawler/plain.rb
+- lib/harvestman/version.rb
+- spec/harvestman_spec.rb
+- spec/spec_helper.rb
+- test/example1.html
+- test/example2.html
+- test/example3.html
+- test/helper.rb
+- test/test_harvestman.rb
+homepage: ''
+licenses: []
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 129965805
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+      segments:
+      - 0
+      hash: 129965805
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.25
+signing_key:
+specification_version: 3
+summary: Lightweight web crawler
+test_files:
+- spec/harvestman_spec.rb
+- spec/spec_helper.rb
+- test/example1.html
+- test/example2.html
+- test/example3.html
+- test/helper.rb
+- test/test_harvestman.rb