RubyGems - medusa-crawler - Versions diffs - 1.0.0.pre.1 - Mend

medusa-crawler 1.0.0.pre.1

Files changed (35) hide show

checksums.yaml +7 -0
checksums.yaml.gz.sig +0 -0
data.tar.gz.sig +0 -0
data/CHANGELOG.md +20 -0
data/CONTRIBUTORS.md +22 -0
data/LICENSE.txt +20 -0
data/README.md +48 -0
data/Rakefile +24 -0
data/VERSION +1 -0
data/bin/medusa +4 -0
data/lib/medusa.rb +2 -0
data/lib/medusa/cli.rb +24 -0
data/lib/medusa/cli/count.rb +22 -0
data/lib/medusa/cli/cron.rb +90 -0
data/lib/medusa/cli/pagedepth.rb +32 -0
data/lib/medusa/cli/serialize.rb +35 -0
data/lib/medusa/cli/url_list.rb +41 -0
data/lib/medusa/cookie_store.rb +35 -0
data/lib/medusa/core.rb +305 -0
data/lib/medusa/exceptions.rb +5 -0
data/lib/medusa/http.rb +202 -0
data/lib/medusa/page.rb +229 -0
data/lib/medusa/page_store.rb +160 -0
data/lib/medusa/storage.rb +8 -0
data/lib/medusa/storage/base.rb +81 -0
data/lib/medusa/storage/exceptions.rb +15 -0
data/lib/medusa/storage/moneta.rb +42 -0
data/lib/medusa/tentacle.rb +39 -0
data/lib/medusa/version.rb +3 -0
data/spec/fakeweb_helper.rb +85 -0
data/spec/medusa_helper.rb +5 -0
data/spec/medusa_spec.rb +14 -0
data/spec/spec_helper.rb +104 -0
metadata +187 -0
metadata.gz.sig +0 -0

checksums.yaml ADDED

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 2ad0d3a02d8345991481a05954b6776801c51c20b61b38b37b61ab91659ea420
+  data.tar.gz: 84bc2865f48f987e60bde0fcf60fd6cef2b0b019e374ed64183a8f9cae7c3ee5
+SHA512:
+  metadata.gz: ca1ccfad54b1337fa5c6cd5975888efe8ed237db0e5b3eecd3d993fdcf0a0afc5d9e18c24f54f87918dfdce9e92bb7304d73c1ee276bee7c6a2622ab0b3560e4
+  data.tar.gz: 64f0701dfed21963879edc810ee8eb2d444fccb19543566774dde59f4973866b9bbdbbe9418eb766b6b6fd8f317539347da4b26edcfb56b098d0eb56f6132779

checksums.yaml.gz.sig ADDED

Binary file

data.tar.gz.sig ADDED

Binary file

data/CHANGELOG.md ADDED

@@ -0,0 +1,20 @@
+## Anemone forked into Medusa (2014-12-13)
+Features:
+ - Switch to use `OpenURI` instead of `net/http`, gaining out of the box support for:
+  - Http basic auth options
+  - Proxy configuration options
+  - Automatic string encoding detection based on charset
+  - Connection read timeout option
+  - Ability to control the RETRY_LIMIT upon connection errors
+Changes:
+ - Renamed Anemone to Medusa
+ - Revamped the [README](https://github.com/brutuscat/medusa/blob/master/README.md) file
+ - Revamped the [CHANGELOG](https://github.com/brutuscat/medusa/blob/master/CHANGELOG.md) file
+ - Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa/blob/master/CONTRIBUTORS.mdd) file
+> Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc) for a travel in time.

data/CONTRIBUTORS.md ADDED

@@ -0,0 +1,22 @@
+# Contributors
+Many thanks to the following folks who have contributed code to Medusa (a fork of Anemone).
+In no particular order:
+| Person         | Github        | Twitter  |
+| -------------  |:-------------:| --------:|
+| Chris Kite     | [chriskite](https://github.com/chriskite) |  |
+| Marc Seeger    |  |  |
+| Joost Baaij    |  |  |
+| Laurent Arnoud |  |  |
+| Cheng Huang    | [zzzhc](https://github.com/zzzhc) |  |
+| Mauro Asprea   | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
+| Alex Pooley    |  |  |
+| Luca Pradovera | [polysics](https://github.com/polysics) |  |
+| Sergey Kojin   |  |  |
+| Richard Paul   |  |  |
+> If you are submitting a [PR](https://help.github.com/articles/using-pull-requests/), feel free to add yourself to this table.

data/LICENSE.txt ADDED

@@ -0,0 +1,20 @@
+Copyright (c) 2009 Vertive, Inc.
+Copyright (c) 2020 Mauro Asprea <mauroasprea@gmail.com>
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

data/README.md ADDED

@@ -0,0 +1,48 @@
+# Medusa: a ruby crawler framework ![Ruby](https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push)
+Medusa is a framework to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
+## Features
+- Choose the links to follow on each page with `focus_crawl()`
+- Multi-threaded design for high performance
+- Tracks 301 HTTP redirects
+- Allows exclusion of URLs based on regular expressions
+- HTTPS support
+- Records response time for each page
+- Obey robots.txt
+- In-memory or persistent storage of pages during crawl using [Moneta](https://github.com/moneta-rb/moneta) adapters
+- Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
+## Examples
+See the scripts under the <tt>lib/Medusa/cli</tt> directory for examples of several useful Medusa tasks.
+## TODO
+- [x] Simplify storage module using [Moneta](https://github.com/minad/moneta)
+- [x] Add multiverse of ruby versions and runtimes in test suite
+- [ ] Solve memory issues with a persistent Queue
+- [ ] Improve docs & examples
+- [ ] Allow to control the crawler, eg: "stop", "resume"
+- [ ] Improve logging facilities to collect stats, catch errors & failures
+- [ ] Add the concept of "bots" or drivers to interact with pages (eg: capybara)
+**Do you have an idea? [Open an issue so we can discuss it](https://github.com/brutuscat/medusa-crawler/issues/new)**
+## Requirements
+ - moneta
+ - nokogiri
+ - robotex
+## Development
+To test and develop this gem, additional requirements are:
+ - rspec
+ - webmock
+## Disclaimer
+Medusa is a revamped version of the defunk anemone gem.

data/Rakefile ADDED

@@ -0,0 +1,24 @@
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+require 'rdoc/task'
+desc "Run all specs"
+RSpec::Core::RakeTask.new(:rspec) do |spec|
+  spec.pattern = 'spec/**/*_spec.rb'
+end
+RSpec::Core::RakeTask.new(:rcov) do |spec|
+  spec.pattern = 'spec/**/*_spec.rb'
+  spec.rcov = true
+end
+task :default => :rspec
+Rake::RDocTask.new(:rdoc) do |rdoc|
+  version = File.exist?('VERSION') ? File.read('VERSION') : ""
+  rdoc.rdoc_dir = 'rdoc'
+  rdoc.title = "medusa #{version}"
+  rdoc.rdoc_files.include('README*')
+  rdoc.rdoc_files.include('lib/**/*.rb')
+end

data/VERSION ADDED

	@@ -0,0 +1 @@
1	+ 1.0.0.pre.1

data/bin/medusa ADDED

@@ -0,0 +1,4 @@
+#!/usr/bin/env ruby
+require 'medusa/cli'
+Medusa::CLI::run

data/lib/medusa.rb ADDED

	@@ -0,0 +1,2 @@
1	+ require 'rubygems'
2	+ require 'medusa/core'

data/lib/medusa/cli.rb ADDED

@@ -0,0 +1,24 @@
+module Medusa
+  module CLI
+    COMMANDS = %w[count cron pagedepth serialize url-list]
+    def self.run
+      command = ARGV.shift
+      if COMMANDS.include? command
+        load "medusa/cli/#{command.tr('-', '_')}.rb"
+      else
+        puts <<-INFO
+Medusa is a web spider framework that can collect
+useful information about pages it visits.
+Usage:
+  medusa <command> [arguments]
+Commands:
+  #{COMMANDS.join(', ')}
+INFO
+      end
+    end
+  end
+end

data/lib/medusa/cli/count.rb ADDED

@@ -0,0 +1,22 @@
+require 'medusa'
+begin
+  # make sure that the first option is a URL we can crawl
+  url = URI(ARGV[0])
+rescue
+  puts <<-INFO
+Usage:
+  medusa count <url>
+Synopsis:
+  Crawls a site starting at the given URL and outputs the total number
+  of unique pages on the site.
+INFO
+  exit(0)
+end
+Medusa.crawl(url) do |medusa|
+  medusa.after_crawl do |pages|
+    puts pages.uniq!.size
+  end
+end

data/lib/medusa/cli/cron.rb ADDED

@@ -0,0 +1,90 @@
+require 'medusa'
+require 'optparse'
+require 'ostruct'
+options = OpenStruct.new
+options.relative = false
+options.output_file = 'urls.txt'
+begin
+  # make sure that the last argument is a URL we can crawl
+  root = URI(ARGV.last)
+rescue
+  puts <<-INFO
+Usage:
+  medusa cron [options] <url>
+Synopsis:
+  Combination of `count`, `pagedepth` and `url-list` commands.
+  Performs pagedepth, url list, and count functionality.
+  Outputs results to STDOUT and link list to file (urls.txt).
+  Meant to be run daily as a cron job.
+Options:
+  -r, --relative           Output relative URLs (rather than absolute)
+  -o, --output filename    Filename to save URL list to. Defautls to urls.txt.
+INFO
+  exit(0)
+end
+# parse command-line options
+opts = OptionParser.new
+opts.on('-r', '--relative')        { options.relative = true }
+opts.on('-o', '--output filename') {|o| options.output_file = o }
+opts.parse!(ARGV)
+Medusa.crawl(root, {:discard_page_bodies => true}) do |medusa|
+  medusa.after_crawl do |pages|
+    puts "Crawl results for #{root}\n"
+    # print a list of 404's
+    not_found = []
+    pages.each_value do |page|
+      url = page.url.to_s
+      not_found << url if page.not_found?
+    end
+    unless not_found.empty?
+      puts "\n404's:"
+      missing_links = pages.urls_linking_to(not_found)
+      missing_links.each do |url, links|
+        if options.relative
+          puts URI(url).path.to_s
+        else
+          puts url
+        end
+        links.slice(0..10).each do |u|
+          u = u.path if options.relative
+          puts "  linked from #{u}"
+        end
+        puts " ..." if links.size > 10
+      end
+      print "\n"
+    end
+    # remove redirect aliases, and calculate pagedepths
+    pages = pages.shortest_paths!(root).uniq
+    depths = pages.values.inject({}) do |depths, page|
+      depths[page.depth] ||= 0
+      depths[page.depth] += 1
+      depths
+    end
+    # print the page count
+    puts "Total pages: #{pages.size}\n"
+    # print a list of depths
+    depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
+    # output a list of urls to file
+    file = open(options.output_file, 'w')
+    pages.each_key do |url|
+      url = options.relative ? url.path.to_s : url.to_s
+      file.puts url
+    end
+  end
+end

data/lib/medusa/cli/pagedepth.rb ADDED

@@ -0,0 +1,32 @@
+require 'medusa'
+begin
+  # make sure that the first option is a URL we can crawl
+  root = URI(ARGV[0])
+rescue
+  puts <<-INFO
+Usage:
+  medusa pagedepth <url>
+Synopsis:
+  Crawls a site starting at the given URL and outputs a count of
+  the number of pages at each depth of the crawl.
+INFO
+  exit(0)
+end
+Medusa.crawl(root, read_timeout: 3, discard_page_bodies: true, obey_robots_txt: true) do |medusa|
+  medusa.skip_links_like %r{^/c/$}, %r{^/stores/$}
+  medusa.after_crawl do |pages|
+    pages = pages.shortest_paths!(root).uniq!
+    depths = pages.values.inject({}) do |depths, page|
+      depths[page.depth] ||= 0
+      depths[page.depth] += 1
+      depths
+    end
+    depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
+  end
+end

data/lib/medusa/cli/serialize.rb ADDED

@@ -0,0 +1,35 @@
+require 'medusa'
+require 'optparse'
+require 'ostruct'
+begin
+  # make sure that the first option is a URL we can crawl
+  root = URI(ARGV[0])
+rescue
+  puts <<-INFO
+Usage:
+  medusa serialize [options] <url>
+Synopsis:
+  Crawls a site starting at the given URL and saves the resulting
+  PageStore object to a file using Marshal serialization.
+Options:
+  -o, --output filename      Filename to save PageStore to. Defaults to crawl.{Time.now}
+INFO
+  exit(0)
+end
+options = OpenStruct.new
+options.output_file = "crawl.#{Time.now.to_i}"
+# parse command-line options
+opts = OptionParser.new
+opts.on('-o', '--output filename') {|o| options.output_file = o }
+opts.parse!(ARGV)
+Medusa.crawl(root) do |medusa|
+  medusa.after_crawl do |pages|
+    open(options.output_file, 'w') {|f| Marshal.dump(pages, f)}
+  end
+end

data/lib/medusa/cli/url_list.rb ADDED

@@ -0,0 +1,41 @@
+require 'medusa'
+require 'optparse'
+require 'ostruct'
+options = OpenStruct.new
+options.relative = false
+begin
+  # make sure that the last option is a URL we can crawl
+  root = URI(ARGV.last)
+rescue
+  puts <<-INFO
+Usage:
+  medusa url-list [options] <url>
+Synopsis:
+  Crawls a site starting at the given URL, and outputs the URL of each page
+  in the domain as they are encountered.
+Options:
+  -r, --relative      Output relative URLs (rather than absolute)
+INFO
+  exit(0)
+end
+# parse command-line options
+opts = OptionParser.new
+opts.on('-r', '--relative') { options.relative = true }
+opts.parse!(ARGV)
+Medusa.crawl(root, :discard_page_bodies => true) do |medusa|
+  medusa.on_every_page do |page|
+    if options.relative
+      puts page.url.path
+    else
+      puts page.url
+    end
+  end
+end

data/lib/medusa/cookie_store.rb ADDED

@@ -0,0 +1,35 @@
+require 'delegate'
+require 'webrick/cookie'
+class WEBrick::Cookie
+  def expired?
+    !!expires && expires < Time.now
+  end
+end
+module Medusa
+  class CookieStore < DelegateClass(Hash)
+    def initialize(cookies = nil)
+      @cookies = {}
+      cookies.each { |name, value| @cookies[name] = WEBrick::Cookie.new(name, value) } if cookies
+      super(@cookies)
+    end
+    def merge!(set_cookie_str)
+      begin
+        cookie_hash = WEBrick::Cookie.parse_set_cookies(set_cookie_str).inject({}) do |hash, cookie|
+          hash[cookie.name] = cookie if !!cookie
+          hash
+        end
+        @cookies.merge! cookie_hash
+      rescue
+      end
+    end
+    def to_s
+      @cookies.values.reject { |cookie| cookie.expired? }.map { |cookie| "#{cookie.name}=#{cookie.value}" }.join(';')
+    end
+  end
+end