RubyGems - medusa-crawler - Versions diffs - 1.0.0.pre.1 → 1.0.0.pre.2 - Mend

medusa-crawler 1.0.0.pre.1 → 1.0.0.pre.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 2ad0d3a02d8345991481a05954b6776801c51c20b61b38b37b61ab91659ea420
-  data.tar.gz: 84bc2865f48f987e60bde0fcf60fd6cef2b0b019e374ed64183a8f9cae7c3ee5
+  metadata.gz: 9daf1076b9f0528f797128f639affadc99c61aa7980f4de84f91b520bda7b305
+  data.tar.gz: c95e174b0215473befb1d9437865bc5e20696efb245151ebd5e6b462fbadb099
 SHA512:
-  metadata.gz: ca1ccfad54b1337fa5c6cd5975888efe8ed237db0e5b3eecd3d993fdcf0a0afc5d9e18c24f54f87918dfdce9e92bb7304d73c1ee276bee7c6a2622ab0b3560e4
-  data.tar.gz: 64f0701dfed21963879edc810ee8eb2d444fccb19543566774dde59f4973866b9bbdbbe9418eb766b6b6fd8f317539347da4b26edcfb56b098d0eb56f6132779
+  metadata.gz: e0afb9e4ac4cc5fdfd0a7c9bb3f01672bbdf41e711005eefffbd5d1b54a601e52868ad093e9985f49078942656551acca73de26d4b675e1442dec981e618bee3
+  data.tar.gz: 1098323c312714b7d1d72abd762d49a97dc233dfd8fb38cd574ab5611112cdc73859050a6f8fefcda35315274fad939c36750ce8277c5560b72769e5e5051a5b

checksums.yaml.gz.sig CHANGED

Binary file

data.tar.gz.sig CHANGED

Binary file

data/CONTRIBUTORS.md CHANGED

@@ -5,18 +5,19 @@ Many thanks to the following folks who have contributed code to Medusa (a fork o
 In no particular order:
-| Person         | Github        | Twitter  |
-| -------------  |:-------------:| --------:|
-| Chris Kite     | [chriskite](https://github.com/chriskite) |  |
-| Marc Seeger    |  |  |
-| Joost Baaij    |  |  |
-| Laurent Arnoud |  |  |
-| Cheng Huang    | [zzzhc](https://github.com/zzzhc) |  |
-| Mauro Asprea   | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
-| Alex Pooley    |  |  |
-| Luca Pradovera | [polysics](https://github.com/polysics) |  |
-| Sergey Kojin   |  |  |
-| Richard Paul   |  |  |
+| Person          | Github        | Twitter  |
+| --------------- |:-------------:| --------:|
+| Chris Kite      | [chriskite](https://github.com/chriskite) |  |
+| Marc Seeger     |  |  |
+| Joost Baaij     |  |  |
+| Laurent Arnoud  |  |  |
+| Cheng Huang     | [zzzhc](https://github.com/zzzhc) |  |
+| Mauro Asprea    | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
+| Alex Pooley     |  |  |
+| Luca Pradovera  | [polysics](https://github.com/polysics) |  |
+| Sergey Kojin    |  |  |
+| Richard Paul    |  |  |
+| Martha Thompson | [MothOnMars](https://github.com/MothOnMars) |  |
 > If you are submitting a [PR](https://help.github.com/articles/using-pull-requests/), feel free to add yourself to this table.

data/README.rdoc ADDED

@@ -0,0 +1,44 @@
+== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
+Medusa is a framework for the ruby language to crawl and collect useful information about the pages
+it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
+=== Features
+* Choose the links to follow on each page with +focus_crawl+
+* Multi-threaded design for high performance
+* Tracks +301+ HTTP redirects
+* Allows exclusion of URLs based on regular expressions
+* HTTPS support
+* Records response time for each page
+* Obey _robots.txt_ directives (optional, but recommended)
+* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
+* Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
+<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
+---
+=== Requirements
+moneta::  for the key/value storage adapters
+nokogiri::  for parsing the HTML of webpages
+robotex::  for support of the robots.txt directives
+=== Development
+To test and develop this gem, additional requirements are:
+- rspec
+- webmock
+=== About
+Medusa is a revamped version of the defunk _anemone_ gem.
+=== License
+Copyright (c) 2009 Vertive, Inc.
+Copyright (c) 2020 Mauro Asprea
+Released under the {MIT License}[https://github.com/brutuscat/medusa-crawler/blob/master/LICENSE.txt]

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 1.0.0.pre.1
1	+ 1.0.0.pre.2

data/lib/medusa/core.rb CHANGED

@@ -298,7 +298,7 @@ module Medusa
     # its URL matches a skip_link pattern.
     #
     def skip_link?(link)
-      @skip_link_patterns.any? { |pattern| link.path =~ pattern }
+      @skip_link_patterns.any? { |pattern| link.to_s =~ pattern }
     end
   end

data/lib/medusa/page.rb CHANGED

@@ -22,8 +22,6 @@ module Medusa
     attr_accessor :data
     # Integer response code of the page
     attr_accessor :code
-    # Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
-    attr_accessor :visited
     # Depth of this page from the root of the crawl. This is not necessarily the
     # shortest path; use PageStore#shortest_paths! to find that value.
     attr_accessor :depth
@@ -40,7 +38,6 @@ module Medusa
       @data = OpenStruct.new
       @links = nil
-      @visited = false
       @body = nil
       @doc = nil
       @base = nil

data/lib/medusa/page_store.rb CHANGED

@@ -58,44 +58,6 @@ module Medusa
       has_key? url
     end
-    #
-    # Use a breadth-first search to calculate the single-source
-    # shortest paths from *root* to all pages in the PageStore
-    #
-    def shortest_paths!(root)
-      root = URI(root) if root.is_a?(String)
-      raise "Root node not found" if !has_key?(root)
-      q = Queue.new
-      q.enq root
-      root_page = self[root]
-      root_page.depth = 0
-      root_page.visited = true
-      self[root] = root_page
-      while !q.empty?
-        page = self[q.deq]
-        page.links.each do |u|
-          begin
-            link = self[u]
-            next if link.nil? || !link.fetched? || link.visited
-            q << u unless link.redirect?
-            link.visited = true
-            link.depth = page.depth + 1
-            self[u] = link
-            if link.redirect?
-              u = link.redirect_to
-              redo
-            end
-          end
-        end
-      end
-      self
-    end
     #
     # Removes all Pages from storage where redirect? is true
     #

data/lib/medusa/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Medusa
-  VERSION = '1.0.0.pre.1'
+  VERSION = '1.0.0.pre.2'
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: medusa-crawler
 version: !ruby/object:Gem::Version
-  version: 1.0.0.pre.1
+  version: 1.0.0.pre.2
 platform: ruby
 authors:
 - Mauro Asprea
@@ -35,7 +35,7 @@ cert_chain:
   g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
   mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
   -----END CERTIFICATE-----
-date: 2020-08-06 00:00:00.000000000 Z
+date: 2020-08-14 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: moneta
@@ -97,44 +97,39 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: 1.0.0
-description: |-
-  == Medusa: a ruby crawler framework
+description: |+
+  == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
-  Medusa is a ruby framework to crawl and collect useful information about the pages it visits.
-  It is versatile, allowing you to write your own specialized tasks quickly and easily.
+  Medusa is a framework for the ruby language to crawl and collect useful information about the pages
+  it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
-  #### Features
+  === Features
-  - Choose the links to follow on each page with `focus_crawl()`
-  - Multi-threaded design for high performance
-  - Tracks 301 HTTP redirects
-  - Allows exclusion of URLs based on regular expressions
-  - HTTPS support
-  - Records response time for each page
-  - Obey robots.txt
-  - In-memory or persistent storage of pages during crawl using Moneta adapters.
-  - Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
-email:
-executables:
-- medusa
+  * Choose the links to follow on each page with +focus_crawl+
+  * Multi-threaded design for high performance
+  * Tracks +301+ HTTP redirects
+  * Allows exclusion of URLs based on regular expressions
+  * HTTPS support
+  * Records response time for each page
+  * Obey _robots.txt_ directives (optional, but recommended)
+  * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
+  * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
+  <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
+email: mauroasprea@gmail.com
+executables: []
 extensions: []
 extra_rdoc_files:
-- README.md
+- README.rdoc
 files:
 - CHANGELOG.md
 - CONTRIBUTORS.md
 - LICENSE.txt
-- README.md
+- README.rdoc
 - Rakefile
 - VERSION
-- bin/medusa
 - lib/medusa.rb
-- lib/medusa/cli.rb
-- lib/medusa/cli/count.rb
-- lib/medusa/cli/cron.rb
-- lib/medusa/cli/pagedepth.rb
-- lib/medusa/cli/serialize.rb
-- lib/medusa/cli/url_list.rb
 - lib/medusa/cookie_store.rb
 - lib/medusa/core.rb
 - lib/medusa/exceptions.rb
@@ -156,11 +151,12 @@ licenses:
 - MIT
 metadata:
   bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
-  source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.1
+  source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.2
+  description_markup_format: rdoc
 post_install_message:
 rdoc_options:
 - "-m"
-- README.md
+- README.rdoc
 - "-t"
 - Medusa
 require_paths:
@@ -169,7 +165,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: '0'
+      version: 2.3.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">"

metadata.gz.sig CHANGED

Binary file

data/README.md DELETED

@@ -1,48 +0,0 @@
-# Medusa: a ruby crawler framework ![Ruby](https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push)
-Medusa is a framework to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
-## Features
-- Choose the links to follow on each page with `focus_crawl()`
-- Multi-threaded design for high performance
-- Tracks 301 HTTP redirects
-- Allows exclusion of URLs based on regular expressions
-- HTTPS support
-- Records response time for each page
-- Obey robots.txt
-- In-memory or persistent storage of pages during crawl using [Moneta](https://github.com/moneta-rb/moneta) adapters
-- Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
-## Examples
-See the scripts under the <tt>lib/Medusa/cli</tt> directory for examples of several useful Medusa tasks.
-## TODO
-- [x] Simplify storage module using [Moneta](https://github.com/minad/moneta)
-- [x] Add multiverse of ruby versions and runtimes in test suite
-- [ ] Solve memory issues with a persistent Queue
-- [ ] Improve docs & examples
-- [ ] Allow to control the crawler, eg: "stop", "resume"
-- [ ] Improve logging facilities to collect stats, catch errors & failures
-- [ ] Add the concept of "bots" or drivers to interact with pages (eg: capybara)
-**Do you have an idea? [Open an issue so we can discuss it](https://github.com/brutuscat/medusa-crawler/issues/new)**
-## Requirements
- - moneta
- - nokogiri
- - robotex
-## Development
-To test and develop this gem, additional requirements are:
- - rspec
- - webmock
-## Disclaimer
-Medusa is a revamped version of the defunk anemone gem.

data/bin/medusa DELETED

@@ -1,4 +0,0 @@
-#!/usr/bin/env ruby
-require 'medusa/cli'
-Medusa::CLI::run

data/lib/medusa/cli.rb DELETED

@@ -1,24 +0,0 @@
-module Medusa
-  module CLI
-    COMMANDS = %w[count cron pagedepth serialize url-list]
-    def self.run
-      command = ARGV.shift
-      if COMMANDS.include? command
-        load "medusa/cli/#{command.tr('-', '_')}.rb"
-      else
-        puts <<-INFO
-Medusa is a web spider framework that can collect
-useful information about pages it visits.
-Usage:
-  medusa <command> [arguments]
-Commands:
-  #{COMMANDS.join(', ')}
-INFO
-      end
-    end
-  end
-end

data/lib/medusa/cli/count.rb DELETED

@@ -1,22 +0,0 @@
-require 'medusa'
-begin
-  # make sure that the first option is a URL we can crawl
-  url = URI(ARGV[0])
-rescue
-  puts <<-INFO
-Usage:
-  medusa count <url>
-Synopsis:
-  Crawls a site starting at the given URL and outputs the total number
-  of unique pages on the site.
-INFO
-  exit(0)
-end
-Medusa.crawl(url) do |medusa|
-  medusa.after_crawl do |pages|
-    puts pages.uniq!.size
-  end
-end

data/lib/medusa/cli/cron.rb DELETED

@@ -1,90 +0,0 @@
-require 'medusa'
-require 'optparse'
-require 'ostruct'
-options = OpenStruct.new
-options.relative = false
-options.output_file = 'urls.txt'
-begin
-  # make sure that the last argument is a URL we can crawl
-  root = URI(ARGV.last)
-rescue
-  puts <<-INFO
-Usage:
-  medusa cron [options] <url>
-Synopsis:
-  Combination of `count`, `pagedepth` and `url-list` commands.
-  Performs pagedepth, url list, and count functionality.
-  Outputs results to STDOUT and link list to file (urls.txt).
-  Meant to be run daily as a cron job.
-Options:
-  -r, --relative           Output relative URLs (rather than absolute)
-  -o, --output filename    Filename to save URL list to. Defautls to urls.txt.
-INFO
-  exit(0)
-end
-# parse command-line options
-opts = OptionParser.new
-opts.on('-r', '--relative')        { options.relative = true }
-opts.on('-o', '--output filename') {|o| options.output_file = o }
-opts.parse!(ARGV)
-Medusa.crawl(root, {:discard_page_bodies => true}) do |medusa|
-  medusa.after_crawl do |pages|
-    puts "Crawl results for #{root}\n"
-    # print a list of 404's
-    not_found = []
-    pages.each_value do |page|
-      url = page.url.to_s
-      not_found << url if page.not_found?
-    end
-    unless not_found.empty?
-      puts "\n404's:"
-      missing_links = pages.urls_linking_to(not_found)
-      missing_links.each do |url, links|
-        if options.relative
-          puts URI(url).path.to_s
-        else
-          puts url
-        end
-        links.slice(0..10).each do |u|
-          u = u.path if options.relative
-          puts "  linked from #{u}"
-        end
-        puts " ..." if links.size > 10
-      end
-      print "\n"
-    end
-    # remove redirect aliases, and calculate pagedepths
-    pages = pages.shortest_paths!(root).uniq
-    depths = pages.values.inject({}) do |depths, page|
-      depths[page.depth] ||= 0
-      depths[page.depth] += 1
-      depths
-    end
-    # print the page count
-    puts "Total pages: #{pages.size}\n"
-    # print a list of depths
-    depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
-    # output a list of urls to file
-    file = open(options.output_file, 'w')
-    pages.each_key do |url|
-      url = options.relative ? url.path.to_s : url.to_s
-      file.puts url
-    end
-  end
-end

data/lib/medusa/cli/pagedepth.rb DELETED

@@ -1,32 +0,0 @@
-require 'medusa'
-begin
-  # make sure that the first option is a URL we can crawl
-  root = URI(ARGV[0])
-rescue
-  puts <<-INFO
-Usage:
-  medusa pagedepth <url>
-Synopsis:
-  Crawls a site starting at the given URL and outputs a count of
-  the number of pages at each depth of the crawl.
-INFO
-  exit(0)
-end
-Medusa.crawl(root, read_timeout: 3, discard_page_bodies: true, obey_robots_txt: true) do |medusa|
-  medusa.skip_links_like %r{^/c/$}, %r{^/stores/$}
-  medusa.after_crawl do |pages|
-    pages = pages.shortest_paths!(root).uniq!
-    depths = pages.values.inject({}) do |depths, page|
-      depths[page.depth] ||= 0
-      depths[page.depth] += 1
-      depths
-    end
-    depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
-  end
-end

data/lib/medusa/cli/serialize.rb DELETED

@@ -1,35 +0,0 @@
-require 'medusa'
-require 'optparse'
-require 'ostruct'
-begin
-  # make sure that the first option is a URL we can crawl
-  root = URI(ARGV[0])
-rescue
-  puts <<-INFO
-Usage:
-  medusa serialize [options] <url>
-Synopsis:
-  Crawls a site starting at the given URL and saves the resulting
-  PageStore object to a file using Marshal serialization.
-Options:
-  -o, --output filename      Filename to save PageStore to. Defaults to crawl.{Time.now}
-INFO
-  exit(0)
-end
-options = OpenStruct.new
-options.output_file = "crawl.#{Time.now.to_i}"
-# parse command-line options
-opts = OptionParser.new
-opts.on('-o', '--output filename') {|o| options.output_file = o }
-opts.parse!(ARGV)
-Medusa.crawl(root) do |medusa|
-  medusa.after_crawl do |pages|
-    open(options.output_file, 'w') {|f| Marshal.dump(pages, f)}
-  end
-end

data/lib/medusa/cli/url_list.rb DELETED

@@ -1,41 +0,0 @@
-require 'medusa'
-require 'optparse'
-require 'ostruct'
-options = OpenStruct.new
-options.relative = false
-begin
-  # make sure that the last option is a URL we can crawl
-  root = URI(ARGV.last)
-rescue
-  puts <<-INFO
-Usage:
-  medusa url-list [options] <url>
-Synopsis:
-  Crawls a site starting at the given URL, and outputs the URL of each page
-  in the domain as they are encountered.
-Options:
-  -r, --relative      Output relative URLs (rather than absolute)
-INFO
-  exit(0)
-end
-# parse command-line options
-opts = OptionParser.new
-opts.on('-r', '--relative') { options.relative = true }
-opts.parse!(ARGV)
-Medusa.crawl(root, :discard_page_bodies => true) do |medusa|
-  medusa.on_every_page do |page|
-    if options.relative
-      puts page.url.path
-    else
-      puts page.url
-    end
-  end
-end