RubyGems - medusa-crawler - Versions diffs - 1.0.0.pre.2 → 1.0.0 - Mend

medusa-crawler 1.0.0.pre.2 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 9daf1076b9f0528f797128f639affadc99c61aa7980f4de84f91b520bda7b305
-  data.tar.gz: c95e174b0215473befb1d9437865bc5e20696efb245151ebd5e6b462fbadb099
+  metadata.gz: f2398745018a9c162fa8587b634753fe5aa93b46c541c6a45018ab243ab738ec
+  data.tar.gz: 792d6308a0b24958de72506e45acba5f2814e3f26982787bba913233285c5c08
 SHA512:
-  metadata.gz: e0afb9e4ac4cc5fdfd0a7c9bb3f01672bbdf41e711005eefffbd5d1b54a601e52868ad093e9985f49078942656551acca73de26d4b675e1442dec981e618bee3
-  data.tar.gz: 1098323c312714b7d1d72abd762d49a97dc233dfd8fb38cd574ab5611112cdc73859050a6f8fefcda35315274fad939c36750ce8277c5560b72769e5e5051a5b
+  metadata.gz: e773494251c93ba4f1b9b33a9bc778e605f13d16368520ab38f757ff341489a495a265882ad49da7fcf17b809fdfbe23bd46289732386971619b957bcf36e3d6
+  data.tar.gz: e9c98d600dffed81b9bd3fc82ced0479913b1b6c42ff27bda9bb0607166d8eba1dbd7a34f63108ce04759d0720ed408ffcde615ec0d11d7de00242698a2a601d

checksums.yaml.gz.sig CHANGED

Binary file

data.tar.gz.sig CHANGED

Binary file

data/CHANGELOG.md CHANGED

@@ -1,8 +1,36 @@
+## Release v1.0.0 (2020-08-17)
+Features:
+- Remove `PageStore#pages_linking_to`, `PageStore#urls_linking_to`
+- Remove `verbose` setting
-## Anemone forked into Medusa (2014-12-13)
+Changes:
+ - Add an examples section to the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
+ - Update the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
+ - Update the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
+## Pre-release v1.0.0.pre.2
 Features:
+ - Remove CLI bins
+ - Remove `PageStore#shortest_paths!`
+Fixes
+ - Skip link regex filter to consider the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
+## Pre-release v1.0.0.pre.1
+Features:
+ - Switch to use `Moneta` instead of custom storage provider adapters
+Fixes
+ - Fix link skip regex to include the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
+Dev
+ - Use webmock gem for testing
+Changes:
+ - Rename Medusa to medusa-crawler gem
+## Anemone forked into Medusa (2014-12-13)
+Features:
  - Switch to use `OpenURI` instead of `net/http`, gaining out of the box support for:
   - Http basic auth options
   - Proxy configuration options
@@ -11,10 +39,9 @@ Features:
   - Ability to control the RETRY_LIMIT upon connection errors
 Changes:
  - Renamed Anemone to Medusa
- - Revamped the [README](https://github.com/brutuscat/medusa/blob/master/README.md) file
- - Revamped the [CHANGELOG](https://github.com/brutuscat/medusa/blob/master/CHANGELOG.md) file
- - Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa/blob/master/CONTRIBUTORS.mdd) file
+ - Revamped the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
+ - Revamped the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
+ - Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
-> Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc) for a travel in time.
+> Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc) to go back to the past.

data/CONTRIBUTORS.md CHANGED

@@ -1,6 +1,6 @@
 # Contributors
-Many thanks to the following folks who have contributed code to Medusa (a fork of Anemone).
+Many thanks to the following people who have contributed code to Medusa (a fork of Anemone).
 In no particular order:

data/README.rdoc CHANGED

@@ -1,4 +1,4 @@
-== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
+== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
 Medusa is a framework for the ruby language to crawl and collect useful information about the pages
 it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
@@ -9,7 +9,6 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
 * Multi-threaded design for high performance
 * Tracks +301+ HTTP redirects
 * Allows exclusion of URLs based on regular expressions
-* HTTPS support
 * Records response time for each page
 * Obey _robots.txt_ directives (optional, but recommended)
 * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
@@ -17,6 +16,37 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
 <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
+=== Examples
+Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
+    require 'medusa'
+    Medusa.crawl('https://www.example.com', depth_limit: 2)
+Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
+    require 'medusa'
+    Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
+      crawler.discard_page_bodies = some_flag
+      # Persist all the pages state across crawl-runs.
+      crawler.clear_on_startup = false
+      crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
+      crawler.skip_links_like(/private/)
+      crawler.on_pages_like(/public/) do |page|
+        logger.debug "[public page]  #{page.url} took #{page.response_time} found #{page.links.count}"
+      end
+      # Use an arbitrary logic, page by page, to continue customize the crawling.
+      crawler.focus_crawl(/public/) do |page|
+        page.links.first
+      end
+    end
 ---
 === Requirements

data/Rakefile CHANGED

@@ -7,11 +7,6 @@ RSpec::Core::RakeTask.new(:rspec) do |spec|
   spec.pattern = 'spec/**/*_spec.rb'
 end
-RSpec::Core::RakeTask.new(:rcov) do |spec|
-  spec.pattern = 'spec/**/*_spec.rb'
-  spec.rcov = true
-end
 task :default => :rspec
 Rake::RDocTask.new(:rdoc) do |rdoc|

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 1.0.0~~.pre.2~~
1	+ 1.0.0

data/lib/medusa/core.rb CHANGED

@@ -26,8 +26,6 @@ module Medusa
     DEFAULT_OPTS = {
       # run 4 Tentacle threads to fetch pages
       :threads => 4,
-      # disable verbose output
-      :verbose => false,
       # don't throw away the page response body after scanning it for links
       :discard_page_bodies => false,
       # identify self as Medusa/VERSION
@@ -40,7 +38,7 @@ module Medusa
       :depth_limit => false,
       # number of times HTTP redirects will be followed
       :redirect_limit => 5,
-      # storage engine defaults to Hash in +process_options+ if none specified
+      # storage engine defaults to In-memory store in +process_options+ if none specified
       :storage => nil,
       # cleanups of the storage on every startup of the crawler
       :clear_on_startup => true,
@@ -148,6 +146,7 @@ module Medusa
     # Perform the crawl
     #
     def run
       process_options
       @urls.delete_if { |url| !visit_link?(url) }
@@ -165,7 +164,6 @@ module Medusa
       loop do
         page = page_queue.deq
         @pages.touch_key page.url
-        puts "#{page.url} Queue: #{link_queue.size}" if @opts[:verbose]
         do_page_blocks page
         page.discard_doc! if @opts[:discard_page_bodies]

data/lib/medusa/http.rb CHANGED

@@ -29,9 +29,9 @@ module Medusa
     # including redirects
     #
     def fetch_pages(url, referer = nil, depth = nil)
+      pages = []
       begin
         url = URI(url) unless url.is_a?(URI)
-        pages = []
         get(url, referer) do |response, headers, code, location, redirect_to, response_time|
           pages << Page.new(location, :body => response,
                                       :headers => headers,
@@ -43,13 +43,8 @@ module Medusa
         end
         return pages
-      rescue Exception => e
-        if verbose?
-          puts e.inspect
-          puts e.backtrace
-        end
-        pages ||= []
-        return pages << Page.new(url, :error => e)
+      rescue StandardError => e
+        return pages << Page.new(url, error: e)
       end
     end
@@ -180,18 +175,13 @@ module Medusa
       rescue Timeout::Error, EOFError, Errno::ECONNREFUSED, Errno::ETIMEDOUT, Errno::ECONNRESET => e
         retries += 1
-        puts "[medusa] Retrying ##{retries} on url #{url} because of: #{e.inspect}" if verbose?
         sleep(3 ^ retries)
         retry unless retries > RETRY_LIMIT
       ensure
-        resource.close if !resource.nil? && !resource.closed?
+        resource&.close unless resource&.closed?
       end
     end
-    def verbose?
-      @opts[:verbose]
-    end
     #
     # Allowed to connect to the requested url?
     #

data/lib/medusa/page_store.rb CHANGED

@@ -65,58 +65,5 @@ module Medusa
       each_value { |page| delete page.url if page.redirect? }
       self
     end
-    #
-    # If given a single URL (as a String or URI), returns an Array of Pages which link to that URL
-    # If given an Array of URLs, returns a Hash (URI => [Page, Page...]) of Pages linking to those URLs
-    #
-    def pages_linking_to(urls)
-      unless urls.is_a?(Array)
-        urls = [urls]
-        single = true
-      end
-      urls.map! do |url|
-        unless url.is_a?(URI)
-          URI(url) rescue nil
-        else
-          url
-        end
-      end
-      urls.compact
-      links = {}
-      urls.each { |url| links[url] = [] }
-      values.each do |page|
-        urls.each { |url| links[url] << page if page.links.include?(url) }
-      end
-      if single and !links.empty?
-        return links[urls.first]
-      else
-        return links
-      end
-    end
-    #
-    # If given a single URL (as a String or URI), returns an Array of URLs which link to that URL
-    # If given an Array of URLs, returns a Hash (URI => [URI, URI...]) of URLs linking to those URLs
-    #
-    def urls_linking_to(urls)
-      unless urls.is_a?(Array)
-        urls = [urls] unless urls.is_a?(Array)
-        single = true
-      end
-      links = pages_linking_to(urls)
-      links.each { |url, pages| links[url] = pages.map{|p| p.url} }
-      if single and !links.empty?
-        return links[urls.first]
-      else
-        return links
-      end
-    end
   end
 end

data/lib/medusa/storage/base.rb CHANGED

@@ -18,7 +18,6 @@ module Medusa
       def [](key)
         @adap[key]
         rescue
-          puts key
           raise RetrievalError
       end

data/lib/medusa/version.rb CHANGED

@@ -1,3 +1,3 @@
 module Medusa
-  VERSION = '1.0.0.pre.2'
+  VERSION = '1.0.0'
 end

data/spec/fakeweb_helper.rb CHANGED

@@ -29,6 +29,8 @@ module Medusa
       @base = options[:base] if options.has_key?(:base)
       @content_type = options[:content_type] || "text/html"
       @body = options[:body]
+      @status = options[:status] || [200, 'OK']
+      @exception = options[:exception]
       create_body unless @body
       add_to_fakeweb
@@ -56,7 +58,7 @@ module Medusa
     end
     def add_to_fakeweb
-      options = {body: @body, status: [200, 'OK'], headers: {'Content-Type' => @content_type}}
+      options = {body: @body, status: @status, headers: {'Content-Type' => @content_type}}
       if @redirect
         options[:status] = [301, 'Moved Permanently']
@@ -66,7 +68,7 @@ module Medusa
         options[:headers]['Location'] = redirect_url
         # register the page this one redirects to
-        WebMock.stub_request(:get, redirect_url).to_return(body: '', status: [200, 'OK'], headers: {'Content-Type' => @content_type})
+        WebMock.stub_request(:get, redirect_url).to_return(body: '', status: @status, headers: {'Content-Type' => @content_type})
       end
       if @auth
@@ -75,11 +77,14 @@ module Medusa
         WebMock.stub_request(:get, url).to_return(unautorized_options)
         WebMock.stub_request(:get, url).with(basic_auth: AUTH).to_return(options)
       else
-        WebMock.stub_request(:get, url).to_return(options)
+        WebMock.stub_request(:get, url).tap do |req|
+          if @exception
+            req.to_raise(@exception)
+          else
+            req.to_return(options)
+          end
+        end
       end
     end
   end
 end
-#default root
-Medusa::FakePage.new

data/spec/medusa_spec.rb CHANGED

@@ -1,3 +1,4 @@
+require 'fakeweb_helper'
 RSpec.describe Medusa do
@@ -6,9 +7,9 @@ RSpec.describe Medusa do
   end
   it "should return a Medusa::Core from the crawl, which has a PageStore" do
+  	Medusa::FakePage.new
     result = Medusa.crawl(SPEC_DOMAIN)
     expect(result).to be_an_instance_of(Medusa::Core)
     expect(result.pages).to be_an_instance_of(Medusa::PageStore)
   end
 end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: medusa-crawler
 version: !ruby/object:Gem::Version
-  version: 1.0.0.pre.2
+  version: 1.0.0
 platform: ruby
 authors:
 - Mauro Asprea
@@ -35,7 +35,7 @@ cert_chain:
   g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
   mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
   -----END CERTIFICATE-----
-date: 2020-08-14 00:00:00.000000000 Z
+date: 2020-08-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: moneta
@@ -98,7 +98,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: 1.0.0
 description: |+
-  == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
+  == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
   Medusa is a framework for the ruby language to crawl and collect useful information about the pages
   it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
@@ -109,7 +109,6 @@ description: |+
   * Multi-threaded design for high performance
   * Tracks +301+ HTTP redirects
   * Allows exclusion of URLs based on regular expressions
-  * HTTPS support
   * Records response time for each page
   * Obey _robots.txt_ directives (optional, but recommended)
   * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
@@ -117,6 +116,37 @@ description: |+
   <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
+  === Examples
+  Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
+      require 'medusa'
+      Medusa.crawl('https://www.example.com', depth_limit: 2)
+  Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
+      require 'medusa'
+      Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
+        crawler.discard_page_bodies = some_flag
+        # Persist all the pages state across crawl-runs.
+        crawler.clear_on_startup = false
+        crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
+        crawler.skip_links_like(/private/)
+        crawler.on_pages_like(/public/) do |page|
+          logger.debug "[public page]  #{page.url} took #{page.response_time} found #{page.links.count}"
+        end
+        # Use an arbitrary logic, page by page, to continue customize the crawling.
+        crawler.focus_crawl(/public/) do |page|
+          page.links.first
+        end
+      end
 email: mauroasprea@gmail.com
 executables: []
 extensions: []
@@ -151,7 +181,7 @@ licenses:
 - MIT
 metadata:
   bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
-  source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.2
+  source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0
   description_markup_format: rdoc
 post_install_message:
 rdoc_options:
@@ -168,11 +198,11 @@ required_ruby_version: !ruby/object:Gem::Requirement
       version: 2.3.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
-  - - ">"
+  - - ">="
     - !ruby/object:Gem::Version
-      version: 1.3.1
+      version: '0'
 requirements: []
-rubygems_version: 3.1.2
+rubygems_version: 3.1.4
 signing_key:
 specification_version: 4
 summary: Medusa is a ruby crawler framework

metadata.gz.sig CHANGED

Binary file