medusa-crawler 1.0.0.pre.1 → 1.0.0.pre.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2ad0d3a02d8345991481a05954b6776801c51c20b61b38b37b61ab91659ea420
4
- data.tar.gz: 84bc2865f48f987e60bde0fcf60fd6cef2b0b019e374ed64183a8f9cae7c3ee5
3
+ metadata.gz: 9daf1076b9f0528f797128f639affadc99c61aa7980f4de84f91b520bda7b305
4
+ data.tar.gz: c95e174b0215473befb1d9437865bc5e20696efb245151ebd5e6b462fbadb099
5
5
  SHA512:
6
- metadata.gz: ca1ccfad54b1337fa5c6cd5975888efe8ed237db0e5b3eecd3d993fdcf0a0afc5d9e18c24f54f87918dfdce9e92bb7304d73c1ee276bee7c6a2622ab0b3560e4
7
- data.tar.gz: 64f0701dfed21963879edc810ee8eb2d444fccb19543566774dde59f4973866b9bbdbbe9418eb766b6b6fd8f317539347da4b26edcfb56b098d0eb56f6132779
6
+ metadata.gz: e0afb9e4ac4cc5fdfd0a7c9bb3f01672bbdf41e711005eefffbd5d1b54a601e52868ad093e9985f49078942656551acca73de26d4b675e1442dec981e618bee3
7
+ data.tar.gz: 1098323c312714b7d1d72abd762d49a97dc233dfd8fb38cd574ab5611112cdc73859050a6f8fefcda35315274fad939c36750ce8277c5560b72769e5e5051a5b
Binary file
data.tar.gz.sig CHANGED
Binary file
@@ -5,18 +5,19 @@ Many thanks to the following folks who have contributed code to Medusa (a fork o
5
5
  In no particular order:
6
6
 
7
7
 
8
- | Person | Github | Twitter |
9
- | ------------- |:-------------:| --------:|
10
- | Chris Kite | [chriskite](https://github.com/chriskite) | |
11
- | Marc Seeger | | |
12
- | Joost Baaij | | |
13
- | Laurent Arnoud | | |
14
- | Cheng Huang | [zzzhc](https://github.com/zzzhc) | |
15
- | Mauro Asprea | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
16
- | Alex Pooley | | |
17
- | Luca Pradovera | [polysics](https://github.com/polysics) | |
18
- | Sergey Kojin | | |
19
- | Richard Paul | | |
8
+ | Person | Github | Twitter |
9
+ | --------------- |:-------------:| --------:|
10
+ | Chris Kite | [chriskite](https://github.com/chriskite) | |
11
+ | Marc Seeger | | |
12
+ | Joost Baaij | | |
13
+ | Laurent Arnoud | | |
14
+ | Cheng Huang | [zzzhc](https://github.com/zzzhc) | |
15
+ | Mauro Asprea | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
16
+ | Alex Pooley | | |
17
+ | Luca Pradovera | [polysics](https://github.com/polysics) | |
18
+ | Sergey Kojin | | |
19
+ | Richard Paul | | |
20
+ | Martha Thompson | [MothOnMars](https://github.com/MothOnMars) | |
20
21
 
21
22
 
22
23
  > If you are submitting a [PR](https://help.github.com/articles/using-pull-requests/), feel free to add yourself to this table.
@@ -0,0 +1,44 @@
1
+ == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
2
+
3
+ Medusa is a framework for the ruby language to crawl and collect useful information about the pages
4
+ it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
5
+
6
+ === Features
7
+
8
+ * Choose the links to follow on each page with +focus_crawl+
9
+ * Multi-threaded design for high performance
10
+ * Tracks +301+ HTTP redirects
11
+ * Allows exclusion of URLs based on regular expressions
12
+ * HTTPS support
13
+ * Records response time for each page
14
+ * Obey _robots.txt_ directives (optional, but recommended)
15
+ * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
16
+ * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
17
+
18
+ <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
19
+
20
+ ---
21
+
22
+ === Requirements
23
+
24
+ moneta:: for the key/value storage adapters
25
+ nokogiri:: for parsing the HTML of webpages
26
+ robotex:: for support of the robots.txt directives
27
+
28
+ === Development
29
+
30
+ To test and develop this gem, additional requirements are:
31
+ - rspec
32
+ - webmock
33
+
34
+ === About
35
+
36
+ Medusa is a revamped version of the defunk _anemone_ gem.
37
+
38
+ === License
39
+
40
+ Copyright (c) 2009 Vertive, Inc.
41
+
42
+ Copyright (c) 2020 Mauro Asprea
43
+
44
+ Released under the {MIT License}[https://github.com/brutuscat/medusa-crawler/blob/master/LICENSE.txt]
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.0.pre.1
1
+ 1.0.0.pre.2
@@ -298,7 +298,7 @@ module Medusa
298
298
  # its URL matches a skip_link pattern.
299
299
  #
300
300
  def skip_link?(link)
301
- @skip_link_patterns.any? { |pattern| link.path =~ pattern }
301
+ @skip_link_patterns.any? { |pattern| link.to_s =~ pattern }
302
302
  end
303
303
 
304
304
  end
@@ -22,8 +22,6 @@ module Medusa
22
22
  attr_accessor :data
23
23
  # Integer response code of the page
24
24
  attr_accessor :code
25
- # Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
26
- attr_accessor :visited
27
25
  # Depth of this page from the root of the crawl. This is not necessarily the
28
26
  # shortest path; use PageStore#shortest_paths! to find that value.
29
27
  attr_accessor :depth
@@ -40,7 +38,6 @@ module Medusa
40
38
  @data = OpenStruct.new
41
39
 
42
40
  @links = nil
43
- @visited = false
44
41
  @body = nil
45
42
  @doc = nil
46
43
  @base = nil
@@ -58,44 +58,6 @@ module Medusa
58
58
  has_key? url
59
59
  end
60
60
 
61
- #
62
- # Use a breadth-first search to calculate the single-source
63
- # shortest paths from *root* to all pages in the PageStore
64
- #
65
- def shortest_paths!(root)
66
- root = URI(root) if root.is_a?(String)
67
- raise "Root node not found" if !has_key?(root)
68
-
69
- q = Queue.new
70
-
71
- q.enq root
72
- root_page = self[root]
73
- root_page.depth = 0
74
- root_page.visited = true
75
- self[root] = root_page
76
- while !q.empty?
77
- page = self[q.deq]
78
- page.links.each do |u|
79
- begin
80
- link = self[u]
81
- next if link.nil? || !link.fetched? || link.visited
82
-
83
- q << u unless link.redirect?
84
- link.visited = true
85
- link.depth = page.depth + 1
86
- self[u] = link
87
-
88
- if link.redirect?
89
- u = link.redirect_to
90
- redo
91
- end
92
- end
93
- end
94
- end
95
-
96
- self
97
- end
98
-
99
61
  #
100
62
  # Removes all Pages from storage where redirect? is true
101
63
  #
@@ -1,3 +1,3 @@
1
1
  module Medusa
2
- VERSION = '1.0.0.pre.1'
2
+ VERSION = '1.0.0.pre.2'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: medusa-crawler
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0.pre.1
4
+ version: 1.0.0.pre.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mauro Asprea
@@ -35,7 +35,7 @@ cert_chain:
35
35
  g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
36
36
  mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
37
37
  -----END CERTIFICATE-----
38
- date: 2020-08-06 00:00:00.000000000 Z
38
+ date: 2020-08-14 00:00:00.000000000 Z
39
39
  dependencies:
40
40
  - !ruby/object:Gem::Dependency
41
41
  name: moneta
@@ -97,44 +97,39 @@ dependencies:
97
97
  - - ">="
98
98
  - !ruby/object:Gem::Version
99
99
  version: 1.0.0
100
- description: |-
101
- == Medusa: a ruby crawler framework
100
+ description: |+
101
+ == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
102
102
 
103
- Medusa is a ruby framework to crawl and collect useful information about the pages it visits.
104
- It is versatile, allowing you to write your own specialized tasks quickly and easily.
103
+ Medusa is a framework for the ruby language to crawl and collect useful information about the pages
104
+ it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
105
105
 
106
- #### Features
106
+ === Features
107
107
 
108
- - Choose the links to follow on each page with `focus_crawl()`
109
- - Multi-threaded design for high performance
110
- - Tracks 301 HTTP redirects
111
- - Allows exclusion of URLs based on regular expressions
112
- - HTTPS support
113
- - Records response time for each page
114
- - Obey robots.txt
115
- - In-memory or persistent storage of pages during crawl using Moneta adapters.
116
- - Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
117
- email:
118
- executables:
119
- - medusa
108
+ * Choose the links to follow on each page with +focus_crawl+
109
+ * Multi-threaded design for high performance
110
+ * Tracks +301+ HTTP redirects
111
+ * Allows exclusion of URLs based on regular expressions
112
+ * HTTPS support
113
+ * Records response time for each page
114
+ * Obey _robots.txt_ directives (optional, but recommended)
115
+ * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
116
+ * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
117
+
118
+ <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
119
+
120
+ email: mauroasprea@gmail.com
121
+ executables: []
120
122
  extensions: []
121
123
  extra_rdoc_files:
122
- - README.md
124
+ - README.rdoc
123
125
  files:
124
126
  - CHANGELOG.md
125
127
  - CONTRIBUTORS.md
126
128
  - LICENSE.txt
127
- - README.md
129
+ - README.rdoc
128
130
  - Rakefile
129
131
  - VERSION
130
- - bin/medusa
131
132
  - lib/medusa.rb
132
- - lib/medusa/cli.rb
133
- - lib/medusa/cli/count.rb
134
- - lib/medusa/cli/cron.rb
135
- - lib/medusa/cli/pagedepth.rb
136
- - lib/medusa/cli/serialize.rb
137
- - lib/medusa/cli/url_list.rb
138
133
  - lib/medusa/cookie_store.rb
139
134
  - lib/medusa/core.rb
140
135
  - lib/medusa/exceptions.rb
@@ -156,11 +151,12 @@ licenses:
156
151
  - MIT
157
152
  metadata:
158
153
  bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
159
- source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.1
154
+ source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.2
155
+ description_markup_format: rdoc
160
156
  post_install_message:
161
157
  rdoc_options:
162
158
  - "-m"
163
- - README.md
159
+ - README.rdoc
164
160
  - "-t"
165
161
  - Medusa
166
162
  require_paths:
@@ -169,7 +165,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
169
165
  requirements:
170
166
  - - ">="
171
167
  - !ruby/object:Gem::Version
172
- version: '0'
168
+ version: 2.3.0
173
169
  required_rubygems_version: !ruby/object:Gem::Requirement
174
170
  requirements:
175
171
  - - ">"
metadata.gz.sig CHANGED
Binary file
data/README.md DELETED
@@ -1,48 +0,0 @@
1
- # Medusa: a ruby crawler framework ![Ruby](https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push)
2
-
3
- Medusa is a framework to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
4
-
5
-
6
- ## Features
7
-
8
- - Choose the links to follow on each page with `focus_crawl()`
9
- - Multi-threaded design for high performance
10
- - Tracks 301 HTTP redirects
11
- - Allows exclusion of URLs based on regular expressions
12
- - HTTPS support
13
- - Records response time for each page
14
- - Obey robots.txt
15
- - In-memory or persistent storage of pages during crawl using [Moneta](https://github.com/moneta-rb/moneta) adapters
16
- - Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
17
-
18
- ## Examples
19
-
20
- See the scripts under the <tt>lib/Medusa/cli</tt> directory for examples of several useful Medusa tasks.
21
-
22
- ## TODO
23
-
24
- - [x] Simplify storage module using [Moneta](https://github.com/minad/moneta)
25
- - [x] Add multiverse of ruby versions and runtimes in test suite
26
- - [ ] Solve memory issues with a persistent Queue
27
- - [ ] Improve docs & examples
28
- - [ ] Allow to control the crawler, eg: "stop", "resume"
29
- - [ ] Improve logging facilities to collect stats, catch errors & failures
30
- - [ ] Add the concept of "bots" or drivers to interact with pages (eg: capybara)
31
-
32
- **Do you have an idea? [Open an issue so we can discuss it](https://github.com/brutuscat/medusa-crawler/issues/new)**
33
-
34
- ## Requirements
35
-
36
- - moneta
37
- - nokogiri
38
- - robotex
39
-
40
- ## Development
41
-
42
- To test and develop this gem, additional requirements are:
43
- - rspec
44
- - webmock
45
-
46
- ## Disclaimer
47
-
48
- Medusa is a revamped version of the defunk anemone gem.
data/bin/medusa DELETED
@@ -1,4 +0,0 @@
1
- #!/usr/bin/env ruby
2
- require 'medusa/cli'
3
-
4
- Medusa::CLI::run
@@ -1,24 +0,0 @@
1
- module Medusa
2
- module CLI
3
- COMMANDS = %w[count cron pagedepth serialize url-list]
4
-
5
- def self.run
6
- command = ARGV.shift
7
-
8
- if COMMANDS.include? command
9
- load "medusa/cli/#{command.tr('-', '_')}.rb"
10
- else
11
- puts <<-INFO
12
- Medusa is a web spider framework that can collect
13
- useful information about pages it visits.
14
-
15
- Usage:
16
- medusa <command> [arguments]
17
-
18
- Commands:
19
- #{COMMANDS.join(', ')}
20
- INFO
21
- end
22
- end
23
- end
24
- end
@@ -1,22 +0,0 @@
1
- require 'medusa'
2
-
3
- begin
4
- # make sure that the first option is a URL we can crawl
5
- url = URI(ARGV[0])
6
- rescue
7
- puts <<-INFO
8
- Usage:
9
- medusa count <url>
10
-
11
- Synopsis:
12
- Crawls a site starting at the given URL and outputs the total number
13
- of unique pages on the site.
14
- INFO
15
- exit(0)
16
- end
17
-
18
- Medusa.crawl(url) do |medusa|
19
- medusa.after_crawl do |pages|
20
- puts pages.uniq!.size
21
- end
22
- end
@@ -1,90 +0,0 @@
1
- require 'medusa'
2
- require 'optparse'
3
- require 'ostruct'
4
-
5
- options = OpenStruct.new
6
- options.relative = false
7
- options.output_file = 'urls.txt'
8
-
9
- begin
10
- # make sure that the last argument is a URL we can crawl
11
- root = URI(ARGV.last)
12
- rescue
13
- puts <<-INFO
14
- Usage:
15
- medusa cron [options] <url>
16
-
17
- Synopsis:
18
- Combination of `count`, `pagedepth` and `url-list` commands.
19
- Performs pagedepth, url list, and count functionality.
20
- Outputs results to STDOUT and link list to file (urls.txt).
21
- Meant to be run daily as a cron job.
22
-
23
- Options:
24
- -r, --relative Output relative URLs (rather than absolute)
25
- -o, --output filename Filename to save URL list to. Defautls to urls.txt.
26
- INFO
27
- exit(0)
28
- end
29
-
30
- # parse command-line options
31
- opts = OptionParser.new
32
- opts.on('-r', '--relative') { options.relative = true }
33
- opts.on('-o', '--output filename') {|o| options.output_file = o }
34
- opts.parse!(ARGV)
35
-
36
- Medusa.crawl(root, {:discard_page_bodies => true}) do |medusa|
37
-
38
- medusa.after_crawl do |pages|
39
- puts "Crawl results for #{root}\n"
40
-
41
- # print a list of 404's
42
- not_found = []
43
- pages.each_value do |page|
44
- url = page.url.to_s
45
- not_found << url if page.not_found?
46
- end
47
- unless not_found.empty?
48
- puts "\n404's:"
49
-
50
- missing_links = pages.urls_linking_to(not_found)
51
- missing_links.each do |url, links|
52
- if options.relative
53
- puts URI(url).path.to_s
54
- else
55
- puts url
56
- end
57
- links.slice(0..10).each do |u|
58
- u = u.path if options.relative
59
- puts " linked from #{u}"
60
- end
61
-
62
- puts " ..." if links.size > 10
63
- end
64
-
65
- print "\n"
66
- end
67
-
68
- # remove redirect aliases, and calculate pagedepths
69
- pages = pages.shortest_paths!(root).uniq
70
- depths = pages.values.inject({}) do |depths, page|
71
- depths[page.depth] ||= 0
72
- depths[page.depth] += 1
73
- depths
74
- end
75
-
76
- # print the page count
77
- puts "Total pages: #{pages.size}\n"
78
-
79
- # print a list of depths
80
- depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
81
-
82
- # output a list of urls to file
83
- file = open(options.output_file, 'w')
84
- pages.each_key do |url|
85
- url = options.relative ? url.path.to_s : url.to_s
86
- file.puts url
87
- end
88
- end
89
-
90
- end
@@ -1,32 +0,0 @@
1
- require 'medusa'
2
-
3
- begin
4
- # make sure that the first option is a URL we can crawl
5
- root = URI(ARGV[0])
6
- rescue
7
- puts <<-INFO
8
- Usage:
9
- medusa pagedepth <url>
10
-
11
- Synopsis:
12
- Crawls a site starting at the given URL and outputs a count of
13
- the number of pages at each depth of the crawl.
14
- INFO
15
- exit(0)
16
- end
17
-
18
- Medusa.crawl(root, read_timeout: 3, discard_page_bodies: true, obey_robots_txt: true) do |medusa|
19
- medusa.skip_links_like %r{^/c/$}, %r{^/stores/$}
20
-
21
- medusa.after_crawl do |pages|
22
- pages = pages.shortest_paths!(root).uniq!
23
-
24
- depths = pages.values.inject({}) do |depths, page|
25
- depths[page.depth] ||= 0
26
- depths[page.depth] += 1
27
- depths
28
- end
29
-
30
- depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
31
- end
32
- end
@@ -1,35 +0,0 @@
1
- require 'medusa'
2
- require 'optparse'
3
- require 'ostruct'
4
-
5
- begin
6
- # make sure that the first option is a URL we can crawl
7
- root = URI(ARGV[0])
8
- rescue
9
- puts <<-INFO
10
- Usage:
11
- medusa serialize [options] <url>
12
-
13
- Synopsis:
14
- Crawls a site starting at the given URL and saves the resulting
15
- PageStore object to a file using Marshal serialization.
16
-
17
- Options:
18
- -o, --output filename Filename to save PageStore to. Defaults to crawl.{Time.now}
19
- INFO
20
- exit(0)
21
- end
22
-
23
- options = OpenStruct.new
24
- options.output_file = "crawl.#{Time.now.to_i}"
25
-
26
- # parse command-line options
27
- opts = OptionParser.new
28
- opts.on('-o', '--output filename') {|o| options.output_file = o }
29
- opts.parse!(ARGV)
30
-
31
- Medusa.crawl(root) do |medusa|
32
- medusa.after_crawl do |pages|
33
- open(options.output_file, 'w') {|f| Marshal.dump(pages, f)}
34
- end
35
- end
@@ -1,41 +0,0 @@
1
- require 'medusa'
2
- require 'optparse'
3
- require 'ostruct'
4
-
5
- options = OpenStruct.new
6
- options.relative = false
7
-
8
- begin
9
- # make sure that the last option is a URL we can crawl
10
- root = URI(ARGV.last)
11
- rescue
12
- puts <<-INFO
13
- Usage:
14
- medusa url-list [options] <url>
15
-
16
- Synopsis:
17
- Crawls a site starting at the given URL, and outputs the URL of each page
18
- in the domain as they are encountered.
19
-
20
- Options:
21
- -r, --relative Output relative URLs (rather than absolute)
22
- INFO
23
- exit(0)
24
- end
25
-
26
- # parse command-line options
27
- opts = OptionParser.new
28
- opts.on('-r', '--relative') { options.relative = true }
29
- opts.parse!(ARGV)
30
-
31
- Medusa.crawl(root, :discard_page_bodies => true) do |medusa|
32
-
33
- medusa.on_every_page do |page|
34
- if options.relative
35
- puts page.url.path
36
- else
37
- puts page.url
38
- end
39
- end
40
-
41
- end