medusa-crawler 1.0.0.pre.1 → 1.0.0.pre.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2ad0d3a02d8345991481a05954b6776801c51c20b61b38b37b61ab91659ea420
4
- data.tar.gz: 84bc2865f48f987e60bde0fcf60fd6cef2b0b019e374ed64183a8f9cae7c3ee5
3
+ metadata.gz: 9daf1076b9f0528f797128f639affadc99c61aa7980f4de84f91b520bda7b305
4
+ data.tar.gz: c95e174b0215473befb1d9437865bc5e20696efb245151ebd5e6b462fbadb099
5
5
  SHA512:
6
- metadata.gz: ca1ccfad54b1337fa5c6cd5975888efe8ed237db0e5b3eecd3d993fdcf0a0afc5d9e18c24f54f87918dfdce9e92bb7304d73c1ee276bee7c6a2622ab0b3560e4
7
- data.tar.gz: 64f0701dfed21963879edc810ee8eb2d444fccb19543566774dde59f4973866b9bbdbbe9418eb766b6b6fd8f317539347da4b26edcfb56b098d0eb56f6132779
6
+ metadata.gz: e0afb9e4ac4cc5fdfd0a7c9bb3f01672bbdf41e711005eefffbd5d1b54a601e52868ad093e9985f49078942656551acca73de26d4b675e1442dec981e618bee3
7
+ data.tar.gz: 1098323c312714b7d1d72abd762d49a97dc233dfd8fb38cd574ab5611112cdc73859050a6f8fefcda35315274fad939c36750ce8277c5560b72769e5e5051a5b
Binary file
data.tar.gz.sig CHANGED
Binary file
@@ -5,18 +5,19 @@ Many thanks to the following folks who have contributed code to Medusa (a fork o
5
5
  In no particular order:
6
6
 
7
7
 
8
- | Person | Github | Twitter |
9
- | ------------- |:-------------:| --------:|
10
- | Chris Kite | [chriskite](https://github.com/chriskite) | |
11
- | Marc Seeger | | |
12
- | Joost Baaij | | |
13
- | Laurent Arnoud | | |
14
- | Cheng Huang | [zzzhc](https://github.com/zzzhc) | |
15
- | Mauro Asprea | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
16
- | Alex Pooley | | |
17
- | Luca Pradovera | [polysics](https://github.com/polysics) | |
18
- | Sergey Kojin | | |
19
- | Richard Paul | | |
8
+ | Person | Github | Twitter |
9
+ | --------------- |:-------------:| --------:|
10
+ | Chris Kite | [chriskite](https://github.com/chriskite) | |
11
+ | Marc Seeger | | |
12
+ | Joost Baaij | | |
13
+ | Laurent Arnoud | | |
14
+ | Cheng Huang | [zzzhc](https://github.com/zzzhc) | |
15
+ | Mauro Asprea | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
16
+ | Alex Pooley | | |
17
+ | Luca Pradovera | [polysics](https://github.com/polysics) | |
18
+ | Sergey Kojin | | |
19
+ | Richard Paul | | |
20
+ | Martha Thompson | [MothOnMars](https://github.com/MothOnMars) | |
20
21
 
21
22
 
22
23
  > If you are submitting a [PR](https://help.github.com/articles/using-pull-requests/), feel free to add yourself to this table.
@@ -0,0 +1,44 @@
1
+ == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
2
+
3
+ Medusa is a framework for the ruby language to crawl and collect useful information about the pages
4
+ it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
5
+
6
+ === Features
7
+
8
+ * Choose the links to follow on each page with +focus_crawl+
9
+ * Multi-threaded design for high performance
10
+ * Tracks +301+ HTTP redirects
11
+ * Allows exclusion of URLs based on regular expressions
12
+ * HTTPS support
13
+ * Records response time for each page
14
+ * Obey _robots.txt_ directives (optional, but recommended)
15
+ * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
16
+ * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
17
+
18
+ <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
19
+
20
+ ---
21
+
22
+ === Requirements
23
+
24
+ moneta:: for the key/value storage adapters
25
+ nokogiri:: for parsing the HTML of webpages
26
+ robotex:: for support of the robots.txt directives
27
+
28
+ === Development
29
+
30
+ To test and develop this gem, additional requirements are:
31
+ - rspec
32
+ - webmock
33
+
34
+ === About
35
+
36
+ Medusa is a revamped version of the defunk _anemone_ gem.
37
+
38
+ === License
39
+
40
+ Copyright (c) 2009 Vertive, Inc.
41
+
42
+ Copyright (c) 2020 Mauro Asprea
43
+
44
+ Released under the {MIT License}[https://github.com/brutuscat/medusa-crawler/blob/master/LICENSE.txt]
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.0.pre.1
1
+ 1.0.0.pre.2
@@ -298,7 +298,7 @@ module Medusa
298
298
  # its URL matches a skip_link pattern.
299
299
  #
300
300
  def skip_link?(link)
301
- @skip_link_patterns.any? { |pattern| link.path =~ pattern }
301
+ @skip_link_patterns.any? { |pattern| link.to_s =~ pattern }
302
302
  end
303
303
 
304
304
  end
@@ -22,8 +22,6 @@ module Medusa
22
22
  attr_accessor :data
23
23
  # Integer response code of the page
24
24
  attr_accessor :code
25
- # Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
26
- attr_accessor :visited
27
25
  # Depth of this page from the root of the crawl. This is not necessarily the
28
26
  # shortest path; use PageStore#shortest_paths! to find that value.
29
27
  attr_accessor :depth
@@ -40,7 +38,6 @@ module Medusa
40
38
  @data = OpenStruct.new
41
39
 
42
40
  @links = nil
43
- @visited = false
44
41
  @body = nil
45
42
  @doc = nil
46
43
  @base = nil
@@ -58,44 +58,6 @@ module Medusa
58
58
  has_key? url
59
59
  end
60
60
 
61
- #
62
- # Use a breadth-first search to calculate the single-source
63
- # shortest paths from *root* to all pages in the PageStore
64
- #
65
- def shortest_paths!(root)
66
- root = URI(root) if root.is_a?(String)
67
- raise "Root node not found" if !has_key?(root)
68
-
69
- q = Queue.new
70
-
71
- q.enq root
72
- root_page = self[root]
73
- root_page.depth = 0
74
- root_page.visited = true
75
- self[root] = root_page
76
- while !q.empty?
77
- page = self[q.deq]
78
- page.links.each do |u|
79
- begin
80
- link = self[u]
81
- next if link.nil? || !link.fetched? || link.visited
82
-
83
- q << u unless link.redirect?
84
- link.visited = true
85
- link.depth = page.depth + 1
86
- self[u] = link
87
-
88
- if link.redirect?
89
- u = link.redirect_to
90
- redo
91
- end
92
- end
93
- end
94
- end
95
-
96
- self
97
- end
98
-
99
61
  #
100
62
  # Removes all Pages from storage where redirect? is true
101
63
  #
@@ -1,3 +1,3 @@
1
1
  module Medusa
2
- VERSION = '1.0.0.pre.1'
2
+ VERSION = '1.0.0.pre.2'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: medusa-crawler
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0.pre.1
4
+ version: 1.0.0.pre.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mauro Asprea
@@ -35,7 +35,7 @@ cert_chain:
35
35
  g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
36
36
  mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
37
37
  -----END CERTIFICATE-----
38
- date: 2020-08-06 00:00:00.000000000 Z
38
+ date: 2020-08-14 00:00:00.000000000 Z
39
39
  dependencies:
40
40
  - !ruby/object:Gem::Dependency
41
41
  name: moneta
@@ -97,44 +97,39 @@ dependencies:
97
97
  - - ">="
98
98
  - !ruby/object:Gem::Version
99
99
  version: 1.0.0
100
- description: |-
101
- == Medusa: a ruby crawler framework
100
+ description: |+
101
+ == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
102
102
 
103
- Medusa is a ruby framework to crawl and collect useful information about the pages it visits.
104
- It is versatile, allowing you to write your own specialized tasks quickly and easily.
103
+ Medusa is a framework for the ruby language to crawl and collect useful information about the pages
104
+ it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
105
105
 
106
- #### Features
106
+ === Features
107
107
 
108
- - Choose the links to follow on each page with `focus_crawl()`
109
- - Multi-threaded design for high performance
110
- - Tracks 301 HTTP redirects
111
- - Allows exclusion of URLs based on regular expressions
112
- - HTTPS support
113
- - Records response time for each page
114
- - Obey robots.txt
115
- - In-memory or persistent storage of pages during crawl using Moneta adapters.
116
- - Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
117
- email:
118
- executables:
119
- - medusa
108
+ * Choose the links to follow on each page with +focus_crawl+
109
+ * Multi-threaded design for high performance
110
+ * Tracks +301+ HTTP redirects
111
+ * Allows exclusion of URLs based on regular expressions
112
+ * HTTPS support
113
+ * Records response time for each page
114
+ * Obey _robots.txt_ directives (optional, but recommended)
115
+ * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
116
+ * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
117
+
118
+ <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
119
+
120
+ email: mauroasprea@gmail.com
121
+ executables: []
120
122
  extensions: []
121
123
  extra_rdoc_files:
122
- - README.md
124
+ - README.rdoc
123
125
  files:
124
126
  - CHANGELOG.md
125
127
  - CONTRIBUTORS.md
126
128
  - LICENSE.txt
127
- - README.md
129
+ - README.rdoc
128
130
  - Rakefile
129
131
  - VERSION
130
- - bin/medusa
131
132
  - lib/medusa.rb
132
- - lib/medusa/cli.rb
133
- - lib/medusa/cli/count.rb
134
- - lib/medusa/cli/cron.rb
135
- - lib/medusa/cli/pagedepth.rb
136
- - lib/medusa/cli/serialize.rb
137
- - lib/medusa/cli/url_list.rb
138
133
  - lib/medusa/cookie_store.rb
139
134
  - lib/medusa/core.rb
140
135
  - lib/medusa/exceptions.rb
@@ -156,11 +151,12 @@ licenses:
156
151
  - MIT
157
152
  metadata:
158
153
  bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
159
- source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.1
154
+ source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.2
155
+ description_markup_format: rdoc
160
156
  post_install_message:
161
157
  rdoc_options:
162
158
  - "-m"
163
- - README.md
159
+ - README.rdoc
164
160
  - "-t"
165
161
  - Medusa
166
162
  require_paths:
@@ -169,7 +165,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
169
165
  requirements:
170
166
  - - ">="
171
167
  - !ruby/object:Gem::Version
172
- version: '0'
168
+ version: 2.3.0
173
169
  required_rubygems_version: !ruby/object:Gem::Requirement
174
170
  requirements:
175
171
  - - ">"
metadata.gz.sig CHANGED
Binary file
data/README.md DELETED
@@ -1,48 +0,0 @@
1
- # Medusa: a ruby crawler framework ![Ruby](https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push)
2
-
3
- Medusa is a framework to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
4
-
5
-
6
- ## Features
7
-
8
- - Choose the links to follow on each page with `focus_crawl()`
9
- - Multi-threaded design for high performance
10
- - Tracks 301 HTTP redirects
11
- - Allows exclusion of URLs based on regular expressions
12
- - HTTPS support
13
- - Records response time for each page
14
- - Obey robots.txt
15
- - In-memory or persistent storage of pages during crawl using [Moneta](https://github.com/moneta-rb/moneta) adapters
16
- - Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
17
-
18
- ## Examples
19
-
20
- See the scripts under the <tt>lib/Medusa/cli</tt> directory for examples of several useful Medusa tasks.
21
-
22
- ## TODO
23
-
24
- - [x] Simplify storage module using [Moneta](https://github.com/minad/moneta)
25
- - [x] Add multiverse of ruby versions and runtimes in test suite
26
- - [ ] Solve memory issues with a persistent Queue
27
- - [ ] Improve docs & examples
28
- - [ ] Allow to control the crawler, eg: "stop", "resume"
29
- - [ ] Improve logging facilities to collect stats, catch errors & failures
30
- - [ ] Add the concept of "bots" or drivers to interact with pages (eg: capybara)
31
-
32
- **Do you have an idea? [Open an issue so we can discuss it](https://github.com/brutuscat/medusa-crawler/issues/new)**
33
-
34
- ## Requirements
35
-
36
- - moneta
37
- - nokogiri
38
- - robotex
39
-
40
- ## Development
41
-
42
- To test and develop this gem, additional requirements are:
43
- - rspec
44
- - webmock
45
-
46
- ## Disclaimer
47
-
48
- Medusa is a revamped version of the defunk anemone gem.
data/bin/medusa DELETED
@@ -1,4 +0,0 @@
1
- #!/usr/bin/env ruby
2
- require 'medusa/cli'
3
-
4
- Medusa::CLI::run
@@ -1,24 +0,0 @@
1
- module Medusa
2
- module CLI
3
- COMMANDS = %w[count cron pagedepth serialize url-list]
4
-
5
- def self.run
6
- command = ARGV.shift
7
-
8
- if COMMANDS.include? command
9
- load "medusa/cli/#{command.tr('-', '_')}.rb"
10
- else
11
- puts <<-INFO
12
- Medusa is a web spider framework that can collect
13
- useful information about pages it visits.
14
-
15
- Usage:
16
- medusa <command> [arguments]
17
-
18
- Commands:
19
- #{COMMANDS.join(', ')}
20
- INFO
21
- end
22
- end
23
- end
24
- end
@@ -1,22 +0,0 @@
1
- require 'medusa'
2
-
3
- begin
4
- # make sure that the first option is a URL we can crawl
5
- url = URI(ARGV[0])
6
- rescue
7
- puts <<-INFO
8
- Usage:
9
- medusa count <url>
10
-
11
- Synopsis:
12
- Crawls a site starting at the given URL and outputs the total number
13
- of unique pages on the site.
14
- INFO
15
- exit(0)
16
- end
17
-
18
- Medusa.crawl(url) do |medusa|
19
- medusa.after_crawl do |pages|
20
- puts pages.uniq!.size
21
- end
22
- end
@@ -1,90 +0,0 @@
1
- require 'medusa'
2
- require 'optparse'
3
- require 'ostruct'
4
-
5
- options = OpenStruct.new
6
- options.relative = false
7
- options.output_file = 'urls.txt'
8
-
9
- begin
10
- # make sure that the last argument is a URL we can crawl
11
- root = URI(ARGV.last)
12
- rescue
13
- puts <<-INFO
14
- Usage:
15
- medusa cron [options] <url>
16
-
17
- Synopsis:
18
- Combination of `count`, `pagedepth` and `url-list` commands.
19
- Performs pagedepth, url list, and count functionality.
20
- Outputs results to STDOUT and link list to file (urls.txt).
21
- Meant to be run daily as a cron job.
22
-
23
- Options:
24
- -r, --relative Output relative URLs (rather than absolute)
25
- -o, --output filename Filename to save URL list to. Defautls to urls.txt.
26
- INFO
27
- exit(0)
28
- end
29
-
30
- # parse command-line options
31
- opts = OptionParser.new
32
- opts.on('-r', '--relative') { options.relative = true }
33
- opts.on('-o', '--output filename') {|o| options.output_file = o }
34
- opts.parse!(ARGV)
35
-
36
- Medusa.crawl(root, {:discard_page_bodies => true}) do |medusa|
37
-
38
- medusa.after_crawl do |pages|
39
- puts "Crawl results for #{root}\n"
40
-
41
- # print a list of 404's
42
- not_found = []
43
- pages.each_value do |page|
44
- url = page.url.to_s
45
- not_found << url if page.not_found?
46
- end
47
- unless not_found.empty?
48
- puts "\n404's:"
49
-
50
- missing_links = pages.urls_linking_to(not_found)
51
- missing_links.each do |url, links|
52
- if options.relative
53
- puts URI(url).path.to_s
54
- else
55
- puts url
56
- end
57
- links.slice(0..10).each do |u|
58
- u = u.path if options.relative
59
- puts " linked from #{u}"
60
- end
61
-
62
- puts " ..." if links.size > 10
63
- end
64
-
65
- print "\n"
66
- end
67
-
68
- # remove redirect aliases, and calculate pagedepths
69
- pages = pages.shortest_paths!(root).uniq
70
- depths = pages.values.inject({}) do |depths, page|
71
- depths[page.depth] ||= 0
72
- depths[page.depth] += 1
73
- depths
74
- end
75
-
76
- # print the page count
77
- puts "Total pages: #{pages.size}\n"
78
-
79
- # print a list of depths
80
- depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
81
-
82
- # output a list of urls to file
83
- file = open(options.output_file, 'w')
84
- pages.each_key do |url|
85
- url = options.relative ? url.path.to_s : url.to_s
86
- file.puts url
87
- end
88
- end
89
-
90
- end
@@ -1,32 +0,0 @@
1
- require 'medusa'
2
-
3
- begin
4
- # make sure that the first option is a URL we can crawl
5
- root = URI(ARGV[0])
6
- rescue
7
- puts <<-INFO
8
- Usage:
9
- medusa pagedepth <url>
10
-
11
- Synopsis:
12
- Crawls a site starting at the given URL and outputs a count of
13
- the number of pages at each depth of the crawl.
14
- INFO
15
- exit(0)
16
- end
17
-
18
- Medusa.crawl(root, read_timeout: 3, discard_page_bodies: true, obey_robots_txt: true) do |medusa|
19
- medusa.skip_links_like %r{^/c/$}, %r{^/stores/$}
20
-
21
- medusa.after_crawl do |pages|
22
- pages = pages.shortest_paths!(root).uniq!
23
-
24
- depths = pages.values.inject({}) do |depths, page|
25
- depths[page.depth] ||= 0
26
- depths[page.depth] += 1
27
- depths
28
- end
29
-
30
- depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
31
- end
32
- end
@@ -1,35 +0,0 @@
1
- require 'medusa'
2
- require 'optparse'
3
- require 'ostruct'
4
-
5
- begin
6
- # make sure that the first option is a URL we can crawl
7
- root = URI(ARGV[0])
8
- rescue
9
- puts <<-INFO
10
- Usage:
11
- medusa serialize [options] <url>
12
-
13
- Synopsis:
14
- Crawls a site starting at the given URL and saves the resulting
15
- PageStore object to a file using Marshal serialization.
16
-
17
- Options:
18
- -o, --output filename Filename to save PageStore to. Defaults to crawl.{Time.now}
19
- INFO
20
- exit(0)
21
- end
22
-
23
- options = OpenStruct.new
24
- options.output_file = "crawl.#{Time.now.to_i}"
25
-
26
- # parse command-line options
27
- opts = OptionParser.new
28
- opts.on('-o', '--output filename') {|o| options.output_file = o }
29
- opts.parse!(ARGV)
30
-
31
- Medusa.crawl(root) do |medusa|
32
- medusa.after_crawl do |pages|
33
- open(options.output_file, 'w') {|f| Marshal.dump(pages, f)}
34
- end
35
- end
@@ -1,41 +0,0 @@
1
- require 'medusa'
2
- require 'optparse'
3
- require 'ostruct'
4
-
5
- options = OpenStruct.new
6
- options.relative = false
7
-
8
- begin
9
- # make sure that the last option is a URL we can crawl
10
- root = URI(ARGV.last)
11
- rescue
12
- puts <<-INFO
13
- Usage:
14
- medusa url-list [options] <url>
15
-
16
- Synopsis:
17
- Crawls a site starting at the given URL, and outputs the URL of each page
18
- in the domain as they are encountered.
19
-
20
- Options:
21
- -r, --relative Output relative URLs (rather than absolute)
22
- INFO
23
- exit(0)
24
- end
25
-
26
- # parse command-line options
27
- opts = OptionParser.new
28
- opts.on('-r', '--relative') { options.relative = true }
29
- opts.parse!(ARGV)
30
-
31
- Medusa.crawl(root, :discard_page_bodies => true) do |medusa|
32
-
33
- medusa.on_every_page do |page|
34
- if options.relative
35
- puts page.url.path
36
- else
37
- puts page.url
38
- end
39
- end
40
-
41
- end