medusa-crawler 1.0.0.pre.1 → 1.0.0.pre.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- checksums.yaml.gz.sig +0 -0
- data.tar.gz.sig +0 -0
- data/CONTRIBUTORS.md +13 -12
- data/README.rdoc +44 -0
- data/VERSION +1 -1
- data/lib/medusa/core.rb +1 -1
- data/lib/medusa/page.rb +0 -3
- data/lib/medusa/page_store.rb +0 -38
- data/lib/medusa/version.rb +1 -1
- metadata +27 -31
- metadata.gz.sig +0 -0
- data/README.md +0 -48
- data/bin/medusa +0 -4
- data/lib/medusa/cli.rb +0 -24
- data/lib/medusa/cli/count.rb +0 -22
- data/lib/medusa/cli/cron.rb +0 -90
- data/lib/medusa/cli/pagedepth.rb +0 -32
- data/lib/medusa/cli/serialize.rb +0 -35
- data/lib/medusa/cli/url_list.rb +0 -41
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9daf1076b9f0528f797128f639affadc99c61aa7980f4de84f91b520bda7b305
|
4
|
+
data.tar.gz: c95e174b0215473befb1d9437865bc5e20696efb245151ebd5e6b462fbadb099
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e0afb9e4ac4cc5fdfd0a7c9bb3f01672bbdf41e711005eefffbd5d1b54a601e52868ad093e9985f49078942656551acca73de26d4b675e1442dec981e618bee3
|
7
|
+
data.tar.gz: 1098323c312714b7d1d72abd762d49a97dc233dfd8fb38cd574ab5611112cdc73859050a6f8fefcda35315274fad939c36750ce8277c5560b72769e5e5051a5b
|
checksums.yaml.gz.sig
CHANGED
Binary file
|
data.tar.gz.sig
CHANGED
Binary file
|
data/CONTRIBUTORS.md
CHANGED
@@ -5,18 +5,19 @@ Many thanks to the following folks who have contributed code to Medusa (a fork o
|
|
5
5
|
In no particular order:
|
6
6
|
|
7
7
|
|
8
|
-
| Person
|
9
|
-
|
|
10
|
-
| Chris Kite
|
11
|
-
| Marc Seeger
|
12
|
-
| Joost Baaij
|
13
|
-
| Laurent Arnoud
|
14
|
-
| Cheng Huang
|
15
|
-
| Mauro Asprea
|
16
|
-
| Alex Pooley
|
17
|
-
| Luca Pradovera
|
18
|
-
| Sergey Kojin
|
19
|
-
| Richard Paul
|
8
|
+
| Person | Github | Twitter |
|
9
|
+
| --------------- |:-------------:| --------:|
|
10
|
+
| Chris Kite | [chriskite](https://github.com/chriskite) | |
|
11
|
+
| Marc Seeger | | |
|
12
|
+
| Joost Baaij | | |
|
13
|
+
| Laurent Arnoud | | |
|
14
|
+
| Cheng Huang | [zzzhc](https://github.com/zzzhc) | |
|
15
|
+
| Mauro Asprea | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
|
16
|
+
| Alex Pooley | | |
|
17
|
+
| Luca Pradovera | [polysics](https://github.com/polysics) | |
|
18
|
+
| Sergey Kojin | | |
|
19
|
+
| Richard Paul | | |
|
20
|
+
| Martha Thompson | [MothOnMars](https://github.com/MothOnMars) | |
|
20
21
|
|
21
22
|
|
22
23
|
> If you are submitting a [PR](https://help.github.com/articles/using-pull-requests/), feel free to add yourself to this table.
|
data/README.rdoc
ADDED
@@ -0,0 +1,44 @@
|
|
1
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
2
|
+
|
3
|
+
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
4
|
+
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
5
|
+
|
6
|
+
=== Features
|
7
|
+
|
8
|
+
* Choose the links to follow on each page with +focus_crawl+
|
9
|
+
* Multi-threaded design for high performance
|
10
|
+
* Tracks +301+ HTTP redirects
|
11
|
+
* Allows exclusion of URLs based on regular expressions
|
12
|
+
* HTTPS support
|
13
|
+
* Records response time for each page
|
14
|
+
* Obey _robots.txt_ directives (optional, but recommended)
|
15
|
+
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
16
|
+
* Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
|
17
|
+
|
18
|
+
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
19
|
+
|
20
|
+
---
|
21
|
+
|
22
|
+
=== Requirements
|
23
|
+
|
24
|
+
moneta:: for the key/value storage adapters
|
25
|
+
nokogiri:: for parsing the HTML of webpages
|
26
|
+
robotex:: for support of the robots.txt directives
|
27
|
+
|
28
|
+
=== Development
|
29
|
+
|
30
|
+
To test and develop this gem, additional requirements are:
|
31
|
+
- rspec
|
32
|
+
- webmock
|
33
|
+
|
34
|
+
=== About
|
35
|
+
|
36
|
+
Medusa is a revamped version of the defunk _anemone_ gem.
|
37
|
+
|
38
|
+
=== License
|
39
|
+
|
40
|
+
Copyright (c) 2009 Vertive, Inc.
|
41
|
+
|
42
|
+
Copyright (c) 2020 Mauro Asprea
|
43
|
+
|
44
|
+
Released under the {MIT License}[https://github.com/brutuscat/medusa-crawler/blob/master/LICENSE.txt]
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.0.0.pre.
|
1
|
+
1.0.0.pre.2
|
data/lib/medusa/core.rb
CHANGED
data/lib/medusa/page.rb
CHANGED
@@ -22,8 +22,6 @@ module Medusa
|
|
22
22
|
attr_accessor :data
|
23
23
|
# Integer response code of the page
|
24
24
|
attr_accessor :code
|
25
|
-
# Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
|
26
|
-
attr_accessor :visited
|
27
25
|
# Depth of this page from the root of the crawl. This is not necessarily the
|
28
26
|
# shortest path; use PageStore#shortest_paths! to find that value.
|
29
27
|
attr_accessor :depth
|
@@ -40,7 +38,6 @@ module Medusa
|
|
40
38
|
@data = OpenStruct.new
|
41
39
|
|
42
40
|
@links = nil
|
43
|
-
@visited = false
|
44
41
|
@body = nil
|
45
42
|
@doc = nil
|
46
43
|
@base = nil
|
data/lib/medusa/page_store.rb
CHANGED
@@ -58,44 +58,6 @@ module Medusa
|
|
58
58
|
has_key? url
|
59
59
|
end
|
60
60
|
|
61
|
-
#
|
62
|
-
# Use a breadth-first search to calculate the single-source
|
63
|
-
# shortest paths from *root* to all pages in the PageStore
|
64
|
-
#
|
65
|
-
def shortest_paths!(root)
|
66
|
-
root = URI(root) if root.is_a?(String)
|
67
|
-
raise "Root node not found" if !has_key?(root)
|
68
|
-
|
69
|
-
q = Queue.new
|
70
|
-
|
71
|
-
q.enq root
|
72
|
-
root_page = self[root]
|
73
|
-
root_page.depth = 0
|
74
|
-
root_page.visited = true
|
75
|
-
self[root] = root_page
|
76
|
-
while !q.empty?
|
77
|
-
page = self[q.deq]
|
78
|
-
page.links.each do |u|
|
79
|
-
begin
|
80
|
-
link = self[u]
|
81
|
-
next if link.nil? || !link.fetched? || link.visited
|
82
|
-
|
83
|
-
q << u unless link.redirect?
|
84
|
-
link.visited = true
|
85
|
-
link.depth = page.depth + 1
|
86
|
-
self[u] = link
|
87
|
-
|
88
|
-
if link.redirect?
|
89
|
-
u = link.redirect_to
|
90
|
-
redo
|
91
|
-
end
|
92
|
-
end
|
93
|
-
end
|
94
|
-
end
|
95
|
-
|
96
|
-
self
|
97
|
-
end
|
98
|
-
|
99
61
|
#
|
100
62
|
# Removes all Pages from storage where redirect? is true
|
101
63
|
#
|
data/lib/medusa/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: medusa-crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.0.pre.
|
4
|
+
version: 1.0.0.pre.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mauro Asprea
|
@@ -35,7 +35,7 @@ cert_chain:
|
|
35
35
|
g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
|
36
36
|
mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
|
37
37
|
-----END CERTIFICATE-----
|
38
|
-
date: 2020-08-
|
38
|
+
date: 2020-08-14 00:00:00.000000000 Z
|
39
39
|
dependencies:
|
40
40
|
- !ruby/object:Gem::Dependency
|
41
41
|
name: moneta
|
@@ -97,44 +97,39 @@ dependencies:
|
|
97
97
|
- - ">="
|
98
98
|
- !ruby/object:Gem::Version
|
99
99
|
version: 1.0.0
|
100
|
-
description:
|
101
|
-
== Medusa: a ruby crawler framework
|
100
|
+
description: |+
|
101
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
102
102
|
|
103
|
-
Medusa is a ruby
|
104
|
-
It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
103
|
+
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
104
|
+
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
105
105
|
|
106
|
-
|
106
|
+
=== Features
|
107
107
|
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
108
|
+
* Choose the links to follow on each page with +focus_crawl+
|
109
|
+
* Multi-threaded design for high performance
|
110
|
+
* Tracks +301+ HTTP redirects
|
111
|
+
* Allows exclusion of URLs based on regular expressions
|
112
|
+
* HTTPS support
|
113
|
+
* Records response time for each page
|
114
|
+
* Obey _robots.txt_ directives (optional, but recommended)
|
115
|
+
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
116
|
+
* Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
|
117
|
+
|
118
|
+
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
119
|
+
|
120
|
+
email: mauroasprea@gmail.com
|
121
|
+
executables: []
|
120
122
|
extensions: []
|
121
123
|
extra_rdoc_files:
|
122
|
-
- README.
|
124
|
+
- README.rdoc
|
123
125
|
files:
|
124
126
|
- CHANGELOG.md
|
125
127
|
- CONTRIBUTORS.md
|
126
128
|
- LICENSE.txt
|
127
|
-
- README.
|
129
|
+
- README.rdoc
|
128
130
|
- Rakefile
|
129
131
|
- VERSION
|
130
|
-
- bin/medusa
|
131
132
|
- lib/medusa.rb
|
132
|
-
- lib/medusa/cli.rb
|
133
|
-
- lib/medusa/cli/count.rb
|
134
|
-
- lib/medusa/cli/cron.rb
|
135
|
-
- lib/medusa/cli/pagedepth.rb
|
136
|
-
- lib/medusa/cli/serialize.rb
|
137
|
-
- lib/medusa/cli/url_list.rb
|
138
133
|
- lib/medusa/cookie_store.rb
|
139
134
|
- lib/medusa/core.rb
|
140
135
|
- lib/medusa/exceptions.rb
|
@@ -156,11 +151,12 @@ licenses:
|
|
156
151
|
- MIT
|
157
152
|
metadata:
|
158
153
|
bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
|
159
|
-
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.
|
154
|
+
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.2
|
155
|
+
description_markup_format: rdoc
|
160
156
|
post_install_message:
|
161
157
|
rdoc_options:
|
162
158
|
- "-m"
|
163
|
-
- README.
|
159
|
+
- README.rdoc
|
164
160
|
- "-t"
|
165
161
|
- Medusa
|
166
162
|
require_paths:
|
@@ -169,7 +165,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
169
165
|
requirements:
|
170
166
|
- - ">="
|
171
167
|
- !ruby/object:Gem::Version
|
172
|
-
version:
|
168
|
+
version: 2.3.0
|
173
169
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
174
170
|
requirements:
|
175
171
|
- - ">"
|
metadata.gz.sig
CHANGED
Binary file
|
data/README.md
DELETED
@@ -1,48 +0,0 @@
|
|
1
|
-
# Medusa: a ruby crawler framework 
|
2
|
-
|
3
|
-
Medusa is a framework to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
4
|
-
|
5
|
-
|
6
|
-
## Features
|
7
|
-
|
8
|
-
- Choose the links to follow on each page with `focus_crawl()`
|
9
|
-
- Multi-threaded design for high performance
|
10
|
-
- Tracks 301 HTTP redirects
|
11
|
-
- Allows exclusion of URLs based on regular expressions
|
12
|
-
- HTTPS support
|
13
|
-
- Records response time for each page
|
14
|
-
- Obey robots.txt
|
15
|
-
- In-memory or persistent storage of pages during crawl using [Moneta](https://github.com/moneta-rb/moneta) adapters
|
16
|
-
- Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
|
17
|
-
|
18
|
-
## Examples
|
19
|
-
|
20
|
-
See the scripts under the <tt>lib/Medusa/cli</tt> directory for examples of several useful Medusa tasks.
|
21
|
-
|
22
|
-
## TODO
|
23
|
-
|
24
|
-
- [x] Simplify storage module using [Moneta](https://github.com/minad/moneta)
|
25
|
-
- [x] Add multiverse of ruby versions and runtimes in test suite
|
26
|
-
- [ ] Solve memory issues with a persistent Queue
|
27
|
-
- [ ] Improve docs & examples
|
28
|
-
- [ ] Allow to control the crawler, eg: "stop", "resume"
|
29
|
-
- [ ] Improve logging facilities to collect stats, catch errors & failures
|
30
|
-
- [ ] Add the concept of "bots" or drivers to interact with pages (eg: capybara)
|
31
|
-
|
32
|
-
**Do you have an idea? [Open an issue so we can discuss it](https://github.com/brutuscat/medusa-crawler/issues/new)**
|
33
|
-
|
34
|
-
## Requirements
|
35
|
-
|
36
|
-
- moneta
|
37
|
-
- nokogiri
|
38
|
-
- robotex
|
39
|
-
|
40
|
-
## Development
|
41
|
-
|
42
|
-
To test and develop this gem, additional requirements are:
|
43
|
-
- rspec
|
44
|
-
- webmock
|
45
|
-
|
46
|
-
## Disclaimer
|
47
|
-
|
48
|
-
Medusa is a revamped version of the defunk anemone gem.
|
data/bin/medusa
DELETED
data/lib/medusa/cli.rb
DELETED
@@ -1,24 +0,0 @@
|
|
1
|
-
module Medusa
|
2
|
-
module CLI
|
3
|
-
COMMANDS = %w[count cron pagedepth serialize url-list]
|
4
|
-
|
5
|
-
def self.run
|
6
|
-
command = ARGV.shift
|
7
|
-
|
8
|
-
if COMMANDS.include? command
|
9
|
-
load "medusa/cli/#{command.tr('-', '_')}.rb"
|
10
|
-
else
|
11
|
-
puts <<-INFO
|
12
|
-
Medusa is a web spider framework that can collect
|
13
|
-
useful information about pages it visits.
|
14
|
-
|
15
|
-
Usage:
|
16
|
-
medusa <command> [arguments]
|
17
|
-
|
18
|
-
Commands:
|
19
|
-
#{COMMANDS.join(', ')}
|
20
|
-
INFO
|
21
|
-
end
|
22
|
-
end
|
23
|
-
end
|
24
|
-
end
|
data/lib/medusa/cli/count.rb
DELETED
@@ -1,22 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
|
3
|
-
begin
|
4
|
-
# make sure that the first option is a URL we can crawl
|
5
|
-
url = URI(ARGV[0])
|
6
|
-
rescue
|
7
|
-
puts <<-INFO
|
8
|
-
Usage:
|
9
|
-
medusa count <url>
|
10
|
-
|
11
|
-
Synopsis:
|
12
|
-
Crawls a site starting at the given URL and outputs the total number
|
13
|
-
of unique pages on the site.
|
14
|
-
INFO
|
15
|
-
exit(0)
|
16
|
-
end
|
17
|
-
|
18
|
-
Medusa.crawl(url) do |medusa|
|
19
|
-
medusa.after_crawl do |pages|
|
20
|
-
puts pages.uniq!.size
|
21
|
-
end
|
22
|
-
end
|
data/lib/medusa/cli/cron.rb
DELETED
@@ -1,90 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
require 'optparse'
|
3
|
-
require 'ostruct'
|
4
|
-
|
5
|
-
options = OpenStruct.new
|
6
|
-
options.relative = false
|
7
|
-
options.output_file = 'urls.txt'
|
8
|
-
|
9
|
-
begin
|
10
|
-
# make sure that the last argument is a URL we can crawl
|
11
|
-
root = URI(ARGV.last)
|
12
|
-
rescue
|
13
|
-
puts <<-INFO
|
14
|
-
Usage:
|
15
|
-
medusa cron [options] <url>
|
16
|
-
|
17
|
-
Synopsis:
|
18
|
-
Combination of `count`, `pagedepth` and `url-list` commands.
|
19
|
-
Performs pagedepth, url list, and count functionality.
|
20
|
-
Outputs results to STDOUT and link list to file (urls.txt).
|
21
|
-
Meant to be run daily as a cron job.
|
22
|
-
|
23
|
-
Options:
|
24
|
-
-r, --relative Output relative URLs (rather than absolute)
|
25
|
-
-o, --output filename Filename to save URL list to. Defautls to urls.txt.
|
26
|
-
INFO
|
27
|
-
exit(0)
|
28
|
-
end
|
29
|
-
|
30
|
-
# parse command-line options
|
31
|
-
opts = OptionParser.new
|
32
|
-
opts.on('-r', '--relative') { options.relative = true }
|
33
|
-
opts.on('-o', '--output filename') {|o| options.output_file = o }
|
34
|
-
opts.parse!(ARGV)
|
35
|
-
|
36
|
-
Medusa.crawl(root, {:discard_page_bodies => true}) do |medusa|
|
37
|
-
|
38
|
-
medusa.after_crawl do |pages|
|
39
|
-
puts "Crawl results for #{root}\n"
|
40
|
-
|
41
|
-
# print a list of 404's
|
42
|
-
not_found = []
|
43
|
-
pages.each_value do |page|
|
44
|
-
url = page.url.to_s
|
45
|
-
not_found << url if page.not_found?
|
46
|
-
end
|
47
|
-
unless not_found.empty?
|
48
|
-
puts "\n404's:"
|
49
|
-
|
50
|
-
missing_links = pages.urls_linking_to(not_found)
|
51
|
-
missing_links.each do |url, links|
|
52
|
-
if options.relative
|
53
|
-
puts URI(url).path.to_s
|
54
|
-
else
|
55
|
-
puts url
|
56
|
-
end
|
57
|
-
links.slice(0..10).each do |u|
|
58
|
-
u = u.path if options.relative
|
59
|
-
puts " linked from #{u}"
|
60
|
-
end
|
61
|
-
|
62
|
-
puts " ..." if links.size > 10
|
63
|
-
end
|
64
|
-
|
65
|
-
print "\n"
|
66
|
-
end
|
67
|
-
|
68
|
-
# remove redirect aliases, and calculate pagedepths
|
69
|
-
pages = pages.shortest_paths!(root).uniq
|
70
|
-
depths = pages.values.inject({}) do |depths, page|
|
71
|
-
depths[page.depth] ||= 0
|
72
|
-
depths[page.depth] += 1
|
73
|
-
depths
|
74
|
-
end
|
75
|
-
|
76
|
-
# print the page count
|
77
|
-
puts "Total pages: #{pages.size}\n"
|
78
|
-
|
79
|
-
# print a list of depths
|
80
|
-
depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
|
81
|
-
|
82
|
-
# output a list of urls to file
|
83
|
-
file = open(options.output_file, 'w')
|
84
|
-
pages.each_key do |url|
|
85
|
-
url = options.relative ? url.path.to_s : url.to_s
|
86
|
-
file.puts url
|
87
|
-
end
|
88
|
-
end
|
89
|
-
|
90
|
-
end
|
data/lib/medusa/cli/pagedepth.rb
DELETED
@@ -1,32 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
|
3
|
-
begin
|
4
|
-
# make sure that the first option is a URL we can crawl
|
5
|
-
root = URI(ARGV[0])
|
6
|
-
rescue
|
7
|
-
puts <<-INFO
|
8
|
-
Usage:
|
9
|
-
medusa pagedepth <url>
|
10
|
-
|
11
|
-
Synopsis:
|
12
|
-
Crawls a site starting at the given URL and outputs a count of
|
13
|
-
the number of pages at each depth of the crawl.
|
14
|
-
INFO
|
15
|
-
exit(0)
|
16
|
-
end
|
17
|
-
|
18
|
-
Medusa.crawl(root, read_timeout: 3, discard_page_bodies: true, obey_robots_txt: true) do |medusa|
|
19
|
-
medusa.skip_links_like %r{^/c/$}, %r{^/stores/$}
|
20
|
-
|
21
|
-
medusa.after_crawl do |pages|
|
22
|
-
pages = pages.shortest_paths!(root).uniq!
|
23
|
-
|
24
|
-
depths = pages.values.inject({}) do |depths, page|
|
25
|
-
depths[page.depth] ||= 0
|
26
|
-
depths[page.depth] += 1
|
27
|
-
depths
|
28
|
-
end
|
29
|
-
|
30
|
-
depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
|
31
|
-
end
|
32
|
-
end
|
data/lib/medusa/cli/serialize.rb
DELETED
@@ -1,35 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
require 'optparse'
|
3
|
-
require 'ostruct'
|
4
|
-
|
5
|
-
begin
|
6
|
-
# make sure that the first option is a URL we can crawl
|
7
|
-
root = URI(ARGV[0])
|
8
|
-
rescue
|
9
|
-
puts <<-INFO
|
10
|
-
Usage:
|
11
|
-
medusa serialize [options] <url>
|
12
|
-
|
13
|
-
Synopsis:
|
14
|
-
Crawls a site starting at the given URL and saves the resulting
|
15
|
-
PageStore object to a file using Marshal serialization.
|
16
|
-
|
17
|
-
Options:
|
18
|
-
-o, --output filename Filename to save PageStore to. Defaults to crawl.{Time.now}
|
19
|
-
INFO
|
20
|
-
exit(0)
|
21
|
-
end
|
22
|
-
|
23
|
-
options = OpenStruct.new
|
24
|
-
options.output_file = "crawl.#{Time.now.to_i}"
|
25
|
-
|
26
|
-
# parse command-line options
|
27
|
-
opts = OptionParser.new
|
28
|
-
opts.on('-o', '--output filename') {|o| options.output_file = o }
|
29
|
-
opts.parse!(ARGV)
|
30
|
-
|
31
|
-
Medusa.crawl(root) do |medusa|
|
32
|
-
medusa.after_crawl do |pages|
|
33
|
-
open(options.output_file, 'w') {|f| Marshal.dump(pages, f)}
|
34
|
-
end
|
35
|
-
end
|
data/lib/medusa/cli/url_list.rb
DELETED
@@ -1,41 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
require 'optparse'
|
3
|
-
require 'ostruct'
|
4
|
-
|
5
|
-
options = OpenStruct.new
|
6
|
-
options.relative = false
|
7
|
-
|
8
|
-
begin
|
9
|
-
# make sure that the last option is a URL we can crawl
|
10
|
-
root = URI(ARGV.last)
|
11
|
-
rescue
|
12
|
-
puts <<-INFO
|
13
|
-
Usage:
|
14
|
-
medusa url-list [options] <url>
|
15
|
-
|
16
|
-
Synopsis:
|
17
|
-
Crawls a site starting at the given URL, and outputs the URL of each page
|
18
|
-
in the domain as they are encountered.
|
19
|
-
|
20
|
-
Options:
|
21
|
-
-r, --relative Output relative URLs (rather than absolute)
|
22
|
-
INFO
|
23
|
-
exit(0)
|
24
|
-
end
|
25
|
-
|
26
|
-
# parse command-line options
|
27
|
-
opts = OptionParser.new
|
28
|
-
opts.on('-r', '--relative') { options.relative = true }
|
29
|
-
opts.parse!(ARGV)
|
30
|
-
|
31
|
-
Medusa.crawl(root, :discard_page_bodies => true) do |medusa|
|
32
|
-
|
33
|
-
medusa.on_every_page do |page|
|
34
|
-
if options.relative
|
35
|
-
puts page.url.path
|
36
|
-
else
|
37
|
-
puts page.url
|
38
|
-
end
|
39
|
-
end
|
40
|
-
|
41
|
-
end
|