medusa-crawler 1.0.0.pre.1 → 1.0.0.pre.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- checksums.yaml.gz.sig +0 -0
- data.tar.gz.sig +0 -0
- data/CONTRIBUTORS.md +13 -12
- data/README.rdoc +44 -0
- data/VERSION +1 -1
- data/lib/medusa/core.rb +1 -1
- data/lib/medusa/page.rb +0 -3
- data/lib/medusa/page_store.rb +0 -38
- data/lib/medusa/version.rb +1 -1
- metadata +27 -31
- metadata.gz.sig +0 -0
- data/README.md +0 -48
- data/bin/medusa +0 -4
- data/lib/medusa/cli.rb +0 -24
- data/lib/medusa/cli/count.rb +0 -22
- data/lib/medusa/cli/cron.rb +0 -90
- data/lib/medusa/cli/pagedepth.rb +0 -32
- data/lib/medusa/cli/serialize.rb +0 -35
- data/lib/medusa/cli/url_list.rb +0 -41
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 9daf1076b9f0528f797128f639affadc99c61aa7980f4de84f91b520bda7b305
|
4
|
+
data.tar.gz: c95e174b0215473befb1d9437865bc5e20696efb245151ebd5e6b462fbadb099
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e0afb9e4ac4cc5fdfd0a7c9bb3f01672bbdf41e711005eefffbd5d1b54a601e52868ad093e9985f49078942656551acca73de26d4b675e1442dec981e618bee3
|
7
|
+
data.tar.gz: 1098323c312714b7d1d72abd762d49a97dc233dfd8fb38cd574ab5611112cdc73859050a6f8fefcda35315274fad939c36750ce8277c5560b72769e5e5051a5b
|
checksums.yaml.gz.sig
CHANGED
Binary file
|
data.tar.gz.sig
CHANGED
Binary file
|
data/CONTRIBUTORS.md
CHANGED
@@ -5,18 +5,19 @@ Many thanks to the following folks who have contributed code to Medusa (a fork o
|
|
5
5
|
In no particular order:
|
6
6
|
|
7
7
|
|
8
|
-
| Person
|
9
|
-
|
|
10
|
-
| Chris Kite
|
11
|
-
| Marc Seeger
|
12
|
-
| Joost Baaij
|
13
|
-
| Laurent Arnoud
|
14
|
-
| Cheng Huang
|
15
|
-
| Mauro Asprea
|
16
|
-
| Alex Pooley
|
17
|
-
| Luca Pradovera
|
18
|
-
| Sergey Kojin
|
19
|
-
| Richard Paul
|
8
|
+
| Person | Github | Twitter |
|
9
|
+
| --------------- |:-------------:| --------:|
|
10
|
+
| Chris Kite | [chriskite](https://github.com/chriskite) | |
|
11
|
+
| Marc Seeger | | |
|
12
|
+
| Joost Baaij | | |
|
13
|
+
| Laurent Arnoud | | |
|
14
|
+
| Cheng Huang | [zzzhc](https://github.com/zzzhc) | |
|
15
|
+
| Mauro Asprea | [brutuscat](https://github.com/brutuscat) | [@brutuscat](https://twitter.com/brutuscat) |
|
16
|
+
| Alex Pooley | | |
|
17
|
+
| Luca Pradovera | [polysics](https://github.com/polysics) | |
|
18
|
+
| Sergey Kojin | | |
|
19
|
+
| Richard Paul | | |
|
20
|
+
| Martha Thompson | [MothOnMars](https://github.com/MothOnMars) | |
|
20
21
|
|
21
22
|
|
22
23
|
> If you are submitting a [PR](https://help.github.com/articles/using-pull-requests/), feel free to add yourself to this table.
|
data/README.rdoc
ADDED
@@ -0,0 +1,44 @@
|
|
1
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
2
|
+
|
3
|
+
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
4
|
+
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
5
|
+
|
6
|
+
=== Features
|
7
|
+
|
8
|
+
* Choose the links to follow on each page with +focus_crawl+
|
9
|
+
* Multi-threaded design for high performance
|
10
|
+
* Tracks +301+ HTTP redirects
|
11
|
+
* Allows exclusion of URLs based on regular expressions
|
12
|
+
* HTTPS support
|
13
|
+
* Records response time for each page
|
14
|
+
* Obey _robots.txt_ directives (optional, but recommended)
|
15
|
+
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
16
|
+
* Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
|
17
|
+
|
18
|
+
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
19
|
+
|
20
|
+
---
|
21
|
+
|
22
|
+
=== Requirements
|
23
|
+
|
24
|
+
moneta:: for the key/value storage adapters
|
25
|
+
nokogiri:: for parsing the HTML of webpages
|
26
|
+
robotex:: for support of the robots.txt directives
|
27
|
+
|
28
|
+
=== Development
|
29
|
+
|
30
|
+
To test and develop this gem, additional requirements are:
|
31
|
+
- rspec
|
32
|
+
- webmock
|
33
|
+
|
34
|
+
=== About
|
35
|
+
|
36
|
+
Medusa is a revamped version of the defunk _anemone_ gem.
|
37
|
+
|
38
|
+
=== License
|
39
|
+
|
40
|
+
Copyright (c) 2009 Vertive, Inc.
|
41
|
+
|
42
|
+
Copyright (c) 2020 Mauro Asprea
|
43
|
+
|
44
|
+
Released under the {MIT License}[https://github.com/brutuscat/medusa-crawler/blob/master/LICENSE.txt]
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.0.0.pre.
|
1
|
+
1.0.0.pre.2
|
data/lib/medusa/core.rb
CHANGED
data/lib/medusa/page.rb
CHANGED
@@ -22,8 +22,6 @@ module Medusa
|
|
22
22
|
attr_accessor :data
|
23
23
|
# Integer response code of the page
|
24
24
|
attr_accessor :code
|
25
|
-
# Boolean indicating whether or not this page has been visited in PageStore#shortest_paths!
|
26
|
-
attr_accessor :visited
|
27
25
|
# Depth of this page from the root of the crawl. This is not necessarily the
|
28
26
|
# shortest path; use PageStore#shortest_paths! to find that value.
|
29
27
|
attr_accessor :depth
|
@@ -40,7 +38,6 @@ module Medusa
|
|
40
38
|
@data = OpenStruct.new
|
41
39
|
|
42
40
|
@links = nil
|
43
|
-
@visited = false
|
44
41
|
@body = nil
|
45
42
|
@doc = nil
|
46
43
|
@base = nil
|
data/lib/medusa/page_store.rb
CHANGED
@@ -58,44 +58,6 @@ module Medusa
|
|
58
58
|
has_key? url
|
59
59
|
end
|
60
60
|
|
61
|
-
#
|
62
|
-
# Use a breadth-first search to calculate the single-source
|
63
|
-
# shortest paths from *root* to all pages in the PageStore
|
64
|
-
#
|
65
|
-
def shortest_paths!(root)
|
66
|
-
root = URI(root) if root.is_a?(String)
|
67
|
-
raise "Root node not found" if !has_key?(root)
|
68
|
-
|
69
|
-
q = Queue.new
|
70
|
-
|
71
|
-
q.enq root
|
72
|
-
root_page = self[root]
|
73
|
-
root_page.depth = 0
|
74
|
-
root_page.visited = true
|
75
|
-
self[root] = root_page
|
76
|
-
while !q.empty?
|
77
|
-
page = self[q.deq]
|
78
|
-
page.links.each do |u|
|
79
|
-
begin
|
80
|
-
link = self[u]
|
81
|
-
next if link.nil? || !link.fetched? || link.visited
|
82
|
-
|
83
|
-
q << u unless link.redirect?
|
84
|
-
link.visited = true
|
85
|
-
link.depth = page.depth + 1
|
86
|
-
self[u] = link
|
87
|
-
|
88
|
-
if link.redirect?
|
89
|
-
u = link.redirect_to
|
90
|
-
redo
|
91
|
-
end
|
92
|
-
end
|
93
|
-
end
|
94
|
-
end
|
95
|
-
|
96
|
-
self
|
97
|
-
end
|
98
|
-
|
99
61
|
#
|
100
62
|
# Removes all Pages from storage where redirect? is true
|
101
63
|
#
|
data/lib/medusa/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: medusa-crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.0.pre.
|
4
|
+
version: 1.0.0.pre.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mauro Asprea
|
@@ -35,7 +35,7 @@ cert_chain:
|
|
35
35
|
g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
|
36
36
|
mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
|
37
37
|
-----END CERTIFICATE-----
|
38
|
-
date: 2020-08-
|
38
|
+
date: 2020-08-14 00:00:00.000000000 Z
|
39
39
|
dependencies:
|
40
40
|
- !ruby/object:Gem::Dependency
|
41
41
|
name: moneta
|
@@ -97,44 +97,39 @@ dependencies:
|
|
97
97
|
- - ">="
|
98
98
|
- !ruby/object:Gem::Version
|
99
99
|
version: 1.0.0
|
100
|
-
description:
|
101
|
-
== Medusa: a ruby crawler framework
|
100
|
+
description: |+
|
101
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
102
102
|
|
103
|
-
Medusa is a ruby
|
104
|
-
It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
103
|
+
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
104
|
+
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
105
105
|
|
106
|
-
|
106
|
+
=== Features
|
107
107
|
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
108
|
+
* Choose the links to follow on each page with +focus_crawl+
|
109
|
+
* Multi-threaded design for high performance
|
110
|
+
* Tracks +301+ HTTP redirects
|
111
|
+
* Allows exclusion of URLs based on regular expressions
|
112
|
+
* HTTPS support
|
113
|
+
* Records response time for each page
|
114
|
+
* Obey _robots.txt_ directives (optional, but recommended)
|
115
|
+
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
116
|
+
* Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
|
117
|
+
|
118
|
+
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
119
|
+
|
120
|
+
email: mauroasprea@gmail.com
|
121
|
+
executables: []
|
120
122
|
extensions: []
|
121
123
|
extra_rdoc_files:
|
122
|
-
- README.
|
124
|
+
- README.rdoc
|
123
125
|
files:
|
124
126
|
- CHANGELOG.md
|
125
127
|
- CONTRIBUTORS.md
|
126
128
|
- LICENSE.txt
|
127
|
-
- README.
|
129
|
+
- README.rdoc
|
128
130
|
- Rakefile
|
129
131
|
- VERSION
|
130
|
-
- bin/medusa
|
131
132
|
- lib/medusa.rb
|
132
|
-
- lib/medusa/cli.rb
|
133
|
-
- lib/medusa/cli/count.rb
|
134
|
-
- lib/medusa/cli/cron.rb
|
135
|
-
- lib/medusa/cli/pagedepth.rb
|
136
|
-
- lib/medusa/cli/serialize.rb
|
137
|
-
- lib/medusa/cli/url_list.rb
|
138
133
|
- lib/medusa/cookie_store.rb
|
139
134
|
- lib/medusa/core.rb
|
140
135
|
- lib/medusa/exceptions.rb
|
@@ -156,11 +151,12 @@ licenses:
|
|
156
151
|
- MIT
|
157
152
|
metadata:
|
158
153
|
bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
|
159
|
-
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.
|
154
|
+
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.2
|
155
|
+
description_markup_format: rdoc
|
160
156
|
post_install_message:
|
161
157
|
rdoc_options:
|
162
158
|
- "-m"
|
163
|
-
- README.
|
159
|
+
- README.rdoc
|
164
160
|
- "-t"
|
165
161
|
- Medusa
|
166
162
|
require_paths:
|
@@ -169,7 +165,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
169
165
|
requirements:
|
170
166
|
- - ">="
|
171
167
|
- !ruby/object:Gem::Version
|
172
|
-
version:
|
168
|
+
version: 2.3.0
|
173
169
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
174
170
|
requirements:
|
175
171
|
- - ">"
|
metadata.gz.sig
CHANGED
Binary file
|
data/README.md
DELETED
@@ -1,48 +0,0 @@
|
|
1
|
-
# Medusa: a ruby crawler framework ![Ruby](https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push)
|
2
|
-
|
3
|
-
Medusa is a framework to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
4
|
-
|
5
|
-
|
6
|
-
## Features
|
7
|
-
|
8
|
-
- Choose the links to follow on each page with `focus_crawl()`
|
9
|
-
- Multi-threaded design for high performance
|
10
|
-
- Tracks 301 HTTP redirects
|
11
|
-
- Allows exclusion of URLs based on regular expressions
|
12
|
-
- HTTPS support
|
13
|
-
- Records response time for each page
|
14
|
-
- Obey robots.txt
|
15
|
-
- In-memory or persistent storage of pages during crawl using [Moneta](https://github.com/moneta-rb/moneta) adapters
|
16
|
-
- Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options).
|
17
|
-
|
18
|
-
## Examples
|
19
|
-
|
20
|
-
See the scripts under the <tt>lib/Medusa/cli</tt> directory for examples of several useful Medusa tasks.
|
21
|
-
|
22
|
-
## TODO
|
23
|
-
|
24
|
-
- [x] Simplify storage module using [Moneta](https://github.com/minad/moneta)
|
25
|
-
- [x] Add multiverse of ruby versions and runtimes in test suite
|
26
|
-
- [ ] Solve memory issues with a persistent Queue
|
27
|
-
- [ ] Improve docs & examples
|
28
|
-
- [ ] Allow to control the crawler, eg: "stop", "resume"
|
29
|
-
- [ ] Improve logging facilities to collect stats, catch errors & failures
|
30
|
-
- [ ] Add the concept of "bots" or drivers to interact with pages (eg: capybara)
|
31
|
-
|
32
|
-
**Do you have an idea? [Open an issue so we can discuss it](https://github.com/brutuscat/medusa-crawler/issues/new)**
|
33
|
-
|
34
|
-
## Requirements
|
35
|
-
|
36
|
-
- moneta
|
37
|
-
- nokogiri
|
38
|
-
- robotex
|
39
|
-
|
40
|
-
## Development
|
41
|
-
|
42
|
-
To test and develop this gem, additional requirements are:
|
43
|
-
- rspec
|
44
|
-
- webmock
|
45
|
-
|
46
|
-
## Disclaimer
|
47
|
-
|
48
|
-
Medusa is a revamped version of the defunk anemone gem.
|
data/bin/medusa
DELETED
data/lib/medusa/cli.rb
DELETED
@@ -1,24 +0,0 @@
|
|
1
|
-
module Medusa
|
2
|
-
module CLI
|
3
|
-
COMMANDS = %w[count cron pagedepth serialize url-list]
|
4
|
-
|
5
|
-
def self.run
|
6
|
-
command = ARGV.shift
|
7
|
-
|
8
|
-
if COMMANDS.include? command
|
9
|
-
load "medusa/cli/#{command.tr('-', '_')}.rb"
|
10
|
-
else
|
11
|
-
puts <<-INFO
|
12
|
-
Medusa is a web spider framework that can collect
|
13
|
-
useful information about pages it visits.
|
14
|
-
|
15
|
-
Usage:
|
16
|
-
medusa <command> [arguments]
|
17
|
-
|
18
|
-
Commands:
|
19
|
-
#{COMMANDS.join(', ')}
|
20
|
-
INFO
|
21
|
-
end
|
22
|
-
end
|
23
|
-
end
|
24
|
-
end
|
data/lib/medusa/cli/count.rb
DELETED
@@ -1,22 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
|
3
|
-
begin
|
4
|
-
# make sure that the first option is a URL we can crawl
|
5
|
-
url = URI(ARGV[0])
|
6
|
-
rescue
|
7
|
-
puts <<-INFO
|
8
|
-
Usage:
|
9
|
-
medusa count <url>
|
10
|
-
|
11
|
-
Synopsis:
|
12
|
-
Crawls a site starting at the given URL and outputs the total number
|
13
|
-
of unique pages on the site.
|
14
|
-
INFO
|
15
|
-
exit(0)
|
16
|
-
end
|
17
|
-
|
18
|
-
Medusa.crawl(url) do |medusa|
|
19
|
-
medusa.after_crawl do |pages|
|
20
|
-
puts pages.uniq!.size
|
21
|
-
end
|
22
|
-
end
|
data/lib/medusa/cli/cron.rb
DELETED
@@ -1,90 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
require 'optparse'
|
3
|
-
require 'ostruct'
|
4
|
-
|
5
|
-
options = OpenStruct.new
|
6
|
-
options.relative = false
|
7
|
-
options.output_file = 'urls.txt'
|
8
|
-
|
9
|
-
begin
|
10
|
-
# make sure that the last argument is a URL we can crawl
|
11
|
-
root = URI(ARGV.last)
|
12
|
-
rescue
|
13
|
-
puts <<-INFO
|
14
|
-
Usage:
|
15
|
-
medusa cron [options] <url>
|
16
|
-
|
17
|
-
Synopsis:
|
18
|
-
Combination of `count`, `pagedepth` and `url-list` commands.
|
19
|
-
Performs pagedepth, url list, and count functionality.
|
20
|
-
Outputs results to STDOUT and link list to file (urls.txt).
|
21
|
-
Meant to be run daily as a cron job.
|
22
|
-
|
23
|
-
Options:
|
24
|
-
-r, --relative Output relative URLs (rather than absolute)
|
25
|
-
-o, --output filename Filename to save URL list to. Defautls to urls.txt.
|
26
|
-
INFO
|
27
|
-
exit(0)
|
28
|
-
end
|
29
|
-
|
30
|
-
# parse command-line options
|
31
|
-
opts = OptionParser.new
|
32
|
-
opts.on('-r', '--relative') { options.relative = true }
|
33
|
-
opts.on('-o', '--output filename') {|o| options.output_file = o }
|
34
|
-
opts.parse!(ARGV)
|
35
|
-
|
36
|
-
Medusa.crawl(root, {:discard_page_bodies => true}) do |medusa|
|
37
|
-
|
38
|
-
medusa.after_crawl do |pages|
|
39
|
-
puts "Crawl results for #{root}\n"
|
40
|
-
|
41
|
-
# print a list of 404's
|
42
|
-
not_found = []
|
43
|
-
pages.each_value do |page|
|
44
|
-
url = page.url.to_s
|
45
|
-
not_found << url if page.not_found?
|
46
|
-
end
|
47
|
-
unless not_found.empty?
|
48
|
-
puts "\n404's:"
|
49
|
-
|
50
|
-
missing_links = pages.urls_linking_to(not_found)
|
51
|
-
missing_links.each do |url, links|
|
52
|
-
if options.relative
|
53
|
-
puts URI(url).path.to_s
|
54
|
-
else
|
55
|
-
puts url
|
56
|
-
end
|
57
|
-
links.slice(0..10).each do |u|
|
58
|
-
u = u.path if options.relative
|
59
|
-
puts " linked from #{u}"
|
60
|
-
end
|
61
|
-
|
62
|
-
puts " ..." if links.size > 10
|
63
|
-
end
|
64
|
-
|
65
|
-
print "\n"
|
66
|
-
end
|
67
|
-
|
68
|
-
# remove redirect aliases, and calculate pagedepths
|
69
|
-
pages = pages.shortest_paths!(root).uniq
|
70
|
-
depths = pages.values.inject({}) do |depths, page|
|
71
|
-
depths[page.depth] ||= 0
|
72
|
-
depths[page.depth] += 1
|
73
|
-
depths
|
74
|
-
end
|
75
|
-
|
76
|
-
# print the page count
|
77
|
-
puts "Total pages: #{pages.size}\n"
|
78
|
-
|
79
|
-
# print a list of depths
|
80
|
-
depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
|
81
|
-
|
82
|
-
# output a list of urls to file
|
83
|
-
file = open(options.output_file, 'w')
|
84
|
-
pages.each_key do |url|
|
85
|
-
url = options.relative ? url.path.to_s : url.to_s
|
86
|
-
file.puts url
|
87
|
-
end
|
88
|
-
end
|
89
|
-
|
90
|
-
end
|
data/lib/medusa/cli/pagedepth.rb
DELETED
@@ -1,32 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
|
3
|
-
begin
|
4
|
-
# make sure that the first option is a URL we can crawl
|
5
|
-
root = URI(ARGV[0])
|
6
|
-
rescue
|
7
|
-
puts <<-INFO
|
8
|
-
Usage:
|
9
|
-
medusa pagedepth <url>
|
10
|
-
|
11
|
-
Synopsis:
|
12
|
-
Crawls a site starting at the given URL and outputs a count of
|
13
|
-
the number of pages at each depth of the crawl.
|
14
|
-
INFO
|
15
|
-
exit(0)
|
16
|
-
end
|
17
|
-
|
18
|
-
Medusa.crawl(root, read_timeout: 3, discard_page_bodies: true, obey_robots_txt: true) do |medusa|
|
19
|
-
medusa.skip_links_like %r{^/c/$}, %r{^/stores/$}
|
20
|
-
|
21
|
-
medusa.after_crawl do |pages|
|
22
|
-
pages = pages.shortest_paths!(root).uniq!
|
23
|
-
|
24
|
-
depths = pages.values.inject({}) do |depths, page|
|
25
|
-
depths[page.depth] ||= 0
|
26
|
-
depths[page.depth] += 1
|
27
|
-
depths
|
28
|
-
end
|
29
|
-
|
30
|
-
depths.sort.each { |depth, count| puts "Depth: #{depth} Count: #{count}" }
|
31
|
-
end
|
32
|
-
end
|
data/lib/medusa/cli/serialize.rb
DELETED
@@ -1,35 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
require 'optparse'
|
3
|
-
require 'ostruct'
|
4
|
-
|
5
|
-
begin
|
6
|
-
# make sure that the first option is a URL we can crawl
|
7
|
-
root = URI(ARGV[0])
|
8
|
-
rescue
|
9
|
-
puts <<-INFO
|
10
|
-
Usage:
|
11
|
-
medusa serialize [options] <url>
|
12
|
-
|
13
|
-
Synopsis:
|
14
|
-
Crawls a site starting at the given URL and saves the resulting
|
15
|
-
PageStore object to a file using Marshal serialization.
|
16
|
-
|
17
|
-
Options:
|
18
|
-
-o, --output filename Filename to save PageStore to. Defaults to crawl.{Time.now}
|
19
|
-
INFO
|
20
|
-
exit(0)
|
21
|
-
end
|
22
|
-
|
23
|
-
options = OpenStruct.new
|
24
|
-
options.output_file = "crawl.#{Time.now.to_i}"
|
25
|
-
|
26
|
-
# parse command-line options
|
27
|
-
opts = OptionParser.new
|
28
|
-
opts.on('-o', '--output filename') {|o| options.output_file = o }
|
29
|
-
opts.parse!(ARGV)
|
30
|
-
|
31
|
-
Medusa.crawl(root) do |medusa|
|
32
|
-
medusa.after_crawl do |pages|
|
33
|
-
open(options.output_file, 'w') {|f| Marshal.dump(pages, f)}
|
34
|
-
end
|
35
|
-
end
|
data/lib/medusa/cli/url_list.rb
DELETED
@@ -1,41 +0,0 @@
|
|
1
|
-
require 'medusa'
|
2
|
-
require 'optparse'
|
3
|
-
require 'ostruct'
|
4
|
-
|
5
|
-
options = OpenStruct.new
|
6
|
-
options.relative = false
|
7
|
-
|
8
|
-
begin
|
9
|
-
# make sure that the last option is a URL we can crawl
|
10
|
-
root = URI(ARGV.last)
|
11
|
-
rescue
|
12
|
-
puts <<-INFO
|
13
|
-
Usage:
|
14
|
-
medusa url-list [options] <url>
|
15
|
-
|
16
|
-
Synopsis:
|
17
|
-
Crawls a site starting at the given URL, and outputs the URL of each page
|
18
|
-
in the domain as they are encountered.
|
19
|
-
|
20
|
-
Options:
|
21
|
-
-r, --relative Output relative URLs (rather than absolute)
|
22
|
-
INFO
|
23
|
-
exit(0)
|
24
|
-
end
|
25
|
-
|
26
|
-
# parse command-line options
|
27
|
-
opts = OptionParser.new
|
28
|
-
opts.on('-r', '--relative') { options.relative = true }
|
29
|
-
opts.parse!(ARGV)
|
30
|
-
|
31
|
-
Medusa.crawl(root, :discard_page_bodies => true) do |medusa|
|
32
|
-
|
33
|
-
medusa.on_every_page do |page|
|
34
|
-
if options.relative
|
35
|
-
puts page.url.path
|
36
|
-
else
|
37
|
-
puts page.url
|
38
|
-
end
|
39
|
-
end
|
40
|
-
|
41
|
-
end
|