varnisher 1.0.beta.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/LICENSE +8 -0
- data/README.md +86 -0
- data/bin/varnisher +96 -0
- data/lib/varnisher.rb +4 -0
- data/lib/varnisher/domainpurger.rb +27 -0
- data/lib/varnisher/pagepurger.rb +181 -0
- data/lib/varnisher/spider.rb +164 -0
- data/lib/varnisher/version.rb +3 -0
- metadata +81 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 89bc9abf14d10e978e9d63af4652deebd7912205
|
4
|
+
data.tar.gz: 3074562679934cc26c5b413a281fe9a4aef31f23
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: ec489165fc9a6b70f8824cfef3b962657467ad319dc584a4478497d76eedf8c0adf74d6af8b1b4201935038a57affc156d1e33ec41927f1eb39d5cac6147a112
|
7
|
+
data.tar.gz: b41896764dfab0db49743b7d30236de8f6ea30810b72f6e135d8b252ac8bd9b58c91c65fd9c10a61c7e2cb16bc7f7a869a3999d2f2e2dacdfd57abe59852bca6
|
data/LICENSE
ADDED
@@ -0,0 +1,8 @@
|
|
1
|
+
Copyright (c) 2011 Rob Miller
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
4
|
+
|
5
|
+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
6
|
+
|
7
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
8
|
+
|
data/README.md
ADDED
@@ -0,0 +1,86 @@
|
|
1
|
+
# Varnish Toolkit
|
2
|
+
|
3
|
+
Administering Varnish is generally a breeze, but sometimes you want to do one of the few things that aren't painless out of the box. Hopefully, that's where this toolbox comes in.
|
4
|
+
|
5
|
+
Varnish Toolkit relies on the wonderful libraries [hpricot](http://hpricot.com/), by why the lucky stiff, and [Parallel](https://github.com/grosser/parallel), by Michael Grosser. If you don't have them, install them with:
|
6
|
+
|
7
|
+
$ sudo gem install hpricot parallel
|
8
|
+
|
9
|
+
## Usage
|
10
|
+
|
11
|
+
Usage: varnish.rb [options] action target
|
12
|
+
-h, --help Display this help
|
13
|
+
-v, --verbose Output more information
|
14
|
+
-H, --hostname HOSTNAME Hostname/IP address of your Varnish server. Default is localhost
|
15
|
+
-p, --port PORT Port your Varnish server is listening on. Default is 80
|
16
|
+
-n, --num-pages NUM Number of pages to crawl when in spider mode. -1 will crawl all pages
|
17
|
+
-#, --hashes If true, /foo.html#foo and /foo.html#bar will be seen as different in spider mode
|
18
|
+
-q, --ignore-query-string If true, /foo?foo=bar and /foo?foo=baz will be seen as the same in spider mode
|
19
|
+
|
20
|
+
If you find yourself typing certain parameters every time you use the script, you can specify them in an RC file called `.varnishrc` in your home directory. The file format is YAML and the default options are, if you want to paste and override them:
|
21
|
+
|
22
|
+
verbose: false
|
23
|
+
hostname: localhost
|
24
|
+
port: 80
|
25
|
+
num_pages: 100
|
26
|
+
ignore_hash: true
|
27
|
+
ignore_query_string: false
|
28
|
+
|
29
|
+
## Examples
|
30
|
+
|
31
|
+
### Purging a page and all the resources on it
|
32
|
+
|
33
|
+
Quite often, it's necessary redevelop a page on a website in a way that involves changes not only to the page but also to CSS files, images, JavaScript files, etc. Purging pages in this instance can be a painful process, or at least one that requires a few `ban` commands in `varnishadm`. No longer!
|
34
|
+
|
35
|
+
Just enter:
|
36
|
+
|
37
|
+
$ varnish.rb purge http://www.example.com/path/to/page
|
38
|
+
|
39
|
+
...and `/path/to/page`, along with all its images, CSS files, JavaScript files, and other external accoutrements, will be purged from Varnish's cache.
|
40
|
+
|
41
|
+
As a bonus, this action is multithreaded, meaning even resource-heavy pages should purge quickly and evenly.
|
42
|
+
|
43
|
+
This action requires your VCL to have something like the following, which is fairly standard:
|
44
|
+
|
45
|
+
if (req.request == "PURGE") {
|
46
|
+
if ( client.ip ~ auth ) {
|
47
|
+
ban("obj.http.x-url == " + req.url + " && obj.http.x-host == " + req.http.host);
|
48
|
+
error 200 "Purged.";
|
49
|
+
}
|
50
|
+
}
|
51
|
+
|
52
|
+
(For an explanation of just what `obj.http.x-url` means, and why you should use it rather than `req.url`, see [this page](http://kristianlyng.wordpress.com/2010/07/28/smart-bans-with-varnish/).)
|
53
|
+
|
54
|
+
### Purging an entire domain
|
55
|
+
|
56
|
+
Provided your VCL has something akin to the following in it:
|
57
|
+
|
58
|
+
if ( req.request == "DOMAINPURGE" ) {
|
59
|
+
if ( client.ip ~ auth ) {
|
60
|
+
ban("obj.http.x-host == " + req.http.host);
|
61
|
+
error 200 "Purged.";
|
62
|
+
}
|
63
|
+
}
|
64
|
+
|
65
|
+
...then you should be able to quickly purge an entire domain's worth of pages and resources by simply issuing the command:
|
66
|
+
|
67
|
+
$ varnish.rb purge www.example.com
|
68
|
+
|
69
|
+
### Repopulating the cache
|
70
|
+
|
71
|
+
If you've purged a whole domain, and particularly if your backend is slow, you might want to quickly repopulate the cache so that users never see your slow misses. Well, you can! Use the `spider` action:
|
72
|
+
|
73
|
+
$ varnish.rb spider www.example.com
|
74
|
+
|
75
|
+
`spider` accepts either a hostname or a URL as its starting point, and will only fetch pages on the same domain as its origin. You can limit the number of pages it will process using the `-n` parameter:
|
76
|
+
|
77
|
+
$ varnish.rb -n 500 spider www.example.com
|
78
|
+
|
79
|
+
If you'd like to combine purging and spidering, you can use the `reindex` action:
|
80
|
+
|
81
|
+
$ varnish.rb reindex www.example.com
|
82
|
+
|
83
|
+
…which is functionally equivalent to:
|
84
|
+
|
85
|
+
$ varnish.rb purge www.example.com
|
86
|
+
$ varnish.rb spider www.example.com
|
data/bin/varnisher
ADDED
@@ -0,0 +1,96 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require 'optparse'
|
4
|
+
require 'yaml'
|
5
|
+
|
6
|
+
require 'varnisher'
|
7
|
+
|
8
|
+
$options = {
|
9
|
+
:verbose => false,
|
10
|
+
:hostname => 'localhost',
|
11
|
+
:port => 80,
|
12
|
+
:num_pages => 100,
|
13
|
+
:ignore_hash => true,
|
14
|
+
:spider_threads => 16,
|
15
|
+
:ignore_query_string => false
|
16
|
+
}
|
17
|
+
|
18
|
+
rcfile = File.expand_path("~/.varnishrc")
|
19
|
+
if FileTest.readable? rcfile
|
20
|
+
rc = YAML::load(File.open(rcfile))
|
21
|
+
# Convert to symbols
|
22
|
+
rc = rc.inject({}){ |memo,(k,v)| memo[k.to_sym] = v; memo }
|
23
|
+
$options.merge!(rc)
|
24
|
+
end
|
25
|
+
|
26
|
+
optparse = OptionParser.new do |opts|
|
27
|
+
opts.banner = 'Usage: varnish.rb [options] action target'
|
28
|
+
|
29
|
+
opts.on('-h', '--help', 'Display this help') do
|
30
|
+
puts opts
|
31
|
+
end
|
32
|
+
|
33
|
+
opts.on('-v', '--verbose', 'Output more information') do
|
34
|
+
$options[:verbose] = true
|
35
|
+
end
|
36
|
+
|
37
|
+
opts.on('-H', '--hostname HOSTNAME', 'Hostname/IP address of your Varnish server. Default is localhost') do |hostname|
|
38
|
+
$options[:hostname] = hostname
|
39
|
+
end
|
40
|
+
|
41
|
+
opts.on('-p', '--port PORT', 'Port your Varnish server is listening on. Default is 80') do |port|
|
42
|
+
$options[:port] = port
|
43
|
+
end
|
44
|
+
|
45
|
+
opts.on('-n', '--num-pages NUM', 'Number of pages to crawl when in spider mode. -1 will crawl all pages') do |num|
|
46
|
+
$options[:num_pages] = num.to_i
|
47
|
+
end
|
48
|
+
|
49
|
+
opts.on('-t', '--spider-threads NUM', 'Number of threads to use when spidering. Default is 16') do |num|
|
50
|
+
$options[:spider_threads] = num.to_i
|
51
|
+
end
|
52
|
+
|
53
|
+
opts.on('-#', '--hashes', 'If true, /foo.html#foo and /foo.html#bar will be seen as different in spider mode') do
|
54
|
+
$options[:ignore_hash] = false
|
55
|
+
end
|
56
|
+
|
57
|
+
opts.on('-q', '--ignore-query-string', 'If true, /foo?foo=bar and /foo?foo=baz will be seen as the same in spider mode') do
|
58
|
+
$options[:ignore_query_string] = true
|
59
|
+
end
|
60
|
+
end
|
61
|
+
|
62
|
+
optparse.parse!
|
63
|
+
|
64
|
+
# All our libs use these constants.
|
65
|
+
PROXY_HOSTNAME = $options[:hostname]
|
66
|
+
PROXY_PORT = $options[:port].to_i
|
67
|
+
|
68
|
+
if ( ARGV.length < 2 )
|
69
|
+
puts "You must specify both an action and a target."
|
70
|
+
end
|
71
|
+
|
72
|
+
action = ARGV[0]
|
73
|
+
target = ARGV[1]
|
74
|
+
|
75
|
+
case action
|
76
|
+
when "purge"
|
77
|
+
# If target is a valid URL, then assume we're purging a page and its contents.
|
78
|
+
if target =~ /^[a-z]+:\/\//
|
79
|
+
Varnisher::PagePurger.new target
|
80
|
+
end
|
81
|
+
|
82
|
+
# If target is a hostname, assume we want to purge an entire domain.
|
83
|
+
if target =~ /^(([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])$/
|
84
|
+
Varnisher::DomainPurger.new target
|
85
|
+
end
|
86
|
+
|
87
|
+
when "spider"
|
88
|
+
Varnisher::Spider.new target
|
89
|
+
|
90
|
+
when "reindex"
|
91
|
+
Varnisher::DomainPurger.new target
|
92
|
+
Varnisher::Spider.new target
|
93
|
+
|
94
|
+
else
|
95
|
+
puts "Invalid action."
|
96
|
+
end
|
data/lib/varnisher.rb
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
require 'net/http'
|
2
|
+
|
3
|
+
# This requires a special bit of VCL:
|
4
|
+
#
|
5
|
+
# if ( req.request == "DOMAINPURGE" ) {
|
6
|
+
# if ( client.ip ~ auth ) {
|
7
|
+
# ban("obj.http.x-host == " + req.http.host);
|
8
|
+
# error 200 "Purged.";
|
9
|
+
# }
|
10
|
+
# }
|
11
|
+
|
12
|
+
module Varnisher
|
13
|
+
class DomainPurger
|
14
|
+
def initialize(domain)
|
15
|
+
s = TCPSocket.open(PROXY_HOSTNAME, PROXY_PORT)
|
16
|
+
s.print("DOMAINPURGE / HTTP/1.1\r\nHost: #{domain}\r\n\r\n")
|
17
|
+
|
18
|
+
if s.read =~ /HTTP\/1\.1 200 Purged\./
|
19
|
+
puts "Purged #{domain}"
|
20
|
+
else
|
21
|
+
puts "Failed to purge #{domain}"
|
22
|
+
end
|
23
|
+
|
24
|
+
s.close
|
25
|
+
end
|
26
|
+
end
|
27
|
+
end
|
@@ -0,0 +1,181 @@
|
|
1
|
+
require 'rubygems'
|
2
|
+
require 'hpricot'
|
3
|
+
require 'net/http'
|
4
|
+
require 'parallel'
|
5
|
+
|
6
|
+
module Varnisher
|
7
|
+
class PagePurger
|
8
|
+
|
9
|
+
def initialize(url)
|
10
|
+
@url = url
|
11
|
+
@uri = URI.parse(url)
|
12
|
+
|
13
|
+
@urls = []
|
14
|
+
|
15
|
+
# First, purge the URL itself; that means we'll get up-to-date references within that page.
|
16
|
+
puts "Purging #{@url}...\n\n"
|
17
|
+
purge(@url)
|
18
|
+
|
19
|
+
# Then, do a fresh GET of the page and queue any resources we find on it.
|
20
|
+
puts "Looking for external resources on #{@url}..."
|
21
|
+
|
22
|
+
if $options[:verbose]
|
23
|
+
puts "\n\n"
|
24
|
+
end
|
25
|
+
|
26
|
+
fetch_page(@url)
|
27
|
+
|
28
|
+
if $options[:verbose]
|
29
|
+
puts "\n"
|
30
|
+
end
|
31
|
+
|
32
|
+
puts "#{@urls.length} total resources found.\n\n"
|
33
|
+
|
34
|
+
if @urls.length == 0
|
35
|
+
puts "No resources found. Abort!"
|
36
|
+
return
|
37
|
+
end
|
38
|
+
|
39
|
+
# Let's figure out which of these resources we can actually purge — whether they're on our server, etc.
|
40
|
+
puts "Tidying resources...\n"
|
41
|
+
tidy_resources
|
42
|
+
puts "#{@urls.length} purgeable resources found.\n\n"
|
43
|
+
|
44
|
+
# Now, purge all of the resources we just queued.
|
45
|
+
puts "Purging resources..."
|
46
|
+
|
47
|
+
if $options[:verbose]
|
48
|
+
puts "\n\n"
|
49
|
+
end
|
50
|
+
|
51
|
+
purge_queue
|
52
|
+
|
53
|
+
if $options[:verbose]
|
54
|
+
puts "\n"
|
55
|
+
end
|
56
|
+
|
57
|
+
puts "Nothing more to do!\n\n"
|
58
|
+
end
|
59
|
+
|
60
|
+
# Sends a PURGE request to the Varnish server, asking it to purge the given URL from its cache.
|
61
|
+
def purge(url)
|
62
|
+
begin
|
63
|
+
uri = URI.parse(URI.encode(url.to_s.strip))
|
64
|
+
rescue
|
65
|
+
puts "Couldn't parse URL for purging: #{$!}"
|
66
|
+
return
|
67
|
+
end
|
68
|
+
|
69
|
+
s = TCPSocket.open(PROXY_HOSTNAME, PROXY_PORT)
|
70
|
+
s.print("PURGE #{uri.path} HTTP/1.1\r\nHost: #{uri.host}\r\n\r\n")
|
71
|
+
|
72
|
+
if $options[:verbose]
|
73
|
+
if s.read =~ /HTTP\/1\.1 200 Purged\./
|
74
|
+
puts "Purged #{url}"
|
75
|
+
else
|
76
|
+
puts "Failed to purge #{url}"
|
77
|
+
end
|
78
|
+
end
|
79
|
+
|
80
|
+
s.close
|
81
|
+
end
|
82
|
+
|
83
|
+
# Fetches a page and parses out any external resources (e.g. JavaScript files, images, CSS files) it finds on it.
|
84
|
+
def fetch_page(url)
|
85
|
+
begin
|
86
|
+
uri = URI.parse(URI.encode(url.to_s.strip))
|
87
|
+
rescue
|
88
|
+
puts "Couldn't parse URL for resource-searching: #{url}"
|
89
|
+
return
|
90
|
+
end
|
91
|
+
|
92
|
+
headers = {
|
93
|
+
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.106 Safari/535.2",
|
94
|
+
"Accept-Charset" => "utf-8",
|
95
|
+
"Accept" => "text/html"
|
96
|
+
}
|
97
|
+
|
98
|
+
begin
|
99
|
+
doc = Hpricot(Net::HTTP.get_response(uri).body)
|
100
|
+
rescue
|
101
|
+
puts "Hmm, I couldn't seem to fetch that URL. Sure it's right?\n"
|
102
|
+
return
|
103
|
+
end
|
104
|
+
|
105
|
+
find_resources(doc) do |resource|
|
106
|
+
if $options[:verbose]
|
107
|
+
puts "Found #{resource}"
|
108
|
+
end
|
109
|
+
queue_resource(resource)
|
110
|
+
end
|
111
|
+
end
|
112
|
+
|
113
|
+
def find_resources(doc)
|
114
|
+
return unless doc.respond_to? 'search'
|
115
|
+
|
116
|
+
# A bash at an abstract representation of resources. All you need is an XPath, and what attribute to select from the matched elements.
|
117
|
+
resource = Struct.new :name, :xpath, :attribute
|
118
|
+
resources = [
|
119
|
+
resource.new('stylesheet', 'link[@rel*=stylesheet]', 'href'),
|
120
|
+
resource.new('JavaScript file', 'script[@src]', 'src'),
|
121
|
+
resource.new('image file', 'img[@src]', 'src')
|
122
|
+
]
|
123
|
+
|
124
|
+
resources.each { |resource|
|
125
|
+
doc.search(resource.xpath).each { |e|
|
126
|
+
att = e.get_attribute(resource.attribute)
|
127
|
+
yield att
|
128
|
+
}
|
129
|
+
}
|
130
|
+
end
|
131
|
+
|
132
|
+
# Adds a URL to the processing queue.
|
133
|
+
def queue_resource(url)
|
134
|
+
@urls << url.to_s
|
135
|
+
end
|
136
|
+
|
137
|
+
def tidy_resources
|
138
|
+
valid_urls = []
|
139
|
+
|
140
|
+
@urls.each { |url|
|
141
|
+
# If we're dealing with a host-relative URL (e.g. <img src="/foo/bar.jpg">), absolutify it.
|
142
|
+
if url.to_s =~ /^\//
|
143
|
+
url = @uri.scheme + "://" + @uri.host + url.to_s
|
144
|
+
end
|
145
|
+
|
146
|
+
# If we're dealing with a path-relative URL, make it relative to the current directory.
|
147
|
+
unless url.to_s =~ /[a-z]+:\/\//
|
148
|
+
# Take everything up to the final / in the path to be the current directory.
|
149
|
+
/^(.*)\//.match(@uri.path)
|
150
|
+
url = @uri.scheme + "://" + @uri.host + $1 + "/" + url.to_s
|
151
|
+
end
|
152
|
+
|
153
|
+
begin
|
154
|
+
uri = URI.parse(url)
|
155
|
+
rescue
|
156
|
+
next
|
157
|
+
end
|
158
|
+
|
159
|
+
# Skip URLs that aren't HTTP, or that are on different domains.
|
160
|
+
next if uri.scheme != "http"
|
161
|
+
next if uri.host != @uri.host
|
162
|
+
|
163
|
+
valid_urls << url
|
164
|
+
}
|
165
|
+
|
166
|
+
@urls = valid_urls.dup
|
167
|
+
end
|
168
|
+
|
169
|
+
# Processes the queue of URLs, sending a purge request for each of them.
|
170
|
+
def purge_queue()
|
171
|
+
Parallel.map(@urls) { |url|
|
172
|
+
if $options[:verbose]
|
173
|
+
puts "Purging #{url}..."
|
174
|
+
end
|
175
|
+
|
176
|
+
purge(url)
|
177
|
+
}
|
178
|
+
end
|
179
|
+
|
180
|
+
end
|
181
|
+
end
|
@@ -0,0 +1,164 @@
|
|
1
|
+
require 'rubygems'
|
2
|
+
require 'hpricot'
|
3
|
+
require 'net/http'
|
4
|
+
require 'parallel'
|
5
|
+
|
6
|
+
module Varnisher
|
7
|
+
class Spider
|
8
|
+
|
9
|
+
def initialize(url)
|
10
|
+
if url =~ /^(([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])$/
|
11
|
+
url = 'http://' + url
|
12
|
+
end
|
13
|
+
|
14
|
+
@uri = URI.parse(url)
|
15
|
+
|
16
|
+
@pages_hit = 0
|
17
|
+
|
18
|
+
@visited = []
|
19
|
+
@to_visit = []
|
20
|
+
|
21
|
+
puts "Beginning spider of #{url}"
|
22
|
+
crawl_page(url)
|
23
|
+
spider
|
24
|
+
puts "Done; #{@pages_hit} pages hit."
|
25
|
+
end
|
26
|
+
|
27
|
+
def queue_link(url)
|
28
|
+
@to_visit << url
|
29
|
+
end
|
30
|
+
|
31
|
+
def crawl_page(url, limit = 10)
|
32
|
+
# Don't crawl a page twice
|
33
|
+
return if @visited.include? url
|
34
|
+
|
35
|
+
# Let's not hit this again
|
36
|
+
@visited << url
|
37
|
+
|
38
|
+
begin
|
39
|
+
uri = URI.parse(URI.encode(url.to_s.strip))
|
40
|
+
rescue
|
41
|
+
return
|
42
|
+
end
|
43
|
+
|
44
|
+
headers = {
|
45
|
+
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31",
|
46
|
+
"Accept-Charset" => "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
|
47
|
+
"Accept" => "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
|
48
|
+
}
|
49
|
+
|
50
|
+
begin
|
51
|
+
req = Net::HTTP::Get.new(uri.path, headers)
|
52
|
+
response = Net::HTTP.start(uri.host, uri.port) { |http| http.request(req) }
|
53
|
+
|
54
|
+
case response
|
55
|
+
when Net::HTTPRedirection
|
56
|
+
return crawl_page(response['location'], limit - 1)
|
57
|
+
when Net::HTTPSuccess
|
58
|
+
doc = Hpricot(response.body)
|
59
|
+
end
|
60
|
+
rescue
|
61
|
+
return
|
62
|
+
end
|
63
|
+
|
64
|
+
@pages_hit += 1
|
65
|
+
|
66
|
+
if $options[:verbose]
|
67
|
+
puts "Fetched #{url}..."
|
68
|
+
end
|
69
|
+
|
70
|
+
find_links(doc, url) do |link|
|
71
|
+
next if @visited.include? link
|
72
|
+
next if @to_visit.include? link
|
73
|
+
|
74
|
+
@to_visit << link
|
75
|
+
end
|
76
|
+
end
|
77
|
+
|
78
|
+
def find_links(doc, url)
|
79
|
+
return unless doc.respond_to? 'search'
|
80
|
+
|
81
|
+
begin
|
82
|
+
uri = URI.parse(URI.encode(url.to_s.strip))
|
83
|
+
rescue
|
84
|
+
return
|
85
|
+
end
|
86
|
+
|
87
|
+
hrefs = []
|
88
|
+
|
89
|
+
# Looks like a valid document! Let's parse it for links
|
90
|
+
doc.search("//a[@href]").each do |e|
|
91
|
+
hrefs << e.get_attribute("href")
|
92
|
+
end
|
93
|
+
|
94
|
+
# Let's also look for commented-out URIs
|
95
|
+
doc.search("//comment()").each do |e|
|
96
|
+
e.to_html.scan(/https?:\/\/[^\s\"]*/) { |url| hrefs << url; }
|
97
|
+
end
|
98
|
+
|
99
|
+
hrefs.each do |href|
|
100
|
+
# Skip mailto links
|
101
|
+
next if href =~ /^mailto:/
|
102
|
+
|
103
|
+
# If we're dealing with a host-relative URL (e.g. <img src="/foo/bar.jpg">), absolutify it.
|
104
|
+
if href.to_s =~ /^\//
|
105
|
+
href = uri.scheme + "://" + uri.host + href.to_s
|
106
|
+
end
|
107
|
+
|
108
|
+
# If we're dealing with a path-relative URL, make it relative to the current directory.
|
109
|
+
unless href.to_s =~ /[a-z]+:\/\//
|
110
|
+
# Take everything up to the final / in the path to be the current directory.
|
111
|
+
if uri.path =~ /\//
|
112
|
+
/^(.*)\//.match(uri.path)
|
113
|
+
path = $1
|
114
|
+
# If we're on the homepage, then we don't need a path.
|
115
|
+
else
|
116
|
+
path = ""
|
117
|
+
end
|
118
|
+
|
119
|
+
href = uri.scheme + "://" + uri.host + path + "/" + href.to_s
|
120
|
+
end
|
121
|
+
|
122
|
+
# At this point, we should have an absolute URL regardless of
|
123
|
+
# its original format.
|
124
|
+
|
125
|
+
# Strip hash links
|
126
|
+
if ( $options[:ignore_hash] )
|
127
|
+
href.gsub!(/(#.*?)$/, '')
|
128
|
+
end
|
129
|
+
|
130
|
+
# Strip query strings
|
131
|
+
if ( $options[:ignore_query_string] )
|
132
|
+
href.gsub!(/(\?.*?)$/, '')
|
133
|
+
end
|
134
|
+
|
135
|
+
begin
|
136
|
+
href_uri = URI.parse(href)
|
137
|
+
rescue
|
138
|
+
# No harm in this — if we can't parse it as a URI, it probably isn't one (`javascript:` links, etc.) and we can safely ignore it.
|
139
|
+
next
|
140
|
+
end
|
141
|
+
|
142
|
+
next if href_uri.host != uri.host
|
143
|
+
next unless href_uri.scheme =~ /^https?$/
|
144
|
+
|
145
|
+
yield href
|
146
|
+
end
|
147
|
+
end
|
148
|
+
|
149
|
+
def spider
|
150
|
+
Parallel.in_threads($options[:spider_threads]) { |thread_number|
|
151
|
+
# We've crawled too many pages
|
152
|
+
next if @pages_hit > $options[:num_pages] && $options[:num_pages] >= 0
|
153
|
+
|
154
|
+
while @to_visit.length > 0 do
|
155
|
+
begin
|
156
|
+
url = @to_visit.pop
|
157
|
+
end while ( @visited.include? url )
|
158
|
+
|
159
|
+
crawl_page(url)
|
160
|
+
end
|
161
|
+
}
|
162
|
+
end
|
163
|
+
end
|
164
|
+
end
|
metadata
ADDED
@@ -0,0 +1,81 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: varnisher
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.0.beta.1
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Rob Miller
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2013-08-11 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: hpricot
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ~>
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: 0.8.6
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ~>
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: 0.8.6
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: parallel
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ~>
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: 0.7.1
|
34
|
+
type: :runtime
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - ~>
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: 0.7.1
|
41
|
+
description: Some tools that make working with the Varnish HTTP cache easier, including
|
42
|
+
things like doing mass purges of entire domains.
|
43
|
+
email: rob@bigfish.co.uk
|
44
|
+
executables:
|
45
|
+
- varnisher
|
46
|
+
extensions: []
|
47
|
+
extra_rdoc_files: []
|
48
|
+
files:
|
49
|
+
- bin/varnisher
|
50
|
+
- lib/varnisher/domainpurger.rb
|
51
|
+
- lib/varnisher/pagepurger.rb
|
52
|
+
- lib/varnisher/spider.rb
|
53
|
+
- lib/varnisher/version.rb
|
54
|
+
- lib/varnisher.rb
|
55
|
+
- LICENSE
|
56
|
+
- README.md
|
57
|
+
homepage: http://github.com/robmiller/varnisher
|
58
|
+
licenses:
|
59
|
+
- MIT
|
60
|
+
metadata: {}
|
61
|
+
post_install_message:
|
62
|
+
rdoc_options: []
|
63
|
+
require_paths:
|
64
|
+
- lib
|
65
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
66
|
+
requirements:
|
67
|
+
- - '>='
|
68
|
+
- !ruby/object:Gem::Version
|
69
|
+
version: '0'
|
70
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
71
|
+
requirements:
|
72
|
+
- - '>'
|
73
|
+
- !ruby/object:Gem::Version
|
74
|
+
version: 1.3.1
|
75
|
+
requirements: []
|
76
|
+
rubyforge_project:
|
77
|
+
rubygems_version: 2.0.3
|
78
|
+
signing_key:
|
79
|
+
specification_version: 4
|
80
|
+
summary: Helpful tools for working with Varnish caches
|
81
|
+
test_files: []
|