staticizer 0.0.8 → 0.0.9
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +48 -13
- data/Rakefile +6 -0
- data/lib/staticizer/command.rb +10 -1
- data/lib/staticizer/crawler.rb +92 -72
- data/lib/staticizer/version.rb +1 -1
- data/staticizer.gemspec +1 -0
- data/tests/crawler_test.rb +75 -10
- data/tests/fake_page.html +288 -0
- metadata +16 -1
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 8b1b737ad4357e646eb47042fbede1e239ec6c5f963ce1c25b48a810d2efb31d
|
4
|
+
data.tar.gz: 186898d812bca0e4ad732bae97d91a25cb43d76643435d6f9d9a9a7ca78a76b4
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 96d6225ae4416784dd80f80d2cce73418494186b9135258e59880ae536d95c23841a0baca5433aa8baf32689b47e7bd575d80c0ab609f88b84a5463dddc4a27b
|
7
|
+
data.tar.gz: 8329950984e4556791cb87f17618905146add7133401288ca270af35dc37a9fde49c102201acd44529a1bd96e7b808d67cd40fe3ccb20e5dac90907642108068
|
data/README.md
CHANGED
@@ -9,15 +9,30 @@ website. If the website goes down this backup would be available
|
|
9
9
|
with reduced functionality.
|
10
10
|
|
11
11
|
S3 and Route 53 provide an great way to host a static emergency backup for a website.
|
12
|
-
See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html
|
13
|
-
. In our experience it works
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
the
|
12
|
+
See this article - http://aws.typepad.com/aws/2013/02/create-a-backup-website-using-route-53-dns-failover-and-s3-website-hosting.html
|
13
|
+
. In our experience it works well and is incredibly cheap. Our average sized website
|
14
|
+
with a few hundred pages and assets is less than US$1 a month.
|
15
|
+
|
16
|
+
We tried using existing tools httrack/wget to crawl and create a static version
|
17
|
+
of the site to upload to S3, but we found that they did not work well with S3 hosting.
|
18
|
+
We wanted the site uploaded to S3 to respond to the *exact* same URLs (where possible) as
|
19
|
+
the existing site. This way when the site goes down incoming links from Google search
|
19
20
|
results etc. will still work.
|
20
21
|
|
22
|
+
## TODO
|
23
|
+
|
24
|
+
* Abillity to specify AWS credentials via file or environment options
|
25
|
+
* Tests!
|
26
|
+
* Decide what to do with URLs with query strings. Currently they are crawled and uploaded to S3, but those keys cannot be accessed. ex http://squaremill.com/file?test=1 will be uploaded with the key file?test=1, but can only be accessed by encoding the ? like this %3Ftest=1
|
27
|
+
* Create a 404 file on S3
|
28
|
+
* Provide the option to rewrite absolute URLs to relative urls so that hosting can work on a different domain.
|
29
|
+
* Multithread the crawler
|
30
|
+
* Check for too many redirects
|
31
|
+
* Provide regex options for what urls are scraped
|
32
|
+
* Better handling of incorrect server mime types (ex. server returns text/plain for css instead of text/css)
|
33
|
+
* Provide more options for uploading (upload via scp, ftp, custom etc.). Split out save/uploading into an interface.
|
34
|
+
* Handle large files in a more memory efficient way by streaming uploads/downloads
|
35
|
+
|
21
36
|
## Installation
|
22
37
|
|
23
38
|
Add this line to your application's Gemfile:
|
@@ -30,12 +45,11 @@ And then execute:
|
|
30
45
|
|
31
46
|
Or install it yourself as:
|
32
47
|
|
33
|
-
$ gem install
|
48
|
+
$ gem install staticizer
|
34
49
|
|
35
50
|
## Command line usage
|
36
51
|
|
37
|
-
|
38
|
-
|
52
|
+
Staticizer can be used through the commandline tool or by requiring the library.
|
39
53
|
|
40
54
|
### Crawl a website and write to disk
|
41
55
|
|
@@ -61,6 +75,8 @@ This will only crawl urls in the domain squaremill.com
|
|
61
75
|
|
62
76
|
s = Staticizer::Crawler.new("http://squaremill.com",
|
63
77
|
:aws => {
|
78
|
+
:region => "us-west-1",
|
79
|
+
:endpoint => "http://s3.amazonaws.com",
|
64
80
|
:bucket_name => "www.squaremill.com",
|
65
81
|
:secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
|
66
82
|
:access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
|
@@ -73,10 +89,27 @@ This will only crawl urls in the domain squaremill.com
|
|
73
89
|
s = Staticizer::Crawler.new("http://squaremill.com", :output_dir => "/tmp/crawl")
|
74
90
|
s.crawl
|
75
91
|
|
92
|
+
|
93
|
+
### Crawl a website and make all pages contain 'noindex' meta tag
|
94
|
+
|
95
|
+
s = Staticizer::Crawler.new("http://squaremill.com",
|
96
|
+
:output_dir => "/tmp/crawl",
|
97
|
+
:process_body => lambda {|body, uri, opts|
|
98
|
+
# not the best regex, but it will do for our use
|
99
|
+
body = body.gsub(/<meta\s+name=['"]robots[^>]+>/i,'')
|
100
|
+
body = body.gsub(/<head>/i,"<head>\n<meta name='robots' content='noindex'>")
|
101
|
+
body
|
102
|
+
}
|
103
|
+
)
|
104
|
+
s.crawl
|
105
|
+
|
106
|
+
|
76
107
|
### Crawl a website and rewrite all non www urls to www
|
77
108
|
|
78
109
|
s = Staticizer::Crawler.new("http://squaremill.com",
|
79
110
|
:aws => {
|
111
|
+
:region => "us-west-1",
|
112
|
+
:endpoint => "http://s3.amazonaws.com",
|
80
113
|
:bucket_name => "www.squaremill.com",
|
81
114
|
:secret_access_key => "HIA7T189234aADfFAdf322Vs12duRhOHy+23mc1+s",
|
82
115
|
:access_key_id => "HJFJS5gSJHMDZDFFSSDQQ"
|
@@ -92,14 +125,16 @@ This will only crawl urls in the domain squaremill.com
|
|
92
125
|
)
|
93
126
|
s.crawl
|
94
127
|
|
95
|
-
##
|
128
|
+
## Crawler Options
|
96
129
|
|
97
130
|
* :aws - Hash of connection options passed to aws/sdk gem
|
98
|
-
* :filter_url -
|
131
|
+
* :filter_url - lambda called to see if a discovered URL should be crawled, return the url (can be modified) to crawl, return nil otherwise
|
99
132
|
* :output_dir - if writing a site to disk the directory to write to, will be created if it does not exist
|
100
133
|
* :logger - A logger object responding to the usual Ruby Logger methods.
|
101
134
|
* :log_level - Log level - defaults to INFO.
|
102
|
-
|
135
|
+
* :valid_domains - Array of domains that should be crawled. Domains not in this list will be ignored.
|
136
|
+
* :process_body - lambda called to pre-process body of content before writing it out.
|
137
|
+
* :skip_write - don't write retrieved files to disk or s3, just crawl the site (can be used to find 404s etc.)
|
103
138
|
|
104
139
|
## Contributing
|
105
140
|
|
data/Rakefile
CHANGED
data/lib/staticizer/command.rb
CHANGED
@@ -18,6 +18,11 @@ module Staticizer
|
|
18
18
|
options[:aws][:bucket_name] = v
|
19
19
|
end
|
20
20
|
|
21
|
+
opts.on("--aws-region [STRING]", "AWS Region of S3 bucket") do |v|
|
22
|
+
options[:aws] ||= {}
|
23
|
+
options[:aws][:region] = v
|
24
|
+
end
|
25
|
+
|
21
26
|
opts.on("--aws-access-key [STRING]", "AWS Access Key ID") do |v|
|
22
27
|
options[:aws] ||= {}
|
23
28
|
options[:aws][:access_key_id] = v
|
@@ -44,6 +49,10 @@ module Staticizer
|
|
44
49
|
options[:logger] = Logger.new(v)
|
45
50
|
end
|
46
51
|
|
52
|
+
opts.on("--skip-write [PATH]", "Don't write out files to disk or s3") do |v|
|
53
|
+
options[:skip_write] = true
|
54
|
+
end
|
55
|
+
|
47
56
|
opts.on("--valid-domains x,y,z", Array, "Comma separated list of domains that should be crawled, other domains will be ignored") do |v|
|
48
57
|
options[:valid_domains] = v
|
49
58
|
end
|
@@ -55,7 +64,7 @@ module Staticizer
|
|
55
64
|
end
|
56
65
|
end
|
57
66
|
|
58
|
-
begin
|
67
|
+
begin
|
59
68
|
parser.parse!(args)
|
60
69
|
initial_page = ARGV.pop
|
61
70
|
raise ArgumentError, "Need to specify an initial URL to start the crawl" unless initial_page
|
data/lib/staticizer/crawler.rb
CHANGED
@@ -6,6 +6,9 @@ require 'logger'
|
|
6
6
|
|
7
7
|
module Staticizer
|
8
8
|
class Crawler
|
9
|
+
attr_reader :url_queue
|
10
|
+
attr_accessor :output_dir
|
11
|
+
|
9
12
|
def initialize(initial_page, opts = {})
|
10
13
|
if initial_page.nil?
|
11
14
|
raise ArgumentError, "Initial page required"
|
@@ -14,24 +17,36 @@ module Staticizer
|
|
14
17
|
@opts = opts.dup
|
15
18
|
@url_queue = []
|
16
19
|
@processed_urls = []
|
17
|
-
@opts[:output_dir]
|
20
|
+
@output_dir = @opts[:output_dir] || File.expand_path("crawl/")
|
18
21
|
@log = @opts[:logger] || Logger.new(STDOUT)
|
19
22
|
@log.level = @opts[:log_level] || Logger::INFO
|
20
23
|
|
21
24
|
if @opts[:aws]
|
22
25
|
bucket_name = @opts[:aws].delete(:bucket_name)
|
23
|
-
|
24
|
-
@s3_bucket =
|
25
|
-
@s3_bucket.acl = :public_read
|
26
|
+
Aws.config.update(opts[:aws])
|
27
|
+
@s3_bucket = Aws::S3::Resource.new.bucket(bucket_name)
|
26
28
|
end
|
27
29
|
|
28
30
|
if @opts[:valid_domains].nil?
|
29
31
|
uri = URI.parse(initial_page)
|
30
32
|
@opts[:valid_domains] ||= [uri.host]
|
31
33
|
end
|
34
|
+
|
35
|
+
if @opts[:process_body]
|
36
|
+
@process_body = @opts[:process_body]
|
37
|
+
end
|
38
|
+
|
32
39
|
add_url(initial_page)
|
33
40
|
end
|
34
41
|
|
42
|
+
def log_level
|
43
|
+
@log.level
|
44
|
+
end
|
45
|
+
|
46
|
+
def log_level=(level)
|
47
|
+
@log.level = level
|
48
|
+
end
|
49
|
+
|
35
50
|
def crawl
|
36
51
|
@log.info("Starting crawl")
|
37
52
|
while(@url_queue.length > 0)
|
@@ -42,15 +57,6 @@ module Staticizer
|
|
42
57
|
@log.info("Finished crawl")
|
43
58
|
end
|
44
59
|
|
45
|
-
def extract_videos(doc, base_uri)
|
46
|
-
doc.xpath("//video").map do |video|
|
47
|
-
sources = video.xpath("//source/@src").map {|src| make_absolute(base_uri, src)}
|
48
|
-
poster = video.attributes["poster"].to_s
|
49
|
-
make_absolute(base_uri, poster)
|
50
|
-
[poster, sources]
|
51
|
-
end.flatten.uniq.compact
|
52
|
-
end
|
53
|
-
|
54
60
|
def extract_hrefs(doc, base_uri)
|
55
61
|
doc.xpath("//a/@href").map {|href| make_absolute(base_uri, href) }
|
56
62
|
end
|
@@ -63,17 +69,21 @@ module Staticizer
|
|
63
69
|
doc.xpath("//link/@href").map {|href| make_absolute(base_uri, href) }
|
64
70
|
end
|
65
71
|
|
72
|
+
def extract_videos(doc, base_uri)
|
73
|
+
doc.xpath("//video").map do |video|
|
74
|
+
sources = video.xpath("//source/@src").map {|src| make_absolute(base_uri, src)}
|
75
|
+
poster = video.attributes["poster"].to_s
|
76
|
+
make_absolute(base_uri, poster)
|
77
|
+
[poster, sources]
|
78
|
+
end.flatten.uniq.compact
|
79
|
+
end
|
80
|
+
|
66
81
|
def extract_scripts(doc, base_uri)
|
67
82
|
doc.xpath("//script/@src").map {|src| make_absolute(base_uri, src) }
|
68
83
|
end
|
69
84
|
|
70
85
|
def extract_css_urls(css, base_uri)
|
71
|
-
css.scan(/url\(([
|
72
|
-
path = src[0]
|
73
|
-
# URLS in css can be wrapped with " or 'ex: url("http:://something/"), strip these
|
74
|
-
path = path.strip.gsub(/^['"]/, "").gsub(/['"]$/,"")
|
75
|
-
make_absolute(base_uri, path)
|
76
|
-
end
|
86
|
+
css.scan(/url\(\s*['"]?(.+?)['"]?\s*\)/).map {|src| make_absolute(base_uri, src[0]) }
|
77
87
|
end
|
78
88
|
|
79
89
|
def add_urls(urls, info = {})
|
@@ -81,14 +91,11 @@ module Staticizer
|
|
81
91
|
end
|
82
92
|
|
83
93
|
def make_absolute(base_uri, href)
|
84
|
-
|
85
|
-
|
86
|
-
|
94
|
+
dup_uri = base_uri.dup
|
95
|
+
dup_uri.query = nil
|
96
|
+
if href.to_s =~ /https?/i
|
87
97
|
href.to_s.gsub(" ", "+")
|
88
98
|
else
|
89
|
-
dup_uri = base_uri.dup
|
90
|
-
# Remove the query params as otherwise will try use those when making absolute uri
|
91
|
-
dup_uri.query = nil
|
92
99
|
URI::join(dup_uri.to_s, href).to_s
|
93
100
|
end
|
94
101
|
rescue StandardError => e
|
@@ -110,22 +117,23 @@ module Staticizer
|
|
110
117
|
@url_queue << [url, info]
|
111
118
|
end
|
112
119
|
|
113
|
-
def save_page(response, uri
|
120
|
+
def save_page(response, uri)
|
121
|
+
return if @opts[:skip_write]
|
114
122
|
if @opts[:aws]
|
115
|
-
save_page_to_aws(response, uri
|
123
|
+
save_page_to_aws(response, uri)
|
116
124
|
else
|
117
|
-
save_page_to_disk(response, uri
|
125
|
+
save_page_to_disk(response, uri)
|
118
126
|
end
|
119
127
|
end
|
120
128
|
|
121
|
-
def save_page_to_disk(response, uri
|
129
|
+
def save_page_to_disk(response, uri)
|
122
130
|
path = uri.path
|
123
|
-
path += "?#{uri.query}" if uri.query
|
131
|
+
path += "?#{uri.query}" if uri.query
|
124
132
|
|
125
133
|
path_segments = path.scan(%r{[^/]*/})
|
126
134
|
filename = path.include?("/") ? path[path.rindex("/")+1..-1] : path
|
127
135
|
|
128
|
-
current = @
|
136
|
+
current = @output_dir
|
129
137
|
FileUtils.mkdir_p(current) unless File.exist?(current)
|
130
138
|
|
131
139
|
# Create all the directories necessary for this file
|
@@ -145,71 +153,77 @@ module Staticizer
|
|
145
153
|
end
|
146
154
|
|
147
155
|
body = response.respond_to?(:read_body) ? response.read_body : response
|
148
|
-
body =
|
156
|
+
body = process_body(body, uri, {})
|
149
157
|
outfile = File.join(current, "/#{filename}")
|
150
|
-
|
151
158
|
if filename == ""
|
152
159
|
indexfile = File.join(outfile, "/index.html")
|
153
|
-
return if opts[:no_overwrite] && File.exists?(indexfile)
|
154
160
|
@log.info "Saving #{indexfile}"
|
155
161
|
File.open(indexfile, "wb") {|f| f << body }
|
156
162
|
elsif File.directory?(outfile)
|
157
163
|
dirfile = outfile + ".d"
|
158
|
-
outfile = File.join(outfile, "/index.html")
|
159
|
-
return if opts[:no_overwrite] && File.exists?(outfile)
|
160
164
|
@log.info "Saving #{dirfile}"
|
161
165
|
File.open(dirfile, "wb") {|f| f << body }
|
162
|
-
FileUtils.cp(dirfile, outfile)
|
166
|
+
FileUtils.cp(dirfile, File.join(outfile, "/index.html"))
|
163
167
|
else
|
164
|
-
return if opts[:no_overwrite] && File.exists?(outfile)
|
165
168
|
@log.info "Saving #{outfile}"
|
166
169
|
File.open(outfile, "wb") {|f| f << body }
|
167
170
|
end
|
168
171
|
end
|
169
172
|
|
170
|
-
def save_page_to_aws(response, uri
|
173
|
+
def save_page_to_aws(response, uri)
|
171
174
|
key = uri.path
|
172
175
|
key += "?#{uri.query}" if uri.query
|
176
|
+
key = key.gsub(%r{/$},"/index.html")
|
173
177
|
key = key.gsub(%r{^/},"")
|
174
178
|
key = "index.html" if key == ""
|
175
179
|
# Upload this file directly to AWS::S3
|
176
|
-
opts = {:acl =>
|
180
|
+
opts = {:acl => "public-read"}
|
177
181
|
opts[:content_type] = response['content-type'] rescue "text/html"
|
178
182
|
@log.info "Uploading #{key} to s3 with content type #{opts[:content_type]}"
|
179
183
|
if response.respond_to?(:read_body)
|
180
|
-
|
184
|
+
body = process_body(response.read_body, uri, opts)
|
185
|
+
@s3_bucket.object(key).put(opts.merge(body: body))
|
181
186
|
else
|
182
|
-
|
183
|
-
|
187
|
+
body = process_body(response, uri, opts)
|
188
|
+
@s3_bucket.object(key).put(opts.merge(body: body))
|
189
|
+
end
|
184
190
|
end
|
185
|
-
|
191
|
+
|
186
192
|
def process_success(response, parsed_uri)
|
187
193
|
url = parsed_uri.to_s
|
194
|
+
if @opts[:filter_process]
|
195
|
+
return if @opts[:filter_process].call(response, parsed_uri)
|
196
|
+
end
|
188
197
|
case response['content-type']
|
189
198
|
when /css/
|
190
|
-
save_page(response, parsed_uri
|
191
|
-
add_urls(extract_css_urls(response.body,
|
199
|
+
save_page(response, parsed_uri)
|
200
|
+
add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"})
|
192
201
|
when /html/
|
193
|
-
|
194
|
-
|
195
|
-
doc
|
202
|
+
save_page(response, parsed_uri)
|
203
|
+
doc = Nokogiri::HTML(response.body)
|
204
|
+
add_urls(extract_links(doc, url), {:type_hint => "link"})
|
205
|
+
add_urls(extract_scripts(doc, url), {:type_hint => "script"})
|
206
|
+
add_urls(extract_images(doc, url), {:type_hint => "image"})
|
207
|
+
add_urls(extract_css_urls(response.body, url), {:type_hint => "css_url"})
|
196
208
|
add_urls(extract_videos(doc, parsed_uri), {:type_hint => "video"})
|
197
|
-
add_urls(
|
198
|
-
add_urls(extract_scripts(doc, parsed_uri), {:type_hint => "script"})
|
199
|
-
add_urls(extract_images(doc, parsed_uri), {:type_hint => "image"})
|
200
|
-
add_urls(extract_hrefs(doc, parsed_uri), {:type_hint => "href"})
|
201
|
-
# extract inline style="background-image:url('https://')" type of urls
|
202
|
-
add_urls(extract_css_urls(body, parsed_uri), {:type_hint => "css_url"})
|
209
|
+
add_urls(extract_hrefs(doc, url), {:type_hint => "href"}) unless @opts[:single_page]
|
203
210
|
else
|
204
|
-
save_page(response, parsed_uri
|
211
|
+
save_page(response, parsed_uri)
|
205
212
|
end
|
206
213
|
end
|
207
214
|
|
208
215
|
# If we hit a redirect we save the redirect as a meta refresh page
|
209
216
|
# TODO: for AWS S3 hosting we could instead create a redirect?
|
210
|
-
def process_redirect(url, destination_url
|
217
|
+
def process_redirect(url, destination_url)
|
211
218
|
body = "<html><head><META http-equiv='refresh' content='0;URL=\"#{destination_url}\"'></head><body>You are being redirected to <a href='#{destination_url}'>#{destination_url}</a>.</body></html>"
|
212
|
-
save_page(body, url
|
219
|
+
save_page(body, url)
|
220
|
+
end
|
221
|
+
|
222
|
+
def process_body(body, uri, opts)
|
223
|
+
if @process_body
|
224
|
+
body = @process_body.call(body, uri, opts)
|
225
|
+
end
|
226
|
+
body
|
213
227
|
end
|
214
228
|
|
215
229
|
# Fetch a URI and save it to disk
|
@@ -218,31 +232,37 @@ module Staticizer
|
|
218
232
|
parsed_uri = URI(url)
|
219
233
|
|
220
234
|
@log.debug "Fetching #{parsed_uri}"
|
221
|
-
|
235
|
+
|
222
236
|
# Attempt to use an already open Net::HTTP connection
|
223
237
|
key = parsed_uri.host + parsed_uri.port.to_s
|
224
238
|
connection = @http_connections[key]
|
225
239
|
if connection.nil?
|
226
240
|
connection = Net::HTTP.new(parsed_uri.host, parsed_uri.port)
|
227
|
-
connection.use_ssl = true if parsed_uri.scheme == "https"
|
241
|
+
connection.use_ssl = true if parsed_uri.scheme.downcase == "https"
|
228
242
|
@http_connections[key] = connection
|
229
243
|
end
|
230
244
|
|
231
245
|
request = Net::HTTP::Get.new(parsed_uri.request_uri)
|
232
|
-
|
233
|
-
|
234
|
-
|
235
|
-
|
236
|
-
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
|
241
|
-
|
242
|
-
|
246
|
+
begin
|
247
|
+
connection.request(request) do |response|
|
248
|
+
case response
|
249
|
+
when Net::HTTPSuccess
|
250
|
+
process_success(response, parsed_uri)
|
251
|
+
when Net::HTTPRedirection
|
252
|
+
redirect_url = response['location']
|
253
|
+
@log.debug "Processing redirect to #{redirect_url}"
|
254
|
+
process_redirect(parsed_uri, redirect_url)
|
255
|
+
add_url(redirect_url)
|
256
|
+
else
|
257
|
+
@log.error "Error #{response.code}:#{response.message} fetching url #{url}"
|
258
|
+
end
|
243
259
|
end
|
260
|
+
rescue OpenSSL::SSL::SSLError => e
|
261
|
+
@log.error "SSL Error #{e.message} fetching url #{url}"
|
262
|
+
rescue Errno::ECONNRESET => e
|
263
|
+
@log.error "Error #{e.class}:#{e.message} fetching url #{url}"
|
244
264
|
end
|
245
265
|
end
|
246
266
|
|
247
267
|
end
|
248
|
-
end
|
268
|
+
end
|
data/lib/staticizer/version.rb
CHANGED
data/staticizer.gemspec
CHANGED
@@ -20,6 +20,7 @@ Gem::Specification.new do |spec|
|
|
20
20
|
|
21
21
|
spec.add_development_dependency "bundler", "~> 1.3"
|
22
22
|
spec.add_development_dependency "rake"
|
23
|
+
spec.add_development_dependency "webmock"
|
23
24
|
|
24
25
|
spec.add_runtime_dependency 'nokogiri'
|
25
26
|
spec.add_runtime_dependency 'aws-sdk'
|
data/tests/crawler_test.rb
CHANGED
@@ -1,15 +1,80 @@
|
|
1
1
|
require 'minitest/autorun'
|
2
|
+
require 'ostruct'
|
2
3
|
|
3
|
-
|
4
|
+
lib = File.expand_path(File.dirname(__FILE__) + '/../lib')
|
5
|
+
$LOAD_PATH.unshift(lib) if File.directory?(lib) && !$LOAD_PATH.include?(lib)
|
6
|
+
|
7
|
+
require 'staticizer'
|
4
8
|
|
5
9
|
class TestFilePaths < MiniTest::Unit::TestCase
|
6
|
-
|
7
|
-
|
8
|
-
|
9
|
-
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
"
|
14
|
-
|
10
|
+
def setup
|
11
|
+
@crawler = Staticizer::Crawler.new("http://test.com")
|
12
|
+
@crawler.log_level = Logger::FATAL
|
13
|
+
@fake_page = File.read(File.expand_path(File.dirname(__FILE__) + "/fake_page.html"))
|
14
|
+
end
|
15
|
+
|
16
|
+
def test_save_page_to_disk
|
17
|
+
fake_response = OpenStruct.new(:read_body => "test", :body => "test")
|
18
|
+
file_paths = {
|
19
|
+
"http://test.com" => "index.html",
|
20
|
+
"http://test.com/" => "index.html",
|
21
|
+
"http://test.com/asdfdf/dfdf" => "/asdfdf/dfdf",
|
22
|
+
"http://test.com/asdfdf/dfdf/" => ["/asdfdf/dfdf","/asdfdf/dfdf/index.html"],
|
23
|
+
"http://test.com/asdfad/asdffd.test" => "/asdfad/asdffd.test",
|
24
|
+
"http://test.com/?asdfsd=12312" => "/?asdfsd=12312",
|
25
|
+
"http://test.com/asdfad/asdffd.test?123=sdff" => "/asdfad/asdffd.test?123=sdff",
|
26
|
+
}
|
27
|
+
|
28
|
+
# TODO: Stub out file system using https://github.com/defunkt/fakefs?
|
29
|
+
outputdir = "/tmp/staticizer_crawl_test"
|
30
|
+
FileUtils.rm_rf(outputdir)
|
31
|
+
@crawler.output_dir = outputdir
|
32
|
+
|
33
|
+
file_paths.each do |k,v|
|
34
|
+
@crawler.save_page_to_disk(fake_response, URI.parse(k))
|
35
|
+
[v].flatten.each do |file|
|
36
|
+
expected = File.expand_path(outputdir + "/#{file}")
|
37
|
+
assert File.exists?(expected), "File #{expected} not created for url #{k}"
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
|
42
|
+
def test_save_page_to_aws
|
43
|
+
end
|
44
|
+
|
45
|
+
def test_add_url_with_valid_domains
|
46
|
+
test_url = "http://test.com/test"
|
47
|
+
@crawler.add_url(test_url)
|
48
|
+
assert(@crawler.url_queue[-1] == [test_url, {}], "URL #{test_url} not added to queue")
|
49
|
+
end
|
50
|
+
|
51
|
+
def test_add_url_with_filter
|
52
|
+
end
|
53
|
+
|
54
|
+
def test_initialize_options
|
55
|
+
end
|
56
|
+
|
57
|
+
def test_process_url
|
58
|
+
end
|
59
|
+
|
60
|
+
def test_make_absolute
|
61
|
+
end
|
62
|
+
|
63
|
+
def test_link_extraction
|
64
|
+
end
|
65
|
+
|
66
|
+
def test_href_extraction
|
67
|
+
end
|
68
|
+
|
69
|
+
def test_css_extraction
|
70
|
+
end
|
71
|
+
|
72
|
+
def test_css_url_extraction
|
73
|
+
end
|
74
|
+
|
75
|
+
def test_image_extraction
|
76
|
+
end
|
77
|
+
|
78
|
+
def test_script_extraction
|
79
|
+
end
|
15
80
|
end
|
@@ -0,0 +1,288 @@
|
|
1
|
+
<!DOCTYPE html>
|
2
|
+
<html lang="en">
|
3
|
+
<head>
|
4
|
+
<title>Web Application Design and Development — Square Mill Labs</title>
|
5
|
+
<meta content="authenticity_token" name="csrf-param" />
|
6
|
+
<meta content="LshjtNLXmjVY9NINXYQds+2Ur+jxUtqKVjjbDbVl+9w=" name="csrf-token" />
|
7
|
+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
|
8
|
+
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
|
9
|
+
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
|
10
|
+
<meta property="og:type" content="website">
|
11
|
+
<meta property="og:url" content="http://squaremill.com/">
|
12
|
+
<meta property="og:image" content="">
|
13
|
+
<meta name="viewport" content="width=device-width, maximum-scale=1.0, initial-scale=1.0">
|
14
|
+
<meta name="description" content="Web Application Design and Development — Square Mill Labs">
|
15
|
+
<link rel="shortcut icon" type="image/png" href="http://squaremill.com/assets/icons/favicon-0fecbe6b20ff5bdf623357a3fac76b4b.png">
|
16
|
+
<link data-turbolinks-track="true" href="/assets/mn_application-5ddad96f16e03ad2137bf02270506e61.css" media="all" rel="stylesheet" />
|
17
|
+
<!--[if lt IE 9]>
|
18
|
+
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
|
19
|
+
<![endif]-->
|
20
|
+
|
21
|
+
<script type="text/javascript" src="//use.typekit.net/cjr4fwy.js"></script>
|
22
|
+
<script type="text/javascript">try{Typekit.load();}catch(e){}</script>
|
23
|
+
</head>
|
24
|
+
|
25
|
+
<body id="public">
|
26
|
+
<script type="text/javascript">
|
27
|
+
|
28
|
+
var _gaq = _gaq || [];
|
29
|
+
_gaq.push(['_setAccount', 'UA-30460332-1']);
|
30
|
+
_gaq.push(['_setDomainName', 'squaremill.com']);
|
31
|
+
_gaq.push(['_trackPageview']);
|
32
|
+
|
33
|
+
(function() {
|
34
|
+
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
|
35
|
+
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
|
36
|
+
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
|
37
|
+
})();
|
38
|
+
|
39
|
+
</script>
|
40
|
+
|
41
|
+
|
42
|
+
<header id="header">
|
43
|
+
<nav class="nav container">
|
44
|
+
<a class="branding" href="http://squaremill.com/" rel="home" title="Square Mill - Digital Products for Web and Mobile">
|
45
|
+
<img alt="Square Mill Logo" class="logo" height="16" src="/assets/m2-wordmark-black-97525464acd136ce26b77e39c7ed2ba3.png" width="128" />
|
46
|
+
<p class="description">
|
47
|
+
Digital Products for Web and Mobile
|
48
|
+
</p>
|
49
|
+
</a> <a class="menu-trigger" href="#">Menu</a>
|
50
|
+
<div class="main-nav">
|
51
|
+
<ul class="container">
|
52
|
+
<li><a href="/projects">Projects</a></li>
|
53
|
+
<li><a href="/about">About Us</a></li>
|
54
|
+
<li><a href="/blog">Blog</a></li>
|
55
|
+
|
56
|
+
<!-- <li class="link-biography"><a href="http://squaremill.com/#biography">People</a></li> -->
|
57
|
+
</ul>
|
58
|
+
</div>
|
59
|
+
|
60
|
+
</nav>
|
61
|
+
|
62
|
+
</header>
|
63
|
+
|
64
|
+
|
65
|
+
|
66
|
+
<div id="site-content">
|
67
|
+
|
68
|
+
|
69
|
+
|
70
|
+
|
71
|
+
<div class="container" id="home-projects">
|
72
|
+
<section class="big-promo">
|
73
|
+
<div class="project" id="project-7">
|
74
|
+
<a href="/projects/bon-voyaging">
|
75
|
+
<div class="devices">
|
76
|
+
<div class="laptop device">
|
77
|
+
<img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
|
78
|
+
<div class="screenshot">
|
79
|
+
<img alt="" src="/uploads/project/desktop_image/7/bonvoyaging-desktop.jpg" />
|
80
|
+
</div>
|
81
|
+
</div>
|
82
|
+
|
83
|
+
<div class="handheld device">
|
84
|
+
<img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
|
85
|
+
<div class="screenshot">
|
86
|
+
<img alt="" src="/uploads/project/iphone_image/7/bonvoyagin-handheld.jpg" />
|
87
|
+
</div>
|
88
|
+
</div>
|
89
|
+
</div>
|
90
|
+
|
91
|
+
<div class="project-description">
|
92
|
+
<div class="summary">
|
93
|
+
<h2>Bon Voyaging <i class="icon-play-sign"></i></h2>
|
94
|
+
<p>Bon Voyaging enables discerning travelers to expertly envision their next voyage from inspiration to exploration. Powerful search tools and a interactive javascript interface make planning trips fun.</p>
|
95
|
+
</div>
|
96
|
+
</div>
|
97
|
+
</a>
|
98
|
+
</div>
|
99
|
+
</section>
|
100
|
+
<section class="big-promo">
|
101
|
+
<div class="project" id="project-1">
|
102
|
+
<a href="/projects/kpcb-fellows">
|
103
|
+
<div class="devices">
|
104
|
+
<div class="laptop device">
|
105
|
+
<img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
|
106
|
+
<div class="screenshot">
|
107
|
+
<img alt="" src="/uploads/project/desktop_image/1/kpcb-fellows-screenshot.jpg" />
|
108
|
+
</div>
|
109
|
+
</div>
|
110
|
+
|
111
|
+
<div class="handheld device">
|
112
|
+
<img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
|
113
|
+
<div class="screenshot">
|
114
|
+
<img alt="" src="/uploads/project/iphone_image/1/kpcb-fellows-iphone-screenshot.jpg" />
|
115
|
+
</div>
|
116
|
+
</div>
|
117
|
+
</div>
|
118
|
+
|
119
|
+
<div class="project-description">
|
120
|
+
<div class="summary">
|
121
|
+
<h2>KPCB Fellows Website and Brand <i class="icon-play-sign"></i></h2>
|
122
|
+
<p>The Fellows Program is a three-month work-based program that pairs top U.S. Engineering, Design and Product Design students with leading technology companies</p>
|
123
|
+
</div>
|
124
|
+
</div>
|
125
|
+
</a>
|
126
|
+
</div>
|
127
|
+
</section>
|
128
|
+
<section class="big-promo">
|
129
|
+
<div class="project" id="project-2">
|
130
|
+
<a href="/projects/thomson-reuters-messenger">
|
131
|
+
<div class="devices">
|
132
|
+
<div class="laptop device no-handheld">
|
133
|
+
<img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
|
134
|
+
<div class="screenshot">
|
135
|
+
<img alt="" src="/uploads/project/desktop_image/2/thomson-reuters-messenger-desktop.png" />
|
136
|
+
</div>
|
137
|
+
</div>
|
138
|
+
|
139
|
+
</div>
|
140
|
+
|
141
|
+
<div class="project-description">
|
142
|
+
<div class="summary">
|
143
|
+
<h2>Thomson Reuters Messenger <i class="icon-play-sign"></i></h2>
|
144
|
+
<p>Messenger is an html5 / javascript instant messenger application for financial professionals</p>
|
145
|
+
</div>
|
146
|
+
</div>
|
147
|
+
</a>
|
148
|
+
</div>
|
149
|
+
</section>
|
150
|
+
<section class="big-promo">
|
151
|
+
<div class="project" id="project-3">
|
152
|
+
<a href="/projects/kleiner-perkins-caufield-byers-digital-presence">
|
153
|
+
<div class="devices">
|
154
|
+
<div class="laptop device">
|
155
|
+
<img alt="" class="chrome" src="/assets/projects/macbook-pro-1be04a78f99b40e2c676d78fc24fcc3d.png" />
|
156
|
+
<div class="screenshot">
|
157
|
+
<img alt="" src="/uploads/project/desktop_image/3/kpcb-screenshot.jpg" />
|
158
|
+
</div>
|
159
|
+
</div>
|
160
|
+
|
161
|
+
<div class="handheld device">
|
162
|
+
<img alt="" class="chrome" src="/assets/projects/iphone5-c1ea3a16be931f1e80bacab5cfec932d.png" />
|
163
|
+
<div class="screenshot">
|
164
|
+
<img alt="" src="/uploads/project/iphone_image/3/kpcb-iphone-screenshot.jpg" />
|
165
|
+
</div>
|
166
|
+
</div>
|
167
|
+
</div>
|
168
|
+
|
169
|
+
<div class="project-description">
|
170
|
+
<div class="summary">
|
171
|
+
<h2>Kleiner Perkins Caufield & Byers Digital Presence <i class="icon-play-sign"></i></h2>
|
172
|
+
<p>KPCB is a venture capital stalwart located in Silicon Valley with over 40 years of tech and science investment.</p>
|
173
|
+
</div>
|
174
|
+
</div>
|
175
|
+
</a>
|
176
|
+
</div>
|
177
|
+
</section>
|
178
|
+
</div>
|
179
|
+
|
180
|
+
<section class="clients full-width">
|
181
|
+
<div class="container">
|
182
|
+
<h2>Clients</h2>
|
183
|
+
<ul class="hlist">
|
184
|
+
<li><a href="http://kpcb.com" rel="friend" target="_blank" title="KPCB's Website"><img alt="KPCB" src="/uploads/client/image/1/home_logo_kpcb-logo.png" /></a></li>
|
185
|
+
<li><a href="http://thomsonreuters.com" rel="friend" target="_blank" title="Thomson Reuters's Website"><img alt="Thomson Reuters" src="/uploads/client/image/2/home_logo_thomsonreuters.png" /></a></li>
|
186
|
+
<li><a href="http://sumzero.com" rel="friend" target="_blank" title="SumZero's Website"><img alt="SumZero" src="/uploads/client/image/3/home_logo_sumzero.png" /></a></li>
|
187
|
+
<li><a href="http://marlboroughgallery.com" rel="friend" target="_blank" title="Marlborough Gallery's Website"><img alt="Marlborough Gallery" src="/uploads/client/image/4/home_logo_marlborough.png" /></a></li>
|
188
|
+
<li><a href="http://flurry.com" rel="friend" target="_blank" title="Flurry Analytics's Website"><img alt="Flurry Analytics" src="/uploads/client/image/8/home_logo_flurry.png" /></a></li>
|
189
|
+
</ul>
|
190
|
+
</div>
|
191
|
+
</section>
|
192
|
+
|
193
|
+
<section class="quote">
|
194
|
+
<blockquote>
|
195
|
+
<p>"Square Mill really took the time to understand our business and think strategically about how we want to engage and communicate with our entrepreneurs online. Together, their small team is responsive, nimble and efficient and has the deep design and technical chops to back it up."</p>
|
196
|
+
<small><a href="http://kpcb.com/partner/christina-lee" rel="friend" title="Christina Lee, Operating Partner at KPCB">Christina Lee</a>, <em>Operating Partner at KPCB</em></small>
|
197
|
+
</blockquote>
|
198
|
+
</section>
|
199
|
+
|
200
|
+
|
201
|
+
|
202
|
+
|
203
|
+
</div>
|
204
|
+
|
205
|
+
<footer id="footer">
|
206
|
+
<section class="container">
|
207
|
+
<a class="logo" href="http://squaremill.com/" rel="home">
|
208
|
+
<img alt="Square Mill Logo" height="64" src="/assets/md-logo-black-942423ecfd86c43ec6f13f163ea03f97.png" width="64" />
|
209
|
+
</a> <div class="main-nav">
|
210
|
+
<ul class="container">
|
211
|
+
<li><a href="/projects">Projects</a></li>
|
212
|
+
<li><a href="/about">About Us</a></li>
|
213
|
+
<li><a href="/blog">Blog</a></li>
|
214
|
+
|
215
|
+
<!-- <li class="link-biography"><a href="http://squaremill.com/#biography">People</a></li> -->
|
216
|
+
</ul>
|
217
|
+
</div>
|
218
|
+
|
219
|
+
</section>
|
220
|
+
|
221
|
+
<p class="copyright">
|
222
|
+
© 2014 Square Mill Labs, LLC. All rights reserved.
|
223
|
+
</p>
|
224
|
+
</footer>
|
225
|
+
|
226
|
+
<script type="text/javascript" src="http://code.jquery.com/jquery-2.0.0.js"></script>
|
227
|
+
<script type="text/javascript" src="http://code.jquery.com/jquery-migrate-1.1.1.js"></script>
|
228
|
+
<script src="/assets/mn_application-82f6787dca307be34ec0c9fa6b7ba7d4.js"></script>
|
229
|
+
<script>
|
230
|
+
$(document).ready(function() {
|
231
|
+
|
232
|
+
var controller = $.superscrollorama({
|
233
|
+
triggerAtCenter: true,
|
234
|
+
playoutAnimations: true
|
235
|
+
});
|
236
|
+
|
237
|
+
if ( $(window).width() >= 767 ) {
|
238
|
+
controller.addTween('#project-7',
|
239
|
+
TweenMax.from($('#project-7'), .7, {
|
240
|
+
css:{"opacity":"0"},
|
241
|
+
onComplete: function(){
|
242
|
+
$('#project-7').toggleClass('active-in')
|
243
|
+
}
|
244
|
+
}),
|
245
|
+
300, // duration of scroll in pixel units
|
246
|
+
-100, // scroll offset (from center of viewport)
|
247
|
+
true
|
248
|
+
);
|
249
|
+
controller.addTween('#project-1',
|
250
|
+
TweenMax.from($('#project-1'), .7, {
|
251
|
+
css:{"opacity":"0"},
|
252
|
+
onComplete: function(){
|
253
|
+
$('#project-1').toggleClass('active-in')
|
254
|
+
}
|
255
|
+
}),
|
256
|
+
300, // duration of scroll in pixel units
|
257
|
+
-100, // scroll offset (from center of viewport)
|
258
|
+
true
|
259
|
+
);
|
260
|
+
controller.addTween('#project-2',
|
261
|
+
TweenMax.from($('#project-2'), .7, {
|
262
|
+
css:{"opacity":"0"},
|
263
|
+
onComplete: function(){
|
264
|
+
$('#project-2').toggleClass('active-in')
|
265
|
+
}
|
266
|
+
}),
|
267
|
+
300, // duration of scroll in pixel units
|
268
|
+
-100, // scroll offset (from center of viewport)
|
269
|
+
true
|
270
|
+
);
|
271
|
+
controller.addTween('#project-3',
|
272
|
+
TweenMax.from($('#project-3'), .7, {
|
273
|
+
css:{"opacity":"0"},
|
274
|
+
onComplete: function(){
|
275
|
+
$('#project-3').toggleClass('active-in')
|
276
|
+
}
|
277
|
+
}),
|
278
|
+
300, // duration of scroll in pixel units
|
279
|
+
-100, // scroll offset (from center of viewport)
|
280
|
+
true
|
281
|
+
);
|
282
|
+
}
|
283
|
+
|
284
|
+
});
|
285
|
+
</script>
|
286
|
+
|
287
|
+
</body>
|
288
|
+
</html>
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: staticizer
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.9
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Conor Hunt
|
@@ -38,6 +38,20 @@ dependencies:
|
|
38
38
|
- - ">="
|
39
39
|
- !ruby/object:Gem::Version
|
40
40
|
version: '0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: webmock
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
41
55
|
- !ruby/object:Gem::Dependency
|
42
56
|
name: nokogiri
|
43
57
|
requirement: !ruby/object:Gem::Requirement
|
@@ -87,6 +101,7 @@ files:
|
|
87
101
|
- lib/staticizer/version.rb
|
88
102
|
- staticizer.gemspec
|
89
103
|
- tests/crawler_test.rb
|
104
|
+
- tests/fake_page.html
|
90
105
|
homepage: https://github.com/SquareMill/staticizer
|
91
106
|
licenses:
|
92
107
|
- MIT
|