lame-sitemapper 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: c783779dbc3421bcd12f2efb44e6d55fb4d96275621d1733015893c1c5168ce9
4
+ data.tar.gz: af111696eb28cc307ef70d72b0c44e66c0dc2ea4ded5eea075ecc4912e350f3a
5
+ SHA512:
6
+ metadata.gz: 2626d9617f772a6bfef6b62bb66f01fc58a68c2caa46c93636cc8b76118ed8ce4e19df96cc9efc619e4f0f42f43b3a71262d0df0b4e13289ab7f12cd2de323ed
7
+ data.tar.gz: 188db116c63128a3a3ad2486a9b7e38b6da8fdd5fbce0326f792840cbf3b0723f4c2f99a892b38e59f1537ccd18c538ed4541895319da23f11adf1e6d7987fde
@@ -0,0 +1,4 @@
1
+ /tmp/
2
+ *.gem
3
+ *.log
4
+ Gemfile.lock
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --color
2
+ --format documentation
3
+ --require spec_helper
data/Gemfile ADDED
@@ -0,0 +1,16 @@
1
+ source "https://rubygems.org"
2
+
3
+ gemspec
4
+
5
+ gem "typhoeus"
6
+ gem "nokogiri"
7
+ gem "webrobots"
8
+ gem "addressable"
9
+ gem "public_suffix"
10
+ gem "digest-murmurhash"
11
+ gem "graphviz"
12
+ gem "activesupport"
13
+
14
+ group :development, :test do
15
+ gem "rspec"
16
+ end
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2014 Orest Kulik
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,37 @@
1
+ # lame-sitemapper
2
+
3
+ A tool for a simple, static web pages hierarchy exploration. It starts from the arbitrary page you provide and descents into the tree of links until it has either traversed all possible content on the web site or has stopped at some predefined traversal depth. It is written in Ruby and implemented as a CLI application. Based on user preference, it can output text reports in a standard sitemap.xml form (used by many search engines), a dot file (for easier site hierarchy visualization, [graphviz][graphviz] compatible), a plain text file (displaying detailed hierarchical relations between pages) and a simple HTML format.
4
+
5
+ The main challenge in web site links traversal is to know if some link has been previously seen and, accordingly, not to explore any further in that direction. This prevents infinite traversal of pages, jumping from link to link forever.
6
+
7
+ See [http://www.nisdom.com/blog/2014/04/12/a-simple-ruby-sitemap-dot-xml-generator/][nisdom-sitemapper] for more details.
8
+
9
+ ## Features
10
+ * Obeys robots.txt (can be optionally disregarded).
11
+ * Produces 4 different types of reports. Possible values are 'text', 'sitemap', 'html' and 'graph'.
12
+ * Tracks HTTP redirects.
13
+ * Possibility to choose the number of concurrent threads.
14
+
15
+ ## Installation
16
+ Install it from RubyGems.org using `gem install lame-sitemapper`.
17
+
18
+ ## Examples
19
+ Crawls up to depth 3 of page links, usees 6 threads, disregards robots.txt and creates a hierarchical text report:
20
+ ```
21
+ lame-sitemapper "http://www.some.site.mom" -l 0 -d 3 -t 6 --no-robots
22
+ ```
23
+ Crawls up to depth 4, uses 6 threads, disregards robots.txt, creates dot file, converts it to png file and opens it (you need to have installed [graphviz][graphviz]):
24
+ ```
25
+ lame-sitemapper "http://www.some.site.mom" -l 0 -d 4 -t 6 --no-robots
26
+ -r graph > site.dot && dot -Tpng site.dot > site.png && open site.png
27
+ ```
28
+ Traverses up to level 2, obeys robots.txt and creates an html report:
29
+ ```
30
+ lame-sitemapper "http://www.some.site.mom" -d 2 -r html > site.html
31
+ && open site.html
32
+ ```
33
+
34
+ [graphviz]: http://www.graphviz.org/
35
+ [github-sitemapper]: http://github.com/okulik/lame-sitemapper/
36
+ [bundler]: http://bundler.io/
37
+ [nisdom-sitemapper]: http://www.nisdom.com/blog/2014/04/12/a-simple-ruby-sitemap-dot-xml-generator/
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "lame_sitemapper"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
@@ -0,0 +1,9 @@
1
+ #!/usr/bin/env bash
2
+
3
+ set -euo pipefail
4
+ IFS=$'\n\t'
5
+ set -vx
6
+
7
+ bundle install
8
+
9
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+
5
+ require "cli"
6
+
7
+ LameSitemapper::Cli.new($stdout, ARGV, File.basename(__FILE__)).run
@@ -0,0 +1,34 @@
1
+ require_relative "lib/lame_sitemapper/version"
2
+
3
+ Gem::Specification.new do |spec|
4
+ spec.name = "lame-sitemapper"
5
+ spec.version = LameSitemapper::VERSION
6
+ spec.authors = ["Orest Kulik"]
7
+ spec.email = ["orest@nisdom.com"]
8
+
9
+ spec.summary = %q{A tool for a simple, static web pages hierarchy exploration.}
10
+ spec.description = %q{It starts from the arbitrary page you provide and descents into the tree of links until it has either traversed all possible content on the web site or has stopped at some predefined traversal depth. It is written in Ruby and implemented as a CLI application. Based on user preference, it can output text reports in a standard sitemap.xml form (used by many search engines), a dot file (for easier site hierarchy visualization, graphviz compatible), a plain text file (displaying detailed hierarchical relations between pages) and a simple HTML format.}
11
+ spec.required_ruby_version = Gem::Requirement.new(">= 2.3.0")
12
+
13
+ spec.metadata["source_code_uri"] = "https://github.com/okulik/lame-sitemapper"
14
+
15
+ spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
16
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(spec)/}) }
17
+ end
18
+ spec.bindir = "exe"
19
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
20
+ spec.require_paths = ["lib"]
21
+
22
+ spec.add_runtime_dependency("typhoeus", "~> 0.6", ">= 0.6.8")
23
+ spec.add_runtime_dependency("nokogiri", "~> 1.6", ">= 1.6.1")
24
+ spec.add_runtime_dependency("webrobots", "~> 0.1", ">= 0.1.1")
25
+ spec.add_runtime_dependency("addressable", "~> 2.3", ">= 2.3.6")
26
+ spec.add_runtime_dependency("public_suffix", "~> 1.4", ">= 1.4.2")
27
+ spec.add_runtime_dependency("digest-murmurhash", "~> 0.3", ">= 0.3.0")
28
+ spec.add_runtime_dependency("graphviz", "~> 0.4", ">= 0.4.0")
29
+ spec.add_runtime_dependency("activesupport", "~> 6.0", ">= 6.0.3.2")
30
+
31
+ spec.add_development_dependency("pry")
32
+ spec.add_development_dependency("pry-doc")
33
+ spec.add_development_dependency("pry-byebug")
34
+ end
@@ -0,0 +1,120 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "optparse"
4
+ require "ostruct"
5
+
6
+ require "lame_sitemapper"
7
+ require "core"
8
+ require "url_helper"
9
+ require "report_generator"
10
+
11
+ module LameSitemapper
12
+ class Cli
13
+ attr_reader :opt_parser
14
+
15
+ def initialize(out = nil, args = [], run_file = File.basename(__FILE__))
16
+ @out = out
17
+ @args = args
18
+
19
+ @options = OpenStruct.new
20
+ @options.use_robots = LameSitemapper::SETTINGS[:use_robots]
21
+ @options.max_page_depth = LameSitemapper::SETTINGS[:max_page_depth]
22
+ @options.log_level = LameSitemapper::SETTINGS[:log_level].to_i
23
+ @options.report_type = LameSitemapper::SETTINGS[:report_type]
24
+ @options.frequency_type = LameSitemapper::SETTINGS[:sitemap_frequency_type]
25
+ @options.scraper_threads = LameSitemapper::SETTINGS[:scraper_threads].to_i
26
+
27
+ Thread.current[:name] = "**"
28
+
29
+ @opt_parser = OptionParser.new do |opts|
30
+ opts.banner = "Generate sitemap.xml for a given url."
31
+ opts.separator ""
32
+ opts.separator "Usage: ruby #{run_file} [options] <uri>"
33
+ opts.separator "url needs to be in the form of e.g. http://www.nisdom.com"
34
+ opts.separator ""
35
+ opts.separator "Specific options:"
36
+
37
+ opts.on("--[no-]robots", "Run with robots.txt") do |r|
38
+ @options.use_robots = r
39
+ end
40
+
41
+ opts.on("-l", "--log-level LEVEL", "Set log level from 0 to 4, 0 is most verbose (default 1)") do |level|
42
+ if level.to_i < 0 || level.to_i > 4
43
+ @out.puts opts if @out
44
+ exit
45
+ end
46
+ LOGGER.level = level.to_i
47
+ end
48
+
49
+ opts.on("-d", "--depth DEPTH", "Set maximum page traversal depth from 1 to 10 (default 10)") do |depth|
50
+ if depth.to_i < 1 || depth.to_i > 10
51
+ @out.puts opts if @out
52
+ exit
53
+ end
54
+ @options.max_page_depth = depth.to_i
55
+ end
56
+
57
+ report_types = [:text, :sitemap, :html, :graph, :test_yml]
58
+ opts.on("-r", "--report-type TYPE", report_types, "Set report type #{report_types.map {|f| '\'' + f.to_s + '\''}.join(", ")} (defalut 'text')") do |type|
59
+ @options.report_type = type
60
+ end
61
+
62
+ change_frequency = [:none, :always, :hourly, :daily, :weekly, :monthly, :yearly, :never]
63
+ opts.on("--change-frequency FREQ", change_frequency, "Set sitemap's page change frequency #{change_frequency.map {|f| '\'' + f.to_s + '\''}.join(", ")} (default 'daily')") do |freq|
64
+ @options.frequency_type = freq
65
+ end
66
+
67
+ opts.on("-t", "--scraper-threads NUM", "Set number of scraper threads from 1 to 10 (default 1)") do |num|
68
+ if num.to_i < 1 || num.to_i > 10
69
+ @out.puts opts if @out
70
+ exit
71
+ end
72
+ @options.scraper_threads = num.to_i
73
+ end
74
+
75
+ opts.separator ""
76
+ opts.separator "Common options:"
77
+
78
+ opts.on_tail("-h", "--help", "Display this screen") do
79
+ @out.puts opts if @out
80
+ exit
81
+ end
82
+
83
+ opts.on_tail("-v", "--version", "Show version") do
84
+ @out.puts LameSitemapper::VERSION if @out
85
+ exit
86
+ end
87
+ end
88
+ end
89
+
90
+ def run
91
+ @opt_parser.parse! @args
92
+ if @args.empty?
93
+ @out.puts @opt_parser if @out
94
+ exit
95
+ end
96
+
97
+ start_url = @args.shift
98
+ normalized_host = UrlHelper::get_normalized_host(start_url)
99
+ normalized_start_url = UrlHelper::get_normalized_url(normalized_host, start_url)
100
+ if normalized_host.nil? || normalized_start_url.nil?
101
+ @out.puts @opt_parser if @out
102
+ exit
103
+ end
104
+
105
+ LOGGER.info "starting with #{normalized_start_url}, options #{@options.inspect}"
106
+
107
+ start_time = Time.now
108
+ root, normalized_start_url = Core.new(@out, @options).start(normalized_host, normalized_start_url)
109
+ return unless root
110
+
111
+ LOGGER.info "found #{root.count} pages in #{Time.now - start_time}s"
112
+
113
+ @out.puts ReportGenerator.new(@options, normalized_start_url).send("to_#{@options.report_type}", root) if @out
114
+ rescue OptionParser::InvalidArgument, OptionParser::InvalidOption, OptionParser::MissingArgument =>e
115
+ @out.puts e if @out
116
+ @out.puts @opt_parser if @out
117
+ exit
118
+ end
119
+ end
120
+ end
@@ -0,0 +1,105 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "typhoeus"
4
+ require "webrobots"
5
+ require "addressable/uri"
6
+
7
+ require_relative "scraper"
8
+ require_relative "page"
9
+ require_relative "url_helper"
10
+ require_relative "web_helper"
11
+
12
+ module LameSitemapper
13
+ class Core
14
+ def initialize(out, opts)
15
+ @out = out
16
+ @opts = opts
17
+ end
18
+
19
+ def start(host, start_url)
20
+ if @opts.use_robots
21
+ @robots = WebRobots.new(SETTINGS[:web_settings][:useragent], {
22
+ crawl_delay: :sleep,
23
+ :http_get => lambda do |url|
24
+ response = WebHelper.get_http_response(url)
25
+ return unless response
26
+ return response.body.force_encoding("UTF-8")
27
+ end
28
+ })
29
+
30
+ if error = @robots.error(host)
31
+ msg = "unable to fetch robots.txt"
32
+ LOGGER.fatal msg
33
+ $stderr.puts msg
34
+ return [nil, start_url]
35
+ end
36
+ end
37
+
38
+ # check if our host redirects to somewhere else, if it does, change start_url to redirect url
39
+ response = WebHelper.get_http_response(start_url, :head)
40
+ unless response
41
+ msg = "unable to fetch starting url"
42
+ LOGGER.fatal msg
43
+ $stderr.puts msg
44
+
45
+ return [nil, start_url]
46
+ end
47
+
48
+ if response.redirect_count.to_i > 0
49
+ host = UrlHelper::get_normalized_host(response.effective_url)
50
+ start_url = UrlHelper::get_normalized_url(host, response.effective_url)
51
+ end
52
+
53
+ urls_queue = Queue.new
54
+ pages_queue = Queue.new
55
+ seen_urls = {}
56
+ threads = []
57
+ root = nil
58
+
59
+ Thread.abort_on_exception = true
60
+ (1..@opts.scraper_threads.to_i).each_with_index do |index|
61
+ threads << Thread.new { Scraper.new(seen_urls, urls_queue, pages_queue, index, @opts, @robots).run }
62
+ end
63
+
64
+ urls_queue.push(host: host, url: start_url, depth: 0, parent: root)
65
+
66
+ loop do
67
+ msg = pages_queue.pop
68
+ if msg[:page]
69
+ if LOGGER.info?
70
+ if msg[:page].scraped?
71
+ details = ": a(#{msg[:page].anchors.count}), img(#{msg[:page].images.count}), link(#{msg[:page].links.count}), script(#{msg[:page].scripts.count})"
72
+ else
73
+ details = ": #{msg[:page].format_codes}"
74
+ end
75
+ LOGGER.info "#{UrlHelper.log_prefix(msg[:depth])} created at #{msg[:page].path}#{details}"
76
+ end
77
+
78
+ msg[:page].anchors.each do |anchor|
79
+ urls_queue.push(host: host, url: anchor, depth: msg[:depth] + 1, parent: msg[:page])
80
+ end
81
+
82
+ if msg[:parent].nil?
83
+ root = msg[:page]
84
+ else
85
+ msg[:parent].sub_pages << msg[:page]
86
+ end
87
+ end
88
+
89
+ if urls_queue.empty? && pages_queue.empty?
90
+ until urls_queue.num_waiting == threads.size
91
+ Thread.pass
92
+ end
93
+ if pages_queue.empty?
94
+ threads.size.times { urls_queue << nil }
95
+ break
96
+ end
97
+ end
98
+ end
99
+
100
+ threads.each { |thread| thread.join }
101
+
102
+ [root, start_url]
103
+ end
104
+ end
105
+ end
@@ -0,0 +1,27 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "time"
4
+ require "logger"
5
+ require "yaml"
6
+ require "active_support/core_ext/hash/conversions"
7
+
8
+ require_relative "lame_sitemapper/version"
9
+
10
+ module LameSitemapper
11
+ SETTINGS = begin
12
+ settings_file = File.join(__dir__, "settings.yml")
13
+ env = $PROGRAM_NAME =~ /rspec$/i ? "test" : "production"
14
+ YAML::load(IO.read(settings_file))[env].deep_symbolize_keys
15
+ end
16
+
17
+ LOGGER = begin
18
+ log_file = SETTINGS[:log][:file_name]
19
+ Logger.new(log_file, SETTINGS[:log][:file_count], SETTINGS[:log][:file_size]).tap do |logger|
20
+ logger.level = SETTINGS[:log_level].to_i
21
+ logger.datetime_format = "%Y-%m-%d %H:%M:%S "
22
+ logger.formatter = proc do |severity, datetime, progname, msg|
23
+ "[#{datetime.strftime('%Y-%m-%d %H:%M:%S')} #{Thread.current[:name]}] #{severity[0]} -- : #{msg}\n"
24
+ end
25
+ end
26
+ end
27
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module LameSitemapper
4
+ VERSION = "0.1.0"
5
+ end
@@ -0,0 +1,124 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "forwardable"
4
+
5
+ module LameSitemapper
6
+ class Page
7
+ extend Forwardable
8
+
9
+ def_delegators :each, :count
10
+
11
+ attr_accessor :path
12
+ attr_reader :sub_pages
13
+ attr_reader :anchors
14
+ attr_reader :images
15
+ attr_reader :links
16
+ attr_reader :scripts
17
+
18
+ NON_SCRAPED_DEPTH = 1
19
+ NON_SCRAPED_DOMAIN = 2
20
+ NON_SCRAPED_ROBOTS = 4
21
+ NON_SCRAPED_NO_HTML = 8
22
+ NON_SCRAPED_NOT_ACCESSIBLE = 16
23
+
24
+ def initialize(path)
25
+ @path = path
26
+ @sub_pages = []
27
+ @anchors = []
28
+ @images = []
29
+ @links = []
30
+ @scripts = []
31
+ @non_scraped_code = 0
32
+ end
33
+
34
+ def <<(page)
35
+ @sub_pages << page
36
+
37
+ self
38
+ end
39
+
40
+ def scraped?
41
+ @non_scraped_code == 0
42
+ end
43
+
44
+ def robots_forbidden?
45
+ @non_scraped_code & Page::NON_SCRAPED_ROBOTS > 0
46
+ end
47
+
48
+ def robots_forbidden=(value)
49
+ if value
50
+ @non_scraped_code |= Page::NON_SCRAPED_ROBOTS
51
+ else
52
+ @non_scraped_code &= ~Page::NON_SCRAPED_ROBOTS
53
+ end
54
+ end
55
+
56
+ def external_domain?
57
+ @non_scraped_code & Page::NON_SCRAPED_DOMAIN > 0
58
+ end
59
+
60
+ def external_domain=(value)
61
+ if value
62
+ @non_scraped_code |= Page::NON_SCRAPED_DOMAIN
63
+ else
64
+ @non_scraped_code &= ~Page::NON_SCRAPED_DOMAIN
65
+ end
66
+ end
67
+
68
+ def depth_reached?
69
+ @non_scraped_code & Page::NON_SCRAPED_DEPTH > 0
70
+ end
71
+
72
+ def depth_reached=(value)
73
+ if value
74
+ @non_scraped_code |= Page::NON_SCRAPED_DEPTH
75
+ else
76
+ @non_scraped_code &= ~Page::NON_SCRAPED_DEPTH
77
+ end
78
+ end
79
+
80
+ def no_html?
81
+ @non_scraped_code & Page::NON_SCRAPED_NO_HTML > 0
82
+ end
83
+
84
+ def no_html=(value)
85
+ if value
86
+ @non_scraped_code |= Page::NON_SCRAPED_NO_HTML
87
+ else
88
+ @non_scraped_code &= ~Page::NON_SCRAPED_NO_HTML
89
+ end
90
+ end
91
+
92
+ def not_accessible?
93
+ @non_scraped_code & Page::NON_SCRAPED_NOT_ACCESSIBLE > 0
94
+ end
95
+
96
+ def not_accessible=(value)
97
+ if value
98
+ @non_scraped_code |= Page::NON_SCRAPED_NOT_ACCESSIBLE
99
+ else
100
+ @non_scraped_code &= ~Page::NON_SCRAPED_NOT_ACCESSIBLE
101
+ end
102
+ end
103
+
104
+ def format_codes
105
+ reasons = []
106
+
107
+ reasons << "depth" if depth_reached?
108
+ reasons << "robots" if robots_forbidden?
109
+ reasons << "ext" if external_domain?
110
+ reasons << "nohtml" if no_html?
111
+ reasons << "noacc" if not_accessible?
112
+
113
+ "#{reasons.join('|')} "
114
+ end
115
+
116
+ def each(&block)
117
+ return enum_for(:each) unless block_given?
118
+
119
+ yield self
120
+ @sub_pages.each { |p| p.each(&block) }
121
+ end
122
+
123
+ end
124
+ end
@@ -0,0 +1,181 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "graphviz"
4
+ require_relative "page"
5
+
6
+ module LameSitemapper
7
+ class ReportGenerator
8
+ INDENT = " "
9
+ XML_PROLOG = <<-EOS
10
+ <?xml version="1.0" encoding="UTF-8"?>
11
+ <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
12
+ EOS
13
+ XML_EPILOG = "</urlset>"
14
+ XML_NODE_TEMPLATE1 = <<-EOS
15
+ <url>
16
+ <loc>%s</loc>
17
+ </url>
18
+ EOS
19
+ XML_NODE_TEMPLATE2 = <<-EOS
20
+ <url>
21
+ <loc>%s</loc>
22
+ <changefreq>%s</changefreq>
23
+ </url>
24
+ EOS
25
+ HTML_PROLOG = <<-EOS
26
+ <html>
27
+ <head>
28
+ <title>%s sitemap</title>
29
+ </head>
30
+ <body>
31
+ <h1>Site %s</h1>
32
+ EOS
33
+ HTML_EPILOG = <<-EOS
34
+ </body>
35
+ </html>
36
+ EOS
37
+
38
+ def initialize options, host
39
+ @options = options
40
+ @host = host
41
+ end
42
+
43
+ def to_text(page)
44
+ out = []
45
+ tree_to_text(page, out)
46
+ out.join
47
+ end
48
+
49
+ def to_sitemap(page)
50
+ out = []
51
+ out << XML_PROLOG
52
+ page.each do |p|
53
+ if @options.frequency_type != :none
54
+ out << XML_NODE_TEMPLATE2 % [ p.path, @options.frequency_type ]
55
+ else
56
+ out << XML_NODE_TEMPLATE1 % [ p.path ]
57
+ end
58
+ end
59
+ out << XML_EPILOG
60
+ out.join
61
+ end
62
+
63
+ def to_html(page)
64
+ out = []
65
+ out << HTML_PROLOG % [ page.path, page.path ]
66
+ page.each do |p|
67
+ out << "<h2>#{scraped_mark(p)}#{p.format_codes}<a href=\"#{p.path}\">#{p.path}</a></h2>\n"
68
+ if p.scraped?
69
+ out << "<h3>Images</h3>\n" if p.images.count > 0
70
+ p.images.each do |img|
71
+ uri = UrlHelper.get_normalized_url(@host, img)
72
+ out << "<div>\n"
73
+ out << "<a href=\"#{uri}\">#{uri}</a>\n"
74
+ out << "</div>\n"
75
+ end
76
+ out << "<h3>Links</h3>\n" if p.links.count > 0
77
+ p.links.each do |link|
78
+ uri = UrlHelper.get_normalized_url(@host, link)
79
+ out << "<div>\n"
80
+ out << "<p>#{uri}</p>\n"
81
+ out << "</div>\n"
82
+ end
83
+ out << "<h3>Scripts</h3>\n" if p.scripts.count > 0
84
+ p.scripts.each do |script|
85
+ uri = UrlHelper.get_normalized_url(@host, script)
86
+ out << "<div>\n"
87
+ out << "<p>#{uri}</p>\n"
88
+ out << "</div>\n"
89
+ end
90
+ end
91
+ end
92
+ out << HTML_EPILOG
93
+ out.join
94
+ end
95
+
96
+ def to_graph(page)
97
+ graph = Graphviz::Graph.new
98
+ tree_to_graph(page, graph)
99
+ graph.to_dot
100
+ end
101
+
102
+ def to_test_yml(page)
103
+ out = []
104
+ tree_to_test_yml(page, out)
105
+ out.join
106
+ end
107
+
108
+ private
109
+
110
+ def tree_to_graph(page, node)
111
+ n = node.add_node(page.path.to_s)
112
+ unless page.scraped?
113
+ n.attributes[:shape] = "box"
114
+ n.attributes[:color] = (
115
+ if page.robots_forbidden?
116
+ "crimson"
117
+ elsif page.depth_reached?
118
+ "darkorange"
119
+ elsif page.external_domain?
120
+ "deeppink"
121
+ elsif page.no_html?
122
+ "blue"
123
+ elsif page.not_accessible?
124
+ "blueviolet"
125
+ end
126
+ )
127
+ end
128
+ page.sub_pages.each do |p|
129
+ tree_to_graph(p, n)
130
+ end
131
+ end
132
+
133
+ def tree_to_text(page, out, depth=0)
134
+ indent = INDENT * 2 * depth
135
+ if page.scraped?
136
+ details = ": a(#{page.anchors.count}), img(#{page.images.count}), link(#{page.links.count}), script(#{page.scripts.count})"
137
+ else
138
+ details = ": #{page.format_codes}"
139
+ end
140
+ out << "#{indent}(#{depth})#{scraped_mark(page)}page #{page.path}#{details}\n"
141
+ return unless page.scraped?
142
+
143
+ if page.images.count > 0
144
+ out << "#{indent}#{INDENT}images:\n"
145
+ page.images.each { |img| out << "#{indent}#{INDENT*2}#{img}\n" }
146
+ end
147
+ if page.links.count > 0
148
+ out << "#{indent}#{INDENT}links:\n"
149
+ page.links.each { |link| out << "#{indent}#{INDENT*2}#{link}\n" }
150
+ end
151
+ if page.scripts.count > 0
152
+ out << "#{indent}#{INDENT}scripts:\n"
153
+ page.scripts.each { |script| out << "#{indent}#{INDENT*2}#{script}\n" }
154
+ end
155
+ if page.sub_pages.count > 0
156
+ out << "#{indent}#{INDENT}pages:\n"
157
+ page.sub_pages.each do |sub_page|
158
+ tree_to_text(sub_page, out, depth + 1)
159
+ end
160
+ end
161
+ end
162
+
163
+ def tree_to_test_yml(page, out)
164
+ if page.scraped?
165
+ out << "\"#{page.path}\": \"\n"
166
+ out << "<html><body>\n"
167
+ page.sub_pages.each do |p|
168
+ out << " <a href=\\\"#{p.path}\\\"></a>\n"
169
+ end
170
+ out << "</body></html>\"\n"
171
+ page.sub_pages.each do |p|
172
+ tree_to_test_yml(p, out)
173
+ end
174
+ end
175
+ end
176
+
177
+ def scraped_mark(page)
178
+ page.scraped? ? "* " : " "
179
+ end
180
+ end
181
+ end
@@ -0,0 +1,138 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "digest/murmurhash"
4
+
5
+ require_relative "page"
6
+ require_relative "url_helper"
7
+ require_relative "web_helper"
8
+
9
+ module LameSitemapper
10
+ class Scraper
11
+ EXTRACT_TAGS = [
12
+ ["//a/@href", "anchors"],
13
+ ["//img/@src", "images"],
14
+ ["//link/@href", "links"],
15
+ ["//script/@src", "scripts"]
16
+ ]
17
+
18
+ def initialize(seen_urls, urls_queue, pages_queue, index, opts, robots)
19
+ @seen_urls = seen_urls
20
+ @urls_queue = urls_queue
21
+ @pages_queue = pages_queue
22
+ @index = index
23
+ @opts = opts
24
+ @robots = robots
25
+ end
26
+
27
+ def run
28
+ Thread.current[:name] = "%02d" % @index
29
+ LOGGER.debug "running scraper #{@index}"
30
+ loop do
31
+ msg = @urls_queue.pop
32
+ unless msg
33
+ LOGGER.debug "scraper #{@index} received finish message"
34
+ break
35
+ end
36
+
37
+ page = create_page(msg)
38
+
39
+ @pages_queue.push(page: page, url: msg[:url], depth: msg[:depth], parent: msg[:parent])
40
+ end
41
+ end
42
+
43
+ private
44
+
45
+ def create_page(args)
46
+ normalized_url = UrlHelper.get_normalized_url(args[:host], args[:url])
47
+ unless normalized_url
48
+ LOGGER.error "failed to normalize url #{args[:url]}"
49
+ return
50
+ end
51
+
52
+ return if is_url_already_seen?(normalized_url, args[:depth])
53
+ set_already_seen_url(normalized_url)
54
+ page = Page.new(normalized_url)
55
+ return page unless should_crawl_page?(args[:host], page, args[:depth])
56
+
57
+ response = WebHelper.get_http_response(normalized_url)
58
+ unless response
59
+ LOGGER.error "failed to get resource for #{normalized_url}"
60
+ page.not_accessible = true
61
+ return page
62
+ end
63
+
64
+ if response.headers && response.headers["Content-Type"] !~ /text\/html/
65
+ LOGGER.debug "#{UrlHelper.log_prefix(args[:depth])} stopping, #{page.path} is not html"
66
+ page.no_html = true
67
+ return page
68
+ end
69
+
70
+ # if we had redirect, verify url once more
71
+ if response.redirect_count.to_i > 0
72
+ normalized_url = UrlHelper.get_normalized_url(args[:host], response.effective_url)
73
+ unless normalized_url
74
+ LOGGER.error "failed to normalize url #{response.effective_url}"
75
+ return page
76
+ end
77
+
78
+ page.path = normalized_url # modify path to match redirect
79
+ return if is_url_already_seen?(normalized_url, args[:depth])
80
+ set_already_seen_url(normalized_url)
81
+ return page unless should_crawl_page?(args[:host], page, args[:depth])
82
+ end
83
+
84
+ doc = Nokogiri::HTML(response.body)
85
+ unless doc
86
+ LOGGER.error "failed to parse document from url #{normalized_url}"
87
+ return page
88
+ end
89
+
90
+ EXTRACT_TAGS.each do |expression, collection|
91
+ doc.xpath(expression).each { |attr| page.send(collection) << attr.value }
92
+ page.instance_variable_set("@#{collection}", page.send(collection).reject(&:empty?).uniq)
93
+ end
94
+
95
+ LOGGER.debug "#{UrlHelper.log_prefix(args[:depth])} scraped page at #{normalized_url}"
96
+
97
+ page
98
+ end
99
+
100
+ def is_url_already_seen?(url, depth)
101
+ if @seen_urls[Digest::MurmurHash64B.hexdigest(url.omit(:scheme).to_s)]
102
+ LOGGER.debug "#{UrlHelper.log_prefix(depth)} skipping #{url}, already seen"
103
+ return true
104
+ end
105
+
106
+ false
107
+ end
108
+
109
+ def set_already_seen_url(url)
110
+ @seen_urls[Digest::MurmurHash64B.hexdigest(url.omit(:scheme).to_s)] = true
111
+ end
112
+
113
+ def should_crawl_page?(host, page, depth)
114
+ # check if url is on the same domain as host
115
+ unless UrlHelper.is_url_same_domain?(host, page.path)
116
+ LOGGER.debug "#{UrlHelper.log_prefix(depth)} stopping, #{page.path} is ext host"
117
+ page.external_domain = true
118
+ return false
119
+ end
120
+
121
+ # check if url is allowed with robots.txt
122
+ if @robots && @robots.disallowed?(page.path.to_s)
123
+ LOGGER.debug "#{UrlHelper.log_prefix(depth)} stopping, #{page.path} is robots.txt disallowed"
124
+ page.robots_forbidden = true
125
+ return false
126
+ end
127
+
128
+ # check if max traversal depth has been reached
129
+ if depth >= @opts[:max_page_depth].to_i
130
+ LOGGER.debug "#{UrlHelper.log_prefix(depth)} stopping, max traversal depth reached"
131
+ page.depth_reached = true
132
+ return false
133
+ end
134
+
135
+ true
136
+ end
137
+ end
138
+ end
@@ -0,0 +1,28 @@
1
+ default: &default
2
+ max_page_depth: 10
3
+ log_level: 1
4
+ use_robots: true
5
+ web_settings:
6
+ followlocation: true
7
+ ssl_verifypeer: false
8
+ ssl_verifyhost: 2
9
+ useragent: lame-sitemapper v0.0.0.1
10
+ connecttimeout: 5
11
+ timeout: 5
12
+ report_type: text
13
+ sitemap_frequency_type: daily
14
+ scraper_threads: 5
15
+ log:
16
+ file_name: crawl.log
17
+ file_count: 10
18
+ file_size: 10485760
19
+ production:
20
+ <<: *default
21
+ test:
22
+ <<: *default
23
+ log_level: 0
24
+ scraper_threads: 1
25
+ log:
26
+ file_name: crawl-test.log
27
+ file_count: 10
28
+ file_size: 10485760
@@ -0,0 +1,56 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "addressable/uri"
4
+ require "public_suffix"
5
+
6
+ module LameSitemapper
7
+ class UrlHelper
8
+ SUPPORTED_SCHEMAS = %w(http https)
9
+ LOG_INDENT = " " * 2
10
+
11
+ def self.get_normalized_host(host_string)
12
+ host_url = Addressable::URI.heuristic_parse(host_string, scheme: "http")
13
+
14
+ return unless SUPPORTED_SCHEMAS.include?(host_url.scheme)
15
+ return unless host_url.host
16
+ return if host_url.host =~ /\s/
17
+ return unless PublicSuffix.valid?(host_url.host)
18
+ host_url.omit!(:path, :query, :fragment)
19
+
20
+ Addressable::URI.encode(host_url, ::Addressable::URI).normalize
21
+ rescue Addressable::URI::InvalidURIError, TypeError
22
+ nil
23
+ end
24
+
25
+ def self.get_normalized_url(host_url, resource_url)
26
+ host_url = Addressable::URI.parse(host_url)
27
+ resource_url = Addressable::URI.parse(resource_url)
28
+
29
+ m = {}
30
+ m[:scheme] = host_url.scheme unless resource_url.scheme
31
+ unless resource_url.host
32
+ m[:host] = host_url.host
33
+ m[:port] = host_url.port
34
+ end
35
+ resource_url.merge!(m) unless m.empty?
36
+ return unless SUPPORTED_SCHEMAS.include?(resource_url.scheme)
37
+ return unless PublicSuffix.valid?(resource_url.host)
38
+ resource_url.omit!(:fragment)
39
+ resource_url.query = resource_url.query.split("&").map(&:strip).sort.join("&") unless resource_url.query.nil? || resource_url.query.empty?
40
+
41
+ Addressable::URI.encode(resource_url, ::Addressable::URI).normalize
42
+ rescue Addressable::URI::InvalidURIError, TypeError
43
+ nil
44
+ end
45
+
46
+ def self.is_url_same_domain?(host_url, resource_url)
47
+ host_url = Addressable::URI.parse(host_url)
48
+ resource_url = Addressable::URI.parse(resource_url)
49
+ host_url.host == resource_url.host
50
+ end
51
+
52
+ def self.log_prefix(depth)
53
+ "#{LOG_INDENT * depth}(#{depth})"
54
+ end
55
+ end
56
+ end
@@ -0,0 +1,29 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "typhoeus"
4
+
5
+ module LameSitemapper
6
+ class WebHelper
7
+ def self.get_http_response(url, method=:get)
8
+ response = Typhoeus.send(method, url.to_s, SETTINGS[:web_settings])
9
+ return if response.nil?
10
+
11
+ if response.timed_out?
12
+ LOGGER.warn "resource at #{url} timed-out"
13
+ return
14
+ end
15
+
16
+ unless response.success?
17
+ LOGGER.warn "resource at #{url} returned error code #{response.code}"
18
+ return
19
+ end
20
+
21
+ if response.body.nil?
22
+ LOGGER.warn "resource at #{url} returned empty body"
23
+ return
24
+ end
25
+
26
+ response
27
+ end
28
+ end
29
+ end
metadata ADDED
@@ -0,0 +1,271 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: lame-sitemapper
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Orest Kulik
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2020-08-26 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: typhoeus
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '0.6'
20
+ - - ">="
21
+ - !ruby/object:Gem::Version
22
+ version: 0.6.8
23
+ type: :runtime
24
+ prerelease: false
25
+ version_requirements: !ruby/object:Gem::Requirement
26
+ requirements:
27
+ - - "~>"
28
+ - !ruby/object:Gem::Version
29
+ version: '0.6'
30
+ - - ">="
31
+ - !ruby/object:Gem::Version
32
+ version: 0.6.8
33
+ - !ruby/object:Gem::Dependency
34
+ name: nokogiri
35
+ requirement: !ruby/object:Gem::Requirement
36
+ requirements:
37
+ - - "~>"
38
+ - !ruby/object:Gem::Version
39
+ version: '1.6'
40
+ - - ">="
41
+ - !ruby/object:Gem::Version
42
+ version: 1.6.1
43
+ type: :runtime
44
+ prerelease: false
45
+ version_requirements: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - "~>"
48
+ - !ruby/object:Gem::Version
49
+ version: '1.6'
50
+ - - ">="
51
+ - !ruby/object:Gem::Version
52
+ version: 1.6.1
53
+ - !ruby/object:Gem::Dependency
54
+ name: webrobots
55
+ requirement: !ruby/object:Gem::Requirement
56
+ requirements:
57
+ - - "~>"
58
+ - !ruby/object:Gem::Version
59
+ version: '0.1'
60
+ - - ">="
61
+ - !ruby/object:Gem::Version
62
+ version: 0.1.1
63
+ type: :runtime
64
+ prerelease: false
65
+ version_requirements: !ruby/object:Gem::Requirement
66
+ requirements:
67
+ - - "~>"
68
+ - !ruby/object:Gem::Version
69
+ version: '0.1'
70
+ - - ">="
71
+ - !ruby/object:Gem::Version
72
+ version: 0.1.1
73
+ - !ruby/object:Gem::Dependency
74
+ name: addressable
75
+ requirement: !ruby/object:Gem::Requirement
76
+ requirements:
77
+ - - "~>"
78
+ - !ruby/object:Gem::Version
79
+ version: '2.3'
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: 2.3.6
83
+ type: :runtime
84
+ prerelease: false
85
+ version_requirements: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - "~>"
88
+ - !ruby/object:Gem::Version
89
+ version: '2.3'
90
+ - - ">="
91
+ - !ruby/object:Gem::Version
92
+ version: 2.3.6
93
+ - !ruby/object:Gem::Dependency
94
+ name: public_suffix
95
+ requirement: !ruby/object:Gem::Requirement
96
+ requirements:
97
+ - - "~>"
98
+ - !ruby/object:Gem::Version
99
+ version: '1.4'
100
+ - - ">="
101
+ - !ruby/object:Gem::Version
102
+ version: 1.4.2
103
+ type: :runtime
104
+ prerelease: false
105
+ version_requirements: !ruby/object:Gem::Requirement
106
+ requirements:
107
+ - - "~>"
108
+ - !ruby/object:Gem::Version
109
+ version: '1.4'
110
+ - - ">="
111
+ - !ruby/object:Gem::Version
112
+ version: 1.4.2
113
+ - !ruby/object:Gem::Dependency
114
+ name: digest-murmurhash
115
+ requirement: !ruby/object:Gem::Requirement
116
+ requirements:
117
+ - - ">="
118
+ - !ruby/object:Gem::Version
119
+ version: 0.3.0
120
+ - - "~>"
121
+ - !ruby/object:Gem::Version
122
+ version: '0.3'
123
+ type: :runtime
124
+ prerelease: false
125
+ version_requirements: !ruby/object:Gem::Requirement
126
+ requirements:
127
+ - - ">="
128
+ - !ruby/object:Gem::Version
129
+ version: 0.3.0
130
+ - - "~>"
131
+ - !ruby/object:Gem::Version
132
+ version: '0.3'
133
+ - !ruby/object:Gem::Dependency
134
+ name: graphviz
135
+ requirement: !ruby/object:Gem::Requirement
136
+ requirements:
137
+ - - ">="
138
+ - !ruby/object:Gem::Version
139
+ version: 0.4.0
140
+ - - "~>"
141
+ - !ruby/object:Gem::Version
142
+ version: '0.4'
143
+ type: :runtime
144
+ prerelease: false
145
+ version_requirements: !ruby/object:Gem::Requirement
146
+ requirements:
147
+ - - ">="
148
+ - !ruby/object:Gem::Version
149
+ version: 0.4.0
150
+ - - "~>"
151
+ - !ruby/object:Gem::Version
152
+ version: '0.4'
153
+ - !ruby/object:Gem::Dependency
154
+ name: activesupport
155
+ requirement: !ruby/object:Gem::Requirement
156
+ requirements:
157
+ - - "~>"
158
+ - !ruby/object:Gem::Version
159
+ version: '6.0'
160
+ - - ">="
161
+ - !ruby/object:Gem::Version
162
+ version: 6.0.3.2
163
+ type: :runtime
164
+ prerelease: false
165
+ version_requirements: !ruby/object:Gem::Requirement
166
+ requirements:
167
+ - - "~>"
168
+ - !ruby/object:Gem::Version
169
+ version: '6.0'
170
+ - - ">="
171
+ - !ruby/object:Gem::Version
172
+ version: 6.0.3.2
173
+ - !ruby/object:Gem::Dependency
174
+ name: pry
175
+ requirement: !ruby/object:Gem::Requirement
176
+ requirements:
177
+ - - ">="
178
+ - !ruby/object:Gem::Version
179
+ version: '0'
180
+ type: :development
181
+ prerelease: false
182
+ version_requirements: !ruby/object:Gem::Requirement
183
+ requirements:
184
+ - - ">="
185
+ - !ruby/object:Gem::Version
186
+ version: '0'
187
+ - !ruby/object:Gem::Dependency
188
+ name: pry-doc
189
+ requirement: !ruby/object:Gem::Requirement
190
+ requirements:
191
+ - - ">="
192
+ - !ruby/object:Gem::Version
193
+ version: '0'
194
+ type: :development
195
+ prerelease: false
196
+ version_requirements: !ruby/object:Gem::Requirement
197
+ requirements:
198
+ - - ">="
199
+ - !ruby/object:Gem::Version
200
+ version: '0'
201
+ - !ruby/object:Gem::Dependency
202
+ name: pry-byebug
203
+ requirement: !ruby/object:Gem::Requirement
204
+ requirements:
205
+ - - ">="
206
+ - !ruby/object:Gem::Version
207
+ version: '0'
208
+ type: :development
209
+ prerelease: false
210
+ version_requirements: !ruby/object:Gem::Requirement
211
+ requirements:
212
+ - - ">="
213
+ - !ruby/object:Gem::Version
214
+ version: '0'
215
+ description: It starts from the arbitrary page you provide and descents into the tree
216
+ of links until it has either traversed all possible content on the web site or has
217
+ stopped at some predefined traversal depth. It is written in Ruby and implemented
218
+ as a CLI application. Based on user preference, it can output text reports in a
219
+ standard sitemap.xml form (used by many search engines), a dot file (for easier
220
+ site hierarchy visualization, graphviz compatible), a plain text file (displaying
221
+ detailed hierarchical relations between pages) and a simple HTML format.
222
+ email:
223
+ - orest@nisdom.com
224
+ executables:
225
+ - lame-sitemapper
226
+ extensions: []
227
+ extra_rdoc_files: []
228
+ files:
229
+ - ".gitignore"
230
+ - ".rspec"
231
+ - Gemfile
232
+ - LICENSE
233
+ - README.md
234
+ - bin/console
235
+ - bin/setup
236
+ - exe/lame-sitemapper
237
+ - lame-sitemapper.gemspec
238
+ - lib/cli.rb
239
+ - lib/core.rb
240
+ - lib/lame_sitemapper.rb
241
+ - lib/lame_sitemapper/version.rb
242
+ - lib/page.rb
243
+ - lib/report_generator.rb
244
+ - lib/scraper.rb
245
+ - lib/settings.yml
246
+ - lib/url_helper.rb
247
+ - lib/web_helper.rb
248
+ homepage:
249
+ licenses: []
250
+ metadata:
251
+ source_code_uri: https://github.com/okulik/lame-sitemapper
252
+ post_install_message:
253
+ rdoc_options: []
254
+ require_paths:
255
+ - lib
256
+ required_ruby_version: !ruby/object:Gem::Requirement
257
+ requirements:
258
+ - - ">="
259
+ - !ruby/object:Gem::Version
260
+ version: 2.3.0
261
+ required_rubygems_version: !ruby/object:Gem::Requirement
262
+ requirements:
263
+ - - ">="
264
+ - !ruby/object:Gem::Version
265
+ version: '0'
266
+ requirements: []
267
+ rubygems_version: 3.0.3
268
+ signing_key:
269
+ specification_version: 4
270
+ summary: A tool for a simple, static web pages hierarchy exploration.
271
+ test_files: []