probot 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 65dd2e16a696a2500937d32ae744262f74be414620e5505838fde49b8847d34d
4
+ data.tar.gz: 5f50733228c4c4eec37218eda00ceb4c2dd7eca801df6dd7345dd9e7edc94516
5
+ SHA512:
6
+ metadata.gz: 33a44b9aba61643781e697e1b0fd54ac8d1afb40a61b1e2dc13d174aeea1b4ec1e6c2b762d122a2e1f5e50b447881b28c57cedaaff9a05f1174ce0d531c7f605
7
+ data.tar.gz: 85615bf3573de8af826308f5bc36df231581de0da4346081a44632cd5882df8c7574ce9889f586dc7fdde2c881a6da13edf49b806bc9f76d0676b20cd516a789
data/.standard.yml ADDED
@@ -0,0 +1,3 @@
1
+ # For available configuration options, see:
2
+ # https://github.com/testdouble/standard
3
+ ruby_version: 3.0.0
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ ## [Unreleased]
2
+
3
+ ## [0.1.0] - 2023-09-09
4
+
5
+ - Initial release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2023 Dan Milne
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,93 @@
1
+ # Probot
2
+
3
+ OMG another Ruby Robot.txt parser? It was an accident, I didn't mean to make it and I shouldn't have but here we are. It started out tiny and grew. Yes I should have used one of the other gems.
4
+
5
+ Does this even deserve a gem? Feel free to just copy and paste the single file which implements this - one less dependency eh?
6
+
7
+ On the plus side of this yak shaving, there are some nice features I don't think the others have.
8
+
9
+ 1. Support for consecutive user agents making up a single record:
10
+
11
+ ```txt
12
+ User-agent: first-agent
13
+ User-agent: second-agent
14
+ Disallow: /
15
+ ```
16
+
17
+ This record blocks both first-agent and second-agent from the site.
18
+
19
+ 2. It selects the most specific allow / disallow rule, using rule length as a proxy for specificity. You can also ask it to show you the matching rules and their scores.
20
+
21
+ ```ruby
22
+ txt = %Q{
23
+ User-agent: *
24
+ Disallow: /dir1
25
+ Allow: /dir1/dir2
26
+ Disallow: /dir1/dir2/dir3
27
+ }
28
+ Probot.new(txt).matches("/dir1/dir2/dir3")
29
+ => {:disallowed=>{/\/dir1/=>5, /\/dir1\/dir2\/dir3/=>15}, :allowed=>{/\/dir1\/dir2/=>10}}
30
+ ```
31
+
32
+ In this case, we can see the Disallow rule with length 15 would be followed.
33
+
34
+ 3. It sets the User-Agent string when fetching robots.txt
35
+
36
+ ## Installation
37
+
38
+ Install the gem and add to the application's Gemfile by executing:
39
+
40
+ $ bundle add probot
41
+
42
+ If bundler is not being used to manage dependencies, install the gem by executing:
43
+
44
+ $ gem install probot
45
+
46
+ ## Usage
47
+
48
+ It's straightforward to use. Instantiate it if you'll make a few requests:
49
+
50
+ ```ruby
51
+ > r = Probot.new('https://booko.info', agent: 'BookScraper')
52
+ > r.rules
53
+ => {"*"=>{"disallow"=>[/\/search/, /\/products\/search/, /\/.*\/refresh_prices/, /\/.*\/add_to_cart/, /\/.*\/get_prices/, /\/lists\/add/, /\/.*\/add$/, /\/api\//, /\/users\/bits/, /\/users\/create/, /\/prices\//, /\/widgets\/issue/], "allow"=>[], "crawl_delay"=>0, "crawl-delay"=>0.1},
54
+ "YandexBot"=>{"disallow"=>[], "allow"=>[], "crawl_delay"=>0, "crawl-delay"=>300.0}}
55
+
56
+ > r.allowed?("/abc/refresh_prices")
57
+ => false
58
+ > r.allowed?("https://booko.info/9780765397522/All-Systems-Red")
59
+ => true
60
+ > r.allowed?("https://booko.info/9780765397522/refresh_prices")
61
+ => false
62
+ ```
63
+
64
+ Or just one-shot it for one-offs:
65
+
66
+ ```ruby
67
+ Probot.allowed?("https://booko.info/9780765397522/All-Systems-Red", agent: "BookScraper")
68
+ ```
69
+
70
+
71
+ ## Development
72
+
73
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
74
+
75
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
76
+
77
+ ## Contributing
78
+
79
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/Probot.
80
+
81
+ ## Further Reading
82
+
83
+ * https://moz.com/learn/seo/robotstxt
84
+ * https://stackoverflow.com/questions/45293419/order-of-directives-in-robots-txt-do-they-overwrite-each-other-or-complement-ea
85
+ * https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
86
+ * https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
87
+
88
+ * https://github.com/google/robotstxt - Google's official parser
89
+
90
+
91
+ ## License
92
+
93
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/Rakefile ADDED
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rake/testtask"
5
+
6
+ Rake::TestTask.new(:test) do |t|
7
+ t.libs << "test"
8
+ t.libs << "lib"
9
+ t.test_files = FileList["test/**/test_*.rb"]
10
+ end
11
+
12
+ require "standard/rake"
13
+
14
+ task default: %i[test standard]
@@ -0,0 +1,3 @@
1
+ class Probot
2
+ VERSION = "0.1.0"
3
+ end
data/lib/probot.rb ADDED
@@ -0,0 +1,157 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uri"
4
+ require "net/http"
5
+
6
+ # https://moz.com/learn/seo/robotstxt
7
+ # https://stackoverflow.com/questions/45293419/order-of-directives-in-robots-txt-do-they-overwrite-each-other-or-complement-ea
8
+ # https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
9
+ # https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
10
+ #
11
+ # https://github.com/google/robotstxt - Google's official parser
12
+
13
+ # Note: User-agent found on consecutive lines are considered to be part of the same record.
14
+ # Note: Google ignores crawl_delay
15
+ # Note: Google does not consider crawl_delay or sitemap to be part of the per-agent records.
16
+
17
+ # Two main parts of this class:
18
+ # Parse a robots.txt file
19
+ # Find the most specific rule for a given URL. We use the length of the regexp as a proxy for specificity.
20
+
21
+ class Probot
22
+ attr_reader :rules, :sitemap, :doc
23
+ attr_accessor :agent
24
+
25
+ def initialize(data, agent: "*")
26
+ raise ArgumentError, "The first argument must be a string" unless data.is_a?(String)
27
+ @agent = agent
28
+
29
+ @rules = {}
30
+ @current_agents = ["*"]
31
+ @current_agents.each { |agent| @rules[agent] ||= {"disallow" => [], "allow" => [], "crawl_delay" => 0} }
32
+ @sitemaps = []
33
+
34
+ @doc = data.start_with?("http") ? fetch_robots_txt(data) : data
35
+ parse(@doc)
36
+ end
37
+
38
+ def request_headers = (agent == "*") ? {} : {"User-Agent" => @agent}
39
+
40
+ def fetch_robots_txt(url)
41
+ Net::HTTP.get(URI(url).tap { |u| u.path = "/robots.txt" }, request_headers)
42
+ rescue
43
+ ""
44
+ end
45
+
46
+ def crawl_delay = rules.dig(@agent, "crawl_delay")
47
+
48
+ def found_agents = rules.keys
49
+
50
+ def disallowed = rules.dig(@agent, "disallow") || rules.dig("*", "disallow")
51
+
52
+ def allowed = rules.dig(@agent, "allow") || rules.dig("*", "allow")
53
+
54
+ def disallowed_matches(url) = disallowed.select { |disallowed_url| url.match?(disallowed_url) }.to_h { |rule| [rule, pattern_length(rule)] }
55
+
56
+ def allowed_matches(url) = allowed.select { |allowed_url| url.match?(allowed_url) }.to_h { |rule| [rule, pattern_length(rule)] }
57
+
58
+ def matches(url) = {disallowed: disallowed_matches(url), allowed: allowed_matches(url)}
59
+
60
+ def disallowed_best(url) = disallowed_matches(url).max_by { |k, v| v }
61
+
62
+ def allowed_best(url) = allowed_matches(url).max_by { |k, v| v }
63
+
64
+ def matching_rule(url) = (disallowed_best(url)&.last.to_i > allowed_best(url)&.last.to_i) ? {disallow: disallowed_best(url)&.first} : {allow: allowed_best(url)&.first}
65
+
66
+ # If a URL is not disallowed, it is allowed - so we check if it is explictly disallowed and if not, it's allowed.
67
+ def allowed?(url) = !disallowed?(url)
68
+
69
+ def disallowed?(url) = matching_rule(url)&.keys&.first == :disallow
70
+
71
+ def parse(doc)
72
+ # We need to handle consective user-agent lines, which are considered to be part of the same record.
73
+ subsequent_agent = false
74
+
75
+ doc.lines.each do |line|
76
+ next if line.start_with?("#") || !line.include?(":") || line.split(":").length < 2
77
+
78
+ data = ParsedLine.new(line)
79
+
80
+ if data.agent?
81
+ if subsequent_agent
82
+ @current_agents << data.value
83
+ else
84
+ @current_agents = [data.value]
85
+ subsequent_agent = true
86
+ end
87
+
88
+ @current_agents.each { |agent| rules[agent] ||= {"disallow" => [], "allow" => [], "crawl_delay" => 0} }
89
+ next
90
+ end
91
+
92
+ # All Regex characters are escaped, then we unescape * and $ as they may used in robots.txt
93
+
94
+ if data.allow? || data.disallow?
95
+ @current_agents.each { |agent| rules[agent][data.key] << Regexp.new(Regexp.escape(data.value).gsub('\*', ".*").gsub('\$', "$")) }
96
+
97
+ subsequent_agent = false # When user-agent strings are found on consecutive lines, they are considered to be part of the same record. Google ignores crawl_delay.
98
+ next
99
+ end
100
+
101
+ if data.crawl_delay?
102
+ @current_agents.each { |agent| rules[agent][data.key] = data.value }
103
+ next
104
+ end
105
+
106
+ if data.sitemap?
107
+ @sitemap = URI(data.value).path
108
+ next
109
+ end
110
+
111
+ @current_agents.each { |agent| rules[agent][data.key] = data.value }
112
+ end
113
+ end
114
+
115
+ def pattern_length(regexp) = regexp.source.gsub(/(\\[\*\$\.])/, "*").length
116
+
117
+ # ParedLine Note: In the case of 'Sitemap: https://example.com/sitemap.xml', raw_value needs to rejoin after splitting the URL.
118
+
119
+ ParsedLine = Struct.new(:input_string) do
120
+ def key = input_string.split(":").first&.strip&.downcase
121
+
122
+ def raw_value = input_string.split(":").slice(1..)&.join(":")&.strip
123
+
124
+ def clean_value = raw_value.split("#").first&.strip
125
+
126
+ def agent? = key == "user-agent"
127
+
128
+ def disallow? = key == "disallow"
129
+
130
+ def allow? = key == "allow"
131
+
132
+ def crawl_delay? = key == "crawl-delay"
133
+
134
+ def sitemap? = key == "sitemap"
135
+
136
+ def value
137
+ return clean_value.to_f if crawl_delay?
138
+ return URI(clean_value).to_s if disallow? || allow?
139
+
140
+ raw_value
141
+ rescue URI::InvalidURIError
142
+ raw_value
143
+ end
144
+ end
145
+
146
+ def self.allowed?(url, agent: "*") = Probot.new(url, agent: agent).allowed?(url)
147
+ end
148
+
149
+ # Probot.allowed?("https://booko.info/9780765397522/All-Systems-Red")
150
+ # => true
151
+ # r = Probot.new('https://booko.info', agent: 'YandexBot')
152
+ # r = Probot.new('https://www.allenandunwin.com')
153
+ # $ Probot.new('https://www.amazon.com/').matches("/gp/wishlist/ipad-install/gcrnsts")
154
+ # => {:disallowed=>{/\/wishlist\//=>10, /\/gp\/wishlist\//=>13, /.*\/gcrnsts/=>10}, :allowed=>{/\/gp\/wishlist\/ipad\-install.*/=>28}}
155
+ #
156
+ # Test with
157
+ # assert Probot.new(nil, doc: %Q{allow: /$\ndisallow: /}).matching_rule('https://example.com/page.htm') == {disallow: /\//}
data/probot.gemspec ADDED
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "lib/probot/version"
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = "probot"
7
+ spec.version = Probot::VERSION
8
+ spec.authors = ["Dan Milne"]
9
+ spec.email = ["d@nmilne.com"]
10
+
11
+ spec.summary = "A robots.txt parser."
12
+ spec.description = "A fully featured robots.txt parser."
13
+ spec.homepage = "http://github.com/dkam/probot"
14
+ spec.license = "MIT"
15
+ spec.required_ruby_version = ">= 3.0"
16
+
17
+ spec.metadata["homepage_uri"] = spec.homepage
18
+ spec.metadata["source_code_uri"] = "http://github.com/dkam/probot"
19
+ spec.metadata["changelog_uri"] = "http://github.com/dkam/probot/CHANGELOG.md"
20
+
21
+ # Specify which files should be added to the gem when it is released.
22
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
23
+ spec.files = Dir.chdir(__dir__) do
24
+ `git ls-files -z`.split("\x0").reject do |f|
25
+ (File.expand_path(f) == __FILE__) ||
26
+ f.start_with?(*%w[bin/ test/ spec/ features/ .git .circleci appveyor Gemfile])
27
+ end
28
+ end
29
+ spec.bindir = "exe"
30
+ spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
31
+ spec.require_paths = ["lib"]
32
+
33
+ # Uncomment to register a new dependency of your gem
34
+ # spec.add_dependency "example-gem", "~> 1.0"
35
+
36
+ # For more information and examples about making a new gem, check out our
37
+ # guide at: https://bundler.io/guides/creating_gem.html
38
+ end
data/sig/probot.rbs ADDED
@@ -0,0 +1,4 @@
1
+ module Probot
2
+ VERSION: String
3
+ # See the writing guide of rbs: https://github.com/ruby/rbs#guides
4
+ end
metadata ADDED
@@ -0,0 +1,55 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: probot
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Dan Milne
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2023-09-10 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: A fully featured robots.txt parser.
14
+ email:
15
+ - d@nmilne.com
16
+ executables: []
17
+ extensions: []
18
+ extra_rdoc_files: []
19
+ files:
20
+ - ".standard.yml"
21
+ - CHANGELOG.md
22
+ - LICENSE.txt
23
+ - README.md
24
+ - Rakefile
25
+ - lib/probot.rb
26
+ - lib/probot/version.rb
27
+ - probot.gemspec
28
+ - sig/probot.rbs
29
+ homepage: http://github.com/dkam/probot
30
+ licenses:
31
+ - MIT
32
+ metadata:
33
+ homepage_uri: http://github.com/dkam/probot
34
+ source_code_uri: http://github.com/dkam/probot
35
+ changelog_uri: http://github.com/dkam/probot/CHANGELOG.md
36
+ post_install_message:
37
+ rdoc_options: []
38
+ require_paths:
39
+ - lib
40
+ required_ruby_version: !ruby/object:Gem::Requirement
41
+ requirements:
42
+ - - ">="
43
+ - !ruby/object:Gem::Version
44
+ version: '3.0'
45
+ required_rubygems_version: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: '0'
50
+ requirements: []
51
+ rubygems_version: 3.4.19
52
+ signing_key:
53
+ specification_version: 4
54
+ summary: A robots.txt parser.
55
+ test_files: []