probot 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 65dd2e16a696a2500937d32ae744262f74be414620e5505838fde49b8847d34d
4
+ data.tar.gz: 5f50733228c4c4eec37218eda00ceb4c2dd7eca801df6dd7345dd9e7edc94516
5
+ SHA512:
6
+ metadata.gz: 33a44b9aba61643781e697e1b0fd54ac8d1afb40a61b1e2dc13d174aeea1b4ec1e6c2b762d122a2e1f5e50b447881b28c57cedaaff9a05f1174ce0d531c7f605
7
+ data.tar.gz: 85615bf3573de8af826308f5bc36df231581de0da4346081a44632cd5882df8c7574ce9889f586dc7fdde2c881a6da13edf49b806bc9f76d0676b20cd516a789
data/.standard.yml ADDED
@@ -0,0 +1,3 @@
1
+ # For available configuration options, see:
2
+ # https://github.com/testdouble/standard
3
+ ruby_version: 3.0.0
data/CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
1
+ ## [Unreleased]
2
+
3
+ ## [0.1.0] - 2023-09-09
4
+
5
+ - Initial release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2023 Dan Milne
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,93 @@
1
+ # Probot
2
+
3
+ OMG another Ruby Robot.txt parser? It was an accident, I didn't mean to make it and I shouldn't have but here we are. It started out tiny and grew. Yes I should have used one of the other gems.
4
+
5
+ Does this even deserve a gem? Feel free to just copy and paste the single file which implements this - one less dependency eh?
6
+
7
+ On the plus side of this yak shaving, there are some nice features I don't think the others have.
8
+
9
+ 1. Support for consecutive user agents making up a single record:
10
+
11
+ ```txt
12
+ User-agent: first-agent
13
+ User-agent: second-agent
14
+ Disallow: /
15
+ ```
16
+
17
+ This record blocks both first-agent and second-agent from the site.
18
+
19
+ 2. It selects the most specific allow / disallow rule, using rule length as a proxy for specificity. You can also ask it to show you the matching rules and their scores.
20
+
21
+ ```ruby
22
+ txt = %Q{
23
+ User-agent: *
24
+ Disallow: /dir1
25
+ Allow: /dir1/dir2
26
+ Disallow: /dir1/dir2/dir3
27
+ }
28
+ Probot.new(txt).matches("/dir1/dir2/dir3")
29
+ => {:disallowed=>{/\/dir1/=>5, /\/dir1\/dir2\/dir3/=>15}, :allowed=>{/\/dir1\/dir2/=>10}}
30
+ ```
31
+
32
+ In this case, we can see the Disallow rule with length 15 would be followed.
33
+
34
+ 3. It sets the User-Agent string when fetching robots.txt
35
+
36
+ ## Installation
37
+
38
+ Install the gem and add to the application's Gemfile by executing:
39
+
40
+ $ bundle add probot
41
+
42
+ If bundler is not being used to manage dependencies, install the gem by executing:
43
+
44
+ $ gem install probot
45
+
46
+ ## Usage
47
+
48
+ It's straightforward to use. Instantiate it if you'll make a few requests:
49
+
50
+ ```ruby
51
+ > r = Probot.new('https://booko.info', agent: 'BookScraper')
52
+ > r.rules
53
+ => {"*"=>{"disallow"=>[/\/search/, /\/products\/search/, /\/.*\/refresh_prices/, /\/.*\/add_to_cart/, /\/.*\/get_prices/, /\/lists\/add/, /\/.*\/add$/, /\/api\//, /\/users\/bits/, /\/users\/create/, /\/prices\//, /\/widgets\/issue/], "allow"=>[], "crawl_delay"=>0, "crawl-delay"=>0.1},
54
+ "YandexBot"=>{"disallow"=>[], "allow"=>[], "crawl_delay"=>0, "crawl-delay"=>300.0}}
55
+
56
+ > r.allowed?("/abc/refresh_prices")
57
+ => false
58
+ > r.allowed?("https://booko.info/9780765397522/All-Systems-Red")
59
+ => true
60
+ > r.allowed?("https://booko.info/9780765397522/refresh_prices")
61
+ => false
62
+ ```
63
+
64
+ Or just one-shot it for one-offs:
65
+
66
+ ```ruby
67
+ Probot.allowed?("https://booko.info/9780765397522/All-Systems-Red", agent: "BookScraper")
68
+ ```
69
+
70
+
71
+ ## Development
72
+
73
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
74
+
75
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
76
+
77
+ ## Contributing
78
+
79
+ Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/Probot.
80
+
81
+ ## Further Reading
82
+
83
+ * https://moz.com/learn/seo/robotstxt
84
+ * https://stackoverflow.com/questions/45293419/order-of-directives-in-robots-txt-do-they-overwrite-each-other-or-complement-ea
85
+ * https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
86
+ * https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
87
+
88
+ * https://github.com/google/robotstxt - Google's official parser
89
+
90
+
91
+ ## License
92
+
93
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/Rakefile ADDED
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rake/testtask"
5
+
6
+ Rake::TestTask.new(:test) do |t|
7
+ t.libs << "test"
8
+ t.libs << "lib"
9
+ t.test_files = FileList["test/**/test_*.rb"]
10
+ end
11
+
12
+ require "standard/rake"
13
+
14
+ task default: %i[test standard]
@@ -0,0 +1,3 @@
1
+ class Probot
2
+ VERSION = "0.1.0"
3
+ end
data/lib/probot.rb ADDED
@@ -0,0 +1,157 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "uri"
4
+ require "net/http"
5
+
6
+ # https://moz.com/learn/seo/robotstxt
7
+ # https://stackoverflow.com/questions/45293419/order-of-directives-in-robots-txt-do-they-overwrite-each-other-or-complement-ea
8
+ # https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
9
+ # https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
10
+ #
11
+ # https://github.com/google/robotstxt - Google's official parser
12
+
13
+ # Note: User-agent found on consecutive lines are considered to be part of the same record.
14
+ # Note: Google ignores crawl_delay
15
+ # Note: Google does not consider crawl_delay or sitemap to be part of the per-agent records.
16
+
17
+ # Two main parts of this class:
18
+ # Parse a robots.txt file
19
+ # Find the most specific rule for a given URL. We use the length of the regexp as a proxy for specificity.
20
+
21
+ class Probot
22
+ attr_reader :rules, :sitemap, :doc
23
+ attr_accessor :agent
24
+
25
+ def initialize(data, agent: "*")
26
+ raise ArgumentError, "The first argument must be a string" unless data.is_a?(String)
27
+ @agent = agent
28
+
29
+ @rules = {}
30
+ @current_agents = ["*"]
31
+ @current_agents.each { |agent| @rules[agent] ||= {"disallow" => [], "allow" => [], "crawl_delay" => 0} }
32
+ @sitemaps = []
33
+
34
+ @doc = data.start_with?("http") ? fetch_robots_txt(data) : data
35
+ parse(@doc)
36
+ end
37
+
38
+ def request_headers = (agent == "*") ? {} : {"User-Agent" => @agent}
39
+
40
+ def fetch_robots_txt(url)
41
+ Net::HTTP.get(URI(url).tap { |u| u.path = "/robots.txt" }, request_headers)
42
+ rescue
43
+ ""
44
+ end
45
+
46
+ def crawl_delay = rules.dig(@agent, "crawl_delay")
47
+
48
+ def found_agents = rules.keys
49
+
50
+ def disallowed = rules.dig(@agent, "disallow") || rules.dig("*", "disallow")
51
+
52
+ def allowed = rules.dig(@agent, "allow") || rules.dig("*", "allow")
53
+
54
+ def disallowed_matches(url) = disallowed.select { |disallowed_url| url.match?(disallowed_url) }.to_h { |rule| [rule, pattern_length(rule)] }
55
+
56
+ def allowed_matches(url) = allowed.select { |allowed_url| url.match?(allowed_url) }.to_h { |rule| [rule, pattern_length(rule)] }
57
+
58
+ def matches(url) = {disallowed: disallowed_matches(url), allowed: allowed_matches(url)}
59
+
60
+ def disallowed_best(url) = disallowed_matches(url).max_by { |k, v| v }
61
+
62
+ def allowed_best(url) = allowed_matches(url).max_by { |k, v| v }
63
+
64
+ def matching_rule(url) = (disallowed_best(url)&.last.to_i > allowed_best(url)&.last.to_i) ? {disallow: disallowed_best(url)&.first} : {allow: allowed_best(url)&.first}
65
+
66
+ # If a URL is not disallowed, it is allowed - so we check if it is explictly disallowed and if not, it's allowed.
67
+ def allowed?(url) = !disallowed?(url)
68
+
69
+ def disallowed?(url) = matching_rule(url)&.keys&.first == :disallow
70
+
71
+ def parse(doc)
72
+ # We need to handle consective user-agent lines, which are considered to be part of the same record.
73
+ subsequent_agent = false
74
+
75
+ doc.lines.each do |line|
76
+ next if line.start_with?("#") || !line.include?(":") || line.split(":").length < 2
77
+
78
+ data = ParsedLine.new(line)
79
+
80
+ if data.agent?
81
+ if subsequent_agent
82
+ @current_agents << data.value
83
+ else
84
+ @current_agents = [data.value]
85
+ subsequent_agent = true
86
+ end
87
+
88
+ @current_agents.each { |agent| rules[agent] ||= {"disallow" => [], "allow" => [], "crawl_delay" => 0} }
89
+ next
90
+ end
91
+
92
+ # All Regex characters are escaped, then we unescape * and $ as they may used in robots.txt
93
+
94
+ if data.allow? || data.disallow?
95
+ @current_agents.each { |agent| rules[agent][data.key] << Regexp.new(Regexp.escape(data.value).gsub('\*', ".*").gsub('\$', "$")) }
96
+
97
+ subsequent_agent = false # When user-agent strings are found on consecutive lines, they are considered to be part of the same record. Google ignores crawl_delay.
98
+ next
99
+ end
100
+
101
+ if data.crawl_delay?
102
+ @current_agents.each { |agent| rules[agent][data.key] = data.value }
103
+ next
104
+ end
105
+
106
+ if data.sitemap?
107
+ @sitemap = URI(data.value).path
108
+ next
109
+ end
110
+
111
+ @current_agents.each { |agent| rules[agent][data.key] = data.value }
112
+ end
113
+ end
114
+
115
+ def pattern_length(regexp) = regexp.source.gsub(/(\\[\*\$\.])/, "*").length
116
+
117
+ # ParedLine Note: In the case of 'Sitemap: https://example.com/sitemap.xml', raw_value needs to rejoin after splitting the URL.
118
+
119
+ ParsedLine = Struct.new(:input_string) do
120
+ def key = input_string.split(":").first&.strip&.downcase
121
+
122
+ def raw_value = input_string.split(":").slice(1..)&.join(":")&.strip
123
+
124
+ def clean_value = raw_value.split("#").first&.strip
125
+
126
+ def agent? = key == "user-agent"
127
+
128
+ def disallow? = key == "disallow"
129
+
130
+ def allow? = key == "allow"
131
+
132
+ def crawl_delay? = key == "crawl-delay"
133
+
134
+ def sitemap? = key == "sitemap"
135
+
136
+ def value
137
+ return clean_value.to_f if crawl_delay?
138
+ return URI(clean_value).to_s if disallow? || allow?
139
+
140
+ raw_value
141
+ rescue URI::InvalidURIError
142
+ raw_value
143
+ end
144
+ end
145
+
146
+ def self.allowed?(url, agent: "*") = Probot.new(url, agent: agent).allowed?(url)
147
+ end
148
+
149
+ # Probot.allowed?("https://booko.info/9780765397522/All-Systems-Red")
150
+ # => true
151
+ # r = Probot.new('https://booko.info', agent: 'YandexBot')
152
+ # r = Probot.new('https://www.allenandunwin.com')
153
+ # $ Probot.new('https://www.amazon.com/').matches("/gp/wishlist/ipad-install/gcrnsts")
154
+ # => {:disallowed=>{/\/wishlist\//=>10, /\/gp\/wishlist\//=>13, /.*\/gcrnsts/=>10}, :allowed=>{/\/gp\/wishlist\/ipad\-install.*/=>28}}
155
+ #
156
+ # Test with
157
+ # assert Probot.new(nil, doc: %Q{allow: /$\ndisallow: /}).matching_rule('https://example.com/page.htm') == {disallow: /\//}
data/probot.gemspec ADDED
@@ -0,0 +1,38 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "lib/probot/version"
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = "probot"
7
+ spec.version = Probot::VERSION
8
+ spec.authors = ["Dan Milne"]
9
+ spec.email = ["d@nmilne.com"]
10
+
11
+ spec.summary = "A robots.txt parser."
12
+ spec.description = "A fully featured robots.txt parser."
13
+ spec.homepage = "http://github.com/dkam/probot"
14
+ spec.license = "MIT"
15
+ spec.required_ruby_version = ">= 3.0"
16
+
17
+ spec.metadata["homepage_uri"] = spec.homepage
18
+ spec.metadata["source_code_uri"] = "http://github.com/dkam/probot"
19
+ spec.metadata["changelog_uri"] = "http://github.com/dkam/probot/CHANGELOG.md"
20
+
21
+ # Specify which files should be added to the gem when it is released.
22
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
23
+ spec.files = Dir.chdir(__dir__) do
24
+ `git ls-files -z`.split("\x0").reject do |f|
25
+ (File.expand_path(f) == __FILE__) ||
26
+ f.start_with?(*%w[bin/ test/ spec/ features/ .git .circleci appveyor Gemfile])
27
+ end
28
+ end
29
+ spec.bindir = "exe"
30
+ spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
31
+ spec.require_paths = ["lib"]
32
+
33
+ # Uncomment to register a new dependency of your gem
34
+ # spec.add_dependency "example-gem", "~> 1.0"
35
+
36
+ # For more information and examples about making a new gem, check out our
37
+ # guide at: https://bundler.io/guides/creating_gem.html
38
+ end
data/sig/probot.rbs ADDED
@@ -0,0 +1,4 @@
1
+ module Probot
2
+ VERSION: String
3
+ # See the writing guide of rbs: https://github.com/ruby/rbs#guides
4
+ end
metadata ADDED
@@ -0,0 +1,55 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: probot
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Dan Milne
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2023-09-10 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: A fully featured robots.txt parser.
14
+ email:
15
+ - d@nmilne.com
16
+ executables: []
17
+ extensions: []
18
+ extra_rdoc_files: []
19
+ files:
20
+ - ".standard.yml"
21
+ - CHANGELOG.md
22
+ - LICENSE.txt
23
+ - README.md
24
+ - Rakefile
25
+ - lib/probot.rb
26
+ - lib/probot/version.rb
27
+ - probot.gemspec
28
+ - sig/probot.rbs
29
+ homepage: http://github.com/dkam/probot
30
+ licenses:
31
+ - MIT
32
+ metadata:
33
+ homepage_uri: http://github.com/dkam/probot
34
+ source_code_uri: http://github.com/dkam/probot
35
+ changelog_uri: http://github.com/dkam/probot/CHANGELOG.md
36
+ post_install_message:
37
+ rdoc_options: []
38
+ require_paths:
39
+ - lib
40
+ required_ruby_version: !ruby/object:Gem::Requirement
41
+ requirements:
42
+ - - ">="
43
+ - !ruby/object:Gem::Version
44
+ version: '3.0'
45
+ required_rubygems_version: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: '0'
50
+ requirements: []
51
+ rubygems_version: 3.4.19
52
+ signing_key:
53
+ specification_version: 4
54
+ summary: A robots.txt parser.
55
+ test_files: []