probot 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.standard.yml +3 -0
- data/CHANGELOG.md +5 -0
- data/LICENSE.txt +21 -0
- data/README.md +93 -0
- data/Rakefile +14 -0
- data/lib/probot/version.rb +3 -0
- data/lib/probot.rb +157 -0
- data/probot.gemspec +38 -0
- data/sig/probot.rbs +4 -0
- metadata +55 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 65dd2e16a696a2500937d32ae744262f74be414620e5505838fde49b8847d34d
|
4
|
+
data.tar.gz: 5f50733228c4c4eec37218eda00ceb4c2dd7eca801df6dd7345dd9e7edc94516
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 33a44b9aba61643781e697e1b0fd54ac8d1afb40a61b1e2dc13d174aeea1b4ec1e6c2b762d122a2e1f5e50b447881b28c57cedaaff9a05f1174ce0d531c7f605
|
7
|
+
data.tar.gz: 85615bf3573de8af826308f5bc36df231581de0da4346081a44632cd5882df8c7574ce9889f586dc7fdde2c881a6da13edf49b806bc9f76d0676b20cd516a789
|
data/.standard.yml
ADDED
data/CHANGELOG.md
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2023 Dan Milne
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,93 @@
|
|
1
|
+
# Probot
|
2
|
+
|
3
|
+
OMG another Ruby Robot.txt parser? It was an accident, I didn't mean to make it and I shouldn't have but here we are. It started out tiny and grew. Yes I should have used one of the other gems.
|
4
|
+
|
5
|
+
Does this even deserve a gem? Feel free to just copy and paste the single file which implements this - one less dependency eh?
|
6
|
+
|
7
|
+
On the plus side of this yak shaving, there are some nice features I don't think the others have.
|
8
|
+
|
9
|
+
1. Support for consecutive user agents making up a single record:
|
10
|
+
|
11
|
+
```txt
|
12
|
+
User-agent: first-agent
|
13
|
+
User-agent: second-agent
|
14
|
+
Disallow: /
|
15
|
+
```
|
16
|
+
|
17
|
+
This record blocks both first-agent and second-agent from the site.
|
18
|
+
|
19
|
+
2. It selects the most specific allow / disallow rule, using rule length as a proxy for specificity. You can also ask it to show you the matching rules and their scores.
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
txt = %Q{
|
23
|
+
User-agent: *
|
24
|
+
Disallow: /dir1
|
25
|
+
Allow: /dir1/dir2
|
26
|
+
Disallow: /dir1/dir2/dir3
|
27
|
+
}
|
28
|
+
Probot.new(txt).matches("/dir1/dir2/dir3")
|
29
|
+
=> {:disallowed=>{/\/dir1/=>5, /\/dir1\/dir2\/dir3/=>15}, :allowed=>{/\/dir1\/dir2/=>10}}
|
30
|
+
```
|
31
|
+
|
32
|
+
In this case, we can see the Disallow rule with length 15 would be followed.
|
33
|
+
|
34
|
+
3. It sets the User-Agent string when fetching robots.txt
|
35
|
+
|
36
|
+
## Installation
|
37
|
+
|
38
|
+
Install the gem and add to the application's Gemfile by executing:
|
39
|
+
|
40
|
+
$ bundle add probot
|
41
|
+
|
42
|
+
If bundler is not being used to manage dependencies, install the gem by executing:
|
43
|
+
|
44
|
+
$ gem install probot
|
45
|
+
|
46
|
+
## Usage
|
47
|
+
|
48
|
+
It's straightforward to use. Instantiate it if you'll make a few requests:
|
49
|
+
|
50
|
+
```ruby
|
51
|
+
> r = Probot.new('https://booko.info', agent: 'BookScraper')
|
52
|
+
> r.rules
|
53
|
+
=> {"*"=>{"disallow"=>[/\/search/, /\/products\/search/, /\/.*\/refresh_prices/, /\/.*\/add_to_cart/, /\/.*\/get_prices/, /\/lists\/add/, /\/.*\/add$/, /\/api\//, /\/users\/bits/, /\/users\/create/, /\/prices\//, /\/widgets\/issue/], "allow"=>[], "crawl_delay"=>0, "crawl-delay"=>0.1},
|
54
|
+
"YandexBot"=>{"disallow"=>[], "allow"=>[], "crawl_delay"=>0, "crawl-delay"=>300.0}}
|
55
|
+
|
56
|
+
> r.allowed?("/abc/refresh_prices")
|
57
|
+
=> false
|
58
|
+
> r.allowed?("https://booko.info/9780765397522/All-Systems-Red")
|
59
|
+
=> true
|
60
|
+
> r.allowed?("https://booko.info/9780765397522/refresh_prices")
|
61
|
+
=> false
|
62
|
+
```
|
63
|
+
|
64
|
+
Or just one-shot it for one-offs:
|
65
|
+
|
66
|
+
```ruby
|
67
|
+
Probot.allowed?("https://booko.info/9780765397522/All-Systems-Red", agent: "BookScraper")
|
68
|
+
```
|
69
|
+
|
70
|
+
|
71
|
+
## Development
|
72
|
+
|
73
|
+
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
74
|
+
|
75
|
+
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
76
|
+
|
77
|
+
## Contributing
|
78
|
+
|
79
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/Probot.
|
80
|
+
|
81
|
+
## Further Reading
|
82
|
+
|
83
|
+
* https://moz.com/learn/seo/robotstxt
|
84
|
+
* https://stackoverflow.com/questions/45293419/order-of-directives-in-robots-txt-do-they-overwrite-each-other-or-complement-ea
|
85
|
+
* https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
|
86
|
+
* https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
|
87
|
+
|
88
|
+
* https://github.com/google/robotstxt - Google's official parser
|
89
|
+
|
90
|
+
|
91
|
+
## License
|
92
|
+
|
93
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
data/Rakefile
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "bundler/gem_tasks"
|
4
|
+
require "rake/testtask"
|
5
|
+
|
6
|
+
Rake::TestTask.new(:test) do |t|
|
7
|
+
t.libs << "test"
|
8
|
+
t.libs << "lib"
|
9
|
+
t.test_files = FileList["test/**/test_*.rb"]
|
10
|
+
end
|
11
|
+
|
12
|
+
require "standard/rake"
|
13
|
+
|
14
|
+
task default: %i[test standard]
|
data/lib/probot.rb
ADDED
@@ -0,0 +1,157 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require "uri"
|
4
|
+
require "net/http"
|
5
|
+
|
6
|
+
# https://moz.com/learn/seo/robotstxt
|
7
|
+
# https://stackoverflow.com/questions/45293419/order-of-directives-in-robots-txt-do-they-overwrite-each-other-or-complement-ea
|
8
|
+
# https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt
|
9
|
+
# https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
|
10
|
+
#
|
11
|
+
# https://github.com/google/robotstxt - Google's official parser
|
12
|
+
|
13
|
+
# Note: User-agent found on consecutive lines are considered to be part of the same record.
|
14
|
+
# Note: Google ignores crawl_delay
|
15
|
+
# Note: Google does not consider crawl_delay or sitemap to be part of the per-agent records.
|
16
|
+
|
17
|
+
# Two main parts of this class:
|
18
|
+
# Parse a robots.txt file
|
19
|
+
# Find the most specific rule for a given URL. We use the length of the regexp as a proxy for specificity.
|
20
|
+
|
21
|
+
class Probot
|
22
|
+
attr_reader :rules, :sitemap, :doc
|
23
|
+
attr_accessor :agent
|
24
|
+
|
25
|
+
def initialize(data, agent: "*")
|
26
|
+
raise ArgumentError, "The first argument must be a string" unless data.is_a?(String)
|
27
|
+
@agent = agent
|
28
|
+
|
29
|
+
@rules = {}
|
30
|
+
@current_agents = ["*"]
|
31
|
+
@current_agents.each { |agent| @rules[agent] ||= {"disallow" => [], "allow" => [], "crawl_delay" => 0} }
|
32
|
+
@sitemaps = []
|
33
|
+
|
34
|
+
@doc = data.start_with?("http") ? fetch_robots_txt(data) : data
|
35
|
+
parse(@doc)
|
36
|
+
end
|
37
|
+
|
38
|
+
def request_headers = (agent == "*") ? {} : {"User-Agent" => @agent}
|
39
|
+
|
40
|
+
def fetch_robots_txt(url)
|
41
|
+
Net::HTTP.get(URI(url).tap { |u| u.path = "/robots.txt" }, request_headers)
|
42
|
+
rescue
|
43
|
+
""
|
44
|
+
end
|
45
|
+
|
46
|
+
def crawl_delay = rules.dig(@agent, "crawl_delay")
|
47
|
+
|
48
|
+
def found_agents = rules.keys
|
49
|
+
|
50
|
+
def disallowed = rules.dig(@agent, "disallow") || rules.dig("*", "disallow")
|
51
|
+
|
52
|
+
def allowed = rules.dig(@agent, "allow") || rules.dig("*", "allow")
|
53
|
+
|
54
|
+
def disallowed_matches(url) = disallowed.select { |disallowed_url| url.match?(disallowed_url) }.to_h { |rule| [rule, pattern_length(rule)] }
|
55
|
+
|
56
|
+
def allowed_matches(url) = allowed.select { |allowed_url| url.match?(allowed_url) }.to_h { |rule| [rule, pattern_length(rule)] }
|
57
|
+
|
58
|
+
def matches(url) = {disallowed: disallowed_matches(url), allowed: allowed_matches(url)}
|
59
|
+
|
60
|
+
def disallowed_best(url) = disallowed_matches(url).max_by { |k, v| v }
|
61
|
+
|
62
|
+
def allowed_best(url) = allowed_matches(url).max_by { |k, v| v }
|
63
|
+
|
64
|
+
def matching_rule(url) = (disallowed_best(url)&.last.to_i > allowed_best(url)&.last.to_i) ? {disallow: disallowed_best(url)&.first} : {allow: allowed_best(url)&.first}
|
65
|
+
|
66
|
+
# If a URL is not disallowed, it is allowed - so we check if it is explictly disallowed and if not, it's allowed.
|
67
|
+
def allowed?(url) = !disallowed?(url)
|
68
|
+
|
69
|
+
def disallowed?(url) = matching_rule(url)&.keys&.first == :disallow
|
70
|
+
|
71
|
+
def parse(doc)
|
72
|
+
# We need to handle consective user-agent lines, which are considered to be part of the same record.
|
73
|
+
subsequent_agent = false
|
74
|
+
|
75
|
+
doc.lines.each do |line|
|
76
|
+
next if line.start_with?("#") || !line.include?(":") || line.split(":").length < 2
|
77
|
+
|
78
|
+
data = ParsedLine.new(line)
|
79
|
+
|
80
|
+
if data.agent?
|
81
|
+
if subsequent_agent
|
82
|
+
@current_agents << data.value
|
83
|
+
else
|
84
|
+
@current_agents = [data.value]
|
85
|
+
subsequent_agent = true
|
86
|
+
end
|
87
|
+
|
88
|
+
@current_agents.each { |agent| rules[agent] ||= {"disallow" => [], "allow" => [], "crawl_delay" => 0} }
|
89
|
+
next
|
90
|
+
end
|
91
|
+
|
92
|
+
# All Regex characters are escaped, then we unescape * and $ as they may used in robots.txt
|
93
|
+
|
94
|
+
if data.allow? || data.disallow?
|
95
|
+
@current_agents.each { |agent| rules[agent][data.key] << Regexp.new(Regexp.escape(data.value).gsub('\*', ".*").gsub('\$', "$")) }
|
96
|
+
|
97
|
+
subsequent_agent = false # When user-agent strings are found on consecutive lines, they are considered to be part of the same record. Google ignores crawl_delay.
|
98
|
+
next
|
99
|
+
end
|
100
|
+
|
101
|
+
if data.crawl_delay?
|
102
|
+
@current_agents.each { |agent| rules[agent][data.key] = data.value }
|
103
|
+
next
|
104
|
+
end
|
105
|
+
|
106
|
+
if data.sitemap?
|
107
|
+
@sitemap = URI(data.value).path
|
108
|
+
next
|
109
|
+
end
|
110
|
+
|
111
|
+
@current_agents.each { |agent| rules[agent][data.key] = data.value }
|
112
|
+
end
|
113
|
+
end
|
114
|
+
|
115
|
+
def pattern_length(regexp) = regexp.source.gsub(/(\\[\*\$\.])/, "*").length
|
116
|
+
|
117
|
+
# ParedLine Note: In the case of 'Sitemap: https://example.com/sitemap.xml', raw_value needs to rejoin after splitting the URL.
|
118
|
+
|
119
|
+
ParsedLine = Struct.new(:input_string) do
|
120
|
+
def key = input_string.split(":").first&.strip&.downcase
|
121
|
+
|
122
|
+
def raw_value = input_string.split(":").slice(1..)&.join(":")&.strip
|
123
|
+
|
124
|
+
def clean_value = raw_value.split("#").first&.strip
|
125
|
+
|
126
|
+
def agent? = key == "user-agent"
|
127
|
+
|
128
|
+
def disallow? = key == "disallow"
|
129
|
+
|
130
|
+
def allow? = key == "allow"
|
131
|
+
|
132
|
+
def crawl_delay? = key == "crawl-delay"
|
133
|
+
|
134
|
+
def sitemap? = key == "sitemap"
|
135
|
+
|
136
|
+
def value
|
137
|
+
return clean_value.to_f if crawl_delay?
|
138
|
+
return URI(clean_value).to_s if disallow? || allow?
|
139
|
+
|
140
|
+
raw_value
|
141
|
+
rescue URI::InvalidURIError
|
142
|
+
raw_value
|
143
|
+
end
|
144
|
+
end
|
145
|
+
|
146
|
+
def self.allowed?(url, agent: "*") = Probot.new(url, agent: agent).allowed?(url)
|
147
|
+
end
|
148
|
+
|
149
|
+
# Probot.allowed?("https://booko.info/9780765397522/All-Systems-Red")
|
150
|
+
# => true
|
151
|
+
# r = Probot.new('https://booko.info', agent: 'YandexBot')
|
152
|
+
# r = Probot.new('https://www.allenandunwin.com')
|
153
|
+
# $ Probot.new('https://www.amazon.com/').matches("/gp/wishlist/ipad-install/gcrnsts")
|
154
|
+
# => {:disallowed=>{/\/wishlist\//=>10, /\/gp\/wishlist\//=>13, /.*\/gcrnsts/=>10}, :allowed=>{/\/gp\/wishlist\/ipad\-install.*/=>28}}
|
155
|
+
#
|
156
|
+
# Test with
|
157
|
+
# assert Probot.new(nil, doc: %Q{allow: /$\ndisallow: /}).matching_rule('https://example.com/page.htm') == {disallow: /\//}
|
data/probot.gemspec
ADDED
@@ -0,0 +1,38 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require_relative "lib/probot/version"
|
4
|
+
|
5
|
+
Gem::Specification.new do |spec|
|
6
|
+
spec.name = "probot"
|
7
|
+
spec.version = Probot::VERSION
|
8
|
+
spec.authors = ["Dan Milne"]
|
9
|
+
spec.email = ["d@nmilne.com"]
|
10
|
+
|
11
|
+
spec.summary = "A robots.txt parser."
|
12
|
+
spec.description = "A fully featured robots.txt parser."
|
13
|
+
spec.homepage = "http://github.com/dkam/probot"
|
14
|
+
spec.license = "MIT"
|
15
|
+
spec.required_ruby_version = ">= 3.0"
|
16
|
+
|
17
|
+
spec.metadata["homepage_uri"] = spec.homepage
|
18
|
+
spec.metadata["source_code_uri"] = "http://github.com/dkam/probot"
|
19
|
+
spec.metadata["changelog_uri"] = "http://github.com/dkam/probot/CHANGELOG.md"
|
20
|
+
|
21
|
+
# Specify which files should be added to the gem when it is released.
|
22
|
+
# The `git ls-files -z` loads the files in the RubyGem that have been added into git.
|
23
|
+
spec.files = Dir.chdir(__dir__) do
|
24
|
+
`git ls-files -z`.split("\x0").reject do |f|
|
25
|
+
(File.expand_path(f) == __FILE__) ||
|
26
|
+
f.start_with?(*%w[bin/ test/ spec/ features/ .git .circleci appveyor Gemfile])
|
27
|
+
end
|
28
|
+
end
|
29
|
+
spec.bindir = "exe"
|
30
|
+
spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
|
31
|
+
spec.require_paths = ["lib"]
|
32
|
+
|
33
|
+
# Uncomment to register a new dependency of your gem
|
34
|
+
# spec.add_dependency "example-gem", "~> 1.0"
|
35
|
+
|
36
|
+
# For more information and examples about making a new gem, check out our
|
37
|
+
# guide at: https://bundler.io/guides/creating_gem.html
|
38
|
+
end
|
data/sig/probot.rbs
ADDED
metadata
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: probot
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.1.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Dan Milne
|
8
|
+
autorequire:
|
9
|
+
bindir: exe
|
10
|
+
cert_chain: []
|
11
|
+
date: 2023-09-10 00:00:00.000000000 Z
|
12
|
+
dependencies: []
|
13
|
+
description: A fully featured robots.txt parser.
|
14
|
+
email:
|
15
|
+
- d@nmilne.com
|
16
|
+
executables: []
|
17
|
+
extensions: []
|
18
|
+
extra_rdoc_files: []
|
19
|
+
files:
|
20
|
+
- ".standard.yml"
|
21
|
+
- CHANGELOG.md
|
22
|
+
- LICENSE.txt
|
23
|
+
- README.md
|
24
|
+
- Rakefile
|
25
|
+
- lib/probot.rb
|
26
|
+
- lib/probot/version.rb
|
27
|
+
- probot.gemspec
|
28
|
+
- sig/probot.rbs
|
29
|
+
homepage: http://github.com/dkam/probot
|
30
|
+
licenses:
|
31
|
+
- MIT
|
32
|
+
metadata:
|
33
|
+
homepage_uri: http://github.com/dkam/probot
|
34
|
+
source_code_uri: http://github.com/dkam/probot
|
35
|
+
changelog_uri: http://github.com/dkam/probot/CHANGELOG.md
|
36
|
+
post_install_message:
|
37
|
+
rdoc_options: []
|
38
|
+
require_paths:
|
39
|
+
- lib
|
40
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
41
|
+
requirements:
|
42
|
+
- - ">="
|
43
|
+
- !ruby/object:Gem::Version
|
44
|
+
version: '3.0'
|
45
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
46
|
+
requirements:
|
47
|
+
- - ">="
|
48
|
+
- !ruby/object:Gem::Version
|
49
|
+
version: '0'
|
50
|
+
requirements: []
|
51
|
+
rubygems_version: 3.4.19
|
52
|
+
signing_key:
|
53
|
+
specification_version: 4
|
54
|
+
summary: A robots.txt parser.
|
55
|
+
test_files: []
|