legitbot 1.8.0 → 1.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +40 -34
- data/lib/legitbot/duckduckgo.rb +1 -1
- data/lib/legitbot/gptbot.rb +21 -0
- data/lib/legitbot/version.rb +1 -1
- data/lib/legitbot.rb +1 -0
- data/lib/rubocop/cop/custom/ip_ranges.rb +37 -11
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 76098bb34095ff5b37ed3732b6f00baa6e8491b813f61faf0d717c0c35018885
|
4
|
+
data.tar.gz: 5ed3f6c8d09d019685e9a5ff33844de03b3bbf31b1a5156fc4e88264dbcc5d08
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4dcd231f388e8134347db6c22286dfbf3e9155f5b79714c70f83fef8a1d2c2d3b990ea0b2d4f7beab4a59bfe0f1a326558d02b393340b628c55aebfd10cfdbb7
|
7
|
+
data.tar.gz: b66d66abe8eeace6c74bdfacab6cbe3dafa18c9c7c2325df313bc8a6842e81e510040dd1f7c1f0ff32c0692f75e976ff79089ac31a34125391d76d3bd2c2d8fc
|
data/README.md
CHANGED
@@ -11,8 +11,8 @@ Suppose you have a Web request and you would like to check it is not diguised:
|
|
11
11
|
bot = Legitbot.bot(userAgent, ip)
|
12
12
|
```
|
13
13
|
|
14
|
-
`bot` will be `nil` if no bot signature was found in the `User-Agent`.
|
15
|
-
it will be an object with methods
|
14
|
+
`bot` will be `nil` if no bot signature was found in the `User-Agent`.
|
15
|
+
Otherwise, it will be an object with methods
|
16
16
|
|
17
17
|
```ruby
|
18
18
|
bot.detected_as # => :google
|
@@ -29,9 +29,9 @@ Rack::Attack.blocklist("fake Googlebot") do |req|
|
|
29
29
|
end
|
30
30
|
```
|
31
31
|
|
32
|
-
Or if you do not like all those ghoulish crawlers stealing your
|
33
|
-
|
34
|
-
|
32
|
+
Or if you do not like all those ghoulish crawlers stealing your content,
|
33
|
+
evaluating it and getting ready to invade your site with spammers, then block
|
34
|
+
them all:
|
35
35
|
|
36
36
|
```ruby
|
37
37
|
Rack::Attack.blocklist 'fake search engines' do |request|
|
@@ -43,27 +43,31 @@ end
|
|
43
43
|
|
44
44
|
[Semantic versioning](https://semver.org/) with the following clarifications:
|
45
45
|
|
46
|
-
|
47
|
-
|
46
|
+
- MINOR version is incremented when support for new bots is added.
|
47
|
+
- PATCH version is incremented when validation logic for a bot changes (IP list
|
48
|
+
updated, for example).
|
48
49
|
|
49
50
|
## Supported
|
50
51
|
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
52
|
+
- [Ahrefs](https://ahrefs.com/robot)
|
53
|
+
- [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
|
54
|
+
- [Amazon AdBot](https://adbot.amazon.com/index.html)
|
55
|
+
- [Applebot](https://support.apple.com/en-us/HT204683)
|
56
|
+
- [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
|
57
|
+
- [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
|
58
|
+
- [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
|
59
|
+
- [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
|
60
|
+
- [Google crawlers](https://support.google.com/webmasters/answer/1061943)
|
61
|
+
- [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
|
62
|
+
- [OpenAI GPTBot](https://platform.openai.com/docs/gptbot)
|
63
|
+
- [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
|
64
|
+
- [Petal search engine](http://aspiegel.com/petalbot)
|
65
|
+
- [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
|
66
|
+
- [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started),
|
67
|
+
the list of IPs is in the
|
68
|
+
[Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
|
69
|
+
- [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
|
70
|
+
- [You.com](https://about.you.com/youbot/)
|
67
71
|
|
68
72
|
## License
|
69
73
|
|
@@ -71,16 +75,18 @@ Apache 2.0
|
|
71
75
|
|
72
76
|
## Other projects
|
73
77
|
|
74
|
-
|
75
|
-
|
76
|
-
|
78
|
+
- Play Framework variant in Scala:
|
79
|
+
[play-legitbot](https://github.com/osinka/play-legitbot)
|
80
|
+
- Article
|
81
|
+
[When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
|
82
|
+
- [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
|
77
83
|
detects bots by `User-Agent`
|
78
|
-
|
79
|
-
middleware to detect crawlers by few different request headers, including
|
80
|
-
|
81
|
-
|
82
|
-
classify IP as a search engine, but also label them as suspicious
|
83
|
-
reports the number of days since the last activity. My implementation of
|
84
|
+
- [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and
|
85
|
+
Rack middleware to detect crawlers by few different request headers, including
|
86
|
+
`User-Agent`
|
87
|
+
- Project Honeypot's [http:BL](https://www.projecthoneypot.org/httpbl_api.php)
|
88
|
+
can not only classify IP as a search engine, but also label them as suspicious
|
89
|
+
and reports the number of days since the last activity. My implementation of
|
84
90
|
the protocol in Scala is [here](https://github.com/osinka/httpbl).
|
85
|
-
|
86
|
-
to validate bots.
|
91
|
+
- [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with
|
92
|
+
built-in support to validate bots.
|
data/lib/legitbot/duckduckgo.rb
CHANGED
@@ -3,7 +3,7 @@
|
|
3
3
|
module Legitbot # :nodoc:
|
4
4
|
# https://duckduckgo.com/duckduckbot
|
5
5
|
class DuckDuckGo < BotMatch
|
6
|
-
# @fetch:url https://
|
6
|
+
# @fetch:url https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
|
7
7
|
# @fetch:selector section.main article.content ul > li
|
8
8
|
ip_ranges %w[
|
9
9
|
20.185.79.15
|
@@ -0,0 +1,21 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module Legitbot # :nodoc:
|
4
|
+
# https://platform.openai.com/docs/gptbot
|
5
|
+
class GPTBot < BotMatch
|
6
|
+
# @fetch:url https://openai.com/gptbot-ranges.txt
|
7
|
+
ip_ranges %w[
|
8
|
+
20.15.240.64/28
|
9
|
+
20.15.240.80/28
|
10
|
+
20.15.240.96/28
|
11
|
+
20.15.240.176/28
|
12
|
+
20.15.241.0/28
|
13
|
+
20.15.242.128/28
|
14
|
+
20.15.242.144/28
|
15
|
+
20.15.242.192/28
|
16
|
+
40.83.2.64/28
|
17
|
+
]
|
18
|
+
end
|
19
|
+
|
20
|
+
rule Legitbot::GPTBot, %w[GPTBot]
|
21
|
+
end
|
data/lib/legitbot/version.rb
CHANGED
data/lib/legitbot.rb
CHANGED
@@ -12,6 +12,7 @@ require_relative 'legitbot/bing'
|
|
12
12
|
require_relative 'legitbot/duckduckgo'
|
13
13
|
require_relative 'legitbot/facebook'
|
14
14
|
require_relative 'legitbot/google'
|
15
|
+
require_relative 'legitbot/gptbot'
|
15
16
|
require_relative 'legitbot/ias'
|
16
17
|
require_relative 'legitbot/oracle'
|
17
18
|
require_relative 'legitbot/petalbot'
|
@@ -25,8 +25,9 @@ module RuboCop
|
|
25
25
|
params = fetch_params(node)
|
26
26
|
return unless mandatory_params?(params)
|
27
27
|
|
28
|
-
existing_ips = read_node_ips
|
29
|
-
new_ips = fetch_ips(**params)
|
28
|
+
existing_ips = normalise_list(read_node_ips(value))
|
29
|
+
new_ips = normalise_list(fetch_ips(**params))
|
30
|
+
return unless new_ips
|
30
31
|
return if existing_ips == new_ips
|
31
32
|
|
32
33
|
register_offense(value, new_ips, **params)
|
@@ -36,20 +37,45 @@ module RuboCop
|
|
36
37
|
private
|
37
38
|
|
38
39
|
def fetch_ips(url:, selector: nil, jsonpath: nil)
|
40
|
+
body = get_url url
|
41
|
+
return unless body
|
42
|
+
return parse_html(body, selector) if selector
|
43
|
+
return parse_json(body, jsonpath) if jsonpath
|
44
|
+
|
45
|
+
parse_text(body)
|
46
|
+
end
|
47
|
+
|
48
|
+
def get_url(url)
|
39
49
|
response = Net::HTTP.get_response URI(url)
|
50
|
+
unless response.is_a?(Net::HTTPOK)
|
51
|
+
add_global_offense "Could not fetch IPs from #{url} , HTTP status code #{response.code}"
|
52
|
+
return
|
53
|
+
end
|
54
|
+
|
40
55
|
response.value
|
56
|
+
response.body
|
57
|
+
end
|
41
58
|
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
59
|
+
def parse_html(body, selector)
|
60
|
+
document = Nokogiri::HTML body
|
61
|
+
document.css(selector).map(&:content)
|
62
|
+
end
|
63
|
+
|
64
|
+
def parse_json(body, jsonpath)
|
65
|
+
document = JSON.parse body
|
66
|
+
JsonPath.new(jsonpath).on(document)
|
67
|
+
end
|
68
|
+
|
69
|
+
def parse_text(body)
|
70
|
+
body.lines.map(&:chomp)
|
49
71
|
end
|
50
72
|
|
51
73
|
def read_node_ips(value)
|
52
|
-
value.child_nodes.map(&:value)
|
74
|
+
value.child_nodes.map(&:value)
|
75
|
+
end
|
76
|
+
|
77
|
+
def normalise_list(ips)
|
78
|
+
ips.sort_by(&IPAddr.method(:new))
|
53
79
|
end
|
54
80
|
|
55
81
|
def register_offense(node, new_ips, **params)
|
@@ -60,7 +86,7 @@ module RuboCop
|
|
60
86
|
end
|
61
87
|
|
62
88
|
def mandatory_params?(params)
|
63
|
-
params.include?(:url)
|
89
|
+
params.include?(:url)
|
64
90
|
end
|
65
91
|
|
66
92
|
def fetch_params(node)
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: legitbot
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.9.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Alexander Azarov
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2023-
|
11
|
+
date: 2023-08-08 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: fast_interval_tree
|
@@ -81,6 +81,7 @@ files:
|
|
81
81
|
- lib/legitbot/duckduckgo.rb
|
82
82
|
- lib/legitbot/facebook.rb
|
83
83
|
- lib/legitbot/google.rb
|
84
|
+
- lib/legitbot/gptbot.rb
|
84
85
|
- lib/legitbot/ias.rb
|
85
86
|
- lib/legitbot/legitbot.rb
|
86
87
|
- lib/legitbot/oracle.rb
|