legitbot 1.8.0 → 1.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a7cde94cf9e8a396867e4e97a490c1b4da0b300652e619ac200477a5e3aed1d5
4
- data.tar.gz: 1d505df51aa086231f85080fcf3a60291a660cfce3ce79bdeab0591ba9d27578
3
+ metadata.gz: 76098bb34095ff5b37ed3732b6f00baa6e8491b813f61faf0d717c0c35018885
4
+ data.tar.gz: 5ed3f6c8d09d019685e9a5ff33844de03b3bbf31b1a5156fc4e88264dbcc5d08
5
5
  SHA512:
6
- metadata.gz: 41f811fd8c20c9a442218e36a8e54e1e72731b8443c6141205b5f3b7accbdbaeb85491134d2ca3f4cb817d01dc774348c52dae705a7655d98345f975782b5c4a
7
- data.tar.gz: 77912f09be50c5d868099a6ecc2402b582d17039a82b9a57ee790724bf3b99123d7b5db071928f944c814afcc226818530b953c62f31d6f4a5d1df6bfdcefc54
6
+ metadata.gz: 4dcd231f388e8134347db6c22286dfbf3e9155f5b79714c70f83fef8a1d2c2d3b990ea0b2d4f7beab4a59bfe0f1a326558d02b393340b628c55aebfd10cfdbb7
7
+ data.tar.gz: b66d66abe8eeace6c74bdfacab6cbe3dafa18c9c7c2325df313bc8a6842e81e510040dd1f7c1f0ff32c0692f75e976ff79089ac31a34125391d76d3bd2c2d8fc
data/README.md CHANGED
@@ -11,8 +11,8 @@ Suppose you have a Web request and you would like to check it is not diguised:
11
11
  bot = Legitbot.bot(userAgent, ip)
12
12
  ```
13
13
 
14
- `bot` will be `nil` if no bot signature was found in the `User-Agent`. Otherwise,
15
- it will be an object with methods
14
+ `bot` will be `nil` if no bot signature was found in the `User-Agent`.
15
+ Otherwise, it will be an object with methods
16
16
 
17
17
  ```ruby
18
18
  bot.detected_as # => :google
@@ -29,9 +29,9 @@ Rack::Attack.blocklist("fake Googlebot") do |req|
29
29
  end
30
30
  ```
31
31
 
32
- Or if you do not like all those ghoulish crawlers stealing your
33
- content, evaluating it and getting ready to invade your site with spammers,
34
- then block them all:
32
+ Or if you do not like all those ghoulish crawlers stealing your content,
33
+ evaluating it and getting ready to invade your site with spammers, then block
34
+ them all:
35
35
 
36
36
  ```ruby
37
37
  Rack::Attack.blocklist 'fake search engines' do |request|
@@ -43,27 +43,31 @@ end
43
43
 
44
44
  [Semantic versioning](https://semver.org/) with the following clarifications:
45
45
 
46
- * MINOR version is incremented when support for new bots is added.
47
- * PATCH version is incremented when validation logic for a bot changes (IP list updated, for example).
46
+ - MINOR version is incremented when support for new bots is added.
47
+ - PATCH version is incremented when validation logic for a bot changes (IP list
48
+ updated, for example).
48
49
 
49
50
  ## Supported
50
51
 
51
- * [Ahrefs](https://ahrefs.com/robot)
52
- * [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
53
- * [Amazon AdBot](https://adbot.amazon.com/index.html)
54
- * [Applebot](https://support.apple.com/en-us/HT204683)
55
- * [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
56
- * [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
57
- * [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
58
- * [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
59
- * [Google crawlers](https://support.google.com/webmasters/answer/1061943)
60
- * [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
61
- * [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
62
- * [Petal search engine](http://aspiegel.com/petalbot)
63
- * [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
64
- * [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started), the list of IPs is in the [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
65
- * [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
66
- * [You.com](https://about.you.com/youbot/)
52
+ - [Ahrefs](https://ahrefs.com/robot)
53
+ - [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
54
+ - [Amazon AdBot](https://adbot.amazon.com/index.html)
55
+ - [Applebot](https://support.apple.com/en-us/HT204683)
56
+ - [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
57
+ - [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
58
+ - [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
59
+ - [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
60
+ - [Google crawlers](https://support.google.com/webmasters/answer/1061943)
61
+ - [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
62
+ - [OpenAI GPTBot](https://platform.openai.com/docs/gptbot)
63
+ - [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
64
+ - [Petal search engine](http://aspiegel.com/petalbot)
65
+ - [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
66
+ - [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started),
67
+ the list of IPs is in the
68
+ [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
69
+ - [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
70
+ - [You.com](https://about.you.com/youbot/)
67
71
 
68
72
  ## License
69
73
 
@@ -71,16 +75,18 @@ Apache 2.0
71
75
 
72
76
  ## Other projects
73
77
 
74
- * Play Framework variant in Scala: [play-legitbot](https://github.com/osinka/play-legitbot)
75
- * Article [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
76
- * [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
78
+ - Play Framework variant in Scala:
79
+ [play-legitbot](https://github.com/osinka/play-legitbot)
80
+ - Article
81
+ [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
82
+ - [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
77
83
  detects bots by `User-Agent`
78
- * [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and Rack
79
- middleware to detect crawlers by few different request headers, including `User-Agent`
80
- * Project Honeypot's
81
- [http:BL](https://www.projecthoneypot.org/httpbl_api.php) can not only
82
- classify IP as a search engine, but also label them as suspicious and
83
- reports the number of days since the last activity. My implementation of
84
+ - [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and
85
+ Rack middleware to detect crawlers by few different request headers, including
86
+ `User-Agent`
87
+ - Project Honeypot's [http:BL](https://www.projecthoneypot.org/httpbl_api.php)
88
+ can not only classify IP as a search engine, but also label them as suspicious
89
+ and reports the number of days since the last activity. My implementation of
84
90
  the protocol in Scala is [here](https://github.com/osinka/httpbl).
85
- * [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with built-in support
86
- to validate bots.
91
+ - [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with
92
+ built-in support to validate bots.
@@ -3,7 +3,7 @@
3
3
  module Legitbot # :nodoc:
4
4
  # https://duckduckgo.com/duckduckbot
5
5
  class DuckDuckGo < BotMatch
6
- # @fetch:url https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
6
+ # @fetch:url https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
7
7
  # @fetch:selector section.main article.content ul > li
8
8
  ip_ranges %w[
9
9
  20.185.79.15
@@ -0,0 +1,21 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Legitbot # :nodoc:
4
+ # https://platform.openai.com/docs/gptbot
5
+ class GPTBot < BotMatch
6
+ # @fetch:url https://openai.com/gptbot-ranges.txt
7
+ ip_ranges %w[
8
+ 20.15.240.64/28
9
+ 20.15.240.80/28
10
+ 20.15.240.96/28
11
+ 20.15.240.176/28
12
+ 20.15.241.0/28
13
+ 20.15.242.128/28
14
+ 20.15.242.144/28
15
+ 20.15.242.192/28
16
+ 40.83.2.64/28
17
+ ]
18
+ end
19
+
20
+ rule Legitbot::GPTBot, %w[GPTBot]
21
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Legitbot
4
- VERSION = '1.8.0'
4
+ VERSION = '1.9.0'
5
5
  end
data/lib/legitbot.rb CHANGED
@@ -12,6 +12,7 @@ require_relative 'legitbot/bing'
12
12
  require_relative 'legitbot/duckduckgo'
13
13
  require_relative 'legitbot/facebook'
14
14
  require_relative 'legitbot/google'
15
+ require_relative 'legitbot/gptbot'
15
16
  require_relative 'legitbot/ias'
16
17
  require_relative 'legitbot/oracle'
17
18
  require_relative 'legitbot/petalbot'
@@ -25,8 +25,9 @@ module RuboCop
25
25
  params = fetch_params(node)
26
26
  return unless mandatory_params?(params)
27
27
 
28
- existing_ips = read_node_ips value
29
- new_ips = fetch_ips(**params)
28
+ existing_ips = normalise_list(read_node_ips(value))
29
+ new_ips = normalise_list(fetch_ips(**params))
30
+ return unless new_ips
30
31
  return if existing_ips == new_ips
31
32
 
32
33
  register_offense(value, new_ips, **params)
@@ -36,20 +37,45 @@ module RuboCop
36
37
  private
37
38
 
38
39
  def fetch_ips(url:, selector: nil, jsonpath: nil)
40
+ body = get_url url
41
+ return unless body
42
+ return parse_html(body, selector) if selector
43
+ return parse_json(body, jsonpath) if jsonpath
44
+
45
+ parse_text(body)
46
+ end
47
+
48
+ def get_url(url)
39
49
  response = Net::HTTP.get_response URI(url)
50
+ unless response.is_a?(Net::HTTPOK)
51
+ add_global_offense "Could not fetch IPs from #{url} , HTTP status code #{response.code}"
52
+ return
53
+ end
54
+
40
55
  response.value
56
+ response.body
57
+ end
41
58
 
42
- if selector
43
- document = Nokogiri::HTML response.body
44
- document.css(selector).map(&:content).sort_by(&IPAddr.method(:new))
45
- else
46
- document = JSON.parse response.body
47
- JsonPath.new(jsonpath).on(document).sort_by(&IPAddr.method(:new))
48
- end
59
+ def parse_html(body, selector)
60
+ document = Nokogiri::HTML body
61
+ document.css(selector).map(&:content)
62
+ end
63
+
64
+ def parse_json(body, jsonpath)
65
+ document = JSON.parse body
66
+ JsonPath.new(jsonpath).on(document)
67
+ end
68
+
69
+ def parse_text(body)
70
+ body.lines.map(&:chomp)
49
71
  end
50
72
 
51
73
  def read_node_ips(value)
52
- value.child_nodes.map(&:value).sort_by(&IPAddr.method(:new))
74
+ value.child_nodes.map(&:value)
75
+ end
76
+
77
+ def normalise_list(ips)
78
+ ips.sort_by(&IPAddr.method(:new))
53
79
  end
54
80
 
55
81
  def register_offense(node, new_ips, **params)
@@ -60,7 +86,7 @@ module RuboCop
60
86
  end
61
87
 
62
88
  def mandatory_params?(params)
63
- params.include?(:url) && (params.include?(:selector) || params.include?(:jsonpath))
89
+ params.include?(:url)
64
90
  end
65
91
 
66
92
  def fetch_params(node)
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: legitbot
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.8.0
4
+ version: 1.9.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Alexander Azarov
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-07-09 00:00:00.000000000 Z
11
+ date: 2023-08-08 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: fast_interval_tree
@@ -81,6 +81,7 @@ files:
81
81
  - lib/legitbot/duckduckgo.rb
82
82
  - lib/legitbot/facebook.rb
83
83
  - lib/legitbot/google.rb
84
+ - lib/legitbot/gptbot.rb
84
85
  - lib/legitbot/ias.rb
85
86
  - lib/legitbot/legitbot.rb
86
87
  - lib/legitbot/oracle.rb