legitbot 1.8.0 → 1.9.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a7cde94cf9e8a396867e4e97a490c1b4da0b300652e619ac200477a5e3aed1d5
4
- data.tar.gz: 1d505df51aa086231f85080fcf3a60291a660cfce3ce79bdeab0591ba9d27578
3
+ metadata.gz: bdc195bebd0678ef65ecedba5c34e0dbf36f94f12f6984ebe293c416d981211d
4
+ data.tar.gz: 04c60ecf2e0734438ab9b5e48740be6fde23b0242e9ad99e4108b38cc5963a82
5
5
  SHA512:
6
- metadata.gz: 41f811fd8c20c9a442218e36a8e54e1e72731b8443c6141205b5f3b7accbdbaeb85491134d2ca3f4cb817d01dc774348c52dae705a7655d98345f975782b5c4a
7
- data.tar.gz: 77912f09be50c5d868099a6ecc2402b582d17039a82b9a57ee790724bf3b99123d7b5db071928f944c814afcc226818530b953c62f31d6f4a5d1df6bfdcefc54
6
+ metadata.gz: ab30e1372e4f423917dd2eb6e1297ace9e44ecbc44ccd4cb10baf9b2f67ee51606e3133dc48b760a7b913036677d4cf21629f5a152a4a46a812996f7dc5b0ba5
7
+ data.tar.gz: '09655e429b32e4059d9aca5003c8ef317744136f3a8109cf34570a63eba4c8b56f745d96473b61668e72bf6af590bdd322036b23a2f1cca15655b2a10f8c45cd'
@@ -34,7 +34,7 @@ jobs:
34
34
  - name: Run linter
35
35
  run: bundle exec rubocop --auto-correct
36
36
  - name: Create Pull Request
37
- uses: peter-evans/create-pull-request@v3
37
+ uses: peter-evans/create-pull-request@v5
38
38
  with:
39
39
  branch: update/lint-autocorrect
40
40
  delete-branch: true
data/README.md CHANGED
@@ -11,8 +11,8 @@ Suppose you have a Web request and you would like to check it is not diguised:
11
11
  bot = Legitbot.bot(userAgent, ip)
12
12
  ```
13
13
 
14
- `bot` will be `nil` if no bot signature was found in the `User-Agent`. Otherwise,
15
- it will be an object with methods
14
+ `bot` will be `nil` if no bot signature was found in the `User-Agent`.
15
+ Otherwise, it will be an object with methods
16
16
 
17
17
  ```ruby
18
18
  bot.detected_as # => :google
@@ -29,9 +29,9 @@ Rack::Attack.blocklist("fake Googlebot") do |req|
29
29
  end
30
30
  ```
31
31
 
32
- Or if you do not like all those ghoulish crawlers stealing your
33
- content, evaluating it and getting ready to invade your site with spammers,
34
- then block them all:
32
+ Or if you do not like all those ghoulish crawlers stealing your content,
33
+ evaluating it and getting ready to invade your site with spammers, then block
34
+ them all:
35
35
 
36
36
  ```ruby
37
37
  Rack::Attack.blocklist 'fake search engines' do |request|
@@ -43,27 +43,31 @@ end
43
43
 
44
44
  [Semantic versioning](https://semver.org/) with the following clarifications:
45
45
 
46
- * MINOR version is incremented when support for new bots is added.
47
- * PATCH version is incremented when validation logic for a bot changes (IP list updated, for example).
46
+ - MINOR version is incremented when support for new bots is added.
47
+ - PATCH version is incremented when validation logic for a bot changes (IP list
48
+ updated, for example).
48
49
 
49
50
  ## Supported
50
51
 
51
- * [Ahrefs](https://ahrefs.com/robot)
52
- * [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
53
- * [Amazon AdBot](https://adbot.amazon.com/index.html)
54
- * [Applebot](https://support.apple.com/en-us/HT204683)
55
- * [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
56
- * [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
57
- * [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
58
- * [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
59
- * [Google crawlers](https://support.google.com/webmasters/answer/1061943)
60
- * [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
61
- * [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
62
- * [Petal search engine](http://aspiegel.com/petalbot)
63
- * [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
64
- * [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started), the list of IPs is in the [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
65
- * [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
66
- * [You.com](https://about.you.com/youbot/)
52
+ - [Ahrefs](https://ahrefs.com/robot)
53
+ - [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
54
+ - [Amazon AdBot](https://adbot.amazon.com/index.html)
55
+ - [Applebot](https://support.apple.com/en-us/HT204683)
56
+ - [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
57
+ - [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
58
+ - [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
59
+ - [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
60
+ - [Google crawlers](https://support.google.com/webmasters/answer/1061943)
61
+ - [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
62
+ - [OpenAI GPTBot](https://platform.openai.com/docs/gptbot)
63
+ - [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
64
+ - [Petal search engine](http://aspiegel.com/petalbot)
65
+ - [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
66
+ - [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started),
67
+ the list of IPs is in the
68
+ [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
69
+ - [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
70
+ - [You.com](https://about.you.com/youbot/)
67
71
 
68
72
  ## License
69
73
 
@@ -71,16 +75,18 @@ Apache 2.0
71
75
 
72
76
  ## Other projects
73
77
 
74
- * Play Framework variant in Scala: [play-legitbot](https://github.com/osinka/play-legitbot)
75
- * Article [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
76
- * [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
78
+ - Play Framework variant in Scala:
79
+ [play-legitbot](https://github.com/osinka/play-legitbot)
80
+ - Article
81
+ [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
82
+ - [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
77
83
  detects bots by `User-Agent`
78
- * [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and Rack
79
- middleware to detect crawlers by few different request headers, including `User-Agent`
80
- * Project Honeypot's
81
- [http:BL](https://www.projecthoneypot.org/httpbl_api.php) can not only
82
- classify IP as a search engine, but also label them as suspicious and
83
- reports the number of days since the last activity. My implementation of
84
+ - [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and
85
+ Rack middleware to detect crawlers by few different request headers, including
86
+ `User-Agent`
87
+ - Project Honeypot's [http:BL](https://www.projecthoneypot.org/httpbl_api.php)
88
+ can not only classify IP as a search engine, but also label them as suspicious
89
+ and reports the number of days since the last activity. My implementation of
84
90
  the protocol in Scala is [here](https://github.com/osinka/httpbl).
85
- * [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with built-in support
86
- to validate bots.
91
+ - [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with
92
+ built-in support to validate bots.
@@ -3,7 +3,7 @@
3
3
  module Legitbot # :nodoc:
4
4
  # https://duckduckgo.com/duckduckbot
5
5
  class DuckDuckGo < BotMatch
6
- # @fetch:url https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
6
+ # @fetch:url https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
7
7
  # @fetch:selector section.main article.content ul > li
8
8
  ip_ranges %w[
9
9
  20.185.79.15
@@ -0,0 +1,22 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Legitbot # :nodoc:
4
+ # https://platform.openai.com/docs/gptbot
5
+ class GPTBot < BotMatch
6
+ # @fetch:url https://openai.com/gptbot-ranges.txt
7
+ ip_ranges %w[
8
+ 20.9.164.0/24
9
+ 20.15.240.64/28
10
+ 20.15.240.80/28
11
+ 20.15.240.96/28
12
+ 20.15.240.176/28
13
+ 20.15.241.0/28
14
+ 20.15.242.128/28
15
+ 20.15.242.144/28
16
+ 20.15.242.192/28
17
+ 40.83.2.64/28
18
+ ]
19
+ end
20
+
21
+ rule Legitbot::GPTBot, %w[GPTBot]
22
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Legitbot
4
- VERSION = '1.8.0'
4
+ VERSION = '1.9.1'
5
5
  end
data/lib/legitbot/you.rb CHANGED
@@ -5,6 +5,7 @@ module Legitbot # :nodoc:
5
5
  class You < BotMatch
6
6
  ip_ranges %w[
7
7
  20.59.40.22
8
+ 52.226.199.170
8
9
  ]
9
10
  end
10
11
 
data/lib/legitbot.rb CHANGED
@@ -12,6 +12,7 @@ require_relative 'legitbot/bing'
12
12
  require_relative 'legitbot/duckduckgo'
13
13
  require_relative 'legitbot/facebook'
14
14
  require_relative 'legitbot/google'
15
+ require_relative 'legitbot/gptbot'
15
16
  require_relative 'legitbot/ias'
16
17
  require_relative 'legitbot/oracle'
17
18
  require_relative 'legitbot/petalbot'
@@ -25,8 +25,9 @@ module RuboCop
25
25
  params = fetch_params(node)
26
26
  return unless mandatory_params?(params)
27
27
 
28
- existing_ips = read_node_ips value
29
- new_ips = fetch_ips(**params)
28
+ existing_ips = normalise_list(read_node_ips(value))
29
+ new_ips = normalise_list(fetch_ips(**params))
30
+ return unless new_ips
30
31
  return if existing_ips == new_ips
31
32
 
32
33
  register_offense(value, new_ips, **params)
@@ -36,20 +37,45 @@ module RuboCop
36
37
  private
37
38
 
38
39
  def fetch_ips(url:, selector: nil, jsonpath: nil)
40
+ body = get_url url
41
+ return unless body
42
+ return parse_html(body, selector) if selector
43
+ return parse_json(body, jsonpath) if jsonpath
44
+
45
+ parse_text(body)
46
+ end
47
+
48
+ def get_url(url)
39
49
  response = Net::HTTP.get_response URI(url)
50
+ unless response.is_a?(Net::HTTPOK)
51
+ add_global_offense "Could not fetch IPs from #{url} , HTTP status code #{response.code}"
52
+ return
53
+ end
54
+
40
55
  response.value
56
+ response.body
57
+ end
41
58
 
42
- if selector
43
- document = Nokogiri::HTML response.body
44
- document.css(selector).map(&:content).sort_by(&IPAddr.method(:new))
45
- else
46
- document = JSON.parse response.body
47
- JsonPath.new(jsonpath).on(document).sort_by(&IPAddr.method(:new))
48
- end
59
+ def parse_html(body, selector)
60
+ document = Nokogiri::HTML body
61
+ document.css(selector).map(&:content)
62
+ end
63
+
64
+ def parse_json(body, jsonpath)
65
+ document = JSON.parse body
66
+ JsonPath.new(jsonpath).on(document)
67
+ end
68
+
69
+ def parse_text(body)
70
+ body.lines.map(&:chomp)
49
71
  end
50
72
 
51
73
  def read_node_ips(value)
52
- value.child_nodes.map(&:value).sort_by(&IPAddr.method(:new))
74
+ value.child_nodes.map(&:value)
75
+ end
76
+
77
+ def normalise_list(ips)
78
+ ips.sort_by(&IPAddr.method(:new))
53
79
  end
54
80
 
55
81
  def register_offense(node, new_ips, **params)
@@ -60,7 +86,7 @@ module RuboCop
60
86
  end
61
87
 
62
88
  def mandatory_params?(params)
63
- params.include?(:url) && (params.include?(:selector) || params.include?(:jsonpath))
89
+ params.include?(:url)
64
90
  end
65
91
 
66
92
  def fetch_params(node)
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: legitbot
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.8.0
4
+ version: 1.9.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Alexander Azarov
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-07-09 00:00:00.000000000 Z
11
+ date: 2023-09-17 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: fast_interval_tree
@@ -81,6 +81,7 @@ files:
81
81
  - lib/legitbot/duckduckgo.rb
82
82
  - lib/legitbot/facebook.rb
83
83
  - lib/legitbot/google.rb
84
+ - lib/legitbot/gptbot.rb
84
85
  - lib/legitbot/ias.rb
85
86
  - lib/legitbot/legitbot.rb
86
87
  - lib/legitbot/oracle.rb