RubyGems - legitbot - Versions diffs - 1.8.0 → 1.9.1 - Mend

legitbot 1.8.0 → 1.9.1

Files changed (10) hide show

checksums.yaml +4 -4
data/.github/workflows/autocorrect.yml +1 -1
data/README.md +40 -34
data/lib/legitbot/duckduckgo.rb +1 -1
data/lib/legitbot/gptbot.rb +22 -0
data/lib/legitbot/version.rb +1 -1
data/lib/legitbot/you.rb +1 -0
data/lib/legitbot.rb +1 -0
data/lib/rubocop/cop/custom/ip_ranges.rb +37 -11
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a7cde94cf9e8a396867e4e97a490c1b4da0b300652e619ac200477a5e3aed1d5
-  data.tar.gz: 1d505df51aa086231f85080fcf3a60291a660cfce3ce79bdeab0591ba9d27578
+  metadata.gz: bdc195bebd0678ef65ecedba5c34e0dbf36f94f12f6984ebe293c416d981211d
+  data.tar.gz: 04c60ecf2e0734438ab9b5e48740be6fde23b0242e9ad99e4108b38cc5963a82
 SHA512:
-  metadata.gz: 41f811fd8c20c9a442218e36a8e54e1e72731b8443c6141205b5f3b7accbdbaeb85491134d2ca3f4cb817d01dc774348c52dae705a7655d98345f975782b5c4a
-  data.tar.gz: 77912f09be50c5d868099a6ecc2402b582d17039a82b9a57ee790724bf3b99123d7b5db071928f944c814afcc226818530b953c62f31d6f4a5d1df6bfdcefc54
+  metadata.gz: ab30e1372e4f423917dd2eb6e1297ace9e44ecbc44ccd4cb10baf9b2f67ee51606e3133dc48b760a7b913036677d4cf21629f5a152a4a46a812996f7dc5b0ba5
+  data.tar.gz: '09655e429b32e4059d9aca5003c8ef317744136f3a8109cf34570a63eba4c8b56f745d96473b61668e72bf6af590bdd322036b23a2f1cca15655b2a10f8c45cd'

data/.github/workflows/autocorrect.yml CHANGED Viewed

@@ -34,7 +34,7 @@ jobs:
     - name: Run linter
       run: bundle exec rubocop --auto-correct
     - name: Create Pull Request
-      uses: peter-evans/create-pull-request@v3
+      uses: peter-evans/create-pull-request@v5
       with:
         branch: update/lint-autocorrect
         delete-branch: true

data/README.md CHANGED Viewed

@@ -11,8 +11,8 @@ Suppose you have a Web request and you would like to check it is not diguised:
 bot = Legitbot.bot(userAgent, ip)
 ```
-`bot` will be `nil` if no bot signature was found in the `User-Agent`. Otherwise,
-it will be an object with methods
+`bot` will be `nil` if no bot signature was found in the `User-Agent`.
+Otherwise, it will be an object with methods
 ```ruby
 bot.detected_as # => :google
@@ -29,9 +29,9 @@ Rack::Attack.blocklist("fake Googlebot") do |req|
 end
 ```
-Or if you do not like all those ghoulish crawlers stealing your
-content, evaluating it and getting ready to invade your site with spammers,
-then block them all:
+Or if you do not like all those ghoulish crawlers stealing your content,
+evaluating it and getting ready to invade your site with spammers, then block
+them all:
 ```ruby
 Rack::Attack.blocklist 'fake search engines' do |request|
@@ -43,27 +43,31 @@ end
 [Semantic versioning](https://semver.org/) with the following clarifications:
-* MINOR version is incremented when support for new bots is added.
-* PATCH version is incremented when validation logic for a bot changes (IP list updated, for example).
+- MINOR version is incremented when support for new bots is added.
+- PATCH version is incremented when validation logic for a bot changes (IP list
+  updated, for example).
 ## Supported
-* [Ahrefs](https://ahrefs.com/robot)
-* [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
-* [Amazon AdBot](https://adbot.amazon.com/index.html)
-* [Applebot](https://support.apple.com/en-us/HT204683)
-* [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
-* [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
-* [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
-* [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
-* [Google crawlers](https://support.google.com/webmasters/answer/1061943)
-* [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
-* [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
-* [Petal search engine](http://aspiegel.com/petalbot)
-* [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
-* [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started), the list of IPs is in the [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
-* [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
-* [You.com](https://about.you.com/youbot/)
+- [Ahrefs](https://ahrefs.com/robot)
+- [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
+- [Amazon AdBot](https://adbot.amazon.com/index.html)
+- [Applebot](https://support.apple.com/en-us/HT204683)
+- [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
+- [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
+- [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
+- [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
+- [Google crawlers](https://support.google.com/webmasters/answer/1061943)
+- [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
+- [OpenAI GPTBot](https://platform.openai.com/docs/gptbot)
+- [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
+- [Petal search engine](http://aspiegel.com/petalbot)
+- [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
+- [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started),
+  the list of IPs is in the
+  [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
+- [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
+- [You.com](https://about.you.com/youbot/)
 ## License
@@ -71,16 +75,18 @@ Apache 2.0
 ## Other projects
-* Play Framework variant in Scala: [play-legitbot](https://github.com/osinka/play-legitbot)
-* Article [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
-* [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
+- Play Framework variant in Scala:
+  [play-legitbot](https://github.com/osinka/play-legitbot)
+- Article
+  [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
+- [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
   detects bots by `User-Agent`
-* [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and Rack
-  middleware to detect crawlers by few different request headers, including `User-Agent`
-* Project Honeypot's
-  [http:BL](https://www.projecthoneypot.org/httpbl_api.php) can not only
-  classify IP as a search engine, but also label them as suspicious and
-  reports the number of days since the last activity. My implementation of
+- [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and
+  Rack middleware to detect crawlers by few different request headers, including
+  `User-Agent`
+- Project Honeypot's [http:BL](https://www.projecthoneypot.org/httpbl_api.php)
+  can not only classify IP as a search engine, but also label them as suspicious
+  and reports the number of days since the last activity. My implementation of
   the protocol in Scala is [here](https://github.com/osinka/httpbl).
-* [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with built-in support
-  to validate bots.
+- [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with
+  built-in support to validate bots.

data/lib/legitbot/duckduckgo.rb CHANGED Viewed

@@ -3,7 +3,7 @@
 module Legitbot # :nodoc:
   # https://duckduckgo.com/duckduckbot
   class DuckDuckGo < BotMatch
-    # @fetch:url https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
+    # @fetch:url https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
     # @fetch:selector section.main article.content ul > li
     ip_ranges %w[
       20.185.79.15

data/lib/legitbot/gptbot.rb ADDED Viewed

@@ -0,0 +1,22 @@
+# frozen_string_literal: true
+module Legitbot # :nodoc:
+  # https://platform.openai.com/docs/gptbot
+  class GPTBot < BotMatch
+    # @fetch:url https://openai.com/gptbot-ranges.txt
+    ip_ranges %w[
+      20.9.164.0/24
+      20.15.240.64/28
+      20.15.240.80/28
+      20.15.240.96/28
+      20.15.240.176/28
+      20.15.241.0/28
+      20.15.242.128/28
+      20.15.242.144/28
+      20.15.242.192/28
+      40.83.2.64/28
+    ]
+  end
+  rule Legitbot::GPTBot, %w[GPTBot]
+end

data/lib/legitbot/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Legitbot
-  VERSION = '1.8.0'
+  VERSION = '1.9.1'
 end

data/lib/legitbot/you.rb CHANGED Viewed

@@ -5,6 +5,7 @@ module Legitbot # :nodoc:
   class You < BotMatch
     ip_ranges %w[
       20.59.40.22
+      52.226.199.170
     ]
   end

data/lib/legitbot.rb CHANGED Viewed

@@ -12,6 +12,7 @@ require_relative 'legitbot/bing'
 require_relative 'legitbot/duckduckgo'
 require_relative 'legitbot/facebook'
 require_relative 'legitbot/google'
+require_relative 'legitbot/gptbot'
 require_relative 'legitbot/ias'
 require_relative 'legitbot/oracle'
 require_relative 'legitbot/petalbot'

data/lib/rubocop/cop/custom/ip_ranges.rb CHANGED Viewed

@@ -25,8 +25,9 @@ module RuboCop
             params = fetch_params(node)
             return unless mandatory_params?(params)
-            existing_ips = read_node_ips value
-            new_ips = fetch_ips(**params)
+            existing_ips = normalise_list(read_node_ips(value))
+            new_ips = normalise_list(fetch_ips(**params))
+            return unless new_ips
             return if existing_ips == new_ips
             register_offense(value, new_ips, **params)
@@ -36,20 +37,45 @@ module RuboCop
         private
         def fetch_ips(url:, selector: nil, jsonpath: nil)
+          body = get_url url
+          return unless body
+          return parse_html(body, selector) if selector
+          return parse_json(body, jsonpath) if jsonpath
+          parse_text(body)
+        end
+        def get_url(url)
           response = Net::HTTP.get_response URI(url)
+          unless response.is_a?(Net::HTTPOK)
+            add_global_offense "Could not fetch IPs from #{url} , HTTP status code #{response.code}"
+            return
+          end
           response.value
+          response.body
+        end
-          if selector
-            document = Nokogiri::HTML response.body
-            document.css(selector).map(&:content).sort_by(&IPAddr.method(:new))
-          else
-            document = JSON.parse response.body
-            JsonPath.new(jsonpath).on(document).sort_by(&IPAddr.method(:new))
-          end
+        def parse_html(body, selector)
+          document = Nokogiri::HTML body
+          document.css(selector).map(&:content)
+        end
+        def parse_json(body, jsonpath)
+          document = JSON.parse body
+          JsonPath.new(jsonpath).on(document)
+        end
+        def parse_text(body)
+          body.lines.map(&:chomp)
         end
         def read_node_ips(value)
-          value.child_nodes.map(&:value).sort_by(&IPAddr.method(:new))
+          value.child_nodes.map(&:value)
+        end
+        def normalise_list(ips)
+          ips.sort_by(&IPAddr.method(:new))
         end
         def register_offense(node, new_ips, **params)
@@ -60,7 +86,7 @@ module RuboCop
         end
         def mandatory_params?(params)
-          params.include?(:url) && (params.include?(:selector) || params.include?(:jsonpath))
+          params.include?(:url)
         end
         def fetch_params(node)

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: legitbot
 version: !ruby/object:Gem::Version
-  version: 1.8.0
+  version: 1.9.1
 platform: ruby
 authors:
 - Alexander Azarov
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-07-09 00:00:00.000000000 Z
+date: 2023-09-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: fast_interval_tree
@@ -81,6 +81,7 @@ files:
 - lib/legitbot/duckduckgo.rb
 - lib/legitbot/facebook.rb
 - lib/legitbot/google.rb
+- lib/legitbot/gptbot.rb
 - lib/legitbot/ias.rb
 - lib/legitbot/legitbot.rb
 - lib/legitbot/oracle.rb