RubyGems - legitbot - Versions diffs - 1.8.0 → 1.9.0 - Mend

legitbot 1.8.0 → 1.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/README.md +40 -34
data/lib/legitbot/duckduckgo.rb +1 -1
data/lib/legitbot/gptbot.rb +21 -0
data/lib/legitbot/version.rb +1 -1
data/lib/legitbot.rb +1 -0
data/lib/rubocop/cop/custom/ip_ranges.rb +37 -11
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a7cde94cf9e8a396867e4e97a490c1b4da0b300652e619ac200477a5e3aed1d5
-  data.tar.gz: 1d505df51aa086231f85080fcf3a60291a660cfce3ce79bdeab0591ba9d27578
+  metadata.gz: 76098bb34095ff5b37ed3732b6f00baa6e8491b813f61faf0d717c0c35018885
+  data.tar.gz: 5ed3f6c8d09d019685e9a5ff33844de03b3bbf31b1a5156fc4e88264dbcc5d08
 SHA512:
-  metadata.gz: 41f811fd8c20c9a442218e36a8e54e1e72731b8443c6141205b5f3b7accbdbaeb85491134d2ca3f4cb817d01dc774348c52dae705a7655d98345f975782b5c4a
-  data.tar.gz: 77912f09be50c5d868099a6ecc2402b582d17039a82b9a57ee790724bf3b99123d7b5db071928f944c814afcc226818530b953c62f31d6f4a5d1df6bfdcefc54
+  metadata.gz: 4dcd231f388e8134347db6c22286dfbf3e9155f5b79714c70f83fef8a1d2c2d3b990ea0b2d4f7beab4a59bfe0f1a326558d02b393340b628c55aebfd10cfdbb7
+  data.tar.gz: b66d66abe8eeace6c74bdfacab6cbe3dafa18c9c7c2325df313bc8a6842e81e510040dd1f7c1f0ff32c0692f75e976ff79089ac31a34125391d76d3bd2c2d8fc

data/README.md CHANGED Viewed

@@ -11,8 +11,8 @@ Suppose you have a Web request and you would like to check it is not diguised:
 bot = Legitbot.bot(userAgent, ip)
 ```
-`bot` will be `nil` if no bot signature was found in the `User-Agent`. Otherwise,
-it will be an object with methods
+`bot` will be `nil` if no bot signature was found in the `User-Agent`.
+Otherwise, it will be an object with methods
 ```ruby
 bot.detected_as # => :google
@@ -29,9 +29,9 @@ Rack::Attack.blocklist("fake Googlebot") do |req|
 end
 ```
-Or if you do not like all those ghoulish crawlers stealing your
-content, evaluating it and getting ready to invade your site with spammers,
-then block them all:
+Or if you do not like all those ghoulish crawlers stealing your content,
+evaluating it and getting ready to invade your site with spammers, then block
+them all:
 ```ruby
 Rack::Attack.blocklist 'fake search engines' do |request|
@@ -43,27 +43,31 @@ end
 [Semantic versioning](https://semver.org/) with the following clarifications:
-* MINOR version is incremented when support for new bots is added.
-* PATCH version is incremented when validation logic for a bot changes (IP list updated, for example).
+- MINOR version is incremented when support for new bots is added.
+- PATCH version is incremented when validation logic for a bot changes (IP list
+  updated, for example).
 ## Supported
-* [Ahrefs](https://ahrefs.com/robot)
-* [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
-* [Amazon AdBot](https://adbot.amazon.com/index.html)
-* [Applebot](https://support.apple.com/en-us/HT204683)
-* [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
-* [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
-* [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
-* [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
-* [Google crawlers](https://support.google.com/webmasters/answer/1061943)
-* [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
-* [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
-* [Petal search engine](http://aspiegel.com/petalbot)
-* [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
-* [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started), the list of IPs is in the [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
-* [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
-* [You.com](https://about.you.com/youbot/)
+- [Ahrefs](https://ahrefs.com/robot)
+- [Alexa](https://support.alexa.com/hc/en-us/articles/360046707834-What-are-the-IP-addresses-for-Alexa-s-Certify-and-Site-Audit-crawlers-)
+- [Amazon AdBot](https://adbot.amazon.com/index.html)
+- [Applebot](https://support.apple.com/en-us/HT204683)
+- [Baidu spider](http://help.baidu.com/question?prod_en=master&class=498&id=1000973)
+- [Bingbot](https://blogs.bing.com/webmaster/2012/08/31/how-to-verify-that-bingbot-is-bingbot/)
+- [DuckDuckGo bot](https://duckduckgo.com/duckduckbot)
+- [Facebook crawler](https://developers.facebook.com/docs/sharing/webmasters/crawler)
+- [Google crawlers](https://support.google.com/webmasters/answer/1061943)
+- [IAS](https://integralads.com/ias-privacy-data-management/policies/site-indexing-policy/)
+- [OpenAI GPTBot](https://platform.openai.com/docs/gptbot)
+- [Oracle Data Cloud Crawler](https://www.oracle.com/corporate/acquisitions/grapeshot/crawler.html)
+- [Petal search engine](http://aspiegel.com/petalbot)
+- [Pinterest](https://help.pinterest.com/en/articles/about-pinterest-crawler-0)
+- [Twitterbot](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/getting-started),
+  the list of IPs is in the
+  [Troubleshooting page](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/guides/troubleshooting-cards)
+- [Yandex robots](https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml)
+- [You.com](https://about.you.com/youbot/)
 ## License
@@ -71,16 +75,18 @@ Apache 2.0
 ## Other projects
-* Play Framework variant in Scala: [play-legitbot](https://github.com/osinka/play-legitbot)
-* Article [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
-* [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
+- Play Framework variant in Scala:
+  [play-legitbot](https://github.com/osinka/play-legitbot)
+- Article
+  [When (Fake) Googlebots Attack Your Rails App](http://jessewolgamott.com/blog/2015/11/17/when-fake-googlebots-attack-your-rails-app/)
+- [Voight-Kampff](https://github.com/biola/Voight-Kampff) is a Ruby gem that
   detects bots by `User-Agent`
-* [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and Rack
-  middleware to detect crawlers by few different request headers, including `User-Agent`
-* Project Honeypot's
-  [http:BL](https://www.projecthoneypot.org/httpbl_api.php) can not only
-  classify IP as a search engine, but also label them as suspicious and
-  reports the number of days since the last activity. My implementation of
+- [crawler_detect](https://github.com/loadkpi/crawler_detect) is a Ruby gem and
+  Rack middleware to detect crawlers by few different request headers, including
+  `User-Agent`
+- Project Honeypot's [http:BL](https://www.projecthoneypot.org/httpbl_api.php)
+  can not only classify IP as a search engine, but also label them as suspicious
+  and reports the number of days since the last activity. My implementation of
   the protocol in Scala is [here](https://github.com/osinka/httpbl).
-* [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with built-in support
-  to validate bots.
+- [CIDRAM](https://github.com/CIDRAM/CIDRAM) is a PHP routing manager with
+  built-in support to validate bots.

data/lib/legitbot/duckduckgo.rb CHANGED Viewed

@@ -3,7 +3,7 @@
 module Legitbot # :nodoc:
   # https://duckduckgo.com/duckduckbot
   class DuckDuckGo < BotMatch
-    # @fetch:url https://help.duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
+    # @fetch:url https://duckduckgo.com/duckduckgo-help-pages/results/duckduckbot/
     # @fetch:selector section.main article.content ul > li
     ip_ranges %w[
       20.185.79.15

data/lib/legitbot/gptbot.rb ADDED Viewed

@@ -0,0 +1,21 @@
+# frozen_string_literal: true
+module Legitbot # :nodoc:
+  # https://platform.openai.com/docs/gptbot
+  class GPTBot < BotMatch
+    # @fetch:url https://openai.com/gptbot-ranges.txt
+    ip_ranges %w[
+      20.15.240.64/28
+      20.15.240.80/28
+      20.15.240.96/28
+      20.15.240.176/28
+      20.15.241.0/28
+      20.15.242.128/28
+      20.15.242.144/28
+      20.15.242.192/28
+      40.83.2.64/28
+    ]
+  end
+  rule Legitbot::GPTBot, %w[GPTBot]
+end

data/lib/legitbot/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Legitbot
-  VERSION = '1.8.0'
+  VERSION = '1.9.0'
 end

data/lib/legitbot.rb CHANGED Viewed

@@ -12,6 +12,7 @@ require_relative 'legitbot/bing'
 require_relative 'legitbot/duckduckgo'
 require_relative 'legitbot/facebook'
 require_relative 'legitbot/google'
+require_relative 'legitbot/gptbot'
 require_relative 'legitbot/ias'
 require_relative 'legitbot/oracle'
 require_relative 'legitbot/petalbot'

data/lib/rubocop/cop/custom/ip_ranges.rb CHANGED Viewed

@@ -25,8 +25,9 @@ module RuboCop
             params = fetch_params(node)
             return unless mandatory_params?(params)
-            existing_ips = read_node_ips value
-            new_ips = fetch_ips(**params)
+            existing_ips = normalise_list(read_node_ips(value))
+            new_ips = normalise_list(fetch_ips(**params))
+            return unless new_ips
             return if existing_ips == new_ips
             register_offense(value, new_ips, **params)
@@ -36,20 +37,45 @@ module RuboCop
         private
         def fetch_ips(url:, selector: nil, jsonpath: nil)
+          body = get_url url
+          return unless body
+          return parse_html(body, selector) if selector
+          return parse_json(body, jsonpath) if jsonpath
+          parse_text(body)
+        end
+        def get_url(url)
           response = Net::HTTP.get_response URI(url)
+          unless response.is_a?(Net::HTTPOK)
+            add_global_offense "Could not fetch IPs from #{url} , HTTP status code #{response.code}"
+            return
+          end
           response.value
+          response.body
+        end
-          if selector
-            document = Nokogiri::HTML response.body
-            document.css(selector).map(&:content).sort_by(&IPAddr.method(:new))
-          else
-            document = JSON.parse response.body
-            JsonPath.new(jsonpath).on(document).sort_by(&IPAddr.method(:new))
-          end
+        def parse_html(body, selector)
+          document = Nokogiri::HTML body
+          document.css(selector).map(&:content)
+        end
+        def parse_json(body, jsonpath)
+          document = JSON.parse body
+          JsonPath.new(jsonpath).on(document)
+        end
+        def parse_text(body)
+          body.lines.map(&:chomp)
         end
         def read_node_ips(value)
-          value.child_nodes.map(&:value).sort_by(&IPAddr.method(:new))
+          value.child_nodes.map(&:value)
+        end
+        def normalise_list(ips)
+          ips.sort_by(&IPAddr.method(:new))
         end
         def register_offense(node, new_ips, **params)
@@ -60,7 +86,7 @@ module RuboCop
         end
         def mandatory_params?(params)
-          params.include?(:url) && (params.include?(:selector) || params.include?(:jsonpath))
+          params.include?(:url)
         end
         def fetch_params(node)

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: legitbot
 version: !ruby/object:Gem::Version
-  version: 1.8.0
+  version: 1.9.0
 platform: ruby
 authors:
 - Alexander Azarov
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-07-09 00:00:00.000000000 Z
+date: 2023-08-08 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: fast_interval_tree
@@ -81,6 +81,7 @@ files:
 - lib/legitbot/duckduckgo.rb
 - lib/legitbot/facebook.rb
 - lib/legitbot/google.rb
+- lib/legitbot/gptbot.rb
 - lib/legitbot/ias.rb
 - lib/legitbot/legitbot.rb
 - lib/legitbot/oracle.rb