RubyGems - html2rss - Versions diffs - 0.15.0 → 0.17.0 - Mend

html2rss 0.15.0 → 0.17.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (35) hide show

checksums.yaml +4 -4
data/README.md +112 -44
data/html2rss.gemspec +3 -2
data/lib/html2rss/attribute_post_processors/sanitize_html.rb +8 -1
data/lib/html2rss/auto_source/article.rb +37 -5
data/lib/html2rss/auto_source/channel.rb +21 -28
data/lib/html2rss/auto_source/cleanup.rb +0 -16
data/lib/html2rss/auto_source/rss_builder.rb +1 -1
data/lib/html2rss/auto_source/scraper/html.rb +21 -12
data/lib/html2rss/auto_source/scraper/schema/item_list.rb +34 -0
data/lib/html2rss/auto_source/scraper/schema/list_item.rb +25 -0
data/lib/html2rss/auto_source/scraper/schema/thing.rb +104 -0
data/lib/html2rss/auto_source/scraper/schema.rb +22 -34
data/lib/html2rss/auto_source/scraper/semantic_html/extractor.rb +41 -41
data/lib/html2rss/auto_source/scraper/semantic_html/image.rb +6 -6
data/lib/html2rss/auto_source/scraper/semantic_html.rb +3 -2
data/lib/html2rss/auto_source.rb +0 -7
data/lib/html2rss/cli.rb +11 -4
data/lib/html2rss/config/channel.rb +7 -1
data/lib/html2rss/config/selectors.rb +2 -1
data/lib/html2rss/config.rb +1 -0
data/lib/html2rss/item.rb +7 -2
data/lib/html2rss/request_service/browserless_strategy.rb +53 -0
data/lib/html2rss/request_service/context.rb +46 -0
data/lib/html2rss/request_service/faraday_strategy.rb +24 -0
data/lib/html2rss/request_service/puppet_commander.rb +61 -0
data/lib/html2rss/request_service/response.rb +27 -0
data/lib/html2rss/request_service/strategy.rb +28 -0
data/lib/html2rss/request_service.rb +97 -0
data/lib/html2rss/rss_builder/stylesheet.rb +7 -0
data/lib/html2rss/utils.rb +23 -26
data/lib/html2rss/version.rb +1 -1
data/lib/html2rss.rb +5 -5
metadata +31 -11
data/lib/html2rss/auto_source/scraper/schema/base.rb +0 -61

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d89191b35f643372cc18b880dab7535d18a10d9fd123897460ee16c5e990a5d9
-  data.tar.gz: 71cb356f5261b2e6a3d2152afcb68f658e78d5fec5ff15bc67ed0d5bd153fc00
+  metadata.gz: bb6b3eb69655bdbb4511f74db9e1bcc766a98aa55d7afc2561a176c6973bda4f
+  data.tar.gz: 45193122489ba965b489c981696f71508030dc80f156e5b0f077932fc55caec3
 SHA512:
-  metadata.gz: 46f048feae342844df1af51c741d681677192c1dc84452fae1002f5cca5b406c0698a426ec6e532572c4fb4f6fb896a966862d8d2599b8dd742a174707289aed
-  data.tar.gz: 98d0316c64bb5a160d26d5efa59b25901b3a64e572795bbd840539fe69d84a4ea3c797bb16721edb73277d1b9bfb9238f9d40ea2b9bb4ebeffc81e8790a02062
+  metadata.gz: 294529cd8cb1d289e969f94c32757656d12c864a92c438fceb73ba9dbddd85cca822b146898cdeff928aeefcba75652b91c0a56ded66241bcf23014fea299196
+  data.tar.gz: a2b50e52a7f6ad7768a7092fcdf04c7000dd43d190ee12461f6773fe4b324e43be7b411c7aff69b3af1266aeca3fbb4678e5c43e3bac2b0c2a49d86589869b38

data/README.md CHANGED Viewed

@@ -10,21 +10,11 @@ With the _feed config_, you provide a URL to scrape and CSS selectors for extrac
 Support the development by sponsoring this project on GitHub. Thank you! 💓
-## Installation
-| Install | `gem install html2rss` |
-| ------- | ---------------------- |
-| Usage   | `html2rss help`        |
+## Generating a feed on the CLI
-You can also install it as a dependency in your Ruby project:
+[Install Ruby](https://www.ruby-lang.org/en/documentation/installation/) (latest version is recommended) on your machine and run `gem install html2rss` in your terminal.
-|                      🤩 Like it? | Star it! ⭐️         |
-| -------------------------------: | -------------------- |
-| Add this line to your `Gemfile`: | `gem 'html2rss'`     |
-|                    Then execute: | `bundle`             |
-|                    In your code: | `require 'html2rss'` |
-## Generating a feed on the CLI
+After the installation has finished, `html2rss help` will print usage information.
 ### using automatic generation
@@ -59,6 +49,14 @@ Build the feed from this config with: `html2rss feed ./my_config_file.yml`.
 ## Generating a feed with Ruby
+You can also install it as a dependency in your Ruby project:
+|                      🤩 Like it? | Star it! ⭐️         |
+| -------------------------------: | -------------------- |
+| Add this line to your `Gemfile`: | `gem 'html2rss'`     |
+|                    Then execute: | `bundle`             |
+|                    In your code: | `require 'html2rss'` |
 Here's a minimal working example using Ruby:
 ```ruby
@@ -117,7 +115,7 @@ channel:
 Command line usage example:
 ```sh
-bundle exec html2rss feed the_feed_config.yml id=42
+html2rss feed the_feed_config.yml id=42
 ```
 <details><summary>See a Ruby example</summary>
@@ -154,9 +152,9 @@ Your `selectors` hash can contain arbitrary named selectors, but only a few will
 | `comments`    | `comments`         | A URL.                                      |
 | `source`      | ~~source~~         | Not yet supported.                          |
-### The `selector` hash
+### Build RSS 2.0 item attributes by specifying selectors
-Every named selector in your `selectors` hash can have these attributes:
+Every named selector (i.e. `title`, `description`, see table above) in your `selectors` hash can have these attributes:
 | name           | value                                                    |
 | -------------- | -------------------------------------------------------- |
@@ -164,7 +162,7 @@ Every named selector in your `selectors` hash can have these attributes:
 | `extractor`    | Name of the extractor. See notes below.                  |
 | `post_process` | A hash or array of hashes. See notes below.              |
-## Using extractors
+#### Using extractors
 Extractors help with extracting the information from the selected HTML tag.
@@ -201,7 +199,7 @@ selectors:
 </details>
-## Using post processors
+### Using post processors
 Extracted information can be further manipulated with post processors.
@@ -218,7 +216,7 @@ Extracted information can be further manipulated with post processors.
 ⚠️ Always make use of the `sanitize_html` post processor for HTML content. _Never trust the internet!_ ⚠️
-### Chaining post processors
+#### Chaining post processors
 Pass an array to `post_process` to chain the post processors.
@@ -244,14 +242,14 @@ selectors:
 </details>
-### Post processor `gsub`
+##### Post processor `gsub`
 The post processor `gsub` makes use of Ruby's [`gsub`](https://apidock.com/ruby/String/gsub) method.
-| key           | type   | required | note                        |
-| ------------- | ------ | -------- | --------------------------- |
-| `pattern`     | String | yes      | Can be Regexp or String.    |
-| `replacement` | String | yes      | Can be a [backreference](). |
+| key           | type   | required | note                     |
+| ------------- | ------ | -------- | ------------------------ |
+| `pattern`     | String | yes      | Can be Regexp or String. |
+| `replacement` | String | yes      | Can be a backreference.  |
 <details><summary>See a Ruby example</summary>
@@ -283,7 +281,7 @@ selectors:
 </details>
-## Adding `<category>` tags to an item
+#### Adding `<category>` tags to an item
 The `categories` selector takes an array of selector names. Each value of those
 selectors will become a `<category>` on the RSS item.
@@ -326,7 +324,7 @@ selectors:
 </details>
-## Custom item GUID
+#### Custom item GUID
 By default, html2rss generates a GUID from the `title` or `description`.
@@ -371,7 +369,7 @@ selectors:
 </details>
-## Adding an `<enclosure>` tag to an item
+#### Adding an `<enclosure>` tag to an item
 An enclosure can be any file, e.g. a image, audio or video - think Podcast.
@@ -379,7 +377,7 @@ The `enclosure` selector needs to return a URL of the content to enclose. If the
 Since `html2rss` does no further inspection of the enclosure, its support comes with trade-offs:
-1. The content-type is guessed from the file extension of the URL.
+1. The content-type is guessed from the file extension of the URL, unless one is specified in `content_type`.
 2. If the content-type guessing fails, it will default to `application/octet-stream`.
 3. The content-length will always be undetermined and therefore stated as `0` bytes.
@@ -392,7 +390,12 @@ Read the [RSS 2.0 spec](http://www.rssboard.org/rss-profile#element-channel-item
 Html2rss.feed(
   channel: {},
   selectors: {
-    enclosure: { selector: 'audio', extractor: 'attribute', attribute: 'src' }
+    enclosure: {
+      selector: 'audio',
+      extractor: 'attribute',
+      attribute: 'src',
+      content_type: 'audio/mp3'
+    }
   }
 )
 ```
@@ -411,17 +414,16 @@ selectors:
     selector: "audio"
     extractor: "attribute"
     attribute: "src"
+    content_type: "audio/mp3"
 ```
 </details>
 ## Scraping and handling JSON responses
-By default, `html2rss` assumes the URL responds with HTML. However, it can also handle JSON responses. The JSON must return an Array or Hash.
+By default, `html2rss` assumes the URL responds with HTML. However, it can also handle JSON responses. The JSON response must be an Array or Hash.
-| key        | required | default | note                                                 |
-| ---------- | -------- | ------- | ---------------------------------------------------- |
-| `json`     | optional | false   | If set to `true`, the response is parsed as JSON.    |
-| `jsonpath` | optional | $       | Use [JSONPath syntax]() to select nodes of interest. |
+The JSON is converted to XML which you can query using CSS selectors.
 <details><summary>See a Ruby example</summary>
@@ -447,7 +449,73 @@ selectors:
 </details>
-## Set any HTTP header in the request
+## Customization of how requests to the channel URL are sent
+By default, html2rss issues a naiive HTTP request and extracts information from the response. That is performant and works for many websites.
+However, modern websites often do not render much HTML on the server, but evaluate JavaScript on the client to create the HTML. In such cases, the default strategy will not find the "juicy content".
+### Use Browserless.io
+You can use _Browserless.io_ to run a Chrome browser and return the website's source code after the website generated it.
+For this, you can either run your own Browserless.io instance (Docker image available -- [read their license](https://github.com/browserless/browserless/pkgs/container/chromium#licensing)!) or pay them for a hosted instance.
+To run a local Browserless.io instance, you can use the following Docker command:
+```sh
+docker run \
+  --rm \
+  -p 3000:3000 \
+  -e "CONCURRENT=10" \
+  -e "TOKEN=6R0W53R135510" \
+  ghcr.io/browserless/chromium
+```
+To make html2rss use your instance,
+1. specify the environment variables accordingly, and
+2. use the `browserless` strategy for those websites.
+When running locally with commands from above, you can skip setting the environment variables, as they are aligned with the default values.
+```sh
+BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
+  html2rss auto --strategy=browserless https://example.com
+```
+When using traditional feed configs, inside your channel config set `strategy: browserless`.
+<details><summary>See a YAML feed config example</summary>
+```yml
+channel:
+  url: https://www.imdb.com/user/ur67728460/ratings
+  time_zone: UTC
+  ttl: 1440
+  strategy: browserless
+  headers:
+    User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+selectors:
+  items:
+    selector: "li.ipc-metadata-list-summary-item"
+  title:
+    selector: ".ipc-title__text"
+    post_process:
+      - name: gsub
+        pattern: "/^(\\d+.)\\s/"
+        replacement: ""
+      - name: template
+        string: "%{self} rated with: %{user_rating}"
+  link:
+    selector: "a.ipc-title-link-wrapper"
+    extractor: "href"
+  user_rating:
+    selector: "[data-testid='ratingGroup--other-user-rating'] > .ipc-rating-star--rating"
+```
+</details>
+### Set any HTTP header in the request
 To set HTTP request headers, you can add them to the channel's `headers` hash. This is useful for APIs that require an Authorization header.
@@ -595,24 +663,24 @@ Recommended further readings:
 - Fiddling with [`curl`](https://github.com/curl/curl) and [`pup`](https://github.com/ericchiang/pup) to find the selectors seems efficient (`curl URL | pup`).
 - [CSS selectors are versatile. Here's an overview.](https://www.w3.org/TR/selectors-4/#overview)
-### Contributing
+## Contributing
 Find ideas what to contribute in:
 1. <https://github.com/orgs/html2rss/discussions>
 2. the issues tracker: <https://github.com/html2rss/html2rss/issues>
-#### Development Helpers
-1. `bin/setup`: installs dependencies and sets up the development environment.
-2. `bin/guard`: automatically runs rspec, rubocop and reek when a file changes.
-3. for a modern Ruby development experience: install [`ruby-lsp`](https://github.com/Shopify/ruby-lsp) and integrate it to your IDE:
-   a. [Ruby in Visual Studio Code](https://code.visualstudio.com/docs/languages/ruby)
-#### How to submit changes
+To submit changes:
 1. Fork this repo ( <https://github.com/html2rss/html2rss/fork> )
 2. Create your feature branch (`git checkout -b my-new-feature`)
 3. Implement a commit your changes (`git commit -am 'feat: add XYZ'`)
 4. Push to the branch (`git push origin my-new-feature`)
 5. Create a new Pull Request using the Github web UI
+## Development Helpers
+1. `bin/setup`: installs dependencies and sets up the development environment.
+2. for a modern Ruby development experience: install [`ruby-lsp`](https://github.com/Shopify/ruby-lsp) and integrate it to your IDE.
+For example: [Ruby in Visual Studio Code](https://code.visualstudio.com/docs/languages/ruby).

data/html2rss.gemspec CHANGED Viewed

@@ -14,7 +14,7 @@ Gem::Specification.new do |spec|
   spec.description   = 'Supports JSON content, custom HTTP headers, and post-processing of extracted content.'
   spec.homepage      = 'https://github.com/html2rss/html2rss'
   spec.license       = 'MIT'
-  spec.required_ruby_version = '>= 3.1'
+  spec.required_ruby_version = '>= 3.2'
   if spec.respond_to?(:metadata)
     spec.metadata['allowed_push_host'] = 'https://rubygems.org'
@@ -39,8 +39,9 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'mime-types', '> 3.0'
   spec.add_dependency 'nokogiri', '>= 1.10', '< 2.0'
   spec.add_dependency 'parallel'
+  spec.add_dependency 'puppeteer-ruby'
   spec.add_dependency 'regexp_parser'
-  spec.add_dependency 'reverse_markdown', '~> 2.0'
+  spec.add_dependency 'reverse_markdown', '~> 3.0'
   spec.add_dependency 'rss'
   spec.add_dependency 'sanitize', '~> 6.0'
   spec.add_dependency 'thor'

data/lib/html2rss/attribute_post_processors/sanitize_html.rb CHANGED Viewed

@@ -77,10 +77,17 @@ module Html2rss
         )
       end
+      ##
+      # @return [Hash]
+      # @see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referrer-Policy
       def add_attributes
         {
           'a' => { 'rel' => 'nofollow noopener noreferrer', 'target' => '_blank' },
-          'img' => { 'referrer-policy' => 'no-referrer' }
+          'area' => { 'rel' => 'nofollow noopener noreferrer', 'target' => '_blank' },
+          'img' => { 'referrerpolicy' => 'no-referrer' },
+          'iframe' => { 'referrerpolicy' => 'no-referrer' },
+          'video' => { 'referrerpolicy' => 'no-referrer' },
+          'audio' => { 'referrerpolicy' => 'no-referrer' }
         }
       end

data/lib/html2rss/auto_source/article.rb CHANGED Viewed

@@ -2,6 +2,7 @@
 require 'zlib'
 require 'sanitize'
+require 'nokogiri'
 module Html2rss
   class AutoSource
@@ -14,6 +15,31 @@ module Html2rss
       PROVIDED_KEYS = %i[id title description url image guid published_at scraper].freeze
+      ##
+      # Removes the specified pattern from the beginning of the text
+      # within a given range if the pattern occurs before the range's end.
+      #
+      # @param text [String]
+      # @param pattern [String]
+      # @param end_of_range [Integer] - Optional, defaults to half the size of the text
+      # @return [String]
+      def self.remove_pattern_from_start(text, pattern, end_of_range: (text.size * 0.5).to_i)
+        return text unless text.is_a?(String) && pattern.is_a?(String)
+        index = text.index(pattern)
+        return text if index.nil? || index >= end_of_range
+        text.gsub(/^(.{0,#{end_of_range}})#{Regexp.escape(pattern)}/, '\1')
+      end
+      ##
+      # Checks if the text contains HTML tags.
+      # @param text [String]
+      # @return [Boolean]
+      def self.contains_html?(text)
+        Nokogiri::HTML.fragment(text).children.any?(&:element?)
+      end
       # @param options [Hash<Symbol, String>]
       def initialize(**options)
         @to_h = {}
@@ -50,9 +76,15 @@ module Html2rss
       def description
         return @description if defined?(@description)
-        return if url.to_s.empty? || @to_h[:description].to_s.empty?
+        return if (description = @to_h[:description]).to_s.empty?
+        @description = self.class.remove_pattern_from_start(description, title) if title
-        @description ||= Html2rss::AttributePostProcessors::SanitizeHtml.get(@to_h[:description], url)
+        if self.class.contains_html?(@description) && url
+          @description = Html2rss::AttributePostProcessors::SanitizeHtml.get(description, url)
+        else
+          @description
+        end
       end
       # @return [Addressable::URI, nil]
@@ -72,11 +104,11 @@ module Html2rss
       end
       # Parses and returns the published_at time.
-      # @return [Time, nil]
+      # @return [DateTime, nil]
       def published_at
-        return if (string = @to_h[:published_at].to_s).strip.empty?
+        return if (string = @to_h[:published_at].to_s.strip).empty?
-        @published_at ||= Time.parse(string)
+        @published_at ||= DateTime.parse(string)
       rescue ArgumentError
         nil
       end

data/lib/html2rss/auto_source/channel.rb CHANGED Viewed

@@ -24,52 +24,45 @@ module Html2rss
       attr_writer :articles
       attr_reader :stylesheets
-      def url = extract_url
-      def title = extract_title
-      def language = extract_language
-      def description = extract_description
-      def image = extract_image
-      def ttl = extract_ttl
-      def last_build_date = headers['last-modified']
-      def generator
-        "html2rss V. #{::Html2rss::VERSION} (using auto_source scrapers: #{scraper_counts})"
+      def url = @url.normalize.to_s
+      def title
+        @title ||= if (title = parsed_body.at_css('head > title')&.text.to_s) && !title.empty?
+                     title.gsub(/\s+/, ' ').strip
+                   else
+                     Utils.titleized_channel_url(@url)
+                   end
       end
-      private
-      attr_reader :parsed_body, :headers
-      def extract_url
-        @url.normalize.to_s
-      end
-      def extract_title
-        parsed_body.at_css('head > title')&.text
-      end
+      def description = parsed_body.at_css('meta[name="description"]')&.[]('content')
+      def last_build_date = headers['last-modified']
-      def extract_language
+      def language
         return parsed_body['lang'] if parsed_body.name == 'html' && parsed_body['lang']
         parsed_body.at_css('[lang]')&.[]('lang')
       end
-      def extract_description
-        parsed_body.at_css('meta[name="description"]')&.[]('content') || ''
-      end
-      def extract_image
+      def image
         url = parsed_body.at_css('meta[property="og:image"]')&.[]('content')
         Html2rss::Utils.sanitize_url(url) if url
       end
-      def extract_ttl
+      def ttl
         ttl = headers['cache-control']&.match(/max-age=(\d+)/)&.[](1)
         return unless ttl
         ttl.to_i.fdiv(60).ceil
       end
+      def generator
+        "html2rss V. #{::Html2rss::VERSION} (using auto_source scrapers: #{scraper_counts})"
+      end
+      private
+      attr_reader :parsed_body, :headers
       def scraper_counts
         scraper_counts = +''

data/lib/html2rss/auto_source/cleanup.rb CHANGED Viewed

@@ -13,10 +13,7 @@ module Html2rss
           articles.select!(&:valid?)
-          remove_short!(articles, :title)
           deduplicate_by!(articles, :url)
-          deduplicate_by!(articles, :title)
           keep_only_http_urls!(articles)
           reject_different_domain!(articles, url) unless keep_different_domain
@@ -27,19 +24,6 @@ module Html2rss
         private
-        ##
-        # Removes articles with short values for a given key.
-        #
-        # @param articles [Array<Article>] The list of articles to process.
-        # @param key [Symbol] The key to check for short values.
-        # @param min_words [Integer] The minimum number of words required.
-        def remove_short!(articles, key = :title, min_words: 2)
-          articles.reject! do |article|
-            value = article.public_send(key)
-            value.nil? || value.to_s.split.size < min_words
-          end
-        end
         ##
         # Deduplicates articles by a given key.
         #

data/lib/html2rss/auto_source/rss_builder.rb CHANGED Viewed

@@ -60,7 +60,7 @@ module Html2rss
             item_maker.title = article.title
             item_maker.description = article.description
-            item_maker.pubDate = article.published_at
+            item_maker.pubDate = article.published_at&.rfc2822
             item_maker.link = article.url
           end
         end

data/lib/html2rss/auto_source/scraper/html.rb CHANGED Viewed

@@ -1,7 +1,6 @@
 # frozen_string_literal: true
 require 'nokogiri'
-require 'set'
 module Html2rss
   class AutoSource
@@ -12,12 +11,14 @@ module Html2rss
       class Html
         include Enumerable
+        TAGS_TO_IGNORE = /(nav|footer|header)/i
         def self.articles?(parsed_body)
           new(parsed_body, url: '').any?
         end
         def self.parent_until_condition(node, condition)
-          return nil if !node || node.parent.name == 'html'
+          return nil if !node || node.document? || node.parent.name == 'html'
           return node if condition.call(node)
           parent_until_condition(node.parent, condition)
@@ -32,7 +33,7 @@ module Html2rss
         def initialize(parsed_body, url:)
           @parsed_body = parsed_body
           @url = url
-          @css_selectors = Hash.new(0)
+          @selectors = Hash.new(0)
         end
         attr_reader :parsed_body
@@ -48,9 +49,10 @@ module Html2rss
           frequent_selectors.each do |selector|
             parsed_body.xpath(selector).each do |selected_tag|
               article_tag = self.class.parent_until_condition(selected_tag, method(:article_condition))
-              article_hash = SemanticHtml::Extractor.new(article_tag, url: @url).call
-              yield article_hash if article_hash
+              if article_tag && (article_hash = SemanticHtml::Extractor.new(article_tag, url: @url).call)
+                yield article_hash
+              end
             end
           end
         end
@@ -58,25 +60,32 @@ module Html2rss
         ##
         # Find all the anchors in root.
         # @param root [Nokogiri::XML::Node] The root node to search for anchors
-        # @return [Set<String>] The set of CSS selectors which exist at least min_frequency times
+        # @return [Set<String>] The set of XPath selectors which exist at least min_frequency times
         def frequent_selectors(root = @parsed_body.at_css('body'), min_frequency: 2)
           @frequent_selectors ||= begin
             root.traverse do |node|
               next if !node.element? || node.name != 'a'
-              @css_selectors[self.class.simplify_xpath(node.path)] += 1
+              @selectors[self.class.simplify_xpath(node.path)] += 1
             end
-            @css_selectors.keys
-                          .select { |selector| (@css_selectors[selector]).to_i >= min_frequency }
-                          .to_set
+            @selectors.keys
+                      .select { |selector| (@selectors[selector]).to_i >= min_frequency }
+                      .to_set
           end
         end
-        private
         def article_condition(node)
+          # Ignore tags that are below a tag which is in TAGS_TO_IGNORE.
+          return false if node.path.match?(TAGS_TO_IGNORE)
+          # Ignore tags that are below a tag which has a class which matches TAGS_TO_IGNORE.
+          return false if self.class.parent_until_condition(node, proc do |current_node|
+            current_node.classes.any? { |klass| klass.match?(TAGS_TO_IGNORE) }
+          end)
           return true if %w[body html].include?(node.name)
           return true if node.parent.css('a').size > 1
           false

data/lib/html2rss/auto_source/scraper/schema/item_list.rb ADDED Viewed

@@ -0,0 +1,34 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    module Scraper
+      class Schema
+        ##
+        # Handles schema.org ItemList objects, which contain
+        # 1. itemListElements, and/or
+        # 2. interesting attributes, i.e. description, url, image, itself.
+        #
+        # @see https://schema.org/ItemList
+        class ItemList < Thing
+          SUPPORTED_TYPES = Set['ItemList']
+          # @return [Array<Hash>] the scraped article hashes with DEFAULT_ATTRIBUTES
+          def call
+            hashes = [super]
+            return hashes if (elements = @schema_object[:itemListElement]).nil?
+            elements = [elements] unless elements.is_a?(Array)
+            elements.each do |schema_object|
+              hashes << ListItem.new(schema_object, url: @url).call
+            end
+            hashes
+          end
+        end
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/scraper/schema/list_item.rb ADDED Viewed

@@ -0,0 +1,25 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    module Scraper
+      class Schema
+        ##
+        #
+        # @see https://schema.org/ListItem
+        class ListItem < Thing
+          def id =          (id = (schema_object.dig(:item, :@id) || super).to_s).empty? ? nil : id
+          def title =       schema_object.dig(:item, :name) || super || (url ? Utils.titleized_url(url) : nil)
+          def description = schema_object.dig(:item, :description) || super
+          # @return [Addressable::URI, nil]
+          def url
+            url = schema_object.dig(:item, :url) || super
+            Utils.build_absolute_url_from_relative(url, @url) if url
+          end
+        end
+      end
+    end
+  end
+end