RubyGems - html2rss - Versions diffs - 0.12.0 → 0.13.0 - Mend

html2rss 0.12.0 → 0.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

checksums.yaml +4 -4
data/README.md +38 -10
data/html2rss.gemspec +1 -0
data/lib/html2rss/attribute_post_processors/base.rb +9 -6
data/lib/html2rss/attribute_post_processors/gsub.rb +2 -2
data/lib/html2rss/attribute_post_processors/html_to_markdown.rb +2 -2
data/lib/html2rss/attribute_post_processors/markdown_to_html.rb +2 -2
data/lib/html2rss/attribute_post_processors/parse_time.rb +2 -2
data/lib/html2rss/attribute_post_processors/parse_uri.rb +2 -2
data/lib/html2rss/attribute_post_processors/sanitize_html.rb +13 -2
data/lib/html2rss/attribute_post_processors/substring.rb +3 -3
data/lib/html2rss/attribute_post_processors/template.rb +4 -4
data/lib/html2rss/auto_source/article.rb +95 -0
data/lib/html2rss/auto_source/channel.rb +79 -0
data/lib/html2rss/auto_source/cleanup.rb +76 -0
data/lib/html2rss/auto_source/reducer.rb +48 -0
data/lib/html2rss/auto_source/rss_builder.rb +68 -0
data/lib/html2rss/auto_source/scraper/schema/base.rb +61 -0
data/lib/html2rss/auto_source/scraper/schema.rb +122 -0
data/lib/html2rss/auto_source/scraper/semantic_html/extractor.rb +123 -0
data/lib/html2rss/auto_source/scraper/semantic_html/image.rb +54 -0
data/lib/html2rss/auto_source/scraper/semantic_html.rb +118 -0
data/lib/html2rss/auto_source/scraper.rb +33 -0
data/lib/html2rss/auto_source.rb +77 -0
data/lib/html2rss/cli.rb +10 -0
data/lib/html2rss/config/channel.rb +4 -2
data/lib/html2rss/config/selectors.rb +2 -2
data/lib/html2rss/item.rb +8 -2
data/lib/html2rss/utils.rb +5 -10
data/lib/html2rss/version.rb +1 -1
data/lib/html2rss.rb +21 -0
metadata +29 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ebe536d8051a64c6e2adf9fa8e1d9d1f9fa3743541c44ca85022d0603f9032b2
-  data.tar.gz: 7b3aaa213aaf6a37fb6e94fa72c9936ffd2391322297553b253b097edea300cc
+  metadata.gz: 7a2bf557dd65533533e07b4581e195f2d2b32ff906831526a4d7aed27a558d71
+  data.tar.gz: f42e5f03649a08219d310a2545413c371f851530c4d323fd68ef783b4b3b5e13
 SHA512:
-  metadata.gz: '03985002d050b996c1dc315cbe8e3fc79b6619447a048ad3d2dca86f792eab5c2356716cf6198a24efc61de7e7ddceba2780da49c3e68a3c9efe895eb7cf0cf1'
-  data.tar.gz: 8315473528f46a5ba28297af296b879a66ac00f86ba9eb117b4e6c9ec61c285e4090cfd999ff712368f5b988b1cbda460e268aa3ea8928912bcdb1960ae25a4a
+  metadata.gz: 724a1fa8ab15ae140278eb9b055f22e7aad12e94627795f7a2f13c78f5421607e39d6ba040821b4c47b69f963cc0180bf8e964ff0b896403cb6305ed1d67dbb5
+  data.tar.gz: a06c2e16b0b51c6b6d2184430efc2a4e8b2812fee413163aa2991567e7608141f1c18189fdded58c8c3383940c4790478cd631abc6a1470ad648b2030fdefaab

data/README.md CHANGED Viewed

@@ -26,26 +26,40 @@ You can also install it as a dependency in your Ruby project:
 ## Generating a feed on the CLI
-Create a file called `my_config_file.yml` with this example content:
+### using automatic scraping
+html2rss offers an automatic scrapting feature. Try it with:
+`html2rss auto https://unmatchedstyle.com/`
+### creating a feed config file and using it
+If the results are not to your satisfaction, you can create a feed config file.
+Create a file called `my_config_file.yml` with this sample content:
 ```yml
 channel:
-  url: https://stackoverflow.com/questions
+  url: https://unmatchedstyle.com
 selectors:
   items:
-    selector: "#hot-network-questions > ul > li"
+    selector: "article[id^='post-']"
   title:
-    selector: a
+    selector: h2
   link:
     selector: a
     extractor: href
+  description:
+    selector: ".post-content"
+    post_process:
+      - name: sanitize_html
 ```
-Build the RSS with: `html2rss feed ./my_config_file.yml`.
+Build the feed from this config with: `html2rss feed ./my_config_file.yml`.
 ## Generating a feed with Ruby
-Here's a minimal working example in Ruby:
+Here's a minimal working example using Ruby:
 ```ruby
 require 'html2rss'
@@ -481,7 +495,7 @@ feeds:
 Your feed configs go below `feeds`. Everything else is part of the global config.
-Find a full example of a `feeds.yml` at [`spec/feeds.test.yml`](https://github.com/html2rss/html2rss/blob/master/spec/feeds.test.yml).
+Find a full example of a `feeds.yml` at [`spec/fixtures/feeds.test.yml`](https://github.com/html2rss/html2rss/blob/master/spec/fixtures/feeds.test.yml).
 Now you can build your feeds like this:
@@ -583,8 +597,22 @@ Recommended further readings:
 ### Contributing
-1. Fork it ( <https://github.com/html2rss/html2rss/fork> )
+Find ideas what to contribute in:
+1. <https://github.com/orgs/html2rss/discussions>
+2. the issues tracker: <https://github.com/html2rss/html2rss/issues>
+#### Development Helpers
+1. `bin/setup`: installs dependencies and sets up the development environment.
+2. `bin/guard`: automatically runs rspec, rubocop and reek when a file changes.
+3. for a modern Ruby development experience: install [`ruby-lsp`](https://github.com/Shopify/ruby-lsp) and integrate it to your IDE:
+   a. [Ruby in Visual Studio Code](https://code.visualstudio.com/docs/languages/ruby)
+#### How to submit changes
+1. Fork this repo ( <https://github.com/html2rss/html2rss/fork> )
 2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Add some feature'`)
+3. Implement a commit your changes (`git commit -am 'feat: add XYZ'`)
 4. Push to the branch (`git push origin my-new-feature`)
-5. Create a new Pull Request
+5. Create a new Pull Request using the Github web UI

data/html2rss.gemspec CHANGED Viewed

@@ -38,6 +38,7 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'kramdown'
   spec.add_dependency 'mime-types', '> 3.0'
   spec.add_dependency 'nokogiri', '>= 1.10', '< 2.0'
+  spec.add_dependency 'parallel'
   spec.add_dependency 'regexp_parser'
   spec.add_dependency 'reverse_markdown', '~> 2.0'
   spec.add_dependency 'rss'

data/lib/html2rss/attribute_post_processors/base.rb CHANGED Viewed

@@ -26,17 +26,20 @@ module Html2rss
       # @param value [Object] the value to check
       # @param types [Array<Class>, Class] the expected type(s)
       # @param name [String] the name of the option being checked
+      # @param context [Item::Context] the context
       # @raise [InvalidType] if the value is not of the expected type(s)
-      def self.assert_type(value, types = [], name)
+      def self.assert_type(value, types = [], name, context:)
         types = [types] unless types.is_a?(Array)
         return if types.any? { |type| value.is_a?(type) }
-        error_message_template = 'The type of `%s` must be %s, but is: %s'
-        raise InvalidType, format(error_message_template, name, types.join(' or '), value.class), [], cause: nil
-      end
+        options = context[:options] if context.is_a?(Hash)
+        options ||= { file: File.basename(caller_locations(1, 1).first.absolute_path) }
-      # private_class_method :expect_options, :assert_type
+        raise InvalidType, format('The type of `%<name>s` must be %<types>s, but is: %<type>s in: %<options>s',
+                                  name:, types: types.join(' or '), type: value.class, options: options.inspect),
+              [], cause: nil
+      end
       ##
       # This method validates the arguments passed to the post processor. Must be implemented by subclasses.
@@ -51,7 +54,7 @@ module Html2rss
       def initialize(value, context)
         klass = self.class
         # TODO: get rid of Hash
-        klass.assert_type(context, [Item::Context, Hash], 'context')
+        klass.assert_type(context, [Item::Context, Hash], 'context', context:)
         klass.validate_args!(value, context)
         @value = value

data/lib/html2rss/attribute_post_processors/gsub.rb CHANGED Viewed

@@ -27,9 +27,9 @@ module Html2rss
     # See the doc on [String#gsub](https://ruby-doc.org/core/String.html#method-i-gsub) for more info.
     class Gsub < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
+        assert_type value, String, :value, context:
         expect_options(%i[replacement pattern], context)
-        assert_type context.dig(:options, :replacement), [String, Hash], :replacement
+        assert_type context.dig(:options, :replacement), [String, Hash], :replacement, context:
       end
       ##

data/lib/html2rss/attribute_post_processors/html_to_markdown.rb CHANGED Viewed

@@ -27,8 +27,8 @@ module Html2rss
     # Would return:
     #    'Lorem **ipsum** dolor'
     class HtmlToMarkdown < Base
-      def self.validate_args!(value, _context)
-        assert_type value, String, :value
+      def self.validate_args!(value, context)
+        assert_type value, String, :value, context:
       end
       ##

data/lib/html2rss/attribute_post_processors/markdown_to_html.rb CHANGED Viewed

@@ -33,8 +33,8 @@ module Html2rss
     #
     #    <p>Price: 12.34</p>
     class MarkdownToHtml < Base
-      def self.validate_args!(value, _context)
-        assert_type value, String, :value
+      def self.validate_args!(value, context)
+        assert_type value, String, :value, context:
       end
       ##

data/lib/html2rss/attribute_post_processors/parse_time.rb CHANGED Viewed

@@ -27,8 +27,8 @@ module Html2rss
     # It uses `Time.parse`.
     class ParseTime < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
-        assert_type context[:config].time_zone, String, :time_zone
+        assert_type(value, String, :value, context:)
+        assert_type(context[:config].time_zone, String, :time_zone, context:)
       end
       ##

data/lib/html2rss/attribute_post_processors/parse_uri.rb CHANGED Viewed

@@ -25,8 +25,8 @@ module Html2rss
       def self.validate_args!(value, context)
         url_types = [String, URI::HTTP, Addressable::URI].freeze
-        assert_type(value, url_types, :value)
-        assert_type(context.config.url, url_types, :url)
+        assert_type(value, url_types, :value, context:)
+        assert_type(context.config.url, url_types, :url, context:)
         raise ArgumentError, 'The `value` option is missing or empty.' if value.to_s.empty?
       end

data/lib/html2rss/attribute_post_processors/sanitize_html.rb CHANGED Viewed

@@ -39,8 +39,19 @@ module Html2rss
     # Would return:
     #    '<p>Lorem <b>ipsum</b> dolor ...</p>'
     class SanitizeHtml < Base
-      def self.validate_args!(value, _context)
-        assert_type value, String, :value
+      def self.validate_args!(value, context)
+        assert_type value, String, :value, context:
+      end
+      ##
+      # Shorthand method to get the sanitized HTML.
+      # @param html [String]
+      # @param url [String, Addressable::URI]
+      def self.get(html, url)
+        raise ArgumentError, 'url must be a String or Addressable::URI' if url.to_s.empty?
+        return nil if html.to_s.empty?
+        new(html, { config: Config::Channel.new({ url: }) }).get
       end
       ##

data/lib/html2rss/attribute_post_processors/substring.rb CHANGED Viewed

@@ -30,13 +30,13 @@ module Html2rss
     #    'bar'
     class Substring < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
+        assert_type value, String, :value, context:
         options = context[:options]
-        assert_type options[:start], Integer, :start
+        assert_type options[:start], Integer, :start, context:
         end_index = options[:end]
-        assert_type end_index, Integer, :end if end_index
+        assert_type(end_index, Integer, :end, context:) if end_index
       end
       ##

data/lib/html2rss/attribute_post_processors/template.rb CHANGED Viewed

@@ -33,7 +33,7 @@ module Html2rss
     #    'Product (23,42€)'
     class Template < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
+        assert_type value, String, :value, context:
         string = context[:options]&.dig(:string).to_s
         raise InvalidType, 'The `string` template is absent.' if string.empty?
@@ -74,9 +74,9 @@ module Html2rss
       # @return [String]
       # @deprecated Use %<id>s formatting instead. Will be removed in version 1.0.0. See README / Dynamic parameters.
       def format_string_with_methods
-        warn '[DEPRECATION] This method of using params is deprecated and \
-              support for it will be removed in version 1.0.0.\
-              Please use dynamic parameters (i.e. %<id>s, see README.md) instead.'
+        Log.warn '[DEPRECATION] This method of using params is deprecated and \
+                  support for it will be removed in version 1.0.0.\
+                  Please use dynamic parameters (i.e. %<id>s, see README.md) instead.'
         string % methods
       end

data/lib/html2rss/auto_source/article.rb ADDED Viewed

@@ -0,0 +1,95 @@
+# frozen_string_literal: true
+require 'zlib'
+require 'sanitize'
+module Html2rss
+  class AutoSource
+    ##
+    # Article is a simple data object representing an article extracted from a page.
+    # It is enumerable and responds to all keys specified in PROVIDED_KEYS.
+    class Article
+      include Enumerable
+      include Comparable
+      PROVIDED_KEYS = %i[id title description url image guid published_at scraper].freeze
+      # @param options [Hash<Symbol, String>]
+      def initialize(**options)
+        @to_h = {}
+        options.each_pair { |key, value| @to_h[key] = value.freeze if value }
+        @to_h.freeze
+        return unless (unknown_keys = options.keys - PROVIDED_KEYS).any?
+        Log.warn "Article: unknown keys found: #{unknown_keys.join(', ')}"
+      end
+      # Checks if the article is valid based on the presence of URL, ID, and either title or description.
+      # @return [Boolean] True if the article is valid, otherwise false.
+      def valid?
+        !url.to_s.empty? && (!title.to_s.empty? || !description.to_s.empty?) && !id.to_s.empty?
+      end
+      # @yield [key, value]
+      # @return [Enumerator] if no block is given
+      def each
+        return enum_for(:each) unless block_given?
+        PROVIDED_KEYS.each { |key| yield(key, public_send(key)) }
+      end
+      def id
+        @to_h[:id]
+      end
+      def title
+        @to_h[:title]
+      end
+      def description
+        return @description if defined?(@description)
+        return if url.to_s.empty? || @to_h[:description].to_s.empty?
+        @description ||= Html2rss::AttributePostProcessors::SanitizeHtml.get(@to_h[:description], url)
+      end
+      # @return [Addressable::URI, nil]
+      def url
+        @url ||= Html2rss::Utils.sanitize_url(@to_h[:url])
+      end
+      # @return [Addressable::URI, nil]
+      def image
+        @image ||= Html2rss::Utils.sanitize_url(@to_h[:image])
+      end
+      # Generates a unique identifier based on the URL and ID using CRC32.
+      # @return [String]
+      def guid
+        @guid ||= Zlib.crc32([url, id].join('#!/')).to_s(36).encode('utf-8')
+      end
+      # Parses and returns the published_at time.
+      # @return [Time, nil]
+      def published_at
+        return if (string = @to_h[:published_at].to_s).strip.empty?
+        @published_at ||= Time.parse(string)
+      rescue ArgumentError
+        nil
+      end
+      def scraper
+        @to_h[:scraper]
+      end
+      def <=>(other)
+        return nil unless other.is_a?(Article)
+        0 if other.all? { |key, value| value == public_send(key) ? public_send(key) <=> value : false }
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/channel.rb ADDED Viewed

@@ -0,0 +1,79 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    ##
+    # Extracts channel information from
+    # 1. the HTML document's <head>.
+    # 2. the HTTP response
+    class Channel
+      ##
+      #
+      # @param parsed_body [Nokogiri::HTML::Document] The parsed HTML document.
+      # @param response [Faraday::Response] The URL of the HTML document.
+      def initialize(parsed_body, url:, response:, articles: [])
+        @parsed_body = parsed_body
+        @url = url
+        @response = response
+        @articles = articles
+      end
+      def url = extract_url
+      def title = extract_title
+      def language = extract_language
+      def description = extract_description
+      def image = extract_image
+      def ttl = extract_ttl
+      def last_build_date = response.headers['last-modified']
+      def generator
+        "html2rss V. #{::Html2rss::VERSION} (using auto_source scrapers: #{scraper_counts})"
+      end
+      private
+      attr_reader :parsed_body, :response
+      def extract_url
+        @url.normalize.to_s
+      end
+      def extract_title
+        parsed_body.at_css('head > title')&.text
+      end
+      def extract_language
+        return parsed_body['lang'] if parsed_body.name == 'html' && parsed_body['lang']
+        parsed_body.at_css('[lang]')&.[]('lang')
+      end
+      def extract_description
+        parsed_body.at_css('meta[name="description"]')&.[]('content') || ''
+      end
+      def extract_image
+        url = parsed_body.at_css('meta[property="og:image"]')&.[]('content')
+        Html2rss::Utils.sanitize_url(url) if url
+      end
+      def extract_ttl
+        ttl = response.headers['cache-control']&.match(/max-age=(\d+)/)&.[](1)
+        return unless ttl
+        ttl.to_i.fdiv(60).ceil
+      end
+      def scraper_counts
+        scraper_counts = +''
+        @articles.each_with_object(Hash.new(0)) { |article, counts| counts[article.scraper] += 1 }
+                 .each do |klass, count|
+          scraper_counts.concat("[#{klass.to_s.gsub('Html2rss::AutoSource::Scraper::', '')}=#{count}]")
+        end
+        scraper_counts
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/cleanup.rb ADDED Viewed

@@ -0,0 +1,76 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    ##
+    # Cleanup is responsible for cleaning up the extracted articles.
+    # :reek:MissingSafeMethod { enabled: false }
+    # It applies various strategies to filter and refine the article list.
+    class Cleanup
+      class << self
+        def call(articles, url:, keep_different_domain: false)
+          Log.debug "Cleanup: start with #{articles.size} articles"
+          articles.select!(&:valid?)
+          remove_short!(articles, :title)
+          deduplicate_by!(articles, :url)
+          deduplicate_by!(articles, :title)
+          keep_only_http_urls!(articles)
+          reject_different_domain!(articles, url) unless keep_different_domain
+          Log.debug "Cleanup: end with #{articles.size} articles"
+          articles
+        end
+        private
+        ##
+        # Removes articles with short values for a given key.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        # @param key [Symbol] The key to check for short values.
+        # @param min_words [Integer] The minimum number of words required.
+        def remove_short!(articles, key = :title, min_words: 2)
+          articles.reject! do |article|
+            value = article.public_send(key)
+            value.nil? || value.to_s.split.size < min_words
+          end
+        end
+        ##
+        # Deduplicates articles by a given key.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        # @param key [Symbol] The key to deduplicate by.
+        def deduplicate_by!(articles, key)
+          seen = {}
+          articles.reject! do |article|
+            value = article.public_send(key)
+            value.nil? || seen.key?(value).tap { seen[value] = true }
+          end
+        end
+        ##
+        # Keeps only articles with HTTP or HTTPS URLs.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        def keep_only_http_urls!(articles)
+          articles.select! { |article| %w[http https].include?(article.url&.scheme) }
+        end
+        ##
+        # Rejects articles that have a URL not on the same domain as the source.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        # @param base_url [Addressable::URI] The source URL to compare against.
+        def reject_different_domain!(articles, base_url)
+          base_host = base_url.host
+          articles.select! { |article| article.url&.host == base_host }
+        end
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/reducer.rb ADDED Viewed

@@ -0,0 +1,48 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    ##
+    # Reducer is responsible for reducing the list of articles.
+    # It keeps only the longest attributes of articles with the same URL.
+    # It also filters out invalid articles.
+    class Reducer
+      class << self
+        def call(articles, **_options)
+          Log.debug "Reducer: inited with #{articles.size} articles"
+          reduce_by_keeping_longest_values(articles, keep: [:scraper]) { |article| article.url&.path }
+        end
+        private
+        # @param articles [Array<Article>]
+        # @return [Array<Article>] reduced articles
+        def reduce_by_keeping_longest_values(articles, keep:, &)
+          grouped_by_block = articles.group_by(&)
+          grouped_by_block.each_with_object([]) do |(_key, grouped_articles), result|
+            memo_object = {}
+            grouped_articles.each do |article_hash|
+              keep_longest_values(memo_object, article_hash, keep:)
+            end
+            result << Article.new(**memo_object)
+          end
+        end
+        def keep_longest_values(memo_object, article_hash, keep:)
+          article_hash.each do |key, value|
+            next if value.eql?(memo_object[key])
+            if keep.include?(key)
+              memo_object[key] ||= []
+              memo_object[key] << value
+            elsif value && value.to_s.size > memo_object[key].to_s.size
+              memo_object[key] = value
+            end
+          end
+        end
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/rss_builder.rb ADDED Viewed

@@ -0,0 +1,68 @@
+# frozen_string_literal: true
+require 'rss'
+module Html2rss
+  class AutoSource
+    ##
+    # Converts the autosourced channel and articles to an RSS feed.
+    class RssBuilder
+      def self.add_guid(article, maker)
+        maker.guid.tap do |guid|
+          guid.content = article.guid
+          guid.isPermaLink = false
+        end
+      end
+      def self.add_image(article, maker)
+        url = article.image || return
+        maker.enclosure.tap do |enclosure|
+          enclosure.url = url
+          enclosure.type = Html2rss::Utils.guess_content_type_from_url(url)
+          enclosure.length = 0
+        end
+      end
+      def initialize(channel:, articles:)
+        @channel = channel
+        @articles = articles
+      end
+      def call
+        RSS::Maker.make('2.0') do |maker|
+          make_channel(maker.channel)
+          make_items(maker)
+        end
+      end
+      private
+      attr_reader :channel, :articles
+      def make_channel(maker)
+        %i[language title description ttl].each do |key|
+          maker.public_send(:"#{key}=", channel.public_send(key))
+        end
+        maker.link = channel.url
+        maker.generator = channel.generator
+        maker.updated = channel.last_build_date
+      end
+      def make_items(maker)
+        articles.each do |article|
+          maker.items.new_item do |item_maker|
+            RssBuilder.add_guid(article, item_maker)
+            RssBuilder.add_image(article, item_maker)
+            item_maker.title = article.title
+            item_maker.description = article.description
+            item_maker.pubDate = article.published_at
+            item_maker.link = article.url
+          end
+        end
+      end
+    end
+  end
+end