RubyGems - html2rss - Versions diffs - 0.12.0 → 0.13.0 - Mend

html2rss 0.12.0 → 0.13.0

Files changed (32) hide show

checksums.yaml +4 -4
data/README.md +38 -10
data/html2rss.gemspec +1 -0
data/lib/html2rss/attribute_post_processors/base.rb +9 -6
data/lib/html2rss/attribute_post_processors/gsub.rb +2 -2
data/lib/html2rss/attribute_post_processors/html_to_markdown.rb +2 -2
data/lib/html2rss/attribute_post_processors/markdown_to_html.rb +2 -2
data/lib/html2rss/attribute_post_processors/parse_time.rb +2 -2
data/lib/html2rss/attribute_post_processors/parse_uri.rb +2 -2
data/lib/html2rss/attribute_post_processors/sanitize_html.rb +13 -2
data/lib/html2rss/attribute_post_processors/substring.rb +3 -3
data/lib/html2rss/attribute_post_processors/template.rb +4 -4
data/lib/html2rss/auto_source/article.rb +95 -0
data/lib/html2rss/auto_source/channel.rb +79 -0
data/lib/html2rss/auto_source/cleanup.rb +76 -0
data/lib/html2rss/auto_source/reducer.rb +48 -0
data/lib/html2rss/auto_source/rss_builder.rb +68 -0
data/lib/html2rss/auto_source/scraper/schema/base.rb +61 -0
data/lib/html2rss/auto_source/scraper/schema.rb +122 -0
data/lib/html2rss/auto_source/scraper/semantic_html/extractor.rb +123 -0
data/lib/html2rss/auto_source/scraper/semantic_html/image.rb +54 -0
data/lib/html2rss/auto_source/scraper/semantic_html.rb +118 -0
data/lib/html2rss/auto_source/scraper.rb +33 -0
data/lib/html2rss/auto_source.rb +77 -0
data/lib/html2rss/cli.rb +10 -0
data/lib/html2rss/config/channel.rb +4 -2
data/lib/html2rss/config/selectors.rb +2 -2
data/lib/html2rss/item.rb +8 -2
data/lib/html2rss/utils.rb +5 -10
data/lib/html2rss/version.rb +1 -1
data/lib/html2rss.rb +21 -0
metadata +29 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ebe536d8051a64c6e2adf9fa8e1d9d1f9fa3743541c44ca85022d0603f9032b2
-  data.tar.gz: 7b3aaa213aaf6a37fb6e94fa72c9936ffd2391322297553b253b097edea300cc
+  metadata.gz: 7a2bf557dd65533533e07b4581e195f2d2b32ff906831526a4d7aed27a558d71
+  data.tar.gz: f42e5f03649a08219d310a2545413c371f851530c4d323fd68ef783b4b3b5e13
 SHA512:
-  metadata.gz: '03985002d050b996c1dc315cbe8e3fc79b6619447a048ad3d2dca86f792eab5c2356716cf6198a24efc61de7e7ddceba2780da49c3e68a3c9efe895eb7cf0cf1'
-  data.tar.gz: 8315473528f46a5ba28297af296b879a66ac00f86ba9eb117b4e6c9ec61c285e4090cfd999ff712368f5b988b1cbda460e268aa3ea8928912bcdb1960ae25a4a
+  metadata.gz: 724a1fa8ab15ae140278eb9b055f22e7aad12e94627795f7a2f13c78f5421607e39d6ba040821b4c47b69f963cc0180bf8e964ff0b896403cb6305ed1d67dbb5
+  data.tar.gz: a06c2e16b0b51c6b6d2184430efc2a4e8b2812fee413163aa2991567e7608141f1c18189fdded58c8c3383940c4790478cd631abc6a1470ad648b2030fdefaab

data/README.md CHANGED Viewed

@@ -26,26 +26,40 @@ You can also install it as a dependency in your Ruby project:
 ## Generating a feed on the CLI
-Create a file called `my_config_file.yml` with this example content:
+### using automatic scraping
+html2rss offers an automatic scrapting feature. Try it with:
+`html2rss auto https://unmatchedstyle.com/`
+### creating a feed config file and using it
+If the results are not to your satisfaction, you can create a feed config file.
+Create a file called `my_config_file.yml` with this sample content:
 ```yml
 channel:
-  url: https://stackoverflow.com/questions
+  url: https://unmatchedstyle.com
 selectors:
   items:
-    selector: "#hot-network-questions > ul > li"
+    selector: "article[id^='post-']"
   title:
-    selector: a
+    selector: h2
   link:
     selector: a
     extractor: href
+  description:
+    selector: ".post-content"
+    post_process:
+      - name: sanitize_html
 ```
-Build the RSS with: `html2rss feed ./my_config_file.yml`.
+Build the feed from this config with: `html2rss feed ./my_config_file.yml`.
 ## Generating a feed with Ruby
-Here's a minimal working example in Ruby:
+Here's a minimal working example using Ruby:
 ```ruby
 require 'html2rss'
@@ -481,7 +495,7 @@ feeds:
 Your feed configs go below `feeds`. Everything else is part of the global config.
-Find a full example of a `feeds.yml` at [`spec/feeds.test.yml`](https://github.com/html2rss/html2rss/blob/master/spec/feeds.test.yml).
+Find a full example of a `feeds.yml` at [`spec/fixtures/feeds.test.yml`](https://github.com/html2rss/html2rss/blob/master/spec/fixtures/feeds.test.yml).
 Now you can build your feeds like this:
@@ -583,8 +597,22 @@ Recommended further readings:
 ### Contributing
-1. Fork it ( <https://github.com/html2rss/html2rss/fork> )
+Find ideas what to contribute in:
+1. <https://github.com/orgs/html2rss/discussions>
+2. the issues tracker: <https://github.com/html2rss/html2rss/issues>
+#### Development Helpers
+1. `bin/setup`: installs dependencies and sets up the development environment.
+2. `bin/guard`: automatically runs rspec, rubocop and reek when a file changes.
+3. for a modern Ruby development experience: install [`ruby-lsp`](https://github.com/Shopify/ruby-lsp) and integrate it to your IDE:
+   a. [Ruby in Visual Studio Code](https://code.visualstudio.com/docs/languages/ruby)
+#### How to submit changes
+1. Fork this repo ( <https://github.com/html2rss/html2rss/fork> )
 2. Create your feature branch (`git checkout -b my-new-feature`)
-3. Commit your changes (`git commit -am 'Add some feature'`)
+3. Implement a commit your changes (`git commit -am 'feat: add XYZ'`)
 4. Push to the branch (`git push origin my-new-feature`)
-5. Create a new Pull Request
+5. Create a new Pull Request using the Github web UI

data/html2rss.gemspec CHANGED Viewed

@@ -38,6 +38,7 @@ Gem::Specification.new do |spec|
   spec.add_dependency 'kramdown'
   spec.add_dependency 'mime-types', '> 3.0'
   spec.add_dependency 'nokogiri', '>= 1.10', '< 2.0'
+  spec.add_dependency 'parallel'
   spec.add_dependency 'regexp_parser'
   spec.add_dependency 'reverse_markdown', '~> 2.0'
   spec.add_dependency 'rss'

data/lib/html2rss/attribute_post_processors/base.rb CHANGED Viewed

@@ -26,17 +26,20 @@ module Html2rss
       # @param value [Object] the value to check
       # @param types [Array<Class>, Class] the expected type(s)
       # @param name [String] the name of the option being checked
+      # @param context [Item::Context] the context
       # @raise [InvalidType] if the value is not of the expected type(s)
-      def self.assert_type(value, types = [], name)
+      def self.assert_type(value, types = [], name, context:)
         types = [types] unless types.is_a?(Array)
         return if types.any? { |type| value.is_a?(type) }
-        error_message_template = 'The type of `%s` must be %s, but is: %s'
-        raise InvalidType, format(error_message_template, name, types.join(' or '), value.class), [], cause: nil
-      end
+        options = context[:options] if context.is_a?(Hash)
+        options ||= { file: File.basename(caller_locations(1, 1).first.absolute_path) }
-      # private_class_method :expect_options, :assert_type
+        raise InvalidType, format('The type of `%<name>s` must be %<types>s, but is: %<type>s in: %<options>s',
+                                  name:, types: types.join(' or '), type: value.class, options: options.inspect),
+              [], cause: nil
+      end
       ##
       # This method validates the arguments passed to the post processor. Must be implemented by subclasses.
@@ -51,7 +54,7 @@ module Html2rss
       def initialize(value, context)
         klass = self.class
         # TODO: get rid of Hash
-        klass.assert_type(context, [Item::Context, Hash], 'context')
+        klass.assert_type(context, [Item::Context, Hash], 'context', context:)
         klass.validate_args!(value, context)
         @value = value

data/lib/html2rss/attribute_post_processors/gsub.rb CHANGED Viewed

@@ -27,9 +27,9 @@ module Html2rss
     # See the doc on [String#gsub](https://ruby-doc.org/core/String.html#method-i-gsub) for more info.
     class Gsub < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
+        assert_type value, String, :value, context:
         expect_options(%i[replacement pattern], context)
-        assert_type context.dig(:options, :replacement), [String, Hash], :replacement
+        assert_type context.dig(:options, :replacement), [String, Hash], :replacement, context:
       end
       ##

data/lib/html2rss/attribute_post_processors/html_to_markdown.rb CHANGED Viewed

@@ -27,8 +27,8 @@ module Html2rss
     # Would return:
     #    'Lorem **ipsum** dolor'
     class HtmlToMarkdown < Base
-      def self.validate_args!(value, _context)
-        assert_type value, String, :value
+      def self.validate_args!(value, context)
+        assert_type value, String, :value, context:
       end
       ##

data/lib/html2rss/attribute_post_processors/markdown_to_html.rb CHANGED Viewed

@@ -33,8 +33,8 @@ module Html2rss
     #
     #    <p>Price: 12.34</p>
     class MarkdownToHtml < Base
-      def self.validate_args!(value, _context)
-        assert_type value, String, :value
+      def self.validate_args!(value, context)
+        assert_type value, String, :value, context:
       end
       ##

data/lib/html2rss/attribute_post_processors/parse_time.rb CHANGED Viewed

@@ -27,8 +27,8 @@ module Html2rss
     # It uses `Time.parse`.
     class ParseTime < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
-        assert_type context[:config].time_zone, String, :time_zone
+        assert_type(value, String, :value, context:)
+        assert_type(context[:config].time_zone, String, :time_zone, context:)
       end
       ##

data/lib/html2rss/attribute_post_processors/parse_uri.rb CHANGED Viewed

@@ -25,8 +25,8 @@ module Html2rss
       def self.validate_args!(value, context)
         url_types = [String, URI::HTTP, Addressable::URI].freeze
-        assert_type(value, url_types, :value)
-        assert_type(context.config.url, url_types, :url)
+        assert_type(value, url_types, :value, context:)
+        assert_type(context.config.url, url_types, :url, context:)
         raise ArgumentError, 'The `value` option is missing or empty.' if value.to_s.empty?
       end

data/lib/html2rss/attribute_post_processors/sanitize_html.rb CHANGED Viewed

@@ -39,8 +39,19 @@ module Html2rss
     # Would return:
     #    '<p>Lorem <b>ipsum</b> dolor ...</p>'
     class SanitizeHtml < Base
-      def self.validate_args!(value, _context)
-        assert_type value, String, :value
+      def self.validate_args!(value, context)
+        assert_type value, String, :value, context:
+      end
+      ##
+      # Shorthand method to get the sanitized HTML.
+      # @param html [String]
+      # @param url [String, Addressable::URI]
+      def self.get(html, url)
+        raise ArgumentError, 'url must be a String or Addressable::URI' if url.to_s.empty?
+        return nil if html.to_s.empty?
+        new(html, { config: Config::Channel.new({ url: }) }).get
       end
       ##

data/lib/html2rss/attribute_post_processors/substring.rb CHANGED Viewed

@@ -30,13 +30,13 @@ module Html2rss
     #    'bar'
     class Substring < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
+        assert_type value, String, :value, context:
         options = context[:options]
-        assert_type options[:start], Integer, :start
+        assert_type options[:start], Integer, :start, context:
         end_index = options[:end]
-        assert_type end_index, Integer, :end if end_index
+        assert_type(end_index, Integer, :end, context:) if end_index
       end
       ##

data/lib/html2rss/attribute_post_processors/template.rb CHANGED Viewed

@@ -33,7 +33,7 @@ module Html2rss
     #    'Product (23,42€)'
     class Template < Base
       def self.validate_args!(value, context)
-        assert_type value, String, :value
+        assert_type value, String, :value, context:
         string = context[:options]&.dig(:string).to_s
         raise InvalidType, 'The `string` template is absent.' if string.empty?
@@ -74,9 +74,9 @@ module Html2rss
       # @return [String]
       # @deprecated Use %<id>s formatting instead. Will be removed in version 1.0.0. See README / Dynamic parameters.
       def format_string_with_methods
-        warn '[DEPRECATION] This method of using params is deprecated and \
-              support for it will be removed in version 1.0.0.\
-              Please use dynamic parameters (i.e. %<id>s, see README.md) instead.'
+        Log.warn '[DEPRECATION] This method of using params is deprecated and \
+                  support for it will be removed in version 1.0.0.\
+                  Please use dynamic parameters (i.e. %<id>s, see README.md) instead.'
         string % methods
       end

data/lib/html2rss/auto_source/article.rb ADDED Viewed

@@ -0,0 +1,95 @@
+# frozen_string_literal: true
+require 'zlib'
+require 'sanitize'
+module Html2rss
+  class AutoSource
+    ##
+    # Article is a simple data object representing an article extracted from a page.
+    # It is enumerable and responds to all keys specified in PROVIDED_KEYS.
+    class Article
+      include Enumerable
+      include Comparable
+      PROVIDED_KEYS = %i[id title description url image guid published_at scraper].freeze
+      # @param options [Hash<Symbol, String>]
+      def initialize(**options)
+        @to_h = {}
+        options.each_pair { |key, value| @to_h[key] = value.freeze if value }
+        @to_h.freeze
+        return unless (unknown_keys = options.keys - PROVIDED_KEYS).any?
+        Log.warn "Article: unknown keys found: #{unknown_keys.join(', ')}"
+      end
+      # Checks if the article is valid based on the presence of URL, ID, and either title or description.
+      # @return [Boolean] True if the article is valid, otherwise false.
+      def valid?
+        !url.to_s.empty? && (!title.to_s.empty? || !description.to_s.empty?) && !id.to_s.empty?
+      end
+      # @yield [key, value]
+      # @return [Enumerator] if no block is given
+      def each
+        return enum_for(:each) unless block_given?
+        PROVIDED_KEYS.each { |key| yield(key, public_send(key)) }
+      end
+      def id
+        @to_h[:id]
+      end
+      def title
+        @to_h[:title]
+      end
+      def description
+        return @description if defined?(@description)
+        return if url.to_s.empty? || @to_h[:description].to_s.empty?
+        @description ||= Html2rss::AttributePostProcessors::SanitizeHtml.get(@to_h[:description], url)
+      end
+      # @return [Addressable::URI, nil]
+      def url
+        @url ||= Html2rss::Utils.sanitize_url(@to_h[:url])
+      end
+      # @return [Addressable::URI, nil]
+      def image
+        @image ||= Html2rss::Utils.sanitize_url(@to_h[:image])
+      end
+      # Generates a unique identifier based on the URL and ID using CRC32.
+      # @return [String]
+      def guid
+        @guid ||= Zlib.crc32([url, id].join('#!/')).to_s(36).encode('utf-8')
+      end
+      # Parses and returns the published_at time.
+      # @return [Time, nil]
+      def published_at
+        return if (string = @to_h[:published_at].to_s).strip.empty?
+        @published_at ||= Time.parse(string)
+      rescue ArgumentError
+        nil
+      end
+      def scraper
+        @to_h[:scraper]
+      end
+      def <=>(other)
+        return nil unless other.is_a?(Article)
+        0 if other.all? { |key, value| value == public_send(key) ? public_send(key) <=> value : false }
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/channel.rb ADDED Viewed

@@ -0,0 +1,79 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    ##
+    # Extracts channel information from
+    # 1. the HTML document's <head>.
+    # 2. the HTTP response
+    class Channel
+      ##
+      #
+      # @param parsed_body [Nokogiri::HTML::Document] The parsed HTML document.
+      # @param response [Faraday::Response] The URL of the HTML document.
+      def initialize(parsed_body, url:, response:, articles: [])
+        @parsed_body = parsed_body
+        @url = url
+        @response = response
+        @articles = articles
+      end
+      def url = extract_url
+      def title = extract_title
+      def language = extract_language
+      def description = extract_description
+      def image = extract_image
+      def ttl = extract_ttl
+      def last_build_date = response.headers['last-modified']
+      def generator
+        "html2rss V. #{::Html2rss::VERSION} (using auto_source scrapers: #{scraper_counts})"
+      end
+      private
+      attr_reader :parsed_body, :response
+      def extract_url
+        @url.normalize.to_s
+      end
+      def extract_title
+        parsed_body.at_css('head > title')&.text
+      end
+      def extract_language
+        return parsed_body['lang'] if parsed_body.name == 'html' && parsed_body['lang']
+        parsed_body.at_css('[lang]')&.[]('lang')
+      end
+      def extract_description
+        parsed_body.at_css('meta[name="description"]')&.[]('content') || ''
+      end
+      def extract_image
+        url = parsed_body.at_css('meta[property="og:image"]')&.[]('content')
+        Html2rss::Utils.sanitize_url(url) if url
+      end
+      def extract_ttl
+        ttl = response.headers['cache-control']&.match(/max-age=(\d+)/)&.[](1)
+        return unless ttl
+        ttl.to_i.fdiv(60).ceil
+      end
+      def scraper_counts
+        scraper_counts = +''
+        @articles.each_with_object(Hash.new(0)) { |article, counts| counts[article.scraper] += 1 }
+                 .each do |klass, count|
+          scraper_counts.concat("[#{klass.to_s.gsub('Html2rss::AutoSource::Scraper::', '')}=#{count}]")
+        end
+        scraper_counts
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/cleanup.rb ADDED Viewed

@@ -0,0 +1,76 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    ##
+    # Cleanup is responsible for cleaning up the extracted articles.
+    # :reek:MissingSafeMethod { enabled: false }
+    # It applies various strategies to filter and refine the article list.
+    class Cleanup
+      class << self
+        def call(articles, url:, keep_different_domain: false)
+          Log.debug "Cleanup: start with #{articles.size} articles"
+          articles.select!(&:valid?)
+          remove_short!(articles, :title)
+          deduplicate_by!(articles, :url)
+          deduplicate_by!(articles, :title)
+          keep_only_http_urls!(articles)
+          reject_different_domain!(articles, url) unless keep_different_domain
+          Log.debug "Cleanup: end with #{articles.size} articles"
+          articles
+        end
+        private
+        ##
+        # Removes articles with short values for a given key.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        # @param key [Symbol] The key to check for short values.
+        # @param min_words [Integer] The minimum number of words required.
+        def remove_short!(articles, key = :title, min_words: 2)
+          articles.reject! do |article|
+            value = article.public_send(key)
+            value.nil? || value.to_s.split.size < min_words
+          end
+        end
+        ##
+        # Deduplicates articles by a given key.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        # @param key [Symbol] The key to deduplicate by.
+        def deduplicate_by!(articles, key)
+          seen = {}
+          articles.reject! do |article|
+            value = article.public_send(key)
+            value.nil? || seen.key?(value).tap { seen[value] = true }
+          end
+        end
+        ##
+        # Keeps only articles with HTTP or HTTPS URLs.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        def keep_only_http_urls!(articles)
+          articles.select! { |article| %w[http https].include?(article.url&.scheme) }
+        end
+        ##
+        # Rejects articles that have a URL not on the same domain as the source.
+        #
+        # @param articles [Array<Article>] The list of articles to process.
+        # @param base_url [Addressable::URI] The source URL to compare against.
+        def reject_different_domain!(articles, base_url)
+          base_host = base_url.host
+          articles.select! { |article| article.url&.host == base_host }
+        end
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/reducer.rb ADDED Viewed

@@ -0,0 +1,48 @@
+# frozen_string_literal: true
+module Html2rss
+  class AutoSource
+    ##
+    # Reducer is responsible for reducing the list of articles.
+    # It keeps only the longest attributes of articles with the same URL.
+    # It also filters out invalid articles.
+    class Reducer
+      class << self
+        def call(articles, **_options)
+          Log.debug "Reducer: inited with #{articles.size} articles"
+          reduce_by_keeping_longest_values(articles, keep: [:scraper]) { |article| article.url&.path }
+        end
+        private
+        # @param articles [Array<Article>]
+        # @return [Array<Article>] reduced articles
+        def reduce_by_keeping_longest_values(articles, keep:, &)
+          grouped_by_block = articles.group_by(&)
+          grouped_by_block.each_with_object([]) do |(_key, grouped_articles), result|
+            memo_object = {}
+            grouped_articles.each do |article_hash|
+              keep_longest_values(memo_object, article_hash, keep:)
+            end
+            result << Article.new(**memo_object)
+          end
+        end
+        def keep_longest_values(memo_object, article_hash, keep:)
+          article_hash.each do |key, value|
+            next if value.eql?(memo_object[key])
+            if keep.include?(key)
+              memo_object[key] ||= []
+              memo_object[key] << value
+            elsif value && value.to_s.size > memo_object[key].to_s.size
+              memo_object[key] = value
+            end
+          end
+        end
+      end
+    end
+  end
+end

data/lib/html2rss/auto_source/rss_builder.rb ADDED Viewed

@@ -0,0 +1,68 @@
+# frozen_string_literal: true
+require 'rss'
+module Html2rss
+  class AutoSource
+    ##
+    # Converts the autosourced channel and articles to an RSS feed.
+    class RssBuilder
+      def self.add_guid(article, maker)
+        maker.guid.tap do |guid|
+          guid.content = article.guid
+          guid.isPermaLink = false
+        end
+      end
+      def self.add_image(article, maker)
+        url = article.image || return
+        maker.enclosure.tap do |enclosure|
+          enclosure.url = url
+          enclosure.type = Html2rss::Utils.guess_content_type_from_url(url)
+          enclosure.length = 0
+        end
+      end
+      def initialize(channel:, articles:)
+        @channel = channel
+        @articles = articles
+      end
+      def call
+        RSS::Maker.make('2.0') do |maker|
+          make_channel(maker.channel)
+          make_items(maker)
+        end
+      end
+      private
+      attr_reader :channel, :articles
+      def make_channel(maker)
+        %i[language title description ttl].each do |key|
+          maker.public_send(:"#{key}=", channel.public_send(key))
+        end
+        maker.link = channel.url
+        maker.generator = channel.generator
+        maker.updated = channel.last_build_date
+      end
+      def make_items(maker)
+        articles.each do |article|
+          maker.items.new_item do |item_maker|
+            RssBuilder.add_guid(article, item_maker)
+            RssBuilder.add_image(article, item_maker)
+            item_maker.title = article.title
+            item_maker.description = article.description
+            item_maker.pubDate = article.published_at
+            item_maker.link = article.url
+          end
+        end
+      end
+    end
+  end
+end