RubyGems - feedstock - Versions diffs - 0.2.0 → 0.4.0 - Mend

feedstock 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: bd230949cb75ce2edb5a9ea48c7902370a9932a2b91d0e12c5e30208cf917157
-  data.tar.gz: c4ba1a7f4af881899edc582b38074070fd3f97cbbe7fde37ecdc0f42b7152eb0
+  metadata.gz: f1a02c229edb1b2d7c98904d6263aab47cfe5ef4d605c5a3c78ec412c1bb2083
+  data.tar.gz: f0c35d3a675eeb01cbbc73952b85f484df631459a1542161d2b198b8c3b1ccf8
 SHA512:
-  metadata.gz: eaae22277fe1a4084e7560bf1dbf8946e7bf3e76956260dbb3f9fb0883ff72b34b708e1e1f50dfcb941d6eb1952ad2254917079e8137397fafef8844b767fbbc
-  data.tar.gz: 15d91c5390ffdda38e58e7fb13201486f9605ac20cf9a60195edb27c54dafbfb31b5a7ea38a3cc0daa2716f9ec1e9fc208c5a769a837dc9ea88a1c131df72a6b
+  metadata.gz: b31805cfc5c8aedaabf286f2a76b269df4883daf24850121a94edb349a219c1d7177e161df20a434d07b17eed3bd129c7529d94093559fc5520fe19bc0dc2b45
+  data.tar.gz: b47eac95bda32a4a5a7a7a4a904d4baeb6b1055a18e39f09d1d2ed38c1c7b0a6ab9d0672db4f614b210d5bb1969afbd3de962292d6ccf750895e00f6a4b13d6c

data/README.md CHANGED Viewed

@@ -5,25 +5,30 @@
 [gem-badge]: https://badge.fury.io/rb/feedstock.svg
 [gem-link]: https://rubygems.org/gems/feedstock
-Feedstock is a Ruby library for extracting information from a webpage and
-converting it into an Atom feed.
+Feedstock is a Ruby library for extracting information from an HTML/XML document
+and inserting it into an ERB template. Its primary purpose is to create a feed
+for a webpage that doesn't offer one.
 ## Rationale
-Feeds are great. But sometimes a website doesn't provide a feed or doesn't
-provide a feed for the specific content that you want. That's where Feedstock
-can help.
+I love RSS feeds.
-Feedstock is a Ruby library that you can use to create an Atom feed. It takes a
-URL to the webpage to check and a hash of rules. The rules tell Feedstock how to
-extract and transform the data it finds on the webpage.
+That's why I think it's a shame not every website has a feed. However, even when
+a website does have a feed, sometimes it doesn't include quite the mix
+information that I want. I made Feedstock to solve those two problems.
+Feedstock is a Ruby library that you can use to create an Atom or RSS feed. It
+requires a URL to a document and a hash of rules. The rules tell Feedstock how
+to extract and transform the data found on the webpage. That data is stuffed
+into a hash and then run through an ERB template. Feedstock comes with a
+template but you can use your own, too.
 ## Example
-The [feeds.inqk.net repository][example] includes an example of how the Feedstock
-library can be used in practice.
+The [feeds.inqk.net repository][example] includes an example of how the
+Feedstock library can be used in practice.
-[example]: https://github.com/pyrmont/feeds.inqk.net/tree/4a95a438f8d3a707db7946238181ab76c029ee77/src/input
+[example]: https://github.com/pyrmont/feeds.inqk.net/
 "An example of using the Feedstock library"
 ## Installation
@@ -36,169 +41,42 @@ $ gem install feedstock
 ## Usage
-Feedstock extracts information from a given document using a collection of
-_rules_.
-A collection of rules is expressed as a hash. The hash has two mandatory keys
-and one optional key.
-### Info
-The `:info` key is mandatory. It must be associated with a hash. In this
-README, this hash is referred to as the _info hash_.
-#### Keys
-The keys in the info hash should be symbols, not strings. When used with the
-default template, Feedstock will use the key as the name of the XML entity in
-the resulting feed. For example, if the key is `:id`, the XML entity in the
-resulting feed will be `<id>`.
-#### Values
-The value associated with each key in the info hash can be either a string or a
-hash.
-##### String
-If the value is a string, this defines a path to a node in the document. The
-path is expressed using CSS's selector syntax. Although a CSS selector can match
-more than one node, when used in the info hash, a path will only match the first
-matching node in the document.
-##### Hash
-If the value is a hash, this is a _data hash_. A data hash defines the rules
-that Feedstock uses to extract data. It must contain one of two keys:
-- `:literal`: The value associated with this key is used for the content of the
-  XML entity. This can be useful for elements that are not on the page or that
-  don't change.
-- `:path`: The path to the node in the document expressed in CSS's selector
-  syntax.  As noted above, if the value of a key in the info hash is a string,
-  this is treated as a path. The reason to use a data hash with a `:path` key
-  is when using one or more of the keys below. In the info hash, a path matches
-  only the first matching node in the document.
-The following keys may also be defined in a data hash:
-- `:content`: The default is `nil`. The `:content` key can be set to
-  `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
-  value is `"inner_html"`, Feedstock will extract the content of the node as
-  HTML. If the value is an attribute hash, Feedstock will extract the value of
-  that attribute. This is important for links, where the link itself is
-  typically the content of the `href` attribute rather than the content of the
-  `<a>` element. For all other values, the plaintext content of the node is
-  extracted.
-- `:processor`: The default is `nil`. The `:processor` key can be set to a
-  lambda function that takes two arguments. The first is the extracted content,
-  the second is the rule being processed. The content extracted by Feedstock for
-  the given path is processed by the processor.
-- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
-  the prefix is appended to the beginning of the content extracted.
-- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
-  the suffix is appended to the end of the content extracted.
-- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
-  If the value is `"datetime"`, the content is parsed by the [Timeliness
-  library][Timeliness] to return a string. If the value is `"cdata"`, the
-  content is wrapped in `<![CDATA[` and `]]>` tags.
+Feedstock extracts information from a document at a given _URL_ using a
+collection of _rules_. The feed is generated by calling `Feedstock.feed` as
+below:
-[Timeliness]: https://github.com/adzap/timeliness "The official repository for
-the Timeliness library"
+```ruby
+# Define the URL
+url = "https://example.org"
-#### Formatting Order
+# Define the rules
+rules = { info: { id: url,
+                  title: Feedstock::Extract.new(selector: "div.title"),
+                  updated: Feedstock::Extract.new(selector: "span.date") },
-The order for formatting content is: extract, process, wrapping.
+          entry: { id: Feedstock::Extract.new(selector: "a", content: { attribute: "href" }),
+                   title: Feedstock::Extract.new(selector: "h2"),
+                   updated: Feedstock::Extract.new(selector: "span.date"),
+                   author: Feedstock::Extract.new(selector: "span.byline"),
+                   link: Feedstock::Extract.new(selector: "a", content: { attribute: "href" }),
+                   summary: Feedstock::Extract.new(selector: "div.summary") },
-### Entry
+          entries: Feedstock::Extract.new(selector: "div.story") }
-The `:entry` key is mandatory. It must be associated with a hash. In this
-README, this hash is referred to as the _entry hash_.
+# Using the default format and template
+Feedstock.feed url, rules
-#### Keys
-The keys in the entry hash should be symbols, not strings. When used with the
-default template, Feedstock will use the key as the name of the XML entity in
-the resulting feed. For example, if the key is `"id"`, the XML entity in the
-resulting feed will be `<id>`.
-#### Values
-The value associated with each key in the entry hash can be either a string or a
-hash.
-##### String
-If the value is a string, this defines a path to a node in the document. The
-path is expressed using CSS's selector syntax. Unlike with the info hash, a
-the CSS selector will match all nodes.
-##### Hash
-If the value is a hash, this is a _data hash_. A data hash defines the
-rules that Feedstock uses to extract data. It must contain one of two keys:
-- `:literal`: The value associated with this key is used for the content of the
-  XML entity. This can be useful for elements that are not on the page or that
-  don't change.
-- `:path`: The path to the node in the document expressed in CSS's selector
-  syntax. Unlike with the info hash, the CSS selector will match all nodes.
-The following keys may also be defined in a data hash:
-- `:content`: The default is `nil`. The `:content` key can be set to
-  `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
-  value is `"inner_html"`, Feedstock will extract the content of the node as
-  HTML. If the value is an attribute hash, Feedstock will extract the value of
-  that attribute. This is important for links, where the link itself is
-  typically the content of the `href` attribute rather than the content of the
-  `<a>` element. For all other values, the plaintext content of the node is
-  extracted.
-- `:repeat`: The default is `nil`. If repeat is set to `true`, Feedstock will
-  use the content provided by either `:literal` or `:path` repeatedly. Since
-  the value of `:literal` implies `:repeat`, it is not necessary to specify it
-  expressly.
-- `:processor`: The default is `nil`. The `:processor` key can be set to a
-  lambda function that takes two arguments. The first is the extracted content,
-  the second is the rule being processed. The content extracted by Feedstock for
-  the given path is processed by the processor.
-- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
-  the prefix is appended to the beginning of the content extracted.
-- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
-  the suffix is appended to the end of the content extracted.
-- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
-  If the value is `"datetime"`, the content is parsed by the [Timeliness
-  library][Timeliness] to return a string. If the value is `"cdata"`, the
-  content is wrapped in `<![CDATA[` and `]]>` tags.
-### Entries
-The `:entries` key is optional. It can be associated with a hash. In this
-README, this hash is referred to as the _entries hash_.
-The entries hash is offered as a convenience. It allows a user to simplify
-the paths used in the entry hash by omitting a reference to the node
-containing the entries.
+# Using the XML format and a user-specified template
+Feedstock.feed url, rules, :xml, "podcast.xml"
+```
-If an entries hash is provided, it must contain the following key:
+More information is available in [api.md].
-- `:path`: The path to the node in the document expressed in CSS's selector
-  syntax. This path is used as the root for the paths in the entry hash.
+[api.md]: https://github.com/pyrmont/feedstock/blob/master/api.md
 ## Bugs
-Found a bug? I'd love to know about it. The best way is to report them in the
+Found a bug? I'd love to know about it. The best way is to report it in the
 [Issues section][ghi] on GitHub.
 [ghi]: https://github.com/pyrmont/feedstock/issues
@@ -211,7 +89,6 @@ Feedstock uses [Semantic Versioning 2.0.0][sv2].
 ## Licence
-Feedstock is released into the public domain. See [LICENSE.md][lc] for more
-details.
+Feedstock is released into the public domain. See [LICENSE][] for more details.
-[lc]: https://github.com/pyrmont/feedstock/blob/master/LICENSE.md
+[LICENSE]: https://github.com/pyrmont/feedstock/blob/master/LICENSE

data/feedstock.gemspec CHANGED Viewed

@@ -9,12 +9,15 @@ Gem::Specification.new do |s|
   s.email = ["mike@inqk.net"]
   s.summary = "A library for creating RSS feeds from webpages"
   s.description = <<-desc.strip.gsub(/\s+/, " ")
-    Feedstock is a library for extracting information from a webpage and
-    transforming it into an Atom feed.
+    Feedstock is a Ruby library for extracting information from an HTML/XML
+    document and inserting it into an ERB template.
   desc
   s.homepage = "https://github.com/pyrmont/feedstock/"
   s.licenses = "Unlicense"
   s.required_ruby_version = ">= 2.7"
+  s.metadata = {
+    "documentation_uri" => "https://github.com/pyrmont/feedstock/blob/v0.3.0/api.md"
+  }
   s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
                 "feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]

data/lib/feedstock/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Feedstock
-  VERSION = "0.2.0"
+  VERSION = "0.4.0"
 end

data/lib/feedstock.rb CHANGED Viewed

@@ -6,148 +6,156 @@ require "open-uri"
 require "timeliness"
 module Feedstock
-  def self.feed(url, rules, template_file = "#{__dir__}/../default.xml")
-    rules   = normalise_rules rules
-    page    = download_page url
-    info    = extract_info page, rules
-    entries = extract_entries page, rules
-    feed    = create_feed info, entries, template_file
-    feed
-  end
+  class Extract < Struct.new("Extract", :selector, :absolute, :content, :processor, :prefix,
+                             :suffix, :type, :filter, keyword_init: true); end
-  def self.create_feed(info, entries, template_file)
-    template = ERB.new File.read(template_file), trim_mode: "-"
-    template.result_with_hash info: info, entries: entries
-  end
+  class << self
+    def data(url, rules, format = :html)
+      page    = download_page url, format
-  def self.download_page(url)
-    Nokogiri::HTML URI.open(url)
-  end
+      info    = extract_info page, rules
+      entries = extract_entries page, rules
-  def self.extract_entries(page, rules)
-    if rules[:entries]
-      extract_entries_wrapped page, rules
-    else
-      extract_entries_unwrapped page, rules
+      { info: info, entries: entries }
     end
-  end
-  def self.extract_entries_unwrapped(page, rules)
-    static  = Hash.new
-    entries = Array.new
+    def feed(url, rules, format = :html, template_file = "#{__dir__}/../default.xml")
+      info, entries = data(url, rules, format).values_at(:info, :entries)
+      create_feed info, entries, template_file
+    end
+    private def create_feed(info, entries, template_file)
+      template = ERB.new File.read(template_file), trim_mode: "-"
+      template.result_with_hash info: info, entries: entries
+    end
-    rules[:entry].each do |name, rule|
-      if rule[:literal]
-        static[name.to_s] = rule[:literal]
-      elsif rule[:repeat]
-        static[name.to_s] = format_content page.at_css(rule[:path]), rule
+    private def download_page(url, format)
+      case format
+      when :html
+        Nokogiri::HTML URI.open(url)
+      when :xml
+        Nokogiri::XML URI.open(url)
       else
-        page.css(rule[:path]).each.with_index do |match, i|
-          entries[i] = Hash.new if entries[i].nil?
-          entries[i].merge!({ name.to_s => format_content(match, rule) })
-        end
+        raise "Format not recognised"
       end
     end
-    unless static.empty?
-      entries.each{ |entry| entry.merge!(static) }
+    private def extract_content(node, rule)
+      case rule.content
+      in { attribute: attribute }
+        node[attribute]
+      in "inner_html"
+        node.inner_html
+      in "html" | "xml"
+        node.to_s
+      else
+        node.content.strip
+      end
     end
-    entries
-  end
+    private def extract_entries(page, rules)
+      if rules[:entries]
+        extract_entries_wrapped page, rules
+      else
+        extract_entries_unwrapped page, rules
+      end
+    end
-  def self.extract_entries_wrapped(page, rules)
-    entries = Array.new
+    private def extract_entries_unwrapped(page, rules)
+      static  = Hash.new
+      entries = Array.new
-    page.css(rules[:entries][:path]).each.with_index do |node, i|
       rules[:entry].each do |name, rule|
-        entries[i] = Hash.new if entries[i].nil?
-        content = if rule[:literal]
-                    rule[:literal]
-                  elsif rule[:repeat]
-                    format_content page.at_css(rule[:path]), rule
-                  else
-                    format_content node.at_css(rule[:path]), rule
-                  end
+        if rule.is_a? String
+          static[name.to_s] = rule
+        elsif rule.absolute
+          static[name.to_s] = format_content page.at_css(rule.selector), rule
+        else
+          page.css(rule.selector).each.with_index do |match, i|
+            entries[i] = Hash.new if entries[i].nil?
+            entries[i].merge!({ name.to_s => format_content(match, rule) })
+          end
+        end
+      end
-        entries[i].merge!({ name.to_s => content })
+      unless static.empty?
+        entries.each{ |entry| entry.merge!(static) }
       end
+      entries
     end
-    entries
-  end
+    private def extract_entries_wrapped(page, rules)
+      entries = Array.new
-  def self.extract_info(page, rules)
-    info = Hash.new
+      page.css(rules[:entries].selector).each.with_index do |parent, i|
+        rules[:entry].each do |name, rule|
+          entries[i] = Hash.new if entries[i].nil?
-    rules[:info].each do |name, rule|
-      if rule[:literal]
-        info[name.to_s] = rule[:literal]
-      else
-        info[name.to_s] = format_content page.at_css(rule[:path]), rule
+          content = if rule.is_a? String
+                      rule
+                    elsif rule.absolute
+                      format_content page.at_css(rule.selector), rule
+                    elsif rule.selector.empty?
+                      format_content parent, rule
+                    else
+                      format_content parent.at_css(rule.selector), rule
+                    end
+          entries[i].merge!({ name.to_s => content })
+        end
       end
-    end
-    info
-  end
-  def self.format_content(match, rule)
-    return "" if match.nil?
-    text      = extract_content match, rule
-    processed = process_content text, rule
-    wrapped   = wrap_content processed, rule
+      return entries unless rules[:entries].filter.is_a? Proc
-    case rule[:type]
-    when "cdata"
-      "<![CDATA[#{wrapped}]]>"
-    when "datetime"
-      "#{Timeliness.parse(wrapped)&.iso8601}"
-    else
-      wrapped
+      entries.filter(&rules[:entries].filter)
     end
-  end
-  def self.normalise_rules(rules)
-    rules.keys.each do |category|
-      case category
-      when :info, :entry
-        rules[category].each do |name, rule|
-          rules[category][name] = { :path => rule } unless rule.is_a? Hash
+    private def extract_info(page, rules)
+      info = Hash.new
+      rules[:info].each do |name, rule|
+        if rule.is_a? String
+          info[name.to_s] = rule
+        else
+          info[name.to_s] = format_content page.at_css(rule.selector), rule
         end
-      when :entries
-        rule = rules[category]
-        rules[category] = { :path => rule } unless rule.is_a? Hash
       end
+      info
     end
-    rules
-  end
+    private def format_content(match, rule)
+      return "" if match.nil?
+      text      = extract_content match, rule
+      processed = process_content text, rule
+      wrapped   = wrap_content processed, rule
-  def self.extract_content(node, rule)
-    case rule[:content]
-    in { attribute: attribute }
-      node[attribute]
-    in "inner_html"
-      node.inner_html
-    else
-      node.content.strip
+      case rule.type
+      when "cdata"
+        "<![CDATA[#{wrapped}]]>"
+      when "datetime"
+        "#{Timeliness.parse(wrapped)&.iso8601}"
+      else
+        wrapped
+      end
     end
-  end
-  def self.process_content(content, rule)
-    if rule[:processor]
-      rule[:processor].call content, rule
-    else
-      content
+    private def process_content(content, rule)
+      if rule.processor
+        rule.processor.call content, rule
+      else
+        content
+      end
     end
-  end
-  def self.wrap_content(content, rule)
-    return content unless rule[:prepend] || rule[:append]
+    private def wrap_content(content, rule)
+      return content unless (rule.prefix || rule.suffix)
-    "#{rule[:prepend]}#{content}#{rule[:append]}"
+      "#{rule.prefix}#{content}#{rule.suffix}"
+    end
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: feedstock
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.4.0
 platform: ruby
 authors:
 - Michael Camilleri
-autorequire:
 bindir: bin
 cert_chain: []
-date: 2021-02-05 00:00:00.000000000 Z
+date: 2025-02-21 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -80,8 +79,8 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
-description: Feedstock is a library for extracting information from a webpage and
-  transforming it into an Atom feed.
+description: Feedstock is a Ruby library for extracting information from an HTML/XML
+  document and inserting it into an ERB template.
 email:
 - mike@inqk.net
 executables: []
@@ -99,8 +98,8 @@ homepage: https://github.com/pyrmont/feedstock/
 licenses:
 - Unlicense
 metadata:
+  documentation_uri: https://github.com/pyrmont/feedstock/blob/v0.3.0/api.md
   allowed_push_host: https://rubygems.org
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -115,8 +114,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.2.3
-signing_key:
+rubygems_version: 3.6.2
 specification_version: 4
 summary: A library for creating RSS feeds from webpages
 test_files: []