RubyGems - feedstock - Versions diffs - 0.1.1 → 0.2.0 - Mend

feedstock 0.1.1 → 0.2.0

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 60dc0bcb05928b59220fe1ed6ac24487428ceef5279f454bce047d3b3a94a56d
-  data.tar.gz: 91d7a161cdd3aedaf2316082b9f5bbae2fa48dcfd9d599cd65ee746673afc2b8
+  metadata.gz: bd230949cb75ce2edb5a9ea48c7902370a9932a2b91d0e12c5e30208cf917157
+  data.tar.gz: c4ba1a7f4af881899edc582b38074070fd3f97cbbe7fde37ecdc0f42b7152eb0
 SHA512:
-  metadata.gz: 4513cbec520821710ed756544b1a1f7797498a5769395333e19a96c39a466cf83291863944973020ad33a9adb21b58ae2370c3bf08a72843683df1025432fc7c
-  data.tar.gz: 327995486920781894858903f4d510b958ee85b04a9aebf7139a4934f73c8f5af446b8eda6a06bf0e158692c4de278ebfc007b7f729f6bca953c67ea8eaff432
+  metadata.gz: eaae22277fe1a4084e7560bf1dbf8946e7bf3e76956260dbb3f9fb0883ff72b34b708e1e1f50dfcb941d6eb1952ad2254917079e8137397fafef8844b767fbbc
+  data.tar.gz: 15d91c5390ffdda38e58e7fb13201486f9605ac20cf9a60195edb27c54dafbfb31b5a7ea38a3cc0daa2716f9ec1e9fc208c5a769a837dc9ea88a1c131df72a6b

data/README.md CHANGED Viewed

@@ -1,5 +1,10 @@
 # Feedstock
+[![Gem Version][gem-badge]][gem-link]
+[gem-badge]: https://badge.fury.io/rb/feedstock.svg
+[gem-link]: https://rubygems.org/gems/feedstock
 Feedstock is a Ruby library for extracting information from a webpage and
 converting it into an Atom feed.
@@ -13,6 +18,14 @@ Feedstock is a Ruby library that you can use to create an Atom feed. It takes a
 URL to the webpage to check and a hash of rules. The rules tell Feedstock how to
 extract and transform the data it finds on the webpage.
+## Example
+The [feeds.inqk.net repository][example] includes an example of how the Feedstock
+library can be used in practice.
+[example]: https://github.com/pyrmont/feeds.inqk.net/tree/4a95a438f8d3a707db7946238181ab76c029ee77/src/input
+"An example of using the Feedstock library"
 ## Installation
 Feedstock is available as a gem:
@@ -31,20 +44,20 @@ and one optional key.
 ### Info
-The `"info"` key is mandatory. It must be associated with a hash. This document
-refers to this hash as the 'info hash'.
+The `:info` key is mandatory. It must be associated with a hash. In this
+README, this hash is referred to as the _info hash_.
 #### Keys
-The keys in the info hash are strings (not symbols). When used with the default
-template, Feedstock will use the key as the name of the XML entity in the
-resulting feed. For example, if the key is `"id"`, the XML entity in the
+The keys in the info hash should be symbols, not strings. When used with the
+default template, Feedstock will use the key as the name of the XML entity in
+the resulting feed. For example, if the key is `:id`, the XML entity in the
 resulting feed will be `<id>`.
 #### Values
 The value associated with each key in the info hash can be either a string or a
-hash.
+hash.
 ##### String
@@ -55,58 +68,69 @@ matching node in the document.
 ##### Hash
-If the value is a hash, this is the 'data hash'. The data hash defines the
-rules that Feedstock uses to extract data. It must contain one of two keys:
+If the value is a hash, this is a _data hash_. A data hash defines the rules
+that Feedstock uses to extract data. It must contain one of two keys:
-- `"literal"`: The value associated with this key is used for the content of the
+- `:literal`: The value associated with this key is used for the content of the
   XML entity. This can be useful for elements that are not on the page or that
   don't change.
-- `"path"`: The path to the node in the document expressed in CSS's selector
+- `:path`: The path to the node in the document expressed in CSS's selector
   syntax.  As noted above, if the value of a key in the info hash is a string,
-  this is treated as a path. The reason to use a data hash with a `"path"` key
+  this is treated as a path. The reason to use a data hash with a `:path` key
   is when using one or more of the keys below. In the info hash, a path matches
   only the first matching node in the document.
 The following keys may also be defined in a data hash:
-- `"attribute"`: The default is `nil`. If an attribute is provided, Feedstock
-  will extract the content of the attribute rather than the content of the node.
-  This is important for links, where the link itself is typically the content of
-  the `href` attribute rather than the content of the `<a>` element.
-- `"prefix"`: The default is `nil`. If a prefix is provided, the string value of
+- `:content`: The default is `nil`. The `:content` key can be set to
+  `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
+  value is `"inner_html"`, Feedstock will extract the content of the node as
+  HTML. If the value is an attribute hash, Feedstock will extract the value of
+  that attribute. This is important for links, where the link itself is
+  typically the content of the `href` attribute rather than the content of the
+  `<a>` element. For all other values, the plaintext content of the node is
+  extracted.
+- `:processor`: The default is `nil`. The `:processor` key can be set to a
+  lambda function that takes two arguments. The first is the extracted content,
+  the second is the rule being processed. The content extracted by Feedstock for
+  the given path is processed by the processor.
+- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
   the prefix is appended to the beginning of the content extracted.
-- `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
+- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
   the suffix is appended to the end of the content extracted.
-- `"type"`: The default is `nil`. This causes Feedstock to extract only the text
-  in a node (stripping out all HTML). However, a user may specify `"datetime"`
-  or `"cdata"`. `"datetime"` content is parsed by [the Timeliness
-  library][Timeliness] (this is bundled with Feedstock) to return a string.
-  `"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
-  tags.
+- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
+  If the value is `"datetime"`, the content is parsed by the [Timeliness
+  library][Timeliness] to return a string. If the value is `"cdata"`, the
+  content is wrapped in `<![CDATA[` and `]]>` tags.
 [Timeliness]: https://github.com/adzap/timeliness "The official repository for
 the Timeliness library"
+#### Formatting Order
+The order for formatting content is: extract, process, wrapping.
 ### Entry
-The `"entry"` key is mandatory. It must be associated with a hash. This document
-refers to this hash as the 'entry hash'.
+The `:entry` key is mandatory. It must be associated with a hash. In this
+README, this hash is referred to as the _entry hash_.
 #### Keys
-The keys in the entry hash are strings (not symbols). When used with the default
-template, Feedstock will use the key as the name of the XML entity in the
-resulting feed. For example, if the key is `"id"`, the XML entity in the
+The keys in the entry hash should be symbols, not strings. When used with the
+default template, Feedstock will use the key as the name of the XML entity in
+the resulting feed. For example, if the key is `"id"`, the XML entity in the
 resulting feed will be `<id>`.
 #### Values
 The value associated with each key in the entry hash can be either a string or a
-hash.
+hash.
 ##### String
@@ -116,53 +140,60 @@ the CSS selector will match all nodes.
 ##### Hash
-If the value is a hash, we call this the "data hash". The data hash defines the
+If the value is a hash, this is a _data hash_. A data hash defines the
 rules that Feedstock uses to extract data. It must contain one of two keys:
-- `"literal"`: The value associated with this key is used for the content of the
+- `:literal`: The value associated with this key is used for the content of the
   XML entity. This can be useful for elements that are not on the page or that
   don't change.
-- `"path"`: The path to the node in the document expressed in CSS's selector
-  syntax. Unlike with the info hash, the CSS selector will match all nodes.
+- `:path`: The path to the node in the document expressed in CSS's selector
+  syntax. Unlike with the info hash, the CSS selector will match all nodes.
 The following keys may also be defined in a data hash:
-- `"attribute"`: The default is `nil`. If an attribute is provided, Feedstock
-  will extract the content of the attribute rather than the content of the node.
-  This is important for links, where the link itself is typically the content of
-  the `href` attribute rather than the content of the `<a>` element.
+- `:content`: The default is `nil`. The `:content` key can be set to
+  `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
+  value is `"inner_html"`, Feedstock will extract the content of the node as
+  HTML. If the value is an attribute hash, Feedstock will extract the value of
+  that attribute. This is important for links, where the link itself is
+  typically the content of the `href` attribute rather than the content of the
+  `<a>` element. For all other values, the plaintext content of the node is
+  extracted.
+- `:repeat`: The default is `nil`. If repeat is set to `true`, Feedstock will
+  use the content provided by either `:literal` or `:path` repeatedly. Since
+  the value of `:literal` implies `:repeat`, it is not necessary to specify it
+  expressly.
-- `"infix"`: The default is `nil`. If the entries hash has been provided (see
-  below), then the string value of the infix is inserted between the content of
-  each matching node. If the entries hash not been provided, this is ignored.
+- `:processor`: The default is `nil`. The `:processor` key can be set to a
+  lambda function that takes two arguments. The first is the extracted content,
+  the second is the rule being processed. The content extracted by Feedstock for
+  the given path is processed by the processor.
-- `"prefix"`: The default is `nil`. If a prefix is provided, the string value of
+- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
   the prefix is appended to the beginning of the content extracted.
-- `"repeat"`: The default is `nil`. If repeat is set to `true`, Feedstock will
-  use the content provided by either `"literal"` or `"path"` repeatedly. Since
-  the value of `"literal"` implies `"repeat"`, it is not necessary to specify it
-  expressly.
-- `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
+- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
   the suffix is appended to the end of the content extracted.
-- `"type"`: The default is `nil`. This causes Feedstock to extract only the text
-  in a node (stripping out all HTML). However, a user may specify `"datetime"`
-  or `"cdata"`. `"datetime"` content is parsed by [the Timeliness
-  library][Timeliness] (this is bundled with Feedstock) to return a string.
-  `"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
-  tags.
+- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
+  If the value is `"datetime"`, the content is parsed by the [Timeliness
+  library][Timeliness] to return a string. If the value is `"cdata"`, the
+  content is wrapped in `<![CDATA[` and `]]>` tags.
 ### Entries
-The `"entries"` key is optional. It can be associated with a hash. This document
-refers to this hash as the 'entries hash'.
+The `:entries` key is optional. It can be associated with a hash. In this
+README, this hash is referred to as the _entries hash_.
+The entries hash is offered as a convenience. It allows a user to simplify
+the paths used in the entry hash by omitting a reference to the node
+containing the entries.
 If an entries hash is provided, it must contain the following key:
-- `"path"`: The path to the node in the document expressed in CSS's selector
+- `:path`: The path to the node in the document expressed in CSS's selector
   syntax. This path is used as the root for the paths in the entry hash.
 ## Bugs

data/feedstock.gemspec CHANGED Viewed

@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
   desc
   s.homepage = "https://github.com/pyrmont/feedstock/"
   s.licenses = "Unlicense"
-  s.required_ruby_version = ">= 2.5"
+  s.required_ruby_version = ">= 2.7"
   s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
                 "feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]

data/lib/feedstock.rb CHANGED Viewed

@@ -26,7 +26,7 @@ module Feedstock
   end
   def self.extract_entries(page, rules)
-    if rules["entries"]
+    if rules[:entries]
       extract_entries_wrapped page, rules
     else
       extract_entries_unwrapped page, rules
@@ -37,15 +37,15 @@ module Feedstock
     static  = Hash.new
     entries = Array.new
-    rules["entry"].each do |name, rule|
-      if rule["literal"]
-        static[name] = rule["literal"]
-      elsif rule["repeat"]
-        static[name] = format_content page.at_css(rule["path"]), rule
+    rules[:entry].each do |name, rule|
+      if rule[:literal]
+        static[name.to_s] = rule[:literal]
+      elsif rule[:repeat]
+        static[name.to_s] = format_content page.at_css(rule[:path]), rule
       else
-        page.css(rule["path"]).each.with_index do |match, i|
+        page.css(rule[:path]).each.with_index do |match, i|
           entries[i] = Hash.new if entries[i].nil?
-          entries[i].merge!({ name => format_content(match, rule) })
+          entries[i].merge!({ name.to_s => format_content(match, rule) })
         end
       end
     end
@@ -60,19 +60,19 @@ module Feedstock
   def self.extract_entries_wrapped(page, rules)
     entries = Array.new
-    page.css(rules["entries"]["path"]).each.with_index do |node, i|
-      rules["entry"].each do |name, rule|
+    page.css(rules[:entries][:path]).each.with_index do |node, i|
+      rules[:entry].each do |name, rule|
         entries[i] = Hash.new if entries[i].nil?
-        content = if rule["literal"]
-                    rule["literal"]
-                  elsif rule["repeat"]
-                    format_content page.at_css(rule["path"]), rule
+        content = if rule[:literal]
+                    rule[:literal]
+                  elsif rule[:repeat]
+                    format_content page.at_css(rule[:path]), rule
                   else
-                    format_content node.at_css(rule["path"]), rule
+                    format_content node.at_css(rule[:path]), rule
                   end
-        entries[i].merge!({ name => content })
+        entries[i].merge!({ name.to_s => content })
       end
     end
@@ -82,11 +82,11 @@ module Feedstock
   def self.extract_info(page, rules)
     info = Hash.new
-    rules["info"].each do |name, rule|
-      if rule["literal"]
-        info[name] = rule["literal"]
+    rules[:info].each do |name, rule|
+      if rule[:literal]
+        info[name.to_s] = rule[:literal]
       else
-        info[name] = format_content page.at_css(rule["path"]), rule
+        info[name.to_s] = format_content page.at_css(rule[:path]), rule
       end
     end
@@ -96,41 +96,58 @@ module Feedstock
   def self.format_content(match, rule)
     return "" if match.nil?
-    text = if rule["attribute"]
-             match[rule["attribute"]]
-           else
-             match.content.strip
-           end
+    text      = extract_content match, rule
+    processed = process_content text, rule
+    wrapped   = wrap_content processed, rule
-    case rule["type"]
+    case rule[:type]
     when "cdata"
-      "<![CDATA[#{wrap_content(match.inner_html, rule)}]]>"
+      "<![CDATA[#{wrapped}]]>"
     when "datetime"
-      "#{Timeliness.parse(wrap_content(text, rule))&.iso8601}"
+      "#{Timeliness.parse(wrapped)&.iso8601}"
     else
-      wrap_content text, rule
+      wrapped
     end
   end
   def self.normalise_rules(rules)
     rules.keys.each do |category|
       case category
-      when "info", "entry"
+      when :info, :entry
         rules[category].each do |name, rule|
-          rules[category][name] = { "path" => rule } unless rule.is_a? Hash
+          rules[category][name] = { :path => rule } unless rule.is_a? Hash
         end
-      when "entries"
+      when :entries
         rule = rules[category]
-        rules[category] = { "path" => rule } unless rule.is_a? Hash
+        rules[category] = { :path => rule } unless rule.is_a? Hash
       end
     end
     rules
   end
+  def self.extract_content(node, rule)
+    case rule[:content]
+    in { attribute: attribute }
+      node[attribute]
+    in "inner_html"
+      node.inner_html
+    else
+      node.content.strip
+    end
+  end
+  def self.process_content(content, rule)
+    if rule[:processor]
+      rule[:processor].call content, rule
+    else
+      content
+    end
+  end
   def self.wrap_content(content, rule)
-    return content unless rule["prepend"] || rule["append"]
+    return content unless rule[:prepend] || rule[:append]
-    "#{rule["prepend"]}#{content}#{rule["append"]}"
+    "#{rule[:prepend]}#{content}#{rule[:append]}"
   end
 end

data/lib/feedstock/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Feedstock
-  VERSION = "0.1.1"
+  VERSION = "0.2.0"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: feedstock
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.2.0
 platform: ruby
 authors:
 - Michael Camilleri
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2021-02-04 00:00:00.000000000 Z
+date: 2021-02-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri
@@ -108,14 +108,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: '2.5'
+      version: '2.7'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.2
+rubygems_version: 3.2.3
 signing_key:
 specification_version: 4
 summary: A library for creating RSS feeds from webpages