feedstock 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 60dc0bcb05928b59220fe1ed6ac24487428ceef5279f454bce047d3b3a94a56d
4
- data.tar.gz: 91d7a161cdd3aedaf2316082b9f5bbae2fa48dcfd9d599cd65ee746673afc2b8
3
+ metadata.gz: bd230949cb75ce2edb5a9ea48c7902370a9932a2b91d0e12c5e30208cf917157
4
+ data.tar.gz: c4ba1a7f4af881899edc582b38074070fd3f97cbbe7fde37ecdc0f42b7152eb0
5
5
  SHA512:
6
- metadata.gz: 4513cbec520821710ed756544b1a1f7797498a5769395333e19a96c39a466cf83291863944973020ad33a9adb21b58ae2370c3bf08a72843683df1025432fc7c
7
- data.tar.gz: 327995486920781894858903f4d510b958ee85b04a9aebf7139a4934f73c8f5af446b8eda6a06bf0e158692c4de278ebfc007b7f729f6bca953c67ea8eaff432
6
+ metadata.gz: eaae22277fe1a4084e7560bf1dbf8946e7bf3e76956260dbb3f9fb0883ff72b34b708e1e1f50dfcb941d6eb1952ad2254917079e8137397fafef8844b767fbbc
7
+ data.tar.gz: 15d91c5390ffdda38e58e7fb13201486f9605ac20cf9a60195edb27c54dafbfb31b5a7ea38a3cc0daa2716f9ec1e9fc208c5a769a837dc9ea88a1c131df72a6b
data/README.md CHANGED
@@ -1,5 +1,10 @@
1
1
  # Feedstock
2
2
 
3
+ [![Gem Version][gem-badge]][gem-link]
4
+
5
+ [gem-badge]: https://badge.fury.io/rb/feedstock.svg
6
+ [gem-link]: https://rubygems.org/gems/feedstock
7
+
3
8
  Feedstock is a Ruby library for extracting information from a webpage and
4
9
  converting it into an Atom feed.
5
10
 
@@ -13,6 +18,14 @@ Feedstock is a Ruby library that you can use to create an Atom feed. It takes a
13
18
  URL to the webpage to check and a hash of rules. The rules tell Feedstock how to
14
19
  extract and transform the data it finds on the webpage.
15
20
 
21
+ ## Example
22
+
23
+ The [feeds.inqk.net repository][example] includes an example of how the Feedstock
24
+ library can be used in practice.
25
+
26
+ [example]: https://github.com/pyrmont/feeds.inqk.net/tree/4a95a438f8d3a707db7946238181ab76c029ee77/src/input
27
+ "An example of using the Feedstock library"
28
+
16
29
  ## Installation
17
30
 
18
31
  Feedstock is available as a gem:
@@ -31,20 +44,20 @@ and one optional key.
31
44
 
32
45
  ### Info
33
46
 
34
- The `"info"` key is mandatory. It must be associated with a hash. This document
35
- refers to this hash as the 'info hash'.
47
+ The `:info` key is mandatory. It must be associated with a hash. In this
48
+ README, this hash is referred to as the _info hash_.
36
49
 
37
50
  #### Keys
38
51
 
39
- The keys in the info hash are strings (not symbols). When used with the default
40
- template, Feedstock will use the key as the name of the XML entity in the
41
- resulting feed. For example, if the key is `"id"`, the XML entity in the
52
+ The keys in the info hash should be symbols, not strings. When used with the
53
+ default template, Feedstock will use the key as the name of the XML entity in
54
+ the resulting feed. For example, if the key is `:id`, the XML entity in the
42
55
  resulting feed will be `<id>`.
43
56
 
44
57
  #### Values
45
58
 
46
59
  The value associated with each key in the info hash can be either a string or a
47
- hash.
60
+ hash.
48
61
 
49
62
  ##### String
50
63
 
@@ -55,58 +68,69 @@ matching node in the document.
55
68
 
56
69
  ##### Hash
57
70
 
58
- If the value is a hash, this is the 'data hash'. The data hash defines the
59
- rules that Feedstock uses to extract data. It must contain one of two keys:
71
+ If the value is a hash, this is a _data hash_. A data hash defines the rules
72
+ that Feedstock uses to extract data. It must contain one of two keys:
60
73
 
61
- - `"literal"`: The value associated with this key is used for the content of the
74
+ - `:literal`: The value associated with this key is used for the content of the
62
75
  XML entity. This can be useful for elements that are not on the page or that
63
76
  don't change.
64
77
 
65
- - `"path"`: The path to the node in the document expressed in CSS's selector
78
+ - `:path`: The path to the node in the document expressed in CSS's selector
66
79
  syntax. As noted above, if the value of a key in the info hash is a string,
67
- this is treated as a path. The reason to use a data hash with a `"path"` key
80
+ this is treated as a path. The reason to use a data hash with a `:path` key
68
81
  is when using one or more of the keys below. In the info hash, a path matches
69
82
  only the first matching node in the document.
70
83
 
71
84
  The following keys may also be defined in a data hash:
72
85
 
73
- - `"attribute"`: The default is `nil`. If an attribute is provided, Feedstock
74
- will extract the content of the attribute rather than the content of the node.
75
- This is important for links, where the link itself is typically the content of
76
- the `href` attribute rather than the content of the `<a>` element.
77
-
78
- - `"prefix"`: The default is `nil`. If a prefix is provided, the string value of
86
+ - `:content`: The default is `nil`. The `:content` key can be set to
87
+ `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
88
+ value is `"inner_html"`, Feedstock will extract the content of the node as
89
+ HTML. If the value is an attribute hash, Feedstock will extract the value of
90
+ that attribute. This is important for links, where the link itself is
91
+ typically the content of the `href` attribute rather than the content of the
92
+ `<a>` element. For all other values, the plaintext content of the node is
93
+ extracted.
94
+
95
+ - `:processor`: The default is `nil`. The `:processor` key can be set to a
96
+ lambda function that takes two arguments. The first is the extracted content,
97
+ the second is the rule being processed. The content extracted by Feedstock for
98
+ the given path is processed by the processor.
99
+
100
+ - `:prefix`: The default is `nil`. If a prefix is provided, the string value of
79
101
  the prefix is appended to the beginning of the content extracted.
80
102
 
81
- - `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
103
+ - `:suffix`: The default is `nil`. If a suffix is provided, the string value of
82
104
  the suffix is appended to the end of the content extracted.
83
105
 
84
- - `"type"`: The default is `nil`. This causes Feedstock to extract only the text
85
- in a node (stripping out all HTML). However, a user may specify `"datetime"`
86
- or `"cdata"`. `"datetime"` content is parsed by [the Timeliness
87
- library][Timeliness] (this is bundled with Feedstock) to return a string.
88
- `"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
89
- tags.
106
+ - `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
107
+ If the value is `"datetime"`, the content is parsed by the [Timeliness
108
+ library][Timeliness] to return a string. If the value is `"cdata"`, the
109
+ content is wrapped in `<![CDATA[` and `]]>` tags.
90
110
 
91
111
  [Timeliness]: https://github.com/adzap/timeliness "The official repository for
92
112
  the Timeliness library"
93
113
 
114
+ #### Formatting Order
115
+
116
+ The order for formatting content is: extract, process, wrapping.
117
+
94
118
  ### Entry
95
119
 
96
- The `"entry"` key is mandatory. It must be associated with a hash. This document
97
- refers to this hash as the 'entry hash'.
120
+ The `:entry` key is mandatory. It must be associated with a hash. In this
121
+ README, this hash is referred to as the _entry hash_.
98
122
 
99
123
  #### Keys
100
124
 
101
- The keys in the entry hash are strings (not symbols). When used with the default
102
- template, Feedstock will use the key as the name of the XML entity in the
103
- resulting feed. For example, if the key is `"id"`, the XML entity in the
125
+ The keys in the entry hash should be symbols, not strings. When used with the
126
+ default template, Feedstock will use the key as the name of the XML entity in
127
+ the resulting feed. For example, if the key is `"id"`, the XML entity in the
104
128
  resulting feed will be `<id>`.
105
129
 
106
130
  #### Values
107
131
 
108
132
  The value associated with each key in the entry hash can be either a string or a
109
- hash.
133
+ hash.
110
134
 
111
135
  ##### String
112
136
 
@@ -116,53 +140,60 @@ the CSS selector will match all nodes.
116
140
 
117
141
  ##### Hash
118
142
 
119
- If the value is a hash, we call this the "data hash". The data hash defines the
143
+ If the value is a hash, this is a _data hash_. A data hash defines the
120
144
  rules that Feedstock uses to extract data. It must contain one of two keys:
121
145
 
122
- - `"literal"`: The value associated with this key is used for the content of the
146
+ - `:literal`: The value associated with this key is used for the content of the
123
147
  XML entity. This can be useful for elements that are not on the page or that
124
148
  don't change.
125
149
 
126
- - `"path"`: The path to the node in the document expressed in CSS's selector
127
- syntax. Unlike with the info hash, the CSS selector will match all nodes.
150
+ - `:path`: The path to the node in the document expressed in CSS's selector
151
+ syntax. Unlike with the info hash, the CSS selector will match all nodes.
128
152
 
129
153
  The following keys may also be defined in a data hash:
130
154
 
131
- - `"attribute"`: The default is `nil`. If an attribute is provided, Feedstock
132
- will extract the content of the attribute rather than the content of the node.
133
- This is important for links, where the link itself is typically the content of
134
- the `href` attribute rather than the content of the `<a>` element.
155
+ - `:content`: The default is `nil`. The `:content` key can be set to
156
+ `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
157
+ value is `"inner_html"`, Feedstock will extract the content of the node as
158
+ HTML. If the value is an attribute hash, Feedstock will extract the value of
159
+ that attribute. This is important for links, where the link itself is
160
+ typically the content of the `href` attribute rather than the content of the
161
+ `<a>` element. For all other values, the plaintext content of the node is
162
+ extracted.
163
+
164
+ - `:repeat`: The default is `nil`. If repeat is set to `true`, Feedstock will
165
+ use the content provided by either `:literal` or `:path` repeatedly. Since
166
+ the value of `:literal` implies `:repeat`, it is not necessary to specify it
167
+ expressly.
135
168
 
136
- - `"infix"`: The default is `nil`. If the entries hash has been provided (see
137
- below), then the string value of the infix is inserted between the content of
138
- each matching node. If the entries hash not been provided, this is ignored.
169
+ - `:processor`: The default is `nil`. The `:processor` key can be set to a
170
+ lambda function that takes two arguments. The first is the extracted content,
171
+ the second is the rule being processed. The content extracted by Feedstock for
172
+ the given path is processed by the processor.
139
173
 
140
- - `"prefix"`: The default is `nil`. If a prefix is provided, the string value of
174
+ - `:prefix`: The default is `nil`. If a prefix is provided, the string value of
141
175
  the prefix is appended to the beginning of the content extracted.
142
176
 
143
- - `"repeat"`: The default is `nil`. If repeat is set to `true`, Feedstock will
144
- use the content provided by either `"literal"` or `"path"` repeatedly. Since
145
- the value of `"literal"` implies `"repeat"`, it is not necessary to specify it
146
- expressly.
147
-
148
- - `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
177
+ - `:suffix`: The default is `nil`. If a suffix is provided, the string value of
149
178
  the suffix is appended to the end of the content extracted.
150
179
 
151
- - `"type"`: The default is `nil`. This causes Feedstock to extract only the text
152
- in a node (stripping out all HTML). However, a user may specify `"datetime"`
153
- or `"cdata"`. `"datetime"` content is parsed by [the Timeliness
154
- library][Timeliness] (this is bundled with Feedstock) to return a string.
155
- `"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
156
- tags.
180
+ - `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
181
+ If the value is `"datetime"`, the content is parsed by the [Timeliness
182
+ library][Timeliness] to return a string. If the value is `"cdata"`, the
183
+ content is wrapped in `<![CDATA[` and `]]>` tags.
157
184
 
158
185
  ### Entries
159
186
 
160
- The `"entries"` key is optional. It can be associated with a hash. This document
161
- refers to this hash as the 'entries hash'.
187
+ The `:entries` key is optional. It can be associated with a hash. In this
188
+ README, this hash is referred to as the _entries hash_.
189
+
190
+ The entries hash is offered as a convenience. It allows a user to simplify
191
+ the paths used in the entry hash by omitting a reference to the node
192
+ containing the entries.
162
193
 
163
194
  If an entries hash is provided, it must contain the following key:
164
195
 
165
- - `"path"`: The path to the node in the document expressed in CSS's selector
196
+ - `:path`: The path to the node in the document expressed in CSS's selector
166
197
  syntax. This path is used as the root for the paths in the entry hash.
167
198
 
168
199
  ## Bugs
data/feedstock.gemspec CHANGED
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
14
14
  desc
15
15
  s.homepage = "https://github.com/pyrmont/feedstock/"
16
16
  s.licenses = "Unlicense"
17
- s.required_ruby_version = ">= 2.5"
17
+ s.required_ruby_version = ">= 2.7"
18
18
 
19
19
  s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
20
20
  "feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]
data/lib/feedstock.rb CHANGED
@@ -26,7 +26,7 @@ module Feedstock
26
26
  end
27
27
 
28
28
  def self.extract_entries(page, rules)
29
- if rules["entries"]
29
+ if rules[:entries]
30
30
  extract_entries_wrapped page, rules
31
31
  else
32
32
  extract_entries_unwrapped page, rules
@@ -37,15 +37,15 @@ module Feedstock
37
37
  static = Hash.new
38
38
  entries = Array.new
39
39
 
40
- rules["entry"].each do |name, rule|
41
- if rule["literal"]
42
- static[name] = rule["literal"]
43
- elsif rule["repeat"]
44
- static[name] = format_content page.at_css(rule["path"]), rule
40
+ rules[:entry].each do |name, rule|
41
+ if rule[:literal]
42
+ static[name.to_s] = rule[:literal]
43
+ elsif rule[:repeat]
44
+ static[name.to_s] = format_content page.at_css(rule[:path]), rule
45
45
  else
46
- page.css(rule["path"]).each.with_index do |match, i|
46
+ page.css(rule[:path]).each.with_index do |match, i|
47
47
  entries[i] = Hash.new if entries[i].nil?
48
- entries[i].merge!({ name => format_content(match, rule) })
48
+ entries[i].merge!({ name.to_s => format_content(match, rule) })
49
49
  end
50
50
  end
51
51
  end
@@ -60,19 +60,19 @@ module Feedstock
60
60
  def self.extract_entries_wrapped(page, rules)
61
61
  entries = Array.new
62
62
 
63
- page.css(rules["entries"]["path"]).each.with_index do |node, i|
64
- rules["entry"].each do |name, rule|
63
+ page.css(rules[:entries][:path]).each.with_index do |node, i|
64
+ rules[:entry].each do |name, rule|
65
65
  entries[i] = Hash.new if entries[i].nil?
66
66
 
67
- content = if rule["literal"]
68
- rule["literal"]
69
- elsif rule["repeat"]
70
- format_content page.at_css(rule["path"]), rule
67
+ content = if rule[:literal]
68
+ rule[:literal]
69
+ elsif rule[:repeat]
70
+ format_content page.at_css(rule[:path]), rule
71
71
  else
72
- format_content node.at_css(rule["path"]), rule
72
+ format_content node.at_css(rule[:path]), rule
73
73
  end
74
74
 
75
- entries[i].merge!({ name => content })
75
+ entries[i].merge!({ name.to_s => content })
76
76
  end
77
77
  end
78
78
 
@@ -82,11 +82,11 @@ module Feedstock
82
82
  def self.extract_info(page, rules)
83
83
  info = Hash.new
84
84
 
85
- rules["info"].each do |name, rule|
86
- if rule["literal"]
87
- info[name] = rule["literal"]
85
+ rules[:info].each do |name, rule|
86
+ if rule[:literal]
87
+ info[name.to_s] = rule[:literal]
88
88
  else
89
- info[name] = format_content page.at_css(rule["path"]), rule
89
+ info[name.to_s] = format_content page.at_css(rule[:path]), rule
90
90
  end
91
91
  end
92
92
 
@@ -96,41 +96,58 @@ module Feedstock
96
96
  def self.format_content(match, rule)
97
97
  return "" if match.nil?
98
98
 
99
- text = if rule["attribute"]
100
- match[rule["attribute"]]
101
- else
102
- match.content.strip
103
- end
99
+ text = extract_content match, rule
100
+ processed = process_content text, rule
101
+ wrapped = wrap_content processed, rule
104
102
 
105
- case rule["type"]
103
+ case rule[:type]
106
104
  when "cdata"
107
- "<![CDATA[#{wrap_content(match.inner_html, rule)}]]>"
105
+ "<![CDATA[#{wrapped}]]>"
108
106
  when "datetime"
109
- "#{Timeliness.parse(wrap_content(text, rule))&.iso8601}"
107
+ "#{Timeliness.parse(wrapped)&.iso8601}"
110
108
  else
111
- wrap_content text, rule
109
+ wrapped
112
110
  end
113
111
  end
114
112
 
115
113
  def self.normalise_rules(rules)
116
114
  rules.keys.each do |category|
117
115
  case category
118
- when "info", "entry"
116
+ when :info, :entry
119
117
  rules[category].each do |name, rule|
120
- rules[category][name] = { "path" => rule } unless rule.is_a? Hash
118
+ rules[category][name] = { :path => rule } unless rule.is_a? Hash
121
119
  end
122
- when "entries"
120
+ when :entries
123
121
  rule = rules[category]
124
- rules[category] = { "path" => rule } unless rule.is_a? Hash
122
+ rules[category] = { :path => rule } unless rule.is_a? Hash
125
123
  end
126
124
  end
127
125
 
128
126
  rules
129
127
  end
130
128
 
129
+ def self.extract_content(node, rule)
130
+ case rule[:content]
131
+ in { attribute: attribute }
132
+ node[attribute]
133
+ in "inner_html"
134
+ node.inner_html
135
+ else
136
+ node.content.strip
137
+ end
138
+ end
139
+
140
+ def self.process_content(content, rule)
141
+ if rule[:processor]
142
+ rule[:processor].call content, rule
143
+ else
144
+ content
145
+ end
146
+ end
147
+
131
148
  def self.wrap_content(content, rule)
132
- return content unless rule["prepend"] || rule["append"]
149
+ return content unless rule[:prepend] || rule[:append]
133
150
 
134
- "#{rule["prepend"]}#{content}#{rule["append"]}"
151
+ "#{rule[:prepend]}#{content}#{rule[:append]}"
135
152
  end
136
153
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Feedstock
4
- VERSION = "0.1.1"
4
+ VERSION = "0.2.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: feedstock
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Camilleri
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-04 00:00:00.000000000 Z
11
+ date: 2021-02-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -108,14 +108,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
108
108
  requirements:
109
109
  - - ">="
110
110
  - !ruby/object:Gem::Version
111
- version: '2.5'
111
+ version: '2.7'
112
112
  required_rubygems_version: !ruby/object:Gem::Requirement
113
113
  requirements:
114
114
  - - ">="
115
115
  - !ruby/object:Gem::Version
116
116
  version: '0'
117
117
  requirements: []
118
- rubygems_version: 3.1.2
118
+ rubygems_version: 3.2.3
119
119
  signing_key:
120
120
  specification_version: 4
121
121
  summary: A library for creating RSS feeds from webpages