feedstock 0.1.1 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +88 -57
- data/feedstock.gemspec +1 -1
- data/lib/feedstock.rb +52 -35
- data/lib/feedstock/version.rb +1 -1
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bd230949cb75ce2edb5a9ea48c7902370a9932a2b91d0e12c5e30208cf917157
|
4
|
+
data.tar.gz: c4ba1a7f4af881899edc582b38074070fd3f97cbbe7fde37ecdc0f42b7152eb0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: eaae22277fe1a4084e7560bf1dbf8946e7bf3e76956260dbb3f9fb0883ff72b34b708e1e1f50dfcb941d6eb1952ad2254917079e8137397fafef8844b767fbbc
|
7
|
+
data.tar.gz: 15d91c5390ffdda38e58e7fb13201486f9605ac20cf9a60195edb27c54dafbfb31b5a7ea38a3cc0daa2716f9ec1e9fc208c5a769a837dc9ea88a1c131df72a6b
|
data/README.md
CHANGED
@@ -1,5 +1,10 @@
|
|
1
1
|
# Feedstock
|
2
2
|
|
3
|
+
[![Gem Version][gem-badge]][gem-link]
|
4
|
+
|
5
|
+
[gem-badge]: https://badge.fury.io/rb/feedstock.svg
|
6
|
+
[gem-link]: https://rubygems.org/gems/feedstock
|
7
|
+
|
3
8
|
Feedstock is a Ruby library for extracting information from a webpage and
|
4
9
|
converting it into an Atom feed.
|
5
10
|
|
@@ -13,6 +18,14 @@ Feedstock is a Ruby library that you can use to create an Atom feed. It takes a
|
|
13
18
|
URL to the webpage to check and a hash of rules. The rules tell Feedstock how to
|
14
19
|
extract and transform the data it finds on the webpage.
|
15
20
|
|
21
|
+
## Example
|
22
|
+
|
23
|
+
The [feeds.inqk.net repository][example] includes an example of how the Feedstock
|
24
|
+
library can be used in practice.
|
25
|
+
|
26
|
+
[example]: https://github.com/pyrmont/feeds.inqk.net/tree/4a95a438f8d3a707db7946238181ab76c029ee77/src/input
|
27
|
+
"An example of using the Feedstock library"
|
28
|
+
|
16
29
|
## Installation
|
17
30
|
|
18
31
|
Feedstock is available as a gem:
|
@@ -31,20 +44,20 @@ and one optional key.
|
|
31
44
|
|
32
45
|
### Info
|
33
46
|
|
34
|
-
The
|
35
|
-
|
47
|
+
The `:info` key is mandatory. It must be associated with a hash. In this
|
48
|
+
README, this hash is referred to as the _info hash_.
|
36
49
|
|
37
50
|
#### Keys
|
38
51
|
|
39
|
-
The keys in the info hash
|
40
|
-
template, Feedstock will use the key as the name of the XML entity in
|
41
|
-
resulting feed. For example, if the key is
|
52
|
+
The keys in the info hash should be symbols, not strings. When used with the
|
53
|
+
default template, Feedstock will use the key as the name of the XML entity in
|
54
|
+
the resulting feed. For example, if the key is `:id`, the XML entity in the
|
42
55
|
resulting feed will be `<id>`.
|
43
56
|
|
44
57
|
#### Values
|
45
58
|
|
46
59
|
The value associated with each key in the info hash can be either a string or a
|
47
|
-
hash.
|
60
|
+
hash.
|
48
61
|
|
49
62
|
##### String
|
50
63
|
|
@@ -55,58 +68,69 @@ matching node in the document.
|
|
55
68
|
|
56
69
|
##### Hash
|
57
70
|
|
58
|
-
If the value is a hash, this is
|
59
|
-
|
71
|
+
If the value is a hash, this is a _data hash_. A data hash defines the rules
|
72
|
+
that Feedstock uses to extract data. It must contain one of two keys:
|
60
73
|
|
61
|
-
-
|
74
|
+
- `:literal`: The value associated with this key is used for the content of the
|
62
75
|
XML entity. This can be useful for elements that are not on the page or that
|
63
76
|
don't change.
|
64
77
|
|
65
|
-
-
|
78
|
+
- `:path`: The path to the node in the document expressed in CSS's selector
|
66
79
|
syntax. As noted above, if the value of a key in the info hash is a string,
|
67
|
-
this is treated as a path. The reason to use a data hash with a
|
80
|
+
this is treated as a path. The reason to use a data hash with a `:path` key
|
68
81
|
is when using one or more of the keys below. In the info hash, a path matches
|
69
82
|
only the first matching node in the document.
|
70
83
|
|
71
84
|
The following keys may also be defined in a data hash:
|
72
85
|
|
73
|
-
-
|
74
|
-
|
75
|
-
|
76
|
-
the
|
77
|
-
|
78
|
-
|
86
|
+
- `:content`: The default is `nil`. The `:content` key can be set to
|
87
|
+
`"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
|
88
|
+
value is `"inner_html"`, Feedstock will extract the content of the node as
|
89
|
+
HTML. If the value is an attribute hash, Feedstock will extract the value of
|
90
|
+
that attribute. This is important for links, where the link itself is
|
91
|
+
typically the content of the `href` attribute rather than the content of the
|
92
|
+
`<a>` element. For all other values, the plaintext content of the node is
|
93
|
+
extracted.
|
94
|
+
|
95
|
+
- `:processor`: The default is `nil`. The `:processor` key can be set to a
|
96
|
+
lambda function that takes two arguments. The first is the extracted content,
|
97
|
+
the second is the rule being processed. The content extracted by Feedstock for
|
98
|
+
the given path is processed by the processor.
|
99
|
+
|
100
|
+
- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
|
79
101
|
the prefix is appended to the beginning of the content extracted.
|
80
102
|
|
81
|
-
-
|
103
|
+
- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
|
82
104
|
the suffix is appended to the end of the content extracted.
|
83
105
|
|
84
|
-
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
`"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
|
89
|
-
tags.
|
106
|
+
- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
|
107
|
+
If the value is `"datetime"`, the content is parsed by the [Timeliness
|
108
|
+
library][Timeliness] to return a string. If the value is `"cdata"`, the
|
109
|
+
content is wrapped in `<![CDATA[` and `]]>` tags.
|
90
110
|
|
91
111
|
[Timeliness]: https://github.com/adzap/timeliness "The official repository for
|
92
112
|
the Timeliness library"
|
93
113
|
|
114
|
+
#### Formatting Order
|
115
|
+
|
116
|
+
The order for formatting content is: extract, process, wrapping.
|
117
|
+
|
94
118
|
### Entry
|
95
119
|
|
96
|
-
The
|
97
|
-
|
120
|
+
The `:entry` key is mandatory. It must be associated with a hash. In this
|
121
|
+
README, this hash is referred to as the _entry hash_.
|
98
122
|
|
99
123
|
#### Keys
|
100
124
|
|
101
|
-
The keys in the entry hash
|
102
|
-
template, Feedstock will use the key as the name of the XML entity in
|
103
|
-
resulting feed. For example, if the key is `"id"`, the XML entity in the
|
125
|
+
The keys in the entry hash should be symbols, not strings. When used with the
|
126
|
+
default template, Feedstock will use the key as the name of the XML entity in
|
127
|
+
the resulting feed. For example, if the key is `"id"`, the XML entity in the
|
104
128
|
resulting feed will be `<id>`.
|
105
129
|
|
106
130
|
#### Values
|
107
131
|
|
108
132
|
The value associated with each key in the entry hash can be either a string or a
|
109
|
-
hash.
|
133
|
+
hash.
|
110
134
|
|
111
135
|
##### String
|
112
136
|
|
@@ -116,53 +140,60 @@ the CSS selector will match all nodes.
|
|
116
140
|
|
117
141
|
##### Hash
|
118
142
|
|
119
|
-
If the value is a hash,
|
143
|
+
If the value is a hash, this is a _data hash_. A data hash defines the
|
120
144
|
rules that Feedstock uses to extract data. It must contain one of two keys:
|
121
145
|
|
122
|
-
-
|
146
|
+
- `:literal`: The value associated with this key is used for the content of the
|
123
147
|
XML entity. This can be useful for elements that are not on the page or that
|
124
148
|
don't change.
|
125
149
|
|
126
|
-
-
|
127
|
-
syntax. Unlike with the info hash, the CSS selector will match all nodes.
|
150
|
+
- `:path`: The path to the node in the document expressed in CSS's selector
|
151
|
+
syntax. Unlike with the info hash, the CSS selector will match all nodes.
|
128
152
|
|
129
153
|
The following keys may also be defined in a data hash:
|
130
154
|
|
131
|
-
-
|
132
|
-
|
133
|
-
|
134
|
-
the
|
155
|
+
- `:content`: The default is `nil`. The `:content` key can be set to
|
156
|
+
`"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
|
157
|
+
value is `"inner_html"`, Feedstock will extract the content of the node as
|
158
|
+
HTML. If the value is an attribute hash, Feedstock will extract the value of
|
159
|
+
that attribute. This is important for links, where the link itself is
|
160
|
+
typically the content of the `href` attribute rather than the content of the
|
161
|
+
`<a>` element. For all other values, the plaintext content of the node is
|
162
|
+
extracted.
|
163
|
+
|
164
|
+
- `:repeat`: The default is `nil`. If repeat is set to `true`, Feedstock will
|
165
|
+
use the content provided by either `:literal` or `:path` repeatedly. Since
|
166
|
+
the value of `:literal` implies `:repeat`, it is not necessary to specify it
|
167
|
+
expressly.
|
135
168
|
|
136
|
-
-
|
137
|
-
|
138
|
-
|
169
|
+
- `:processor`: The default is `nil`. The `:processor` key can be set to a
|
170
|
+
lambda function that takes two arguments. The first is the extracted content,
|
171
|
+
the second is the rule being processed. The content extracted by Feedstock for
|
172
|
+
the given path is processed by the processor.
|
139
173
|
|
140
|
-
-
|
174
|
+
- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
|
141
175
|
the prefix is appended to the beginning of the content extracted.
|
142
176
|
|
143
|
-
-
|
144
|
-
use the content provided by either `"literal"` or `"path"` repeatedly. Since
|
145
|
-
the value of `"literal"` implies `"repeat"`, it is not necessary to specify it
|
146
|
-
expressly.
|
147
|
-
|
148
|
-
- `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
|
177
|
+
- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
|
149
178
|
the suffix is appended to the end of the content extracted.
|
150
179
|
|
151
|
-
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
`"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
|
156
|
-
tags.
|
180
|
+
- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
|
181
|
+
If the value is `"datetime"`, the content is parsed by the [Timeliness
|
182
|
+
library][Timeliness] to return a string. If the value is `"cdata"`, the
|
183
|
+
content is wrapped in `<![CDATA[` and `]]>` tags.
|
157
184
|
|
158
185
|
### Entries
|
159
186
|
|
160
|
-
The
|
161
|
-
|
187
|
+
The `:entries` key is optional. It can be associated with a hash. In this
|
188
|
+
README, this hash is referred to as the _entries hash_.
|
189
|
+
|
190
|
+
The entries hash is offered as a convenience. It allows a user to simplify
|
191
|
+
the paths used in the entry hash by omitting a reference to the node
|
192
|
+
containing the entries.
|
162
193
|
|
163
194
|
If an entries hash is provided, it must contain the following key:
|
164
195
|
|
165
|
-
-
|
196
|
+
- `:path`: The path to the node in the document expressed in CSS's selector
|
166
197
|
syntax. This path is used as the root for the paths in the entry hash.
|
167
198
|
|
168
199
|
## Bugs
|
data/feedstock.gemspec
CHANGED
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
|
|
14
14
|
desc
|
15
15
|
s.homepage = "https://github.com/pyrmont/feedstock/"
|
16
16
|
s.licenses = "Unlicense"
|
17
|
-
s.required_ruby_version = ">= 2.
|
17
|
+
s.required_ruby_version = ">= 2.7"
|
18
18
|
|
19
19
|
s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
|
20
20
|
"feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]
|
data/lib/feedstock.rb
CHANGED
@@ -26,7 +26,7 @@ module Feedstock
|
|
26
26
|
end
|
27
27
|
|
28
28
|
def self.extract_entries(page, rules)
|
29
|
-
if rules[
|
29
|
+
if rules[:entries]
|
30
30
|
extract_entries_wrapped page, rules
|
31
31
|
else
|
32
32
|
extract_entries_unwrapped page, rules
|
@@ -37,15 +37,15 @@ module Feedstock
|
|
37
37
|
static = Hash.new
|
38
38
|
entries = Array.new
|
39
39
|
|
40
|
-
rules[
|
41
|
-
if rule[
|
42
|
-
static[name] = rule[
|
43
|
-
elsif rule[
|
44
|
-
static[name] = format_content page.at_css(rule[
|
40
|
+
rules[:entry].each do |name, rule|
|
41
|
+
if rule[:literal]
|
42
|
+
static[name.to_s] = rule[:literal]
|
43
|
+
elsif rule[:repeat]
|
44
|
+
static[name.to_s] = format_content page.at_css(rule[:path]), rule
|
45
45
|
else
|
46
|
-
page.css(rule[
|
46
|
+
page.css(rule[:path]).each.with_index do |match, i|
|
47
47
|
entries[i] = Hash.new if entries[i].nil?
|
48
|
-
entries[i].merge!({ name => format_content(match, rule) })
|
48
|
+
entries[i].merge!({ name.to_s => format_content(match, rule) })
|
49
49
|
end
|
50
50
|
end
|
51
51
|
end
|
@@ -60,19 +60,19 @@ module Feedstock
|
|
60
60
|
def self.extract_entries_wrapped(page, rules)
|
61
61
|
entries = Array.new
|
62
62
|
|
63
|
-
page.css(rules[
|
64
|
-
rules[
|
63
|
+
page.css(rules[:entries][:path]).each.with_index do |node, i|
|
64
|
+
rules[:entry].each do |name, rule|
|
65
65
|
entries[i] = Hash.new if entries[i].nil?
|
66
66
|
|
67
|
-
content = if rule[
|
68
|
-
rule[
|
69
|
-
elsif rule[
|
70
|
-
format_content page.at_css(rule[
|
67
|
+
content = if rule[:literal]
|
68
|
+
rule[:literal]
|
69
|
+
elsif rule[:repeat]
|
70
|
+
format_content page.at_css(rule[:path]), rule
|
71
71
|
else
|
72
|
-
format_content node.at_css(rule[
|
72
|
+
format_content node.at_css(rule[:path]), rule
|
73
73
|
end
|
74
74
|
|
75
|
-
entries[i].merge!({ name => content })
|
75
|
+
entries[i].merge!({ name.to_s => content })
|
76
76
|
end
|
77
77
|
end
|
78
78
|
|
@@ -82,11 +82,11 @@ module Feedstock
|
|
82
82
|
def self.extract_info(page, rules)
|
83
83
|
info = Hash.new
|
84
84
|
|
85
|
-
rules[
|
86
|
-
if rule[
|
87
|
-
info[name] = rule[
|
85
|
+
rules[:info].each do |name, rule|
|
86
|
+
if rule[:literal]
|
87
|
+
info[name.to_s] = rule[:literal]
|
88
88
|
else
|
89
|
-
info[name] = format_content page.at_css(rule[
|
89
|
+
info[name.to_s] = format_content page.at_css(rule[:path]), rule
|
90
90
|
end
|
91
91
|
end
|
92
92
|
|
@@ -96,41 +96,58 @@ module Feedstock
|
|
96
96
|
def self.format_content(match, rule)
|
97
97
|
return "" if match.nil?
|
98
98
|
|
99
|
-
text
|
100
|
-
|
101
|
-
|
102
|
-
match.content.strip
|
103
|
-
end
|
99
|
+
text = extract_content match, rule
|
100
|
+
processed = process_content text, rule
|
101
|
+
wrapped = wrap_content processed, rule
|
104
102
|
|
105
|
-
case rule[
|
103
|
+
case rule[:type]
|
106
104
|
when "cdata"
|
107
|
-
"<![CDATA[#{
|
105
|
+
"<![CDATA[#{wrapped}]]>"
|
108
106
|
when "datetime"
|
109
|
-
"#{Timeliness.parse(
|
107
|
+
"#{Timeliness.parse(wrapped)&.iso8601}"
|
110
108
|
else
|
111
|
-
|
109
|
+
wrapped
|
112
110
|
end
|
113
111
|
end
|
114
112
|
|
115
113
|
def self.normalise_rules(rules)
|
116
114
|
rules.keys.each do |category|
|
117
115
|
case category
|
118
|
-
when
|
116
|
+
when :info, :entry
|
119
117
|
rules[category].each do |name, rule|
|
120
|
-
rules[category][name] = {
|
118
|
+
rules[category][name] = { :path => rule } unless rule.is_a? Hash
|
121
119
|
end
|
122
|
-
when
|
120
|
+
when :entries
|
123
121
|
rule = rules[category]
|
124
|
-
rules[category] = {
|
122
|
+
rules[category] = { :path => rule } unless rule.is_a? Hash
|
125
123
|
end
|
126
124
|
end
|
127
125
|
|
128
126
|
rules
|
129
127
|
end
|
130
128
|
|
129
|
+
def self.extract_content(node, rule)
|
130
|
+
case rule[:content]
|
131
|
+
in { attribute: attribute }
|
132
|
+
node[attribute]
|
133
|
+
in "inner_html"
|
134
|
+
node.inner_html
|
135
|
+
else
|
136
|
+
node.content.strip
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
140
|
+
def self.process_content(content, rule)
|
141
|
+
if rule[:processor]
|
142
|
+
rule[:processor].call content, rule
|
143
|
+
else
|
144
|
+
content
|
145
|
+
end
|
146
|
+
end
|
147
|
+
|
131
148
|
def self.wrap_content(content, rule)
|
132
|
-
return content unless rule[
|
149
|
+
return content unless rule[:prepend] || rule[:append]
|
133
150
|
|
134
|
-
"#{rule[
|
151
|
+
"#{rule[:prepend]}#{content}#{rule[:append]}"
|
135
152
|
end
|
136
153
|
end
|
data/lib/feedstock/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: feedstock
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Camilleri
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2021-02-
|
11
|
+
date: 2021-02-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -108,14 +108,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
108
108
|
requirements:
|
109
109
|
- - ">="
|
110
110
|
- !ruby/object:Gem::Version
|
111
|
-
version: '2.
|
111
|
+
version: '2.7'
|
112
112
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
113
113
|
requirements:
|
114
114
|
- - ">="
|
115
115
|
- !ruby/object:Gem::Version
|
116
116
|
version: '0'
|
117
117
|
requirements: []
|
118
|
-
rubygems_version: 3.
|
118
|
+
rubygems_version: 3.2.3
|
119
119
|
signing_key:
|
120
120
|
specification_version: 4
|
121
121
|
summary: A library for creating RSS feeds from webpages
|