feedstock 0.1.1 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +88 -57
- data/feedstock.gemspec +1 -1
- data/lib/feedstock.rb +52 -35
- data/lib/feedstock/version.rb +1 -1
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bd230949cb75ce2edb5a9ea48c7902370a9932a2b91d0e12c5e30208cf917157
|
4
|
+
data.tar.gz: c4ba1a7f4af881899edc582b38074070fd3f97cbbe7fde37ecdc0f42b7152eb0
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: eaae22277fe1a4084e7560bf1dbf8946e7bf3e76956260dbb3f9fb0883ff72b34b708e1e1f50dfcb941d6eb1952ad2254917079e8137397fafef8844b767fbbc
|
7
|
+
data.tar.gz: 15d91c5390ffdda38e58e7fb13201486f9605ac20cf9a60195edb27c54dafbfb31b5a7ea38a3cc0daa2716f9ec1e9fc208c5a769a837dc9ea88a1c131df72a6b
|
data/README.md
CHANGED
@@ -1,5 +1,10 @@
|
|
1
1
|
# Feedstock
|
2
2
|
|
3
|
+
[![Gem Version][gem-badge]][gem-link]
|
4
|
+
|
5
|
+
[gem-badge]: https://badge.fury.io/rb/feedstock.svg
|
6
|
+
[gem-link]: https://rubygems.org/gems/feedstock
|
7
|
+
|
3
8
|
Feedstock is a Ruby library for extracting information from a webpage and
|
4
9
|
converting it into an Atom feed.
|
5
10
|
|
@@ -13,6 +18,14 @@ Feedstock is a Ruby library that you can use to create an Atom feed. It takes a
|
|
13
18
|
URL to the webpage to check and a hash of rules. The rules tell Feedstock how to
|
14
19
|
extract and transform the data it finds on the webpage.
|
15
20
|
|
21
|
+
## Example
|
22
|
+
|
23
|
+
The [feeds.inqk.net repository][example] includes an example of how the Feedstock
|
24
|
+
library can be used in practice.
|
25
|
+
|
26
|
+
[example]: https://github.com/pyrmont/feeds.inqk.net/tree/4a95a438f8d3a707db7946238181ab76c029ee77/src/input
|
27
|
+
"An example of using the Feedstock library"
|
28
|
+
|
16
29
|
## Installation
|
17
30
|
|
18
31
|
Feedstock is available as a gem:
|
@@ -31,20 +44,20 @@ and one optional key.
|
|
31
44
|
|
32
45
|
### Info
|
33
46
|
|
34
|
-
The
|
35
|
-
|
47
|
+
The `:info` key is mandatory. It must be associated with a hash. In this
|
48
|
+
README, this hash is referred to as the _info hash_.
|
36
49
|
|
37
50
|
#### Keys
|
38
51
|
|
39
|
-
The keys in the info hash
|
40
|
-
template, Feedstock will use the key as the name of the XML entity in
|
41
|
-
resulting feed. For example, if the key is
|
52
|
+
The keys in the info hash should be symbols, not strings. When used with the
|
53
|
+
default template, Feedstock will use the key as the name of the XML entity in
|
54
|
+
the resulting feed. For example, if the key is `:id`, the XML entity in the
|
42
55
|
resulting feed will be `<id>`.
|
43
56
|
|
44
57
|
#### Values
|
45
58
|
|
46
59
|
The value associated with each key in the info hash can be either a string or a
|
47
|
-
hash.
|
60
|
+
hash.
|
48
61
|
|
49
62
|
##### String
|
50
63
|
|
@@ -55,58 +68,69 @@ matching node in the document.
|
|
55
68
|
|
56
69
|
##### Hash
|
57
70
|
|
58
|
-
If the value is a hash, this is
|
59
|
-
|
71
|
+
If the value is a hash, this is a _data hash_. A data hash defines the rules
|
72
|
+
that Feedstock uses to extract data. It must contain one of two keys:
|
60
73
|
|
61
|
-
-
|
74
|
+
- `:literal`: The value associated with this key is used for the content of the
|
62
75
|
XML entity. This can be useful for elements that are not on the page or that
|
63
76
|
don't change.
|
64
77
|
|
65
|
-
-
|
78
|
+
- `:path`: The path to the node in the document expressed in CSS's selector
|
66
79
|
syntax. As noted above, if the value of a key in the info hash is a string,
|
67
|
-
this is treated as a path. The reason to use a data hash with a
|
80
|
+
this is treated as a path. The reason to use a data hash with a `:path` key
|
68
81
|
is when using one or more of the keys below. In the info hash, a path matches
|
69
82
|
only the first matching node in the document.
|
70
83
|
|
71
84
|
The following keys may also be defined in a data hash:
|
72
85
|
|
73
|
-
-
|
74
|
-
|
75
|
-
|
76
|
-
the
|
77
|
-
|
78
|
-
|
86
|
+
- `:content`: The default is `nil`. The `:content` key can be set to
|
87
|
+
`"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
|
88
|
+
value is `"inner_html"`, Feedstock will extract the content of the node as
|
89
|
+
HTML. If the value is an attribute hash, Feedstock will extract the value of
|
90
|
+
that attribute. This is important for links, where the link itself is
|
91
|
+
typically the content of the `href` attribute rather than the content of the
|
92
|
+
`<a>` element. For all other values, the plaintext content of the node is
|
93
|
+
extracted.
|
94
|
+
|
95
|
+
- `:processor`: The default is `nil`. The `:processor` key can be set to a
|
96
|
+
lambda function that takes two arguments. The first is the extracted content,
|
97
|
+
the second is the rule being processed. The content extracted by Feedstock for
|
98
|
+
the given path is processed by the processor.
|
99
|
+
|
100
|
+
- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
|
79
101
|
the prefix is appended to the beginning of the content extracted.
|
80
102
|
|
81
|
-
-
|
103
|
+
- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
|
82
104
|
the suffix is appended to the end of the content extracted.
|
83
105
|
|
84
|
-
-
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
`"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
|
89
|
-
tags.
|
106
|
+
- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
|
107
|
+
If the value is `"datetime"`, the content is parsed by the [Timeliness
|
108
|
+
library][Timeliness] to return a string. If the value is `"cdata"`, the
|
109
|
+
content is wrapped in `<![CDATA[` and `]]>` tags.
|
90
110
|
|
91
111
|
[Timeliness]: https://github.com/adzap/timeliness "The official repository for
|
92
112
|
the Timeliness library"
|
93
113
|
|
114
|
+
#### Formatting Order
|
115
|
+
|
116
|
+
The order for formatting content is: extract, process, wrapping.
|
117
|
+
|
94
118
|
### Entry
|
95
119
|
|
96
|
-
The
|
97
|
-
|
120
|
+
The `:entry` key is mandatory. It must be associated with a hash. In this
|
121
|
+
README, this hash is referred to as the _entry hash_.
|
98
122
|
|
99
123
|
#### Keys
|
100
124
|
|
101
|
-
The keys in the entry hash
|
102
|
-
template, Feedstock will use the key as the name of the XML entity in
|
103
|
-
resulting feed. For example, if the key is `"id"`, the XML entity in the
|
125
|
+
The keys in the entry hash should be symbols, not strings. When used with the
|
126
|
+
default template, Feedstock will use the key as the name of the XML entity in
|
127
|
+
the resulting feed. For example, if the key is `"id"`, the XML entity in the
|
104
128
|
resulting feed will be `<id>`.
|
105
129
|
|
106
130
|
#### Values
|
107
131
|
|
108
132
|
The value associated with each key in the entry hash can be either a string or a
|
109
|
-
hash.
|
133
|
+
hash.
|
110
134
|
|
111
135
|
##### String
|
112
136
|
|
@@ -116,53 +140,60 @@ the CSS selector will match all nodes.
|
|
116
140
|
|
117
141
|
##### Hash
|
118
142
|
|
119
|
-
If the value is a hash,
|
143
|
+
If the value is a hash, this is a _data hash_. A data hash defines the
|
120
144
|
rules that Feedstock uses to extract data. It must contain one of two keys:
|
121
145
|
|
122
|
-
-
|
146
|
+
- `:literal`: The value associated with this key is used for the content of the
|
123
147
|
XML entity. This can be useful for elements that are not on the page or that
|
124
148
|
don't change.
|
125
149
|
|
126
|
-
-
|
127
|
-
syntax. Unlike with the info hash, the CSS selector will match all nodes.
|
150
|
+
- `:path`: The path to the node in the document expressed in CSS's selector
|
151
|
+
syntax. Unlike with the info hash, the CSS selector will match all nodes.
|
128
152
|
|
129
153
|
The following keys may also be defined in a data hash:
|
130
154
|
|
131
|
-
-
|
132
|
-
|
133
|
-
|
134
|
-
the
|
155
|
+
- `:content`: The default is `nil`. The `:content` key can be set to
|
156
|
+
`"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
|
157
|
+
value is `"inner_html"`, Feedstock will extract the content of the node as
|
158
|
+
HTML. If the value is an attribute hash, Feedstock will extract the value of
|
159
|
+
that attribute. This is important for links, where the link itself is
|
160
|
+
typically the content of the `href` attribute rather than the content of the
|
161
|
+
`<a>` element. For all other values, the plaintext content of the node is
|
162
|
+
extracted.
|
163
|
+
|
164
|
+
- `:repeat`: The default is `nil`. If repeat is set to `true`, Feedstock will
|
165
|
+
use the content provided by either `:literal` or `:path` repeatedly. Since
|
166
|
+
the value of `:literal` implies `:repeat`, it is not necessary to specify it
|
167
|
+
expressly.
|
135
168
|
|
136
|
-
-
|
137
|
-
|
138
|
-
|
169
|
+
- `:processor`: The default is `nil`. The `:processor` key can be set to a
|
170
|
+
lambda function that takes two arguments. The first is the extracted content,
|
171
|
+
the second is the rule being processed. The content extracted by Feedstock for
|
172
|
+
the given path is processed by the processor.
|
139
173
|
|
140
|
-
-
|
174
|
+
- `:prefix`: The default is `nil`. If a prefix is provided, the string value of
|
141
175
|
the prefix is appended to the beginning of the content extracted.
|
142
176
|
|
143
|
-
-
|
144
|
-
use the content provided by either `"literal"` or `"path"` repeatedly. Since
|
145
|
-
the value of `"literal"` implies `"repeat"`, it is not necessary to specify it
|
146
|
-
expressly.
|
147
|
-
|
148
|
-
- `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
|
177
|
+
- `:suffix`: The default is `nil`. If a suffix is provided, the string value of
|
149
178
|
the suffix is appended to the end of the content extracted.
|
150
179
|
|
151
|
-
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
`"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
|
156
|
-
tags.
|
180
|
+
- `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
|
181
|
+
If the value is `"datetime"`, the content is parsed by the [Timeliness
|
182
|
+
library][Timeliness] to return a string. If the value is `"cdata"`, the
|
183
|
+
content is wrapped in `<![CDATA[` and `]]>` tags.
|
157
184
|
|
158
185
|
### Entries
|
159
186
|
|
160
|
-
The
|
161
|
-
|
187
|
+
The `:entries` key is optional. It can be associated with a hash. In this
|
188
|
+
README, this hash is referred to as the _entries hash_.
|
189
|
+
|
190
|
+
The entries hash is offered as a convenience. It allows a user to simplify
|
191
|
+
the paths used in the entry hash by omitting a reference to the node
|
192
|
+
containing the entries.
|
162
193
|
|
163
194
|
If an entries hash is provided, it must contain the following key:
|
164
195
|
|
165
|
-
-
|
196
|
+
- `:path`: The path to the node in the document expressed in CSS's selector
|
166
197
|
syntax. This path is used as the root for the paths in the entry hash.
|
167
198
|
|
168
199
|
## Bugs
|
data/feedstock.gemspec
CHANGED
@@ -14,7 +14,7 @@ Gem::Specification.new do |s|
|
|
14
14
|
desc
|
15
15
|
s.homepage = "https://github.com/pyrmont/feedstock/"
|
16
16
|
s.licenses = "Unlicense"
|
17
|
-
s.required_ruby_version = ">= 2.
|
17
|
+
s.required_ruby_version = ">= 2.7"
|
18
18
|
|
19
19
|
s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
|
20
20
|
"feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]
|
data/lib/feedstock.rb
CHANGED
@@ -26,7 +26,7 @@ module Feedstock
|
|
26
26
|
end
|
27
27
|
|
28
28
|
def self.extract_entries(page, rules)
|
29
|
-
if rules[
|
29
|
+
if rules[:entries]
|
30
30
|
extract_entries_wrapped page, rules
|
31
31
|
else
|
32
32
|
extract_entries_unwrapped page, rules
|
@@ -37,15 +37,15 @@ module Feedstock
|
|
37
37
|
static = Hash.new
|
38
38
|
entries = Array.new
|
39
39
|
|
40
|
-
rules[
|
41
|
-
if rule[
|
42
|
-
static[name] = rule[
|
43
|
-
elsif rule[
|
44
|
-
static[name] = format_content page.at_css(rule[
|
40
|
+
rules[:entry].each do |name, rule|
|
41
|
+
if rule[:literal]
|
42
|
+
static[name.to_s] = rule[:literal]
|
43
|
+
elsif rule[:repeat]
|
44
|
+
static[name.to_s] = format_content page.at_css(rule[:path]), rule
|
45
45
|
else
|
46
|
-
page.css(rule[
|
46
|
+
page.css(rule[:path]).each.with_index do |match, i|
|
47
47
|
entries[i] = Hash.new if entries[i].nil?
|
48
|
-
entries[i].merge!({ name => format_content(match, rule) })
|
48
|
+
entries[i].merge!({ name.to_s => format_content(match, rule) })
|
49
49
|
end
|
50
50
|
end
|
51
51
|
end
|
@@ -60,19 +60,19 @@ module Feedstock
|
|
60
60
|
def self.extract_entries_wrapped(page, rules)
|
61
61
|
entries = Array.new
|
62
62
|
|
63
|
-
page.css(rules[
|
64
|
-
rules[
|
63
|
+
page.css(rules[:entries][:path]).each.with_index do |node, i|
|
64
|
+
rules[:entry].each do |name, rule|
|
65
65
|
entries[i] = Hash.new if entries[i].nil?
|
66
66
|
|
67
|
-
content = if rule[
|
68
|
-
rule[
|
69
|
-
elsif rule[
|
70
|
-
format_content page.at_css(rule[
|
67
|
+
content = if rule[:literal]
|
68
|
+
rule[:literal]
|
69
|
+
elsif rule[:repeat]
|
70
|
+
format_content page.at_css(rule[:path]), rule
|
71
71
|
else
|
72
|
-
format_content node.at_css(rule[
|
72
|
+
format_content node.at_css(rule[:path]), rule
|
73
73
|
end
|
74
74
|
|
75
|
-
entries[i].merge!({ name => content })
|
75
|
+
entries[i].merge!({ name.to_s => content })
|
76
76
|
end
|
77
77
|
end
|
78
78
|
|
@@ -82,11 +82,11 @@ module Feedstock
|
|
82
82
|
def self.extract_info(page, rules)
|
83
83
|
info = Hash.new
|
84
84
|
|
85
|
-
rules[
|
86
|
-
if rule[
|
87
|
-
info[name] = rule[
|
85
|
+
rules[:info].each do |name, rule|
|
86
|
+
if rule[:literal]
|
87
|
+
info[name.to_s] = rule[:literal]
|
88
88
|
else
|
89
|
-
info[name] = format_content page.at_css(rule[
|
89
|
+
info[name.to_s] = format_content page.at_css(rule[:path]), rule
|
90
90
|
end
|
91
91
|
end
|
92
92
|
|
@@ -96,41 +96,58 @@ module Feedstock
|
|
96
96
|
def self.format_content(match, rule)
|
97
97
|
return "" if match.nil?
|
98
98
|
|
99
|
-
text
|
100
|
-
|
101
|
-
|
102
|
-
match.content.strip
|
103
|
-
end
|
99
|
+
text = extract_content match, rule
|
100
|
+
processed = process_content text, rule
|
101
|
+
wrapped = wrap_content processed, rule
|
104
102
|
|
105
|
-
case rule[
|
103
|
+
case rule[:type]
|
106
104
|
when "cdata"
|
107
|
-
"<![CDATA[#{
|
105
|
+
"<![CDATA[#{wrapped}]]>"
|
108
106
|
when "datetime"
|
109
|
-
"#{Timeliness.parse(
|
107
|
+
"#{Timeliness.parse(wrapped)&.iso8601}"
|
110
108
|
else
|
111
|
-
|
109
|
+
wrapped
|
112
110
|
end
|
113
111
|
end
|
114
112
|
|
115
113
|
def self.normalise_rules(rules)
|
116
114
|
rules.keys.each do |category|
|
117
115
|
case category
|
118
|
-
when
|
116
|
+
when :info, :entry
|
119
117
|
rules[category].each do |name, rule|
|
120
|
-
rules[category][name] = {
|
118
|
+
rules[category][name] = { :path => rule } unless rule.is_a? Hash
|
121
119
|
end
|
122
|
-
when
|
120
|
+
when :entries
|
123
121
|
rule = rules[category]
|
124
|
-
rules[category] = {
|
122
|
+
rules[category] = { :path => rule } unless rule.is_a? Hash
|
125
123
|
end
|
126
124
|
end
|
127
125
|
|
128
126
|
rules
|
129
127
|
end
|
130
128
|
|
129
|
+
def self.extract_content(node, rule)
|
130
|
+
case rule[:content]
|
131
|
+
in { attribute: attribute }
|
132
|
+
node[attribute]
|
133
|
+
in "inner_html"
|
134
|
+
node.inner_html
|
135
|
+
else
|
136
|
+
node.content.strip
|
137
|
+
end
|
138
|
+
end
|
139
|
+
|
140
|
+
def self.process_content(content, rule)
|
141
|
+
if rule[:processor]
|
142
|
+
rule[:processor].call content, rule
|
143
|
+
else
|
144
|
+
content
|
145
|
+
end
|
146
|
+
end
|
147
|
+
|
131
148
|
def self.wrap_content(content, rule)
|
132
|
-
return content unless rule[
|
149
|
+
return content unless rule[:prepend] || rule[:append]
|
133
150
|
|
134
|
-
"#{rule[
|
151
|
+
"#{rule[:prepend]}#{content}#{rule[:append]}"
|
135
152
|
end
|
136
153
|
end
|
data/lib/feedstock/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: feedstock
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Camilleri
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2021-02-
|
11
|
+
date: 2021-02-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|
@@ -108,14 +108,14 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
108
108
|
requirements:
|
109
109
|
- - ">="
|
110
110
|
- !ruby/object:Gem::Version
|
111
|
-
version: '2.
|
111
|
+
version: '2.7'
|
112
112
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
113
113
|
requirements:
|
114
114
|
- - ">="
|
115
115
|
- !ruby/object:Gem::Version
|
116
116
|
version: '0'
|
117
117
|
requirements: []
|
118
|
-
rubygems_version: 3.
|
118
|
+
rubygems_version: 3.2.3
|
119
119
|
signing_key:
|
120
120
|
specification_version: 4
|
121
121
|
summary: A library for creating RSS feeds from webpages
|