feedstock 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bd230949cb75ce2edb5a9ea48c7902370a9932a2b91d0e12c5e30208cf917157
4
- data.tar.gz: c4ba1a7f4af881899edc582b38074070fd3f97cbbe7fde37ecdc0f42b7152eb0
3
+ metadata.gz: b478a8e9dd24f3ac78e99189c959b247828f7a2980a979cc320a3b6f8c5306d1
4
+ data.tar.gz: 957fb142e5abef9289ca92f40ecf6f7425106a40271da4967c80c27f2b69c4eb
5
5
  SHA512:
6
- metadata.gz: eaae22277fe1a4084e7560bf1dbf8946e7bf3e76956260dbb3f9fb0883ff72b34b708e1e1f50dfcb941d6eb1952ad2254917079e8137397fafef8844b767fbbc
7
- data.tar.gz: 15d91c5390ffdda38e58e7fb13201486f9605ac20cf9a60195edb27c54dafbfb31b5a7ea38a3cc0daa2716f9ec1e9fc208c5a769a837dc9ea88a1c131df72a6b
6
+ metadata.gz: '0895e3a795d26151fc74a79107e0873afe8f0d99fe955cc4fffd94342d335075852ae71abbbfc61d47a637fc06e00c73c17f0ad023e7a712e2dd5e015e393554'
7
+ data.tar.gz: b4ef2dbf847a910813d4187b71208c50f773613cca75173152ed6f9343b3de220c98917aa074060851c89e72433f14866cf16539cf357725f8f3c5587d406062
data/README.md CHANGED
@@ -5,23 +5,28 @@
5
5
  [gem-badge]: https://badge.fury.io/rb/feedstock.svg
6
6
  [gem-link]: https://rubygems.org/gems/feedstock
7
7
 
8
- Feedstock is a Ruby library for extracting information from a webpage and
9
- converting it into an Atom feed.
8
+ Feedstock is a Ruby library for extracting information from an HTML/XML document
9
+ and inserting it into an ERB template. Its primary purpose is to create a feed
10
+ for a webpage that doesn't offer one.
10
11
 
11
12
  ## Rationale
12
13
 
13
- Feeds are great. But sometimes a website doesn't provide a feed or doesn't
14
- provide a feed for the specific content that you want. That's where Feedstock
15
- can help.
14
+ I love RSS feeds.
16
15
 
17
- Feedstock is a Ruby library that you can use to create an Atom feed. It takes a
18
- URL to the webpage to check and a hash of rules. The rules tell Feedstock how to
19
- extract and transform the data it finds on the webpage.
16
+ That's why I think it's a shame not every website has a feed. However, even when
17
+ a website does have a feed, sometimes it doesn't include quite the mix
18
+ information that I want. I made Feedstock to solve those two problems.
19
+
20
+ Feedstock is a Ruby library that you can use to create an Atom or RSS feed. It
21
+ requires a URL to a document and a hash of rules. The rules tell Feedstock how
22
+ to extract and transform the data found on the webpage. That data is stuffed
23
+ into a hash and then run through an ERB template. Feedstock comes with a
24
+ template but you can use your own, too.
20
25
 
21
26
  ## Example
22
27
 
23
- The [feeds.inqk.net repository][example] includes an example of how the Feedstock
24
- library can be used in practice.
28
+ The [feeds.inqk.net repository][example] includes an example of how the
29
+ Feedstock library can be used in practice.
25
30
 
26
31
  [example]: https://github.com/pyrmont/feeds.inqk.net/tree/4a95a438f8d3a707db7946238181ab76c029ee77/src/input
27
32
  "An example of using the Feedstock library"
@@ -36,169 +41,42 @@ $ gem install feedstock
36
41
 
37
42
  ## Usage
38
43
 
39
- Feedstock extracts information from a given document using a collection of
40
- _rules_.
41
-
42
- A collection of rules is expressed as a hash. The hash has two mandatory keys
43
- and one optional key.
44
-
45
- ### Info
46
-
47
- The `:info` key is mandatory. It must be associated with a hash. In this
48
- README, this hash is referred to as the _info hash_.
49
-
50
- #### Keys
51
-
52
- The keys in the info hash should be symbols, not strings. When used with the
53
- default template, Feedstock will use the key as the name of the XML entity in
54
- the resulting feed. For example, if the key is `:id`, the XML entity in the
55
- resulting feed will be `<id>`.
56
-
57
- #### Values
58
-
59
- The value associated with each key in the info hash can be either a string or a
60
- hash.
61
-
62
- ##### String
63
-
64
- If the value is a string, this defines a path to a node in the document. The
65
- path is expressed using CSS's selector syntax. Although a CSS selector can match
66
- more than one node, when used in the info hash, a path will only match the first
67
- matching node in the document.
68
-
69
- ##### Hash
70
-
71
- If the value is a hash, this is a _data hash_. A data hash defines the rules
72
- that Feedstock uses to extract data. It must contain one of two keys:
73
-
74
- - `:literal`: The value associated with this key is used for the content of the
75
- XML entity. This can be useful for elements that are not on the page or that
76
- don't change.
77
-
78
- - `:path`: The path to the node in the document expressed in CSS's selector
79
- syntax. As noted above, if the value of a key in the info hash is a string,
80
- this is treated as a path. The reason to use a data hash with a `:path` key
81
- is when using one or more of the keys below. In the info hash, a path matches
82
- only the first matching node in the document.
83
-
84
- The following keys may also be defined in a data hash:
85
-
86
- - `:content`: The default is `nil`. The `:content` key can be set to
87
- `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
88
- value is `"inner_html"`, Feedstock will extract the content of the node as
89
- HTML. If the value is an attribute hash, Feedstock will extract the value of
90
- that attribute. This is important for links, where the link itself is
91
- typically the content of the `href` attribute rather than the content of the
92
- `<a>` element. For all other values, the plaintext content of the node is
93
- extracted.
94
-
95
- - `:processor`: The default is `nil`. The `:processor` key can be set to a
96
- lambda function that takes two arguments. The first is the extracted content,
97
- the second is the rule being processed. The content extracted by Feedstock for
98
- the given path is processed by the processor.
99
-
100
- - `:prefix`: The default is `nil`. If a prefix is provided, the string value of
101
- the prefix is appended to the beginning of the content extracted.
102
-
103
- - `:suffix`: The default is `nil`. If a suffix is provided, the string value of
104
- the suffix is appended to the end of the content extracted.
105
-
106
- - `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
107
- If the value is `"datetime"`, the content is parsed by the [Timeliness
108
- library][Timeliness] to return a string. If the value is `"cdata"`, the
109
- content is wrapped in `<![CDATA[` and `]]>` tags.
110
-
111
- [Timeliness]: https://github.com/adzap/timeliness "The official repository for
112
- the Timeliness library"
113
-
114
- #### Formatting Order
115
-
116
- The order for formatting content is: extract, process, wrapping.
117
-
118
- ### Entry
119
-
120
- The `:entry` key is mandatory. It must be associated with a hash. In this
121
- README, this hash is referred to as the _entry hash_.
122
-
123
- #### Keys
124
-
125
- The keys in the entry hash should be symbols, not strings. When used with the
126
- default template, Feedstock will use the key as the name of the XML entity in
127
- the resulting feed. For example, if the key is `"id"`, the XML entity in the
128
- resulting feed will be `<id>`.
129
-
130
- #### Values
131
-
132
- The value associated with each key in the entry hash can be either a string or a
133
- hash.
134
-
135
- ##### String
136
-
137
- If the value is a string, this defines a path to a node in the document. The
138
- path is expressed using CSS's selector syntax. Unlike with the info hash, a
139
- the CSS selector will match all nodes.
140
-
141
- ##### Hash
142
-
143
- If the value is a hash, this is a _data hash_. A data hash defines the
144
- rules that Feedstock uses to extract data. It must contain one of two keys:
145
-
146
- - `:literal`: The value associated with this key is used for the content of the
147
- XML entity. This can be useful for elements that are not on the page or that
148
- don't change.
149
-
150
- - `:path`: The path to the node in the document expressed in CSS's selector
151
- syntax. Unlike with the info hash, the CSS selector will match all nodes.
152
-
153
- The following keys may also be defined in a data hash:
154
-
155
- - `:content`: The default is `nil`. The `:content` key can be set to
156
- `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
157
- value is `"inner_html"`, Feedstock will extract the content of the node as
158
- HTML. If the value is an attribute hash, Feedstock will extract the value of
159
- that attribute. This is important for links, where the link itself is
160
- typically the content of the `href` attribute rather than the content of the
161
- `<a>` element. For all other values, the plaintext content of the node is
162
- extracted.
163
-
164
- - `:repeat`: The default is `nil`. If repeat is set to `true`, Feedstock will
165
- use the content provided by either `:literal` or `:path` repeatedly. Since
166
- the value of `:literal` implies `:repeat`, it is not necessary to specify it
167
- expressly.
168
-
169
- - `:processor`: The default is `nil`. The `:processor` key can be set to a
170
- lambda function that takes two arguments. The first is the extracted content,
171
- the second is the rule being processed. The content extracted by Feedstock for
172
- the given path is processed by the processor.
173
-
174
- - `:prefix`: The default is `nil`. If a prefix is provided, the string value of
175
- the prefix is appended to the beginning of the content extracted.
176
-
177
- - `:suffix`: The default is `nil`. If a suffix is provided, the string value of
178
- the suffix is appended to the end of the content extracted.
179
-
180
- - `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
181
- If the value is `"datetime"`, the content is parsed by the [Timeliness
182
- library][Timeliness] to return a string. If the value is `"cdata"`, the
183
- content is wrapped in `<![CDATA[` and `]]>` tags.
184
-
185
- ### Entries
186
-
187
- The `:entries` key is optional. It can be associated with a hash. In this
188
- README, this hash is referred to as the _entries hash_.
189
-
190
- The entries hash is offered as a convenience. It allows a user to simplify
191
- the paths used in the entry hash by omitting a reference to the node
192
- containing the entries.
44
+ Feedstock extracts information from a document at a given _URL_ using a
45
+ collection of _rules_. The feed is generated by calling `Feedstock.feed` as
46
+ below:
47
+
48
+ ```ruby
49
+ # Define the URL
50
+ url = "https://example.org"
51
+
52
+ # Define the rules
53
+ rules = { info: { id: url,
54
+ title: "div.title",
55
+ updated: "span.date" },
56
+ entries: "div.story",
57
+ entry: { id: { path: "a",
58
+ content: { attribute: "href" } },
59
+ title: "h2",
60
+ updated: "span.date",
61
+ author: "span.byline",
62
+ link: { path: "a",
63
+ content: { attribute: "href" } },
64
+ summary: "div.summary" } }
65
+
66
+ # Using the default format and template
67
+ Feedstock.feed url, rules
68
+
69
+ # Using the XML format and a user-specified template
70
+ Feedstock.feed url, rules, :xml, "podcast.xml"
71
+ ```
193
72
 
194
- If an entries hash is provided, it must contain the following key:
73
+ More information is available in [api.md].
195
74
 
196
- - `:path`: The path to the node in the document expressed in CSS's selector
197
- syntax. This path is used as the root for the paths in the entry hash.
75
+ [api.md]: https://github.com/pyrmont/feedstock/blob/master/api.md
198
76
 
199
77
  ## Bugs
200
78
 
201
- Found a bug? I'd love to know about it. The best way is to report them in the
79
+ Found a bug? I'd love to know about it. The best way is to report it in the
202
80
  [Issues section][ghi] on GitHub.
203
81
 
204
82
  [ghi]: https://github.com/pyrmont/feedstock/issues
@@ -211,7 +89,6 @@ Feedstock uses [Semantic Versioning 2.0.0][sv2].
211
89
 
212
90
  ## Licence
213
91
 
214
- Feedstock is released into the public domain. See [LICENSE.md][lc] for more
215
- details.
92
+ Feedstock is released into the public domain. See [LICENSE][] for more details.
216
93
 
217
- [lc]: https://github.com/pyrmont/feedstock/blob/master/LICENSE.md
94
+ [LICENSE]: https://github.com/pyrmont/feedstock/blob/master/LICENSE
data/feedstock.gemspec CHANGED
@@ -9,12 +9,15 @@ Gem::Specification.new do |s|
9
9
  s.email = ["mike@inqk.net"]
10
10
  s.summary = "A library for creating RSS feeds from webpages"
11
11
  s.description = <<-desc.strip.gsub(/\s+/, " ")
12
- Feedstock is a library for extracting information from a webpage and
13
- transforming it into an Atom feed.
12
+ Feedstock is a Ruby library for extracting information from an HTML/XML
13
+ document and inserting it into an ERB template.
14
14
  desc
15
15
  s.homepage = "https://github.com/pyrmont/feedstock/"
16
16
  s.licenses = "Unlicense"
17
17
  s.required_ruby_version = ">= 2.7"
18
+ s.metadata = {
19
+ "documentation_uri" => "https://github.com/pyrmont/feedstock/blob/v0.3.0/api.md"
20
+ }
18
21
 
19
22
  s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
20
23
  "feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]
data/lib/feedstock.rb CHANGED
@@ -6,148 +6,162 @@ require "open-uri"
6
6
  require "timeliness"
7
7
 
8
8
  module Feedstock
9
- def self.feed(url, rules, template_file = "#{__dir__}/../default.xml")
10
- rules = normalise_rules rules
11
- page = download_page url
12
- info = extract_info page, rules
13
- entries = extract_entries page, rules
14
- feed = create_feed info, entries, template_file
15
-
16
- feed
17
- end
18
-
19
- def self.create_feed(info, entries, template_file)
20
- template = ERB.new File.read(template_file), trim_mode: "-"
21
- template.result_with_hash info: info, entries: entries
22
- end
9
+ class << self
10
+ def feed(url, rules, format = :html, template_file = "#{__dir__}/../default.xml")
11
+ page = download_page url, format
12
+ rules = normalise_rules rules
23
13
 
24
- def self.download_page(url)
25
- Nokogiri::HTML URI.open(url)
26
- end
14
+ info = extract_info page, rules
15
+ entries = extract_entries page, rules
27
16
 
28
- def self.extract_entries(page, rules)
29
- if rules[:entries]
30
- extract_entries_wrapped page, rules
31
- else
32
- extract_entries_unwrapped page, rules
17
+ create_feed info, entries, template_file
33
18
  end
34
- end
35
19
 
36
- def self.extract_entries_unwrapped(page, rules)
37
- static = Hash.new
38
- entries = Array.new
20
+ private def create_feed(info, entries, template_file)
21
+ template = ERB.new File.read(template_file), trim_mode: "-"
22
+ template.result_with_hash info: info, entries: entries
23
+ end
39
24
 
40
- rules[:entry].each do |name, rule|
41
- if rule[:literal]
42
- static[name.to_s] = rule[:literal]
43
- elsif rule[:repeat]
44
- static[name.to_s] = format_content page.at_css(rule[:path]), rule
25
+ private def download_page(url, format)
26
+ case format
27
+ when :html
28
+ Nokogiri::HTML URI.open(url)
29
+ when :xml
30
+ Nokogiri::XML URI.open(url)
45
31
  else
46
- page.css(rule[:path]).each.with_index do |match, i|
47
- entries[i] = Hash.new if entries[i].nil?
48
- entries[i].merge!({ name.to_s => format_content(match, rule) })
49
- end
32
+ raise "Format not recognised"
50
33
  end
51
34
  end
52
35
 
53
- unless static.empty?
54
- entries.each{ |entry| entry.merge!(static) }
36
+ private def extract_content(node, rule)
37
+ case rule[:content]
38
+ in { attribute: attribute }
39
+ node[attribute]
40
+ in "inner_html"
41
+ node.inner_html
42
+ in "html" | "xml"
43
+ node.to_s
44
+ else
45
+ node.content.strip
46
+ end
55
47
  end
56
48
 
57
- entries
58
- end
49
+ private def extract_entries(page, rules)
50
+ if rules[:entries]
51
+ extract_entries_wrapped page, rules
52
+ else
53
+ extract_entries_unwrapped page, rules
54
+ end
55
+ end
59
56
 
60
- def self.extract_entries_wrapped(page, rules)
61
- entries = Array.new
57
+ private def extract_entries_unwrapped(page, rules)
58
+ static = Hash.new
59
+ entries = Array.new
62
60
 
63
- page.css(rules[:entries][:path]).each.with_index do |node, i|
64
61
  rules[:entry].each do |name, rule|
65
- entries[i] = Hash.new if entries[i].nil?
66
-
67
- content = if rule[:literal]
68
- rule[:literal]
69
- elsif rule[:repeat]
70
- format_content page.at_css(rule[:path]), rule
71
- else
72
- format_content node.at_css(rule[:path]), rule
73
- end
62
+ if rule[:literal]
63
+ static[name.to_s] = rule[:literal]
64
+ elsif rule[:repeat]
65
+ static[name.to_s] = format_content page.at_css(rule[:path]), rule
66
+ else
67
+ page.css(rule[:path]).each.with_index do |match, i|
68
+ entries[i] = Hash.new if entries[i].nil?
69
+ entries[i].merge!({ name.to_s => format_content(match, rule) })
70
+ end
71
+ end
72
+ end
74
73
 
75
- entries[i].merge!({ name.to_s => content })
74
+ unless static.empty?
75
+ entries.each{ |entry| entry.merge!(static) }
76
76
  end
77
+
78
+ entries
77
79
  end
78
80
 
79
- entries
80
- end
81
+ private def extract_entries_wrapped(page, rules)
82
+ entries = Array.new
81
83
 
82
- def self.extract_info(page, rules)
83
- info = Hash.new
84
+ page.css(rules[:entries][:path]).each.with_index do |node, i|
85
+ rules[:entry].each do |name, rule|
86
+ entries[i] = Hash.new if entries[i].nil?
84
87
 
85
- rules[:info].each do |name, rule|
86
- if rule[:literal]
87
- info[name.to_s] = rule[:literal]
88
- else
89
- info[name.to_s] = format_content page.at_css(rule[:path]), rule
90
- end
91
- end
88
+ content = if rule[:literal]
89
+ rule[:literal]
90
+ elsif rule[:repeat]
91
+ format_content page.at_css(rule[:path]), rule
92
+ else
93
+ format_content node.at_css(rule[:path]), rule
94
+ end
92
95
 
93
- info
94
- end
96
+ entries[i].merge!({ name.to_s => content })
97
+ end
98
+ end
95
99
 
96
- def self.format_content(match, rule)
97
- return "" if match.nil?
98
100
 
99
- text = extract_content match, rule
100
- processed = process_content text, rule
101
- wrapped = wrap_content processed, rule
101
+ return entries unless rules[:entries][:filter].is_a? Proc
102
102
 
103
- case rule[:type]
104
- when "cdata"
105
- "<![CDATA[#{wrapped}]]>"
106
- when "datetime"
107
- "#{Timeliness.parse(wrapped)&.iso8601}"
108
- else
109
- wrapped
103
+ entries.filter(&rules[:entries][:filter])
110
104
  end
111
- end
112
105
 
113
- def self.normalise_rules(rules)
114
- rules.keys.each do |category|
115
- case category
116
- when :info, :entry
117
- rules[category].each do |name, rule|
118
- rules[category][name] = { :path => rule } unless rule.is_a? Hash
106
+ private def extract_info(page, rules)
107
+ info = Hash.new
108
+
109
+ rules[:info].each do |name, rule|
110
+ if rule[:literal]
111
+ info[name.to_s] = rule[:literal]
112
+ else
113
+ info[name.to_s] = format_content page.at_css(rule[:path]), rule
119
114
  end
120
- when :entries
121
- rule = rules[category]
122
- rules[category] = { :path => rule } unless rule.is_a? Hash
123
115
  end
116
+
117
+ info
124
118
  end
125
119
 
126
- rules
127
- end
120
+ private def format_content(match, rule)
121
+ return "" if match.nil?
122
+
123
+ text = extract_content match, rule
124
+ processed = process_content text, rule
125
+ wrapped = wrap_content processed, rule
128
126
 
129
- def self.extract_content(node, rule)
130
- case rule[:content]
131
- in { attribute: attribute }
132
- node[attribute]
133
- in "inner_html"
134
- node.inner_html
135
- else
136
- node.content.strip
127
+ case rule[:type]
128
+ when "cdata"
129
+ "<![CDATA[#{wrapped}]]>"
130
+ when "datetime"
131
+ "#{Timeliness.parse(wrapped)&.iso8601}"
132
+ else
133
+ wrapped
134
+ end
137
135
  end
138
- end
139
136
 
140
- def self.process_content(content, rule)
141
- if rule[:processor]
142
- rule[:processor].call content, rule
143
- else
144
- content
137
+ private def normalise_rules(rules)
138
+ rules.keys.each do |category|
139
+ case category
140
+ when :info, :entry
141
+ rules[category].each do |name, rule|
142
+ rules[category][name] = { :path => rule } unless rule.is_a? Hash
143
+ end
144
+ when :entries
145
+ rule = rules[category]
146
+ rules[category] = { :path => rule } unless rule.is_a? Hash
147
+ end
148
+ end
149
+
150
+ rules
145
151
  end
146
- end
147
152
 
148
- def self.wrap_content(content, rule)
149
- return content unless rule[:prepend] || rule[:append]
153
+ private def process_content(content, rule)
154
+ if rule[:processor]
155
+ rule[:processor].call content, rule
156
+ else
157
+ content
158
+ end
159
+ end
150
160
 
151
- "#{rule[:prepend]}#{content}#{rule[:append]}"
161
+ private def wrap_content(content, rule)
162
+ return content unless rule[:prepend] || rule[:append]
163
+
164
+ "#{rule[:prepend]}#{content}#{rule[:append]}"
165
+ end
152
166
  end
153
167
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Feedstock
4
- VERSION = "0.2.0"
4
+ VERSION = "0.3.0"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: feedstock
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Camilleri
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2021-02-05 00:00:00.000000000 Z
11
+ date: 2021-02-06 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -80,8 +80,8 @@ dependencies:
80
80
  - - ">="
81
81
  - !ruby/object:Gem::Version
82
82
  version: '0'
83
- description: Feedstock is a library for extracting information from a webpage and
84
- transforming it into an Atom feed.
83
+ description: Feedstock is a Ruby library for extracting information from an HTML/XML
84
+ document and inserting it into an ERB template.
85
85
  email:
86
86
  - mike@inqk.net
87
87
  executables: []
@@ -99,6 +99,7 @@ homepage: https://github.com/pyrmont/feedstock/
99
99
  licenses:
100
100
  - Unlicense
101
101
  metadata:
102
+ documentation_uri: https://github.com/pyrmont/feedstock/blob/v0.3.0/api.md
102
103
  allowed_push_host: https://rubygems.org
103
104
  post_install_message:
104
105
  rdoc_options: []