feedstock 0.2.0 → 0.4.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bd230949cb75ce2edb5a9ea48c7902370a9932a2b91d0e12c5e30208cf917157
4
- data.tar.gz: c4ba1a7f4af881899edc582b38074070fd3f97cbbe7fde37ecdc0f42b7152eb0
3
+ metadata.gz: f1a02c229edb1b2d7c98904d6263aab47cfe5ef4d605c5a3c78ec412c1bb2083
4
+ data.tar.gz: f0c35d3a675eeb01cbbc73952b85f484df631459a1542161d2b198b8c3b1ccf8
5
5
  SHA512:
6
- metadata.gz: eaae22277fe1a4084e7560bf1dbf8946e7bf3e76956260dbb3f9fb0883ff72b34b708e1e1f50dfcb941d6eb1952ad2254917079e8137397fafef8844b767fbbc
7
- data.tar.gz: 15d91c5390ffdda38e58e7fb13201486f9605ac20cf9a60195edb27c54dafbfb31b5a7ea38a3cc0daa2716f9ec1e9fc208c5a769a837dc9ea88a1c131df72a6b
6
+ metadata.gz: b31805cfc5c8aedaabf286f2a76b269df4883daf24850121a94edb349a219c1d7177e161df20a434d07b17eed3bd129c7529d94093559fc5520fe19bc0dc2b45
7
+ data.tar.gz: b47eac95bda32a4a5a7a7a4a904d4baeb6b1055a18e39f09d1d2ed38c1c7b0a6ab9d0672db4f614b210d5bb1969afbd3de962292d6ccf750895e00f6a4b13d6c
data/README.md CHANGED
@@ -5,25 +5,30 @@
5
5
  [gem-badge]: https://badge.fury.io/rb/feedstock.svg
6
6
  [gem-link]: https://rubygems.org/gems/feedstock
7
7
 
8
- Feedstock is a Ruby library for extracting information from a webpage and
9
- converting it into an Atom feed.
8
+ Feedstock is a Ruby library for extracting information from an HTML/XML document
9
+ and inserting it into an ERB template. Its primary purpose is to create a feed
10
+ for a webpage that doesn't offer one.
10
11
 
11
12
  ## Rationale
12
13
 
13
- Feeds are great. But sometimes a website doesn't provide a feed or doesn't
14
- provide a feed for the specific content that you want. That's where Feedstock
15
- can help.
14
+ I love RSS feeds.
16
15
 
17
- Feedstock is a Ruby library that you can use to create an Atom feed. It takes a
18
- URL to the webpage to check and a hash of rules. The rules tell Feedstock how to
19
- extract and transform the data it finds on the webpage.
16
+ That's why I think it's a shame not every website has a feed. However, even when
17
+ a website does have a feed, sometimes it doesn't include quite the mix
18
+ information that I want. I made Feedstock to solve those two problems.
19
+
20
+ Feedstock is a Ruby library that you can use to create an Atom or RSS feed. It
21
+ requires a URL to a document and a hash of rules. The rules tell Feedstock how
22
+ to extract and transform the data found on the webpage. That data is stuffed
23
+ into a hash and then run through an ERB template. Feedstock comes with a
24
+ template but you can use your own, too.
20
25
 
21
26
  ## Example
22
27
 
23
- The [feeds.inqk.net repository][example] includes an example of how the Feedstock
24
- library can be used in practice.
28
+ The [feeds.inqk.net repository][example] includes an example of how the
29
+ Feedstock library can be used in practice.
25
30
 
26
- [example]: https://github.com/pyrmont/feeds.inqk.net/tree/4a95a438f8d3a707db7946238181ab76c029ee77/src/input
31
+ [example]: https://github.com/pyrmont/feeds.inqk.net/
27
32
  "An example of using the Feedstock library"
28
33
 
29
34
  ## Installation
@@ -36,169 +41,42 @@ $ gem install feedstock
36
41
 
37
42
  ## Usage
38
43
 
39
- Feedstock extracts information from a given document using a collection of
40
- _rules_.
41
-
42
- A collection of rules is expressed as a hash. The hash has two mandatory keys
43
- and one optional key.
44
-
45
- ### Info
46
-
47
- The `:info` key is mandatory. It must be associated with a hash. In this
48
- README, this hash is referred to as the _info hash_.
49
-
50
- #### Keys
51
-
52
- The keys in the info hash should be symbols, not strings. When used with the
53
- default template, Feedstock will use the key as the name of the XML entity in
54
- the resulting feed. For example, if the key is `:id`, the XML entity in the
55
- resulting feed will be `<id>`.
56
-
57
- #### Values
58
-
59
- The value associated with each key in the info hash can be either a string or a
60
- hash.
61
-
62
- ##### String
63
-
64
- If the value is a string, this defines a path to a node in the document. The
65
- path is expressed using CSS's selector syntax. Although a CSS selector can match
66
- more than one node, when used in the info hash, a path will only match the first
67
- matching node in the document.
68
-
69
- ##### Hash
70
-
71
- If the value is a hash, this is a _data hash_. A data hash defines the rules
72
- that Feedstock uses to extract data. It must contain one of two keys:
73
-
74
- - `:literal`: The value associated with this key is used for the content of the
75
- XML entity. This can be useful for elements that are not on the page or that
76
- don't change.
77
-
78
- - `:path`: The path to the node in the document expressed in CSS's selector
79
- syntax. As noted above, if the value of a key in the info hash is a string,
80
- this is treated as a path. The reason to use a data hash with a `:path` key
81
- is when using one or more of the keys below. In the info hash, a path matches
82
- only the first matching node in the document.
83
-
84
- The following keys may also be defined in a data hash:
85
-
86
- - `:content`: The default is `nil`. The `:content` key can be set to
87
- `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
88
- value is `"inner_html"`, Feedstock will extract the content of the node as
89
- HTML. If the value is an attribute hash, Feedstock will extract the value of
90
- that attribute. This is important for links, where the link itself is
91
- typically the content of the `href` attribute rather than the content of the
92
- `<a>` element. For all other values, the plaintext content of the node is
93
- extracted.
94
-
95
- - `:processor`: The default is `nil`. The `:processor` key can be set to a
96
- lambda function that takes two arguments. The first is the extracted content,
97
- the second is the rule being processed. The content extracted by Feedstock for
98
- the given path is processed by the processor.
99
-
100
- - `:prefix`: The default is `nil`. If a prefix is provided, the string value of
101
- the prefix is appended to the beginning of the content extracted.
102
-
103
- - `:suffix`: The default is `nil`. If a suffix is provided, the string value of
104
- the suffix is appended to the end of the content extracted.
105
-
106
- - `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
107
- If the value is `"datetime"`, the content is parsed by the [Timeliness
108
- library][Timeliness] to return a string. If the value is `"cdata"`, the
109
- content is wrapped in `<![CDATA[` and `]]>` tags.
44
+ Feedstock extracts information from a document at a given _URL_ using a
45
+ collection of _rules_. The feed is generated by calling `Feedstock.feed` as
46
+ below:
110
47
 
111
- [Timeliness]: https://github.com/adzap/timeliness "The official repository for
112
- the Timeliness library"
48
+ ```ruby
49
+ # Define the URL
50
+ url = "https://example.org"
113
51
 
114
- #### Formatting Order
52
+ # Define the rules
53
+ rules = { info: { id: url,
54
+ title: Feedstock::Extract.new(selector: "div.title"),
55
+ updated: Feedstock::Extract.new(selector: "span.date") },
115
56
 
116
- The order for formatting content is: extract, process, wrapping.
57
+ entry: { id: Feedstock::Extract.new(selector: "a", content: { attribute: "href" }),
58
+ title: Feedstock::Extract.new(selector: "h2"),
59
+ updated: Feedstock::Extract.new(selector: "span.date"),
60
+ author: Feedstock::Extract.new(selector: "span.byline"),
61
+ link: Feedstock::Extract.new(selector: "a", content: { attribute: "href" }),
62
+ summary: Feedstock::Extract.new(selector: "div.summary") },
117
63
 
118
- ### Entry
64
+ entries: Feedstock::Extract.new(selector: "div.story") }
119
65
 
120
- The `:entry` key is mandatory. It must be associated with a hash. In this
121
- README, this hash is referred to as the _entry hash_.
66
+ # Using the default format and template
67
+ Feedstock.feed url, rules
122
68
 
123
- #### Keys
124
-
125
- The keys in the entry hash should be symbols, not strings. When used with the
126
- default template, Feedstock will use the key as the name of the XML entity in
127
- the resulting feed. For example, if the key is `"id"`, the XML entity in the
128
- resulting feed will be `<id>`.
129
-
130
- #### Values
131
-
132
- The value associated with each key in the entry hash can be either a string or a
133
- hash.
134
-
135
- ##### String
136
-
137
- If the value is a string, this defines a path to a node in the document. The
138
- path is expressed using CSS's selector syntax. Unlike with the info hash, a
139
- the CSS selector will match all nodes.
140
-
141
- ##### Hash
142
-
143
- If the value is a hash, this is a _data hash_. A data hash defines the
144
- rules that Feedstock uses to extract data. It must contain one of two keys:
145
-
146
- - `:literal`: The value associated with this key is used for the content of the
147
- XML entity. This can be useful for elements that are not on the page or that
148
- don't change.
149
-
150
- - `:path`: The path to the node in the document expressed in CSS's selector
151
- syntax. Unlike with the info hash, the CSS selector will match all nodes.
152
-
153
- The following keys may also be defined in a data hash:
154
-
155
- - `:content`: The default is `nil`. The `:content` key can be set to
156
- `"inner_html"` or a _hash_ of the form `{attribute: "<attribute>"}`. If the
157
- value is `"inner_html"`, Feedstock will extract the content of the node as
158
- HTML. If the value is an attribute hash, Feedstock will extract the value of
159
- that attribute. This is important for links, where the link itself is
160
- typically the content of the `href` attribute rather than the content of the
161
- `<a>` element. For all other values, the plaintext content of the node is
162
- extracted.
163
-
164
- - `:repeat`: The default is `nil`. If repeat is set to `true`, Feedstock will
165
- use the content provided by either `:literal` or `:path` repeatedly. Since
166
- the value of `:literal` implies `:repeat`, it is not necessary to specify it
167
- expressly.
168
-
169
- - `:processor`: The default is `nil`. The `:processor` key can be set to a
170
- lambda function that takes two arguments. The first is the extracted content,
171
- the second is the rule being processed. The content extracted by Feedstock for
172
- the given path is processed by the processor.
173
-
174
- - `:prefix`: The default is `nil`. If a prefix is provided, the string value of
175
- the prefix is appended to the beginning of the content extracted.
176
-
177
- - `:suffix`: The default is `nil`. If a suffix is provided, the string value of
178
- the suffix is appended to the end of the content extracted.
179
-
180
- - `:type`: The default is `nil`. A user may specify `"datetime"` or `"cdata"`.
181
- If the value is `"datetime"`, the content is parsed by the [Timeliness
182
- library][Timeliness] to return a string. If the value is `"cdata"`, the
183
- content is wrapped in `<![CDATA[` and `]]>` tags.
184
-
185
- ### Entries
186
-
187
- The `:entries` key is optional. It can be associated with a hash. In this
188
- README, this hash is referred to as the _entries hash_.
189
-
190
- The entries hash is offered as a convenience. It allows a user to simplify
191
- the paths used in the entry hash by omitting a reference to the node
192
- containing the entries.
69
+ # Using the XML format and a user-specified template
70
+ Feedstock.feed url, rules, :xml, "podcast.xml"
71
+ ```
193
72
 
194
- If an entries hash is provided, it must contain the following key:
73
+ More information is available in [api.md].
195
74
 
196
- - `:path`: The path to the node in the document expressed in CSS's selector
197
- syntax. This path is used as the root for the paths in the entry hash.
75
+ [api.md]: https://github.com/pyrmont/feedstock/blob/master/api.md
198
76
 
199
77
  ## Bugs
200
78
 
201
- Found a bug? I'd love to know about it. The best way is to report them in the
79
+ Found a bug? I'd love to know about it. The best way is to report it in the
202
80
  [Issues section][ghi] on GitHub.
203
81
 
204
82
  [ghi]: https://github.com/pyrmont/feedstock/issues
@@ -211,7 +89,6 @@ Feedstock uses [Semantic Versioning 2.0.0][sv2].
211
89
 
212
90
  ## Licence
213
91
 
214
- Feedstock is released into the public domain. See [LICENSE.md][lc] for more
215
- details.
92
+ Feedstock is released into the public domain. See [LICENSE][] for more details.
216
93
 
217
- [lc]: https://github.com/pyrmont/feedstock/blob/master/LICENSE.md
94
+ [LICENSE]: https://github.com/pyrmont/feedstock/blob/master/LICENSE
data/feedstock.gemspec CHANGED
@@ -9,12 +9,15 @@ Gem::Specification.new do |s|
9
9
  s.email = ["mike@inqk.net"]
10
10
  s.summary = "A library for creating RSS feeds from webpages"
11
11
  s.description = <<-desc.strip.gsub(/\s+/, " ")
12
- Feedstock is a library for extracting information from a webpage and
13
- transforming it into an Atom feed.
12
+ Feedstock is a Ruby library for extracting information from an HTML/XML
13
+ document and inserting it into an ERB template.
14
14
  desc
15
15
  s.homepage = "https://github.com/pyrmont/feedstock/"
16
16
  s.licenses = "Unlicense"
17
17
  s.required_ruby_version = ">= 2.7"
18
+ s.metadata = {
19
+ "documentation_uri" => "https://github.com/pyrmont/feedstock/blob/v0.3.0/api.md"
20
+ }
18
21
 
19
22
  s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
20
23
  "feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Feedstock
4
- VERSION = "0.2.0"
4
+ VERSION = "0.4.0"
5
5
  end
data/lib/feedstock.rb CHANGED
@@ -6,148 +6,156 @@ require "open-uri"
6
6
  require "timeliness"
7
7
 
8
8
  module Feedstock
9
- def self.feed(url, rules, template_file = "#{__dir__}/../default.xml")
10
- rules = normalise_rules rules
11
- page = download_page url
12
- info = extract_info page, rules
13
- entries = extract_entries page, rules
14
- feed = create_feed info, entries, template_file
15
-
16
- feed
17
- end
9
+ class Extract < Struct.new("Extract", :selector, :absolute, :content, :processor, :prefix,
10
+ :suffix, :type, :filter, keyword_init: true); end
18
11
 
19
- def self.create_feed(info, entries, template_file)
20
- template = ERB.new File.read(template_file), trim_mode: "-"
21
- template.result_with_hash info: info, entries: entries
22
- end
12
+ class << self
13
+ def data(url, rules, format = :html)
14
+ page = download_page url, format
23
15
 
24
- def self.download_page(url)
25
- Nokogiri::HTML URI.open(url)
26
- end
16
+ info = extract_info page, rules
17
+ entries = extract_entries page, rules
27
18
 
28
- def self.extract_entries(page, rules)
29
- if rules[:entries]
30
- extract_entries_wrapped page, rules
31
- else
32
- extract_entries_unwrapped page, rules
19
+ { info: info, entries: entries }
33
20
  end
34
- end
35
21
 
36
- def self.extract_entries_unwrapped(page, rules)
37
- static = Hash.new
38
- entries = Array.new
22
+ def feed(url, rules, format = :html, template_file = "#{__dir__}/../default.xml")
23
+ info, entries = data(url, rules, format).values_at(:info, :entries)
24
+
25
+ create_feed info, entries, template_file
26
+ end
27
+
28
+ private def create_feed(info, entries, template_file)
29
+ template = ERB.new File.read(template_file), trim_mode: "-"
30
+ template.result_with_hash info: info, entries: entries
31
+ end
39
32
 
40
- rules[:entry].each do |name, rule|
41
- if rule[:literal]
42
- static[name.to_s] = rule[:literal]
43
- elsif rule[:repeat]
44
- static[name.to_s] = format_content page.at_css(rule[:path]), rule
33
+ private def download_page(url, format)
34
+ case format
35
+ when :html
36
+ Nokogiri::HTML URI.open(url)
37
+ when :xml
38
+ Nokogiri::XML URI.open(url)
45
39
  else
46
- page.css(rule[:path]).each.with_index do |match, i|
47
- entries[i] = Hash.new if entries[i].nil?
48
- entries[i].merge!({ name.to_s => format_content(match, rule) })
49
- end
40
+ raise "Format not recognised"
50
41
  end
51
42
  end
52
43
 
53
- unless static.empty?
54
- entries.each{ |entry| entry.merge!(static) }
44
+ private def extract_content(node, rule)
45
+ case rule.content
46
+ in { attribute: attribute }
47
+ node[attribute]
48
+ in "inner_html"
49
+ node.inner_html
50
+ in "html" | "xml"
51
+ node.to_s
52
+ else
53
+ node.content.strip
54
+ end
55
55
  end
56
56
 
57
- entries
58
- end
57
+ private def extract_entries(page, rules)
58
+ if rules[:entries]
59
+ extract_entries_wrapped page, rules
60
+ else
61
+ extract_entries_unwrapped page, rules
62
+ end
63
+ end
59
64
 
60
- def self.extract_entries_wrapped(page, rules)
61
- entries = Array.new
65
+ private def extract_entries_unwrapped(page, rules)
66
+ static = Hash.new
67
+ entries = Array.new
62
68
 
63
- page.css(rules[:entries][:path]).each.with_index do |node, i|
64
69
  rules[:entry].each do |name, rule|
65
- entries[i] = Hash.new if entries[i].nil?
66
-
67
- content = if rule[:literal]
68
- rule[:literal]
69
- elsif rule[:repeat]
70
- format_content page.at_css(rule[:path]), rule
71
- else
72
- format_content node.at_css(rule[:path]), rule
73
- end
70
+ if rule.is_a? String
71
+ static[name.to_s] = rule
72
+ elsif rule.absolute
73
+ static[name.to_s] = format_content page.at_css(rule.selector), rule
74
+ else
75
+ page.css(rule.selector).each.with_index do |match, i|
76
+ entries[i] = Hash.new if entries[i].nil?
77
+ entries[i].merge!({ name.to_s => format_content(match, rule) })
78
+ end
79
+ end
80
+ end
74
81
 
75
- entries[i].merge!({ name.to_s => content })
82
+ unless static.empty?
83
+ entries.each{ |entry| entry.merge!(static) }
76
84
  end
85
+
86
+ entries
77
87
  end
78
88
 
79
- entries
80
- end
89
+ private def extract_entries_wrapped(page, rules)
90
+ entries = Array.new
81
91
 
82
- def self.extract_info(page, rules)
83
- info = Hash.new
92
+ page.css(rules[:entries].selector).each.with_index do |parent, i|
93
+ rules[:entry].each do |name, rule|
94
+ entries[i] = Hash.new if entries[i].nil?
84
95
 
85
- rules[:info].each do |name, rule|
86
- if rule[:literal]
87
- info[name.to_s] = rule[:literal]
88
- else
89
- info[name.to_s] = format_content page.at_css(rule[:path]), rule
96
+ content = if rule.is_a? String
97
+ rule
98
+ elsif rule.absolute
99
+ format_content page.at_css(rule.selector), rule
100
+ elsif rule.selector.empty?
101
+ format_content parent, rule
102
+ else
103
+ format_content parent.at_css(rule.selector), rule
104
+ end
105
+
106
+ entries[i].merge!({ name.to_s => content })
107
+ end
90
108
  end
91
- end
92
109
 
93
- info
94
- end
95
-
96
- def self.format_content(match, rule)
97
- return "" if match.nil?
98
110
 
99
- text = extract_content match, rule
100
- processed = process_content text, rule
101
- wrapped = wrap_content processed, rule
111
+ return entries unless rules[:entries].filter.is_a? Proc
102
112
 
103
- case rule[:type]
104
- when "cdata"
105
- "<![CDATA[#{wrapped}]]>"
106
- when "datetime"
107
- "#{Timeliness.parse(wrapped)&.iso8601}"
108
- else
109
- wrapped
113
+ entries.filter(&rules[:entries].filter)
110
114
  end
111
- end
112
115
 
113
- def self.normalise_rules(rules)
114
- rules.keys.each do |category|
115
- case category
116
- when :info, :entry
117
- rules[category].each do |name, rule|
118
- rules[category][name] = { :path => rule } unless rule.is_a? Hash
116
+ private def extract_info(page, rules)
117
+ info = Hash.new
118
+
119
+ rules[:info].each do |name, rule|
120
+ if rule.is_a? String
121
+ info[name.to_s] = rule
122
+ else
123
+ info[name.to_s] = format_content page.at_css(rule.selector), rule
119
124
  end
120
- when :entries
121
- rule = rules[category]
122
- rules[category] = { :path => rule } unless rule.is_a? Hash
123
125
  end
126
+
127
+ info
124
128
  end
125
129
 
126
- rules
127
- end
130
+ private def format_content(match, rule)
131
+ return "" if match.nil?
132
+
133
+ text = extract_content match, rule
134
+ processed = process_content text, rule
135
+ wrapped = wrap_content processed, rule
128
136
 
129
- def self.extract_content(node, rule)
130
- case rule[:content]
131
- in { attribute: attribute }
132
- node[attribute]
133
- in "inner_html"
134
- node.inner_html
135
- else
136
- node.content.strip
137
+ case rule.type
138
+ when "cdata"
139
+ "<![CDATA[#{wrapped}]]>"
140
+ when "datetime"
141
+ "#{Timeliness.parse(wrapped)&.iso8601}"
142
+ else
143
+ wrapped
144
+ end
137
145
  end
138
- end
139
146
 
140
- def self.process_content(content, rule)
141
- if rule[:processor]
142
- rule[:processor].call content, rule
143
- else
144
- content
147
+ private def process_content(content, rule)
148
+ if rule.processor
149
+ rule.processor.call content, rule
150
+ else
151
+ content
152
+ end
145
153
  end
146
- end
147
154
 
148
- def self.wrap_content(content, rule)
149
- return content unless rule[:prepend] || rule[:append]
155
+ private def wrap_content(content, rule)
156
+ return content unless (rule.prefix || rule.suffix)
150
157
 
151
- "#{rule[:prepend]}#{content}#{rule[:append]}"
158
+ "#{rule.prefix}#{content}#{rule.suffix}"
159
+ end
152
160
  end
153
161
  end
metadata CHANGED
@@ -1,14 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: feedstock
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Camilleri
8
- autorequire:
9
8
  bindir: bin
10
9
  cert_chain: []
11
- date: 2021-02-05 00:00:00.000000000 Z
10
+ date: 2025-02-21 00:00:00.000000000 Z
12
11
  dependencies:
13
12
  - !ruby/object:Gem::Dependency
14
13
  name: nokogiri
@@ -80,8 +79,8 @@ dependencies:
80
79
  - - ">="
81
80
  - !ruby/object:Gem::Version
82
81
  version: '0'
83
- description: Feedstock is a library for extracting information from a webpage and
84
- transforming it into an Atom feed.
82
+ description: Feedstock is a Ruby library for extracting information from an HTML/XML
83
+ document and inserting it into an ERB template.
85
84
  email:
86
85
  - mike@inqk.net
87
86
  executables: []
@@ -99,8 +98,8 @@ homepage: https://github.com/pyrmont/feedstock/
99
98
  licenses:
100
99
  - Unlicense
101
100
  metadata:
101
+ documentation_uri: https://github.com/pyrmont/feedstock/blob/v0.3.0/api.md
102
102
  allowed_push_host: https://rubygems.org
103
- post_install_message:
104
103
  rdoc_options: []
105
104
  require_paths:
106
105
  - lib
@@ -115,8 +114,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
115
114
  - !ruby/object:Gem::Version
116
115
  version: '0'
117
116
  requirements: []
118
- rubygems_version: 3.2.3
119
- signing_key:
117
+ rubygems_version: 3.6.2
120
118
  specification_version: 4
121
119
  summary: A library for creating RSS feeds from webpages
122
120
  test_files: []