feedstock 0.1.0 → 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1c87214f9b1fca1686eb03c1364b62f1d8c716e9998c695edb13273e4bc4f3bd
4
- data.tar.gz: 34dce9c661854f0225e9036102b3c52cf205fd750a454bc42e54b538670fe2d1
3
+ metadata.gz: 60dc0bcb05928b59220fe1ed6ac24487428ceef5279f454bce047d3b3a94a56d
4
+ data.tar.gz: 91d7a161cdd3aedaf2316082b9f5bbae2fa48dcfd9d599cd65ee746673afc2b8
5
5
  SHA512:
6
- metadata.gz: 0bcd9b33a1216473daedd4cefab2506e7cf2a763345c6b94d5fb37e0cf6876142cf1269a4229c53d8b5d1118663ad6f4350787e818aefee52d441b3b27fe1581
7
- data.tar.gz: da15a4cfd0c7df236ce98fa6e580c268b05ad5ee4d9750fe8beaa0c88a48711790dd4e73dd2c91d5fee870b3e0846816e64b1c4220f7c579b3f3f78b8083f330
6
+ metadata.gz: 4513cbec520821710ed756544b1a1f7797498a5769395333e19a96c39a466cf83291863944973020ad33a9adb21b58ae2370c3bf08a72843683df1025432fc7c
7
+ data.tar.gz: 327995486920781894858903f4d510b958ee85b04a9aebf7139a4934f73c8f5af446b8eda6a06bf0e158692c4de278ebfc007b7f729f6bca953c67ea8eaff432
data/README.md CHANGED
@@ -21,6 +21,150 @@ Feedstock is available as a gem:
21
21
  $ gem install feedstock
22
22
  ```
23
23
 
24
+ ## Usage
25
+
26
+ Feedstock extracts information from a given document using a collection of
27
+ _rules_.
28
+
29
+ A collection of rules is expressed as a hash. The hash has two mandatory keys
30
+ and one optional key.
31
+
32
+ ### Info
33
+
34
+ The `"info"` key is mandatory. It must be associated with a hash. This document
35
+ refers to this hash as the 'info hash'.
36
+
37
+ #### Keys
38
+
39
+ The keys in the info hash are strings (not symbols). When used with the default
40
+ template, Feedstock will use the key as the name of the XML entity in the
41
+ resulting feed. For example, if the key is `"id"`, the XML entity in the
42
+ resulting feed will be `<id>`.
43
+
44
+ #### Values
45
+
46
+ The value associated with each key in the info hash can be either a string or a
47
+ hash.
48
+
49
+ ##### String
50
+
51
+ If the value is a string, this defines a path to a node in the document. The
52
+ path is expressed using CSS's selector syntax. Although a CSS selector can match
53
+ more than one node, when used in the info hash, a path will only match the first
54
+ matching node in the document.
55
+
56
+ ##### Hash
57
+
58
+ If the value is a hash, this is the 'data hash'. The data hash defines the
59
+ rules that Feedstock uses to extract data. It must contain one of two keys:
60
+
61
+ - `"literal"`: The value associated with this key is used for the content of the
62
+ XML entity. This can be useful for elements that are not on the page or that
63
+ don't change.
64
+
65
+ - `"path"`: The path to the node in the document expressed in CSS's selector
66
+ syntax. As noted above, if the value of a key in the info hash is a string,
67
+ this is treated as a path. The reason to use a data hash with a `"path"` key
68
+ is when using one or more of the keys below. In the info hash, a path matches
69
+ only the first matching node in the document.
70
+
71
+ The following keys may also be defined in a data hash:
72
+
73
+ - `"attribute"`: The default is `nil`. If an attribute is provided, Feedstock
74
+ will extract the content of the attribute rather than the content of the node.
75
+ This is important for links, where the link itself is typically the content of
76
+ the `href` attribute rather than the content of the `<a>` element.
77
+
78
+ - `"prefix"`: The default is `nil`. If a prefix is provided, the string value of
79
+ the prefix is appended to the beginning of the content extracted.
80
+
81
+ - `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
82
+ the suffix is appended to the end of the content extracted.
83
+
84
+ - `"type"`: The default is `nil`. This causes Feedstock to extract only the text
85
+ in a node (stripping out all HTML). However, a user may specify `"datetime"`
86
+ or `"cdata"`. `"datetime"` content is parsed by [the Timeliness
87
+ library][Timeliness] (this is bundled with Feedstock) to return a string.
88
+ `"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
89
+ tags.
90
+
91
+ [Timeliness]: https://github.com/adzap/timeliness "The official repository for
92
+ the Timeliness library"
93
+
94
+ ### Entry
95
+
96
+ The `"entry"` key is mandatory. It must be associated with a hash. This document
97
+ refers to this hash as the 'entry hash'.
98
+
99
+ #### Keys
100
+
101
+ The keys in the entry hash are strings (not symbols). When used with the default
102
+ template, Feedstock will use the key as the name of the XML entity in the
103
+ resulting feed. For example, if the key is `"id"`, the XML entity in the
104
+ resulting feed will be `<id>`.
105
+
106
+ #### Values
107
+
108
+ The value associated with each key in the entry hash can be either a string or a
109
+ hash.
110
+
111
+ ##### String
112
+
113
+ If the value is a string, this defines a path to a node in the document. The
114
+ path is expressed using CSS's selector syntax. Unlike with the info hash, a
115
+ the CSS selector will match all nodes.
116
+
117
+ ##### Hash
118
+
119
+ If the value is a hash, we call this the "data hash". The data hash defines the
120
+ rules that Feedstock uses to extract data. It must contain one of two keys:
121
+
122
+ - `"literal"`: The value associated with this key is used for the content of the
123
+ XML entity. This can be useful for elements that are not on the page or that
124
+ don't change.
125
+
126
+ - `"path"`: The path to the node in the document expressed in CSS's selector
127
+ syntax. Unlike with the info hash, the CSS selector will match all nodes.
128
+
129
+ The following keys may also be defined in a data hash:
130
+
131
+ - `"attribute"`: The default is `nil`. If an attribute is provided, Feedstock
132
+ will extract the content of the attribute rather than the content of the node.
133
+ This is important for links, where the link itself is typically the content of
134
+ the `href` attribute rather than the content of the `<a>` element.
135
+
136
+ - `"infix"`: The default is `nil`. If the entries hash has been provided (see
137
+ below), then the string value of the infix is inserted between the content of
138
+ each matching node. If the entries hash not been provided, this is ignored.
139
+
140
+ - `"prefix"`: The default is `nil`. If a prefix is provided, the string value of
141
+ the prefix is appended to the beginning of the content extracted.
142
+
143
+ - `"repeat"`: The default is `nil`. If repeat is set to `true`, Feedstock will
144
+ use the content provided by either `"literal"` or `"path"` repeatedly. Since
145
+ the value of `"literal"` implies `"repeat"`, it is not necessary to specify it
146
+ expressly.
147
+
148
+ - `"suffix"`: The default is `nil`. If a suffix is provided, the string value of
149
+ the suffix is appended to the end of the content extracted.
150
+
151
+ - `"type"`: The default is `nil`. This causes Feedstock to extract only the text
152
+ in a node (stripping out all HTML). However, a user may specify `"datetime"`
153
+ or `"cdata"`. `"datetime"` content is parsed by [the Timeliness
154
+ library][Timeliness] (this is bundled with Feedstock) to return a string.
155
+ `"cdata"` content includes any HTML and is wrapped in `<![CDATA[` and `]]>`
156
+ tags.
157
+
158
+ ### Entries
159
+
160
+ The `"entries"` key is optional. It can be associated with a hash. This document
161
+ refers to this hash as the 'entries hash'.
162
+
163
+ If an entries hash is provided, it must contain the following key:
164
+
165
+ - `"path"`: The path to the node in the document expressed in CSS's selector
166
+ syntax. This path is used as the root for the paths in the entry hash.
167
+
24
168
  ## Bugs
25
169
 
26
170
  Found a bug? I'd love to know about it. The best way is to report them in the
@@ -30,12 +174,13 @@ Found a bug? I'd love to know about it. The best way is to report them in the
30
174
 
31
175
  ## Versioning
32
176
 
33
- Tenter uses [Semantic Versioning 2.0.0][sv2].
177
+ Feedstock uses [Semantic Versioning 2.0.0][sv2].
34
178
 
35
179
  [sv2]: http://semver.org/
36
180
 
37
181
  ## Licence
38
182
 
39
- Tenter is released into the public domain. See [LICENSE.md][lc] for more details.
183
+ Feedstock is released into the public domain. See [LICENSE.md][lc] for more
184
+ details.
40
185
 
41
186
  [lc]: https://github.com/pyrmont/feedstock/blob/master/LICENSE.md
data/feedstock.gemspec CHANGED
@@ -19,13 +19,13 @@ Gem::Specification.new do |s|
19
19
  s.files = Dir["Gemfile", "default.xml", "LICENSE", "README.md",
20
20
  "feedstock.gemspec", "lib/feedstock.rb", "lib/**/*.rb"]
21
21
  s.require_paths = ["lib"]
22
-
22
+
23
23
  s.metadata["allowed_push_host"] = "https://rubygems.org"
24
24
 
25
25
  s.add_runtime_dependency "nokogiri"
26
26
  s.add_runtime_dependency "timeliness"
27
27
 
28
- s.add_development_dependency "minitest"
28
+ s.add_development_dependency "minitest"
29
29
  s.add_development_dependency "rake"
30
30
  s.add_development_dependency "warning"
31
31
  end
data/lib/feedstock.rb CHANGED
@@ -22,14 +22,22 @@ module Feedstock
22
22
  end
23
23
 
24
24
  def self.download_page(url)
25
- Nokogiri::HTML open(url)
25
+ Nokogiri::HTML URI.open(url)
26
26
  end
27
27
 
28
28
  def self.extract_entries(page, rules)
29
+ if rules["entries"]
30
+ extract_entries_wrapped page, rules
31
+ else
32
+ extract_entries_unwrapped page, rules
33
+ end
34
+ end
35
+
36
+ def self.extract_entries_unwrapped(page, rules)
29
37
  static = Hash.new
30
38
  entries = Array.new
31
39
 
32
- rules['entries'].each do |name, rule|
40
+ rules["entry"].each do |name, rule|
33
41
  if rule["literal"]
34
42
  static[name] = rule["literal"]
35
43
  elsif rule["repeat"]
@@ -49,6 +57,28 @@ module Feedstock
49
57
  entries
50
58
  end
51
59
 
60
+ def self.extract_entries_wrapped(page, rules)
61
+ entries = Array.new
62
+
63
+ page.css(rules["entries"]["path"]).each.with_index do |node, i|
64
+ rules["entry"].each do |name, rule|
65
+ entries[i] = Hash.new if entries[i].nil?
66
+
67
+ content = if rule["literal"]
68
+ rule["literal"]
69
+ elsif rule["repeat"]
70
+ format_content page.at_css(rule["path"]), rule
71
+ else
72
+ format_content node.at_css(rule["path"]), rule
73
+ end
74
+
75
+ entries[i].merge!({ name => content })
76
+ end
77
+ end
78
+
79
+ entries
80
+ end
81
+
52
82
  def self.extract_info(page, rules)
53
83
  info = Hash.new
54
84
 
@@ -84,8 +114,14 @@ module Feedstock
84
114
 
85
115
  def self.normalise_rules(rules)
86
116
  rules.keys.each do |category|
87
- rules[category].each do |name, rule|
88
- rules[category][name] = { "path" => rule } unless rule.is_a? Hash
117
+ case category
118
+ when "info", "entry"
119
+ rules[category].each do |name, rule|
120
+ rules[category][name] = { "path" => rule } unless rule.is_a? Hash
121
+ end
122
+ when "entries"
123
+ rule = rules[category]
124
+ rules[category] = { "path" => rule } unless rule.is_a? Hash
89
125
  end
90
126
  end
91
127
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Feedstock
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.1"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: feedstock
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Camilleri
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2019-09-18 00:00:00.000000000 Z
11
+ date: 2021-02-04 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
@@ -115,7 +115,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
115
115
  - !ruby/object:Gem::Version
116
116
  version: '0'
117
117
  requirements: []
118
- rubygems_version: 3.0.3
118
+ rubygems_version: 3.1.2
119
119
  signing_key:
120
120
  specification_version: 4
121
121
  summary: A library for creating RSS feeds from webpages