html2rss 0.9.0 → 0.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +323 -270
- data/exe/html2rss +6 -0
- data/html2rss.gemspec +18 -23
- data/lib/html2rss/attribute_post_processors/gsub.rb +30 -8
- data/lib/html2rss/attribute_post_processors/html_to_markdown.rb +7 -2
- data/lib/html2rss/attribute_post_processors/html_transformers/transform_urls_to_absolute_ones.rb +27 -0
- data/lib/html2rss/attribute_post_processors/html_transformers/wrap_img_in_a.rb +41 -0
- data/lib/html2rss/attribute_post_processors/markdown_to_html.rb +11 -2
- data/lib/html2rss/attribute_post_processors/parse_time.rb +11 -4
- data/lib/html2rss/attribute_post_processors/parse_uri.rb +12 -2
- data/lib/html2rss/attribute_post_processors/sanitize_html.rb +40 -44
- data/lib/html2rss/attribute_post_processors/substring.rb +14 -4
- data/lib/html2rss/attribute_post_processors/template.rb +36 -12
- data/lib/html2rss/attribute_post_processors.rb +28 -5
- data/lib/html2rss/cli.rb +29 -0
- data/lib/html2rss/config/channel.rb +117 -0
- data/lib/html2rss/config/selectors.rb +91 -0
- data/lib/html2rss/config.rb +71 -82
- data/lib/html2rss/item.rb +122 -46
- data/lib/html2rss/item_extractors/attribute.rb +20 -7
- data/lib/html2rss/item_extractors/href.rb +20 -4
- data/lib/html2rss/item_extractors/html.rb +18 -6
- data/lib/html2rss/item_extractors/static.rb +18 -7
- data/lib/html2rss/item_extractors/text.rb +17 -5
- data/lib/html2rss/item_extractors.rb +75 -10
- data/lib/html2rss/object_to_xml_converter.rb +56 -0
- data/lib/html2rss/rss_builder/channel.rb +21 -0
- data/lib/html2rss/rss_builder/item.rb +83 -0
- data/lib/html2rss/rss_builder/stylesheet.rb +37 -0
- data/lib/html2rss/rss_builder.rb +96 -0
- data/lib/html2rss/utils.rb +94 -19
- data/lib/html2rss/version.rb +5 -1
- data/lib/html2rss.rb +57 -20
- metadata +53 -165
- data/.gitignore +0 -12
- data/.rspec +0 -4
- data/.rubocop.yml +0 -164
- data/.travis.yml +0 -25
- data/.yardopts +0 -6
- data/CHANGELOG.md +0 -221
- data/Gemfile +0 -8
- data/Gemfile.lock +0 -139
- data/bin/console +0 -15
- data/bin/setup +0 -8
- data/lib/html2rss/feed_builder.rb +0 -81
- data/lib/html2rss/item_extractors/current_time.rb +0 -21
- data/support/logo.png +0 -0
data/README.md
CHANGED
@@ -1,38 +1,51 @@
|
|
1
|
-

|
2
2
|
|
3
|
-
[](http://rubygems.org/gems/html2rss/)
|
5
|
-
[](https://coveralls.io/github/gildesmarais/html2rss?branch=master)
|
6
|
-
[](https://www.rubydoc.info/gems/html2rss)
|
7
|
-

|
8
|
-
[](https://liberapay.com/gildesmarais/donate)
|
3
|
+
[](http://rubygems.org/gems/html2rss/) [](https://www.rubydoc.info/gems/html2rss) 
|
9
4
|
|
10
|
-
|
11
|
-
[Head over to `html2rss-web`!](https://github.com/gildesmarais/html2rss-web)
|
5
|
+
`html2rss` is a Ruby gem that generates RSS 2.0 feeds from a _feed config_.
|
12
6
|
|
13
|
-
|
7
|
+
With the _feed config_, you provide a URL to scrape and CSS selectors for extracting information (like title, URL, etc.). The gem builds the RSS feed accordingly. [Extractors](#using-extractors) and chainable [post processors](#using-post-processors) make information extraction, processing, and sanitizing a breeze. The gem also supports [scraping JSON](#scraping-and-handling-json-responses) responses and [setting HTTP request headers](#set-any-http-header-in-the-request).
|
14
8
|
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
make information extraction, processing and sanitizing a breeze.
|
19
|
-
[Scraping JSON](#scraping-and-handling-json-responses) responses and
|
20
|
-
[setting HTTP request headers](#set-any-http-header-in-the-request) is
|
21
|
-
supported, too.
|
9
|
+
**Looking for a ready-to-use app to serve generated feeds via HTTP?** [Check out `html2rss-web`](https://github.com/html2rss/html2rss-web)!
|
10
|
+
|
11
|
+
Support the development by sponsoring this project on GitHub. Thank you! 💓
|
22
12
|
|
23
13
|
## Installation
|
24
14
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
29
|
-
|
15
|
+
| Install | `gem install html2rss` |
|
16
|
+
| ------- | ---------------------- |
|
17
|
+
| Usage | `html2rss help` |
|
18
|
+
|
19
|
+
You can also install it as a dependency in your Ruby project:
|
30
20
|
|
31
|
-
|
21
|
+
| 🤩 Like it? | Star it! ⭐️ |
|
22
|
+
| -------------------------------: | -------------------- |
|
23
|
+
| Add this line to your `Gemfile`: | `gem 'html2rss'` |
|
24
|
+
| Then execute: | `bundle` |
|
25
|
+
| In your code: | `require 'html2rss'` |
|
32
26
|
|
33
|
-
##
|
27
|
+
## Generating a feed on the CLI
|
34
28
|
|
35
|
-
|
29
|
+
Create a file called `my_config_file.yml` with this example content:
|
30
|
+
|
31
|
+
```yml
|
32
|
+
channel:
|
33
|
+
url: https://stackoverflow.com/questions
|
34
|
+
selectors:
|
35
|
+
items:
|
36
|
+
selector: "#hot-network-questions > ul > li"
|
37
|
+
title:
|
38
|
+
selector: a
|
39
|
+
link:
|
40
|
+
selector: a
|
41
|
+
extractor: href
|
42
|
+
```
|
43
|
+
|
44
|
+
Build the RSS with: `html2rss feed ./my_config_file.yml`.
|
45
|
+
|
46
|
+
## Generating a feed with Ruby
|
47
|
+
|
48
|
+
Here's a minimal working example in Ruby:
|
36
49
|
|
37
50
|
```ruby
|
38
51
|
require 'html2rss'
|
@@ -50,54 +63,86 @@ rss =
|
|
50
63
|
puts rss
|
51
64
|
```
|
52
65
|
|
53
|
-
|
54
|
-
|
66
|
+
## The _feed config_ and its options
|
67
|
+
|
68
|
+
A _feed config_ consists of a `channel` and a `selectors` hash. The contents of both hashes are explained below.
|
69
|
+
|
70
|
+
Good to know:
|
71
|
+
|
72
|
+
- You'll find extensive example feed configs at [`spec/*.test.yml`](https://github.com/html2rss/html2rss/tree/master/spec).
|
73
|
+
- See [`html2rss-configs`](https://github.com/html2rss/html2rss-configs) for ready-made feed configs!
|
74
|
+
- If you've created feed configs, you're invited to send a PR to [`html2rss-configs`](https://github.com/html2rss/html2rss-configs) to make your config available to the public.
|
55
75
|
|
56
|
-
|
76
|
+
Alright, let's move on.
|
57
77
|
|
58
78
|
### The `channel`
|
59
79
|
|
60
|
-
| attribute |
|
61
|
-
| ------------- |
|
62
|
-
| `url` | required | String | | |
|
63
|
-
| `title` | optional
|
64
|
-
| `description` | optional
|
65
|
-
| `ttl` | optional
|
66
|
-
| `time_zone` | optional
|
67
|
-
| `language` | optional
|
68
|
-
| `author` | optional
|
69
|
-
| `headers` | optional
|
70
|
-
| `json` | optional
|
80
|
+
| attribute | | type | default | remark |
|
81
|
+
| ------------- | ------------ | ------- | -------------- | ------------------------------------------ |
|
82
|
+
| `url` | **required** | String | | |
|
83
|
+
| `title` | optional | String | auto-generated | |
|
84
|
+
| `description` | optional | String | auto-generated | |
|
85
|
+
| `ttl` | optional | Integer | `360` | TTL in _minutes_ |
|
86
|
+
| `time_zone` | optional | String | `'UTC'` | TimeZone name |
|
87
|
+
| `language` | optional | String | `'en'` | Language code |
|
88
|
+
| `author` | optional | String | | Format: `email (Name)` |
|
89
|
+
| `headers` | optional | Hash | `{}` | Set HTTP request headers. See notes below. |
|
90
|
+
| `json` | optional | Boolean | `false` | Handle JSON response. See notes below. |
|
91
|
+
|
92
|
+
#### Dynamic parameters in `channel` attributes
|
93
|
+
|
94
|
+
Sometimes there are structurally similar pages with different URLs. In such cases, you can add _dynamic parameters_ to the channel's attributes.
|
95
|
+
|
96
|
+
Example of a dynamic `id` parameter in the channel URLs:
|
97
|
+
|
98
|
+
```yml
|
99
|
+
channel:
|
100
|
+
url: "http://domainname.tld/whatever/%<id>s.html"
|
101
|
+
```
|
102
|
+
|
103
|
+
Command line usage example:
|
104
|
+
|
105
|
+
```sh
|
106
|
+
bundle exec html2rss feed the_feed_config.yml id=42
|
107
|
+
```
|
108
|
+
|
109
|
+
<details><summary>See a Ruby example</summary>
|
110
|
+
|
111
|
+
```ruby
|
112
|
+
config = Html2rss::Config.new({ channel: { url: 'http://domainname.tld/whatever/%<id>s.html' } }, {}, { id: 42 })
|
113
|
+
Html2rss.feed(config)
|
114
|
+
```
|
115
|
+
|
116
|
+
</details>
|
117
|
+
|
118
|
+
See the more complex formatting options of the [`sprintf` method](https://ruby-doc.org/core/Kernel.html#method-i-sprintf).
|
71
119
|
|
72
120
|
### The `selectors`
|
73
121
|
|
74
|
-
|
75
|
-
|
76
|
-
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
|
81
|
-
|
82
|
-
|
83
|
-
|
84
|
-
|
85
|
-
|
|
86
|
-
|
|
87
|
-
| `
|
88
|
-
| `
|
89
|
-
| `
|
90
|
-
| `
|
91
|
-
| `
|
92
|
-
| `
|
93
|
-
| `
|
94
|
-
| `guid` | `guid` | Generated from the `title`. |
|
95
|
-
| `comments` | `comments` | A URL. |
|
96
|
-
| `source` | ~~source~~ | Not yet supported. |
|
122
|
+
First, you must give an **`items`** selector hash, which contains a CSS selector. The selector selects a collection of HTML tags from which the RSS feed items are built. Except for the `items` selector, all other keys are scoped to each item of the collection.
|
123
|
+
|
124
|
+
To build a [valid RSS 2.0 item](http://www.rssboard.org/rss-profile#element-channel-item), you need at least a `title` **or** a `description`. You can have both.
|
125
|
+
|
126
|
+
Having an `items` and a `title` selector is enough to build a simple feed.
|
127
|
+
|
128
|
+
Your `selectors` hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (due to the RSS 2.0 specification):
|
129
|
+
|
130
|
+
| RSS 2.0 tag | name in `html2rss` | remark |
|
131
|
+
| ------------- | ------------------ | ------------------------------------------- |
|
132
|
+
| `title` | `title` | |
|
133
|
+
| `description` | `description` | Supports HTML. |
|
134
|
+
| `link` | `link` | A URL. |
|
135
|
+
| `author` | `author` | |
|
136
|
+
| `category` | `categories` | See notes below. |
|
137
|
+
| `guid` | `guid` | Default title/description. See notes below. |
|
138
|
+
| `enclosure` | `enclosure` | See notes below. |
|
139
|
+
| `pubDate` | `updated` | An instance of `Time`. |
|
140
|
+
| `comments` | `comments` | A URL. |
|
141
|
+
| `source` | ~~source~~ | Not yet supported. |
|
97
142
|
|
98
143
|
### The `selector` hash
|
99
144
|
|
100
|
-
|
145
|
+
Every named selector in your `selectors` hash can have these attributes:
|
101
146
|
|
102
147
|
| name | value |
|
103
148
|
| -------------- | -------------------------------------------------------- |
|
@@ -105,26 +150,6 @@ Your selector hash can have these attributes:
|
|
105
150
|
| `extractor` | Name of the extractor. See notes below. |
|
106
151
|
| `post_process` | A hash or array of hashes. See notes below. |
|
107
152
|
|
108
|
-
#### Reverse ordering of items
|
109
|
-
|
110
|
-
The `items` selector hash can have an `order` attribute.
|
111
|
-
If the value is `reverse` the order of items in the RSS will be reversed.
|
112
|
-
|
113
|
-
<details>
|
114
|
-
<summary>See a YAML feed config example</summary>
|
115
|
-
|
116
|
-
```yml
|
117
|
-
channel:
|
118
|
-
# ... omitted
|
119
|
-
selectors:
|
120
|
-
items:
|
121
|
-
selector: 'ul > li'
|
122
|
-
order: 'reverse'
|
123
|
-
# ... omitted
|
124
|
-
```
|
125
|
-
|
126
|
-
</details>
|
127
|
-
|
128
153
|
## Using extractors
|
129
154
|
|
130
155
|
Extractors help with extracting the information from the selected HTML tag.
|
@@ -134,13 +159,11 @@ Extractors help with extracting the information from the selected HTML tag.
|
|
134
159
|
- The `href` extractor returns a URL from the tag's `href` attribute and corrects relative ones to absolute ones.
|
135
160
|
- The `attribute` extractor returns the value of that tag's attribute.
|
136
161
|
- The `static` extractor returns the configured static value (it doesn't extract anything).
|
137
|
-
- [See file list of extractors](https://github.com/
|
162
|
+
- [See file list of extractors](https://github.com/html2rss/html2rss/tree/master/lib/html2rss/item_extractors).
|
138
163
|
|
139
|
-
Extractors
|
140
|
-
👉 [Read their docs for usage examples](https://www.rubydoc.info/gems/html2rss/Html2rss/ItemExtractors).
|
164
|
+
Extractors might need extra attributes on the selector hash. 👉 [Read their docs for usage examples](https://www.rubydoc.info/gems/html2rss/Html2rss/ItemExtractors).
|
141
165
|
|
142
|
-
<details>
|
143
|
-
<summary>See a Ruby example</summary>
|
166
|
+
<details><summary>See a Ruby example</summary>
|
144
167
|
|
145
168
|
```ruby
|
146
169
|
Html2rss.feed(
|
@@ -150,17 +173,16 @@ Html2rss.feed(
|
|
150
173
|
|
151
174
|
</details>
|
152
175
|
|
153
|
-
<details>
|
154
|
-
<summary>See a YAML feed config example</summary>
|
176
|
+
<details><summary>See a YAML feed config example</summary>
|
155
177
|
|
156
178
|
```yml
|
157
179
|
channel:
|
158
|
-
|
180
|
+
# ... omitted
|
159
181
|
selectors:
|
160
|
-
|
182
|
+
# ... omitted
|
161
183
|
link:
|
162
|
-
selector:
|
163
|
-
extractor:
|
184
|
+
selector: "a"
|
185
|
+
extractor: "href"
|
164
186
|
```
|
165
187
|
|
166
188
|
</details>
|
@@ -182,48 +204,11 @@ Extracted information can be further manipulated with post processors.
|
|
182
204
|
|
183
205
|
⚠️ Always make use of the `sanitize_html` post processor for HTML content. _Never trust the internet!_ ⚠️
|
184
206
|
|
185
|
-
- [See file list of post processors](https://github.com/gildesmarais/html2rss/tree/master/lib/html2rss/attribute_post_processors).
|
186
|
-
|
187
|
-
👉 [Read their docs for usage examples.](https://www.rubydoc.info/gems/html2rss/Html2rss/AttributePostProcessors)
|
188
|
-
|
189
|
-
<details>
|
190
|
-
<summary>See a Ruby example</summary>
|
191
|
-
|
192
|
-
```ruby
|
193
|
-
Html2rss.feed(
|
194
|
-
channel: {},
|
195
|
-
selectors: {
|
196
|
-
description: {
|
197
|
-
selector: '.content', post_process: { name: 'sanitize_html' }
|
198
|
-
}
|
199
|
-
}
|
200
|
-
)
|
201
|
-
```
|
202
|
-
|
203
|
-
</details>
|
204
|
-
|
205
|
-
<details>
|
206
|
-
<summary>See a YAML feed config example</summary>
|
207
|
-
|
208
|
-
```yml
|
209
|
-
channel:
|
210
|
-
# ... omitted
|
211
|
-
selectors:
|
212
|
-
# ... omitted
|
213
|
-
description:
|
214
|
-
selector: '.content'
|
215
|
-
post_process:
|
216
|
-
- name: sanitize_html
|
217
|
-
```
|
218
|
-
|
219
|
-
</details>
|
220
|
-
|
221
207
|
### Chaining post processors
|
222
208
|
|
223
209
|
Pass an array to `post_process` to chain the post processors.
|
224
210
|
|
225
|
-
<details>
|
226
|
-
<summary>YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML</summary>
|
211
|
+
<details><summary>YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML</summary>
|
227
212
|
|
228
213
|
```yml
|
229
214
|
channel:
|
@@ -243,7 +228,44 @@ selectors:
|
|
243
228
|
- name: markdown_to_html
|
244
229
|
```
|
245
230
|
|
246
|
-
|
231
|
+
</details>
|
232
|
+
|
233
|
+
### Post processor `gsub`
|
234
|
+
|
235
|
+
The post processor `gsub` makes use of Ruby's [`gsub`](https://apidock.com/ruby/String/gsub) method.
|
236
|
+
|
237
|
+
| key | type | required | note |
|
238
|
+
| ------------- | ------ | -------- | --------------------------- |
|
239
|
+
| `pattern` | String | yes | Can be Regexp or String. |
|
240
|
+
| `replacement` | String | yes | Can be a [backreference](). |
|
241
|
+
|
242
|
+
<details><summary>See a Ruby example</summary>
|
243
|
+
|
244
|
+
```ruby
|
245
|
+
Html2rss.feed(
|
246
|
+
channel: {},
|
247
|
+
selectors: {
|
248
|
+
title: { selector: 'a', post_process: [{ name: 'gsub', pattern: 'foo', replacement: 'bar' }] }
|
249
|
+
}
|
250
|
+
)
|
251
|
+
```
|
252
|
+
|
253
|
+
</details>
|
254
|
+
|
255
|
+
<details><summary>See a YAML feed config example</summary>
|
256
|
+
|
257
|
+
```yml
|
258
|
+
channel:
|
259
|
+
# ... omitted
|
260
|
+
selectors:
|
261
|
+
# ... omitted
|
262
|
+
title:
|
263
|
+
selector: "a"
|
264
|
+
post_process:
|
265
|
+
- name: "gsub"
|
266
|
+
pattern: "foo"
|
267
|
+
replacement: "bar"
|
268
|
+
```
|
247
269
|
|
248
270
|
</details>
|
249
271
|
|
@@ -290,65 +312,74 @@ selectors:
|
|
290
312
|
|
291
313
|
</details>
|
292
314
|
|
293
|
-
##
|
294
|
-
|
295
|
-
An enclosure can be any file, e.g. a image, audio or video.
|
296
|
-
|
297
|
-
The `enclosure` selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
|
315
|
+
## Custom item GUID
|
298
316
|
|
299
|
-
|
317
|
+
By default, html2rss generates a GUID from the `title` or `description`.
|
300
318
|
|
301
|
-
|
302
|
-
|
303
|
-
3. The content-length will always be undetermined and thus stated as `0` bytes.
|
319
|
+
If this does not work well, you can choose other attributes from which the GUID is build.
|
320
|
+
The principle is the same as for the categories: pass an array of selectors names.
|
304
321
|
|
305
|
-
|
322
|
+
In all cases, the GUID is a SHA1-encoded string.
|
306
323
|
|
307
|
-
<details>
|
308
|
-
<summary>See a Ruby example</summary>
|
324
|
+
<details><summary>See a Ruby example</summary>
|
309
325
|
|
310
326
|
```ruby
|
311
327
|
Html2rss.feed(
|
312
328
|
channel: {},
|
313
329
|
selectors: {
|
314
|
-
|
330
|
+
title: {
|
331
|
+
# ... omitted
|
332
|
+
selector: 'h1'
|
333
|
+
},
|
334
|
+
link: { selector: 'a', extractor: 'href' },
|
335
|
+
guid: %i[link]
|
315
336
|
}
|
316
337
|
)
|
317
338
|
```
|
318
339
|
|
319
340
|
</details>
|
320
341
|
|
321
|
-
<details>
|
322
|
-
<summary>See a YAML feed config example</summary>
|
342
|
+
<details><summary>See a YAML feed config example</summary>
|
323
343
|
|
324
344
|
```yml
|
325
345
|
channel:
|
326
346
|
# ... omitted
|
327
347
|
selectors:
|
328
|
-
|
329
|
-
|
330
|
-
selector: "
|
331
|
-
|
332
|
-
|
348
|
+
# ... omitted
|
349
|
+
title:
|
350
|
+
selector: "h1"
|
351
|
+
link:
|
352
|
+
selector: "a"
|
353
|
+
extractor: "href"
|
354
|
+
guid:
|
355
|
+
- link
|
333
356
|
```
|
334
357
|
|
335
358
|
</details>
|
336
359
|
|
337
|
-
##
|
360
|
+
## Adding an `<enclosure>` tag to an item
|
338
361
|
|
339
|
-
|
362
|
+
An enclosure can be any file, e.g. a image, audio or video - think Podcast.
|
340
363
|
|
341
|
-
|
364
|
+
The `enclosure` selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
|
365
|
+
|
366
|
+
Since `html2rss` does no further inspection of the enclosure, its support comes with trade-offs:
|
367
|
+
|
368
|
+
1. The content-type is guessed from the file extension of the URL.
|
369
|
+
2. If the content-type guessing fails, it will default to `application/octet-stream`.
|
370
|
+
3. The content-length will always be undetermined and therefore stated as `0` bytes.
|
371
|
+
|
372
|
+
Read the [RSS 2.0 spec](http://www.rssboard.org/rss-profile#element-channel-item-enclosure) for further information on enclosing content.
|
342
373
|
|
343
374
|
<details>
|
344
375
|
<summary>See a Ruby example</summary>
|
345
376
|
|
346
377
|
```ruby
|
347
378
|
Html2rss.feed(
|
348
|
-
channel: {
|
349
|
-
|
350
|
-
|
351
|
-
|
379
|
+
channel: {},
|
380
|
+
selectors: {
|
381
|
+
enclosure: { selector: 'audio', extractor: 'attribute', attribute: 'src' }
|
382
|
+
}
|
352
383
|
)
|
353
384
|
```
|
354
385
|
|
@@ -357,130 +388,88 @@ Html2rss.feed(
|
|
357
388
|
<details>
|
358
389
|
<summary>See a YAML feed config example</summary>
|
359
390
|
|
360
|
-
```
|
391
|
+
```yml
|
361
392
|
channel:
|
362
|
-
|
363
|
-
json: true
|
393
|
+
# ... omitted
|
364
394
|
selectors:
|
365
395
|
# ... omitted
|
396
|
+
enclosure:
|
397
|
+
selector: "audio"
|
398
|
+
extractor: "attribute"
|
399
|
+
attribute: "src"
|
366
400
|
```
|
367
401
|
|
368
402
|
</details>
|
403
|
+
## Scraping and handling JSON responses
|
369
404
|
|
370
|
-
|
371
|
-
<summary>See example of a converted JSON object</summary>
|
405
|
+
By default, `html2rss` assumes the URL responds with HTML. However, it can also handle JSON responses. The JSON must return an Array or Hash.
|
372
406
|
|
373
|
-
|
407
|
+
| key | required | default | note |
|
408
|
+
| ---------- | -------- | ------- | ---------------------------------------------------- |
|
409
|
+
| `json` | optional | false | If set to `true`, the response is parsed as JSON. |
|
410
|
+
| `jsonpath` | optional | $ | Use [JSONPath syntax]() to select nodes of interest. |
|
374
411
|
|
375
|
-
|
376
|
-
{
|
377
|
-
"data": [{ "title": "Headline", "url": "https://example.com" }]
|
378
|
-
}
|
379
|
-
```
|
412
|
+
<details><summary>See a Ruby example</summary>
|
380
413
|
|
381
|
-
|
382
|
-
|
383
|
-
|
384
|
-
|
385
|
-
|
386
|
-
<datum>
|
387
|
-
<title>Headline</title>
|
388
|
-
<url>https://example.com</url>
|
389
|
-
</datum>
|
390
|
-
</data>
|
391
|
-
</hash>
|
414
|
+
```ruby
|
415
|
+
Html2rss.feed(
|
416
|
+
channel: { url: 'http://domainname.tld/whatever.json', json: true },
|
417
|
+
selectors: { title: { selector: 'foo' } }
|
418
|
+
)
|
392
419
|
```
|
393
420
|
|
394
|
-
Your items selector would be `data > datum`, the item's `link` selector would be `url`.
|
395
|
-
|
396
|
-
Find further information in [ActiveSupport's `Hash.to_xml` documentation](https://apidock.com/rails/Hash/to_xml).
|
397
|
-
|
398
421
|
</details>
|
399
422
|
|
400
|
-
<details>
|
401
|
-
<summary>See example of a converted JSON array</summary>
|
423
|
+
<details><summary>See a YAML feed config example</summary>
|
402
424
|
|
403
|
-
|
404
|
-
|
405
|
-
|
406
|
-
|
407
|
-
|
408
|
-
|
409
|
-
|
410
|
-
|
411
|
-
```xml
|
412
|
-
<objects>
|
413
|
-
<object>
|
414
|
-
<title>Headline</title>
|
415
|
-
<url>https://example.com</url>
|
416
|
-
</object>
|
417
|
-
</objects>
|
425
|
+
```yml
|
426
|
+
channel:
|
427
|
+
url: "http://domainname.tld/whatever.json"
|
428
|
+
json: true
|
429
|
+
selectors:
|
430
|
+
title:
|
431
|
+
selector: "foo"
|
418
432
|
```
|
419
433
|
|
420
|
-
Your items selector would be `objects > object`, the item's `link` selector would be `url`.
|
421
|
-
|
422
|
-
Find further information in [ActiveSupport's `Array.to_xml` documentation](https://apidock.com/rails/Array/to_xml).
|
423
|
-
|
424
434
|
</details>
|
425
435
|
|
426
436
|
## Set any HTTP header in the request
|
427
437
|
|
428
|
-
|
429
|
-
Use this to e.g. have Cookie or Authorization information sent or to spoof the User-Agent.
|
438
|
+
To set HTTP request headers, you can add them to the channel's `headers` hash. This is useful for APIs that require an Authorization header.
|
430
439
|
|
431
|
-
|
432
|
-
<summary>See a Ruby example</summary>
|
433
|
-
|
434
|
-
```ruby
|
435
|
-
Html2rss.feed(
|
436
|
-
channel: {
|
437
|
-
url: 'https://example.com',
|
438
|
-
headers: {
|
439
|
-
"User-Agent": "html2rss-request",
|
440
|
-
"X-Something": "Foobar",
|
441
|
-
"Authorization": "Token deadbea7",
|
442
|
-
"Cookie": "monster=MeWantCookie"
|
443
|
-
}
|
444
|
-
},
|
445
|
-
selectors: {}
|
446
|
-
)
|
447
|
-
```
|
448
|
-
|
449
|
-
</details>
|
450
|
-
|
451
|
-
<details>
|
452
|
-
<summary>See a YAML feed config example</summary>
|
453
|
-
|
454
|
-
```yaml
|
440
|
+
```yml
|
455
441
|
channel:
|
456
|
-
url: https://example.com
|
442
|
+
url: "https://example.com/api/resource"
|
457
443
|
headers:
|
458
|
-
|
459
|
-
"X-Something": "Foobar"
|
460
|
-
"Authorization": "Token deadbea7"
|
461
|
-
"Cookie": "monster=MeWantCookie"
|
444
|
+
Authorization: "Bearer YOUR_TOKEN"
|
462
445
|
selectors:
|
463
|
-
|
446
|
+
# ... omitted
|
464
447
|
```
|
465
448
|
|
466
|
-
|
449
|
+
Or for setting a User-Agent:
|
467
450
|
|
468
|
-
|
451
|
+
```yml
|
452
|
+
channel:
|
453
|
+
url: "https://example.com"
|
454
|
+
headers:
|
455
|
+
User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
456
|
+
selectors:
|
457
|
+
# ... omitted
|
458
|
+
```
|
469
459
|
|
470
460
|
## Usage with a YAML config file
|
471
461
|
|
472
462
|
This step is not required to work with this gem. If you're using
|
473
|
-
[`html2rss-web`](https://github.com/
|
463
|
+
[`html2rss-web`](https://github.com/html2rss/html2rss-web)
|
474
464
|
and want to create your private feed configs, keep on reading!
|
475
465
|
|
476
|
-
First, create
|
477
|
-
This file will contain your global config and feed configs.
|
466
|
+
First, create a YAML file, e.g. `feeds.yml`. This file will contain your global config and multiple feed configs under the key `feeds`.
|
478
467
|
|
479
468
|
Example:
|
480
469
|
|
481
470
|
```yml
|
482
471
|
headers:
|
483
|
-
|
472
|
+
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
|
484
473
|
feeds:
|
485
474
|
myfeed:
|
486
475
|
channel:
|
@@ -492,7 +481,12 @@ feeds:
|
|
492
481
|
|
493
482
|
Your feed configs go below `feeds`. Everything else is part of the global config.
|
494
483
|
|
495
|
-
|
484
|
+
Find a full example of a `feeds.yml` at [`spec/feeds.test.yml`](https://github.com/html2rss/html2rss/blob/master/spec/feeds.test.yml).
|
485
|
+
|
486
|
+
Now you can build your feeds like this:
|
487
|
+
|
488
|
+
<details>
|
489
|
+
<summary>Build feeds in Ruby</summary>
|
496
490
|
|
497
491
|
```ruby
|
498
492
|
require 'html2rss'
|
@@ -501,37 +495,96 @@ myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
|
|
501
495
|
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')
|
502
496
|
```
|
503
497
|
|
504
|
-
|
498
|
+
</details>
|
505
499
|
|
506
|
-
|
500
|
+
<details>
|
501
|
+
<summary>Build feeds on the command line</summary>
|
507
502
|
|
508
|
-
|
509
|
-
|
510
|
-
|
511
|
-
|
503
|
+
```sh
|
504
|
+
html2rss feed feeds.yml myfeed
|
505
|
+
html2rss feed feeds.yml myotherfeed
|
506
|
+
```
|
507
|
+
|
508
|
+
</details>
|
512
509
|
|
513
|
-
##
|
510
|
+
## Display the RSS feed nicely in a web browser
|
514
511
|
|
515
|
-
|
516
|
-
|
512
|
+
To display RSS feeds nicely in a web browser, you can:
|
513
|
+
|
514
|
+
- add a plain old CSS stylesheet, or
|
515
|
+
- use XSLT (e**X**tensible **S**tylesheet **L**anguage **T**ransformations).
|
516
|
+
|
517
|
+
A web browser will apply these stylesheets and show the contents as described.
|
518
|
+
|
519
|
+
In a CSS stylesheet, you'd use `element` selectors to apply styles.
|
520
|
+
|
521
|
+
If you want to do more, then you need to create a XSLT. XSLT allows you
|
522
|
+
to use a HTML template and to freely design the information of the RSS,
|
523
|
+
including using JavaScript and external resources.
|
524
|
+
|
525
|
+
You can add as many stylesheets and types as you like. Just add them to your global configuration.
|
517
526
|
|
518
527
|
<details>
|
519
|
-
<summary>
|
520
|
-
|
521
|
-
|
522
|
-
|
523
|
-
|
524
|
-
|
525
|
-
|
526
|
-
|
527
|
-
|
528
|
-
|
529
|
-
|
530
|
-
|
531
|
-
|
528
|
+
<summary>Ruby: a stylesheet config example</summary>
|
529
|
+
|
530
|
+
```ruby
|
531
|
+
config = Html2rss::Config.new(
|
532
|
+
{ channel: {}, selectors: {} }, # omitted
|
533
|
+
{
|
534
|
+
stylesheets: [
|
535
|
+
{
|
536
|
+
href: '/relative/base/path/to/style.xls',
|
537
|
+
media: :all,
|
538
|
+
type: 'text/xsl'
|
539
|
+
},
|
540
|
+
{
|
541
|
+
href: 'http://example.com/rss.css',
|
542
|
+
media: :all,
|
543
|
+
type: 'text/css'
|
544
|
+
}
|
545
|
+
]
|
546
|
+
}
|
547
|
+
)
|
548
|
+
|
549
|
+
Html2rss.feed(config)
|
550
|
+
```
|
532
551
|
|
533
552
|
</details>
|
534
553
|
|
535
|
-
|
554
|
+
<details>
|
555
|
+
<summary>YAML: a stylesheet config example</summary>
|
556
|
+
|
557
|
+
```yml
|
558
|
+
stylesheets:
|
559
|
+
- href: "/relative/base/path/to/style.xls"
|
560
|
+
media: "all"
|
561
|
+
type: "text/xsl"
|
562
|
+
- href: "http://example.com/rss.css"
|
563
|
+
media: "all"
|
564
|
+
type: "text/css"
|
565
|
+
feeds:
|
566
|
+
# ... omitted
|
567
|
+
```
|
568
|
+
|
569
|
+
</details>
|
570
|
+
|
571
|
+
Recommended further readings:
|
572
|
+
|
573
|
+
- [How to format RSS with CSS on lifewire.com](https://www.lifewire.com/how-to-format-rss-3469302)
|
574
|
+
- [XSLT: Extensible Stylesheet Language Transformations on MDN](https://developer.mozilla.org/en-US/docs/Web/XSLT)
|
575
|
+
- [The XSLT used by html2rss-web](https://github.com/html2rss/html2rss-web/blob/master/public/rss.xsl)
|
576
|
+
|
577
|
+
## Gotchas and tips & tricks
|
578
|
+
|
579
|
+
- Check that the channel URL does not redirect to a mobile page with a different markup structure.
|
580
|
+
- Do not rely on your web browser's developer console. `html2rss` does not execute JavaScript.
|
581
|
+
- Fiddling with [`curl`](https://github.com/curl/curl) and [`pup`](https://github.com/ericchiang/pup) to find the selectors seems efficient (`curl URL | pup`).
|
582
|
+
- [CSS selectors are versatile. Here's an overview.](https://www.w3.org/TR/selectors-4/#overview)
|
583
|
+
|
584
|
+
### Contributing
|
536
585
|
|
537
|
-
|
586
|
+
1. Fork it ( <https://github.com/html2rss/html2rss/fork> )
|
587
|
+
2. Create your feature branch (`git checkout -b my-new-feature`)
|
588
|
+
3. Commit your changes (`git commit -am 'Add some feature'`)
|
589
|
+
4. Push to the branch (`git push origin my-new-feature`)
|
590
|
+
5. Create a new Pull Request
|