html2rss 0.17.0 → 0.18.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +48 -656
- data/exe/html2rss +1 -1
- data/html2rss.gemspec +5 -2
- data/lib/html2rss/articles/deduplicator.rb +49 -0
- data/lib/html2rss/auto_source/cleanup.rb +33 -5
- data/lib/html2rss/auto_source/scraper/html.rb +118 -43
- data/lib/html2rss/auto_source/scraper/json_state.rb +377 -0
- data/lib/html2rss/auto_source/scraper/microdata.rb +399 -0
- data/lib/html2rss/auto_source/scraper/schema/category_extractor.rb +102 -0
- data/lib/html2rss/auto_source/scraper/schema/item_list.rb +2 -2
- data/lib/html2rss/auto_source/scraper/schema/list_item.rb +3 -3
- data/lib/html2rss/auto_source/scraper/schema/thing.rb +48 -8
- data/lib/html2rss/auto_source/scraper/schema.rb +12 -8
- data/lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb +199 -0
- data/lib/html2rss/auto_source/scraper/semantic_html.rb +84 -79
- data/lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb +261 -0
- data/lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb +134 -0
- data/lib/html2rss/auto_source/scraper/wordpress_api.rb +179 -0
- data/lib/html2rss/auto_source/scraper.rb +142 -8
- data/lib/html2rss/auto_source.rb +119 -47
- data/lib/html2rss/blocked_surface.rb +64 -0
- data/lib/html2rss/category_extractor.rb +82 -0
- data/lib/html2rss/cli.rb +170 -23
- data/lib/html2rss/config/class_methods.rb +189 -0
- data/lib/html2rss/config/dynamic_params.rb +68 -0
- data/lib/html2rss/config/multiple_feeds_config.rb +50 -0
- data/lib/html2rss/config/request_headers.rb +130 -0
- data/lib/html2rss/config/schema.rb +208 -0
- data/lib/html2rss/config/validator.rb +108 -0
- data/lib/html2rss/config.rb +112 -61
- data/lib/html2rss/error.rb +6 -0
- data/lib/html2rss/html_extractor/date_extractor.rb +19 -0
- data/lib/html2rss/html_extractor/enclosure_extractor.rb +101 -0
- data/lib/html2rss/html_extractor/image_extractor.rb +49 -0
- data/lib/html2rss/html_extractor.rb +136 -0
- data/lib/html2rss/html_navigator.rb +46 -0
- data/lib/html2rss/json_feed_builder/item.rb +94 -0
- data/lib/html2rss/json_feed_builder.rb +58 -0
- data/lib/html2rss/rendering/audio_renderer.rb +31 -0
- data/lib/html2rss/rendering/description_builder.rb +88 -0
- data/lib/html2rss/rendering/image_renderer.rb +31 -0
- data/lib/html2rss/rendering/media_renderer.rb +33 -0
- data/lib/html2rss/rendering/pdf_renderer.rb +28 -0
- data/lib/html2rss/rendering/video_renderer.rb +31 -0
- data/lib/html2rss/rendering.rb +14 -0
- data/lib/html2rss/request_controls.rb +128 -0
- data/lib/html2rss/request_service/browserless_strategy.rb +103 -7
- data/lib/html2rss/request_service/budget.rb +39 -0
- data/lib/html2rss/request_service/context.rb +64 -20
- data/lib/html2rss/request_service/faraday_strategy.rb +135 -5
- data/lib/html2rss/request_service/policy.rb +248 -0
- data/lib/html2rss/request_service/puppet_commander.rb +212 -13
- data/lib/html2rss/request_service/response.rb +42 -2
- data/lib/html2rss/request_service/response_guard.rb +62 -0
- data/lib/html2rss/request_service.rb +31 -15
- data/lib/html2rss/request_session/rel_next_pager.rb +70 -0
- data/lib/html2rss/request_session/runtime_input.rb +57 -0
- data/lib/html2rss/request_session/runtime_policy.rb +76 -0
- data/lib/html2rss/request_session.rb +118 -0
- data/lib/html2rss/rss_builder/article.rb +166 -0
- data/lib/html2rss/rss_builder/channel.rb +96 -11
- data/lib/html2rss/rss_builder/enclosure.rb +48 -0
- data/lib/html2rss/rss_builder/stylesheet.rb +4 -4
- data/lib/html2rss/rss_builder.rb +72 -71
- data/lib/html2rss/selectors/config.rb +122 -0
- data/lib/html2rss/selectors/extractors/attribute.rb +50 -0
- data/lib/html2rss/selectors/extractors/href.rb +53 -0
- data/lib/html2rss/selectors/extractors/html.rb +48 -0
- data/lib/html2rss/selectors/extractors/static.rb +41 -0
- data/lib/html2rss/selectors/extractors/text.rb +46 -0
- data/lib/html2rss/selectors/extractors.rb +52 -0
- data/lib/html2rss/selectors/object_to_xml_converter.rb +61 -0
- data/lib/html2rss/selectors/post_processors/base.rb +74 -0
- data/lib/html2rss/selectors/post_processors/gsub.rb +85 -0
- data/lib/html2rss/selectors/post_processors/html_to_markdown.rb +45 -0
- data/lib/html2rss/selectors/post_processors/html_transformers/transform_urls_to_absolute_ones.rb +35 -0
- data/lib/html2rss/selectors/post_processors/html_transformers/wrap_img_in_a.rb +47 -0
- data/lib/html2rss/selectors/post_processors/markdown_to_html.rb +52 -0
- data/lib/html2rss/selectors/post_processors/parse_time.rb +73 -0
- data/lib/html2rss/selectors/post_processors/parse_uri.rb +40 -0
- data/lib/html2rss/selectors/post_processors/sanitize_html.rb +150 -0
- data/lib/html2rss/selectors/post_processors/substring.rb +74 -0
- data/lib/html2rss/selectors/post_processors/template.rb +73 -0
- data/lib/html2rss/selectors/post_processors.rb +43 -0
- data/lib/html2rss/selectors.rb +294 -0
- data/lib/html2rss/url.rb +262 -0
- data/lib/html2rss/version.rb +1 -1
- data/lib/html2rss.rb +129 -70
- data/lib/tasks/config_schema.rake +17 -0
- data/schema/html2rss-config.schema.json +469 -0
- metadata +115 -38
- data/lib/html2rss/attribute_post_processors/base.rb +0 -74
- data/lib/html2rss/attribute_post_processors/gsub.rb +0 -64
- data/lib/html2rss/attribute_post_processors/html_to_markdown.rb +0 -43
- data/lib/html2rss/attribute_post_processors/html_transformers/transform_urls_to_absolute_ones.rb +0 -27
- data/lib/html2rss/attribute_post_processors/html_transformers/wrap_img_in_a.rb +0 -41
- data/lib/html2rss/attribute_post_processors/markdown_to_html.rb +0 -50
- data/lib/html2rss/attribute_post_processors/parse_time.rb +0 -46
- data/lib/html2rss/attribute_post_processors/parse_uri.rb +0 -46
- data/lib/html2rss/attribute_post_processors/sanitize_html.rb +0 -115
- data/lib/html2rss/attribute_post_processors/substring.rb +0 -72
- data/lib/html2rss/attribute_post_processors/template.rb +0 -101
- data/lib/html2rss/attribute_post_processors.rb +0 -44
- data/lib/html2rss/auto_source/article.rb +0 -127
- data/lib/html2rss/auto_source/channel.rb +0 -78
- data/lib/html2rss/auto_source/reducer.rb +0 -48
- data/lib/html2rss/auto_source/rss_builder.rb +0 -70
- data/lib/html2rss/auto_source/scraper/semantic_html/extractor.rb +0 -136
- data/lib/html2rss/auto_source/scraper/semantic_html/image.rb +0 -54
- data/lib/html2rss/config/channel.rb +0 -125
- data/lib/html2rss/config/selectors.rb +0 -103
- data/lib/html2rss/item.rb +0 -186
- data/lib/html2rss/item_extractors/attribute.rb +0 -50
- data/lib/html2rss/item_extractors/href.rb +0 -52
- data/lib/html2rss/item_extractors/html.rb +0 -46
- data/lib/html2rss/item_extractors/static.rb +0 -39
- data/lib/html2rss/item_extractors/text.rb +0 -44
- data/lib/html2rss/item_extractors.rb +0 -88
- data/lib/html2rss/object_to_xml_converter.rb +0 -56
- data/lib/html2rss/rss_builder/item.rb +0 -83
- data/lib/html2rss/utils.rb +0 -113
data/README.md
CHANGED
|
@@ -1,686 +1,78 @@
|
|
|
1
1
|

|
|
2
2
|
|
|
3
|
-
[](http://rubygems.org/gems/html2rss
|
|
3
|
+
[](http://rubygems.org/gems/html2rss) [](https://www.rubydoc.info/gems/html2rss)  [](https://github.com/html2rss/html2rss/actions)
|
|
4
4
|
|
|
5
|
-
`html2rss` is a Ruby gem that generates RSS 2.0 feeds from websites
|
|
5
|
+
`html2rss` is a Ruby gem that generates RSS 2.0 feeds from websites by scraping HTML or JSON content with **CSS selectors** or **auto-detection**.
|
|
6
6
|
|
|
7
|
-
|
|
7
|
+
This gem is the core of the [html2rss-web](https://github.com/html2rss/html2rss-web) application.
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
Most people looking for a first working feed should start with `html2rss-web`, run it with Docker, and open one of the included feeds from their own instance before moving to custom configs or the gem APIs.
|
|
10
10
|
|
|
11
|
-
|
|
11
|
+
## Documentation
|
|
12
12
|
|
|
13
|
-
|
|
13
|
+
Detailed usage guides, reference docs, and the feed directory live on the project website:
|
|
14
14
|
|
|
15
|
-
[
|
|
15
|
+
- [Ruby gem documentation](https://html2rss.github.io/ruby-gem)
|
|
16
|
+
- [Web application](https://html2rss.github.io/web-application)
|
|
17
|
+
- [Feed directory](https://html2rss.github.io/feed-directory)
|
|
18
|
+
- [Contributing guide](https://html2rss.github.io/get-involved/contributing)
|
|
19
|
+
- [GitHub Discussions](https://github.com/orgs/html2rss/discussions)
|
|
20
|
+
- [Sponsor on GitHub](https://github.com/sponsors/gildesmarais)
|
|
16
21
|
|
|
17
|
-
|
|
22
|
+
### 💻 Try in Browser
|
|
18
23
|
|
|
19
|
-
|
|
24
|
+
You can develop html2rss directly in your browser using GitHub Codespaces:
|
|
20
25
|
|
|
21
|
-
|
|
26
|
+
[](https://github.com/codespaces/new?repo=html2rss/html2rss)
|
|
22
27
|
|
|
23
|
-
|
|
28
|
+
The Codespace comes pre-configured with Ruby 3.4 (compatible with Ruby 4.0), all dependencies, and VS Code extensions ready to go!
|
|
24
29
|
|
|
25
|
-
|
|
30
|
+
## 🤝 Contributing
|
|
26
31
|
|
|
27
|
-
|
|
32
|
+
Please see the [contributing guide](https://html2rss.github.io/get-involved/contributing) for details on how to contribute.
|
|
28
33
|
|
|
29
|
-
|
|
34
|
+
## 🏗️ Architecture
|
|
30
35
|
|
|
31
|
-
|
|
32
|
-
channel:
|
|
33
|
-
url: https://unmatchedstyle.com
|
|
34
|
-
selectors:
|
|
35
|
-
items:
|
|
36
|
-
selector: "article[id^='post-']"
|
|
37
|
-
title:
|
|
38
|
-
selector: h2
|
|
39
|
-
link:
|
|
40
|
-
selector: a
|
|
41
|
-
extractor: href
|
|
42
|
-
description:
|
|
43
|
-
selector: ".post-content"
|
|
44
|
-
post_process:
|
|
45
|
-
- name: sanitize_html
|
|
46
|
-
```
|
|
47
|
-
|
|
48
|
-
Build the feed from this config with: `html2rss feed ./my_config_file.yml`.
|
|
49
|
-
|
|
50
|
-
## Generating a feed with Ruby
|
|
51
|
-
|
|
52
|
-
You can also install it as a dependency in your Ruby project:
|
|
53
|
-
|
|
54
|
-
| 🤩 Like it? | Star it! ⭐️ |
|
|
55
|
-
| -------------------------------: | -------------------- |
|
|
56
|
-
| Add this line to your `Gemfile`: | `gem 'html2rss'` |
|
|
57
|
-
| Then execute: | `bundle` |
|
|
58
|
-
| In your code: | `require 'html2rss'` |
|
|
59
|
-
|
|
60
|
-
Here's a minimal working example using Ruby:
|
|
36
|
+
### Core Components
|
|
61
37
|
|
|
62
|
-
|
|
63
|
-
|
|
38
|
+
1. **Config** - Loads and validates configuration (YAML/hash)
|
|
39
|
+
2. **RequestService** - Fetches pages using Faraday or Browserless
|
|
40
|
+
3. **Selectors** - Extracts content via CSS selectors with extractors/post-processors
|
|
41
|
+
4. **AutoSource** - Auto-detects content using Schema.org, JSON state blobs, semantic HTML, and structural patterns
|
|
42
|
+
5. **RssBuilder** - Assembles Article objects and renders RSS 2.0
|
|
64
43
|
|
|
65
|
-
|
|
66
|
-
Html2rss.feed(
|
|
67
|
-
channel: { url: 'https://stackoverflow.com/questions' },
|
|
68
|
-
selectors: {
|
|
69
|
-
items: { selector: '#hot-network-questions > ul > li' },
|
|
70
|
-
title: { selector: 'a' },
|
|
71
|
-
link: { selector: 'a', extractor: 'href' }
|
|
72
|
-
}
|
|
73
|
-
)
|
|
44
|
+
### Data Flow
|
|
74
45
|
|
|
75
|
-
|
|
46
|
+
```text
|
|
47
|
+
Config -> Request -> Extraction -> Processing -> Building -> Output
|
|
76
48
|
```
|
|
77
49
|
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
A _feed config_ consists of a `channel` and a `selectors` hash. The contents of both hashes are explained below.
|
|
81
|
-
|
|
82
|
-
Good to know:
|
|
83
|
-
|
|
84
|
-
- You'll find extensive example feed configs at [`spec/*.test.yml`](https://github.com/html2rss/html2rss/tree/master/spec).
|
|
85
|
-
- See [`html2rss-configs`](https://github.com/html2rss/html2rss-configs) for ready-made feed configs!
|
|
86
|
-
- If you've created feed configs, you're invited to send a PR to [`html2rss-configs`](https://github.com/html2rss/html2rss-configs) to make your config available to the public.
|
|
87
|
-
|
|
88
|
-
Alright, let's move on.
|
|
89
|
-
|
|
90
|
-
### The `channel`
|
|
91
|
-
|
|
92
|
-
| attribute | | type | default | remark |
|
|
93
|
-
| ------------- | ------------ | ------- | -------------- | ------------------------------------------ |
|
|
94
|
-
| `url` | **required** | String | | |
|
|
95
|
-
| `title` | optional | String | auto-generated | |
|
|
96
|
-
| `description` | optional | String | auto-generated | |
|
|
97
|
-
| `ttl` | optional | Integer | `360` | TTL in _minutes_ |
|
|
98
|
-
| `time_zone` | optional | String | `'UTC'` | TimeZone name |
|
|
99
|
-
| `language` | optional | String | `'en'` | Language code |
|
|
100
|
-
| `author` | optional | String | | Format: `email (Name)` |
|
|
101
|
-
| `headers` | optional | Hash | `{}` | Set HTTP request headers. See notes below. |
|
|
102
|
-
| `json` | optional | Boolean | `false` | Handle JSON response. See notes below. |
|
|
103
|
-
|
|
104
|
-
#### Dynamic parameters in `channel` attributes
|
|
105
|
-
|
|
106
|
-
Sometimes there are structurally similar pages with different URLs. In such cases, you can add _dynamic parameters_ to the channel's attributes.
|
|
107
|
-
|
|
108
|
-
Example of a dynamic `id` parameter in the channel URLs:
|
|
109
|
-
|
|
110
|
-
```yml
|
|
111
|
-
channel:
|
|
112
|
-
url: "http://domainname.tld/whatever/%<id>s.html"
|
|
113
|
-
```
|
|
114
|
-
|
|
115
|
-
Command line usage example:
|
|
116
|
-
|
|
117
|
-
```sh
|
|
118
|
-
html2rss feed the_feed_config.yml id=42
|
|
119
|
-
```
|
|
120
|
-
|
|
121
|
-
<details><summary>See a Ruby example</summary>
|
|
122
|
-
|
|
123
|
-
```ruby
|
|
124
|
-
config = Html2rss::Config.new({ channel: { url: 'http://domainname.tld/whatever/%<id>s.html' } }, {}, { id: 42 })
|
|
125
|
-
Html2rss.feed(config)
|
|
126
|
-
```
|
|
127
|
-
|
|
128
|
-
</details>
|
|
129
|
-
|
|
130
|
-
See the more complex formatting options of the [`sprintf` method](https://ruby-doc.org/core/Kernel.html#method-i-sprintf).
|
|
131
|
-
|
|
132
|
-
### The `selectors`
|
|
133
|
-
|
|
134
|
-
First, you must give an **`items`** selector hash, which contains a CSS selector. The selector selects a collection of HTML tags from which the RSS feed items are built. Except for the `items` selector, all other keys are scoped to each item of the collection.
|
|
135
|
-
|
|
136
|
-
To build a [valid RSS 2.0 item](http://www.rssboard.org/rss-profile#element-channel-item), you need at least a `title` **or** a `description`. You can have both.
|
|
137
|
-
|
|
138
|
-
Having an `items` and a `title` selector is enough to build a simple feed.
|
|
139
|
-
|
|
140
|
-
Your `selectors` hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (due to the RSS 2.0 specification):
|
|
141
|
-
|
|
142
|
-
| RSS 2.0 tag | name in `html2rss` | remark |
|
|
143
|
-
| ------------- | ------------------ | ------------------------------------------- |
|
|
144
|
-
| `title` | `title` | |
|
|
145
|
-
| `description` | `description` | Supports HTML. |
|
|
146
|
-
| `link` | `link` | A URL. |
|
|
147
|
-
| `author` | `author` | |
|
|
148
|
-
| `category` | `categories` | See notes below. |
|
|
149
|
-
| `guid` | `guid` | Default title/description. See notes below. |
|
|
150
|
-
| `enclosure` | `enclosure` | See notes below. |
|
|
151
|
-
| `pubDate` | `updated` | An instance of `Time`. |
|
|
152
|
-
| `comments` | `comments` | A URL. |
|
|
153
|
-
| `source` | ~~source~~ | Not yet supported. |
|
|
154
|
-
|
|
155
|
-
### Build RSS 2.0 item attributes by specifying selectors
|
|
156
|
-
|
|
157
|
-
Every named selector (i.e. `title`, `description`, see table above) in your `selectors` hash can have these attributes:
|
|
158
|
-
|
|
159
|
-
| name | value |
|
|
160
|
-
| -------------- | -------------------------------------------------------- |
|
|
161
|
-
| `selector` | The CSS selector to select the tag with the information. |
|
|
162
|
-
| `extractor` | Name of the extractor. See notes below. |
|
|
163
|
-
| `post_process` | A hash or array of hashes. See notes below. |
|
|
164
|
-
|
|
165
|
-
#### Using extractors
|
|
166
|
-
|
|
167
|
-
Extractors help with extracting the information from the selected HTML tag.
|
|
168
|
-
|
|
169
|
-
- The default extractor is `text`, which returns the tag's inner text.
|
|
170
|
-
- The `html` extractor returns the tag's outer HTML.
|
|
171
|
-
- The `href` extractor returns a URL from the tag's `href` attribute and corrects relative ones to absolute ones.
|
|
172
|
-
- The `attribute` extractor returns the value of that tag's attribute.
|
|
173
|
-
- The `static` extractor returns the configured static value (it doesn't extract anything).
|
|
174
|
-
- [See file list of extractors](https://github.com/html2rss/html2rss/tree/master/lib/html2rss/item_extractors).
|
|
175
|
-
|
|
176
|
-
Extractors might need extra attributes on the selector hash. 👉 [Read their docs for usage examples](https://www.rubydoc.info/gems/html2rss/Html2rss/ItemExtractors).
|
|
177
|
-
|
|
178
|
-
<details><summary>See a Ruby example</summary>
|
|
179
|
-
|
|
180
|
-
```ruby
|
|
181
|
-
Html2rss.feed(
|
|
182
|
-
channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
|
|
183
|
-
)
|
|
184
|
-
```
|
|
185
|
-
|
|
186
|
-
</details>
|
|
187
|
-
|
|
188
|
-
<details><summary>See a YAML feed config example</summary>
|
|
189
|
-
|
|
190
|
-
```yml
|
|
191
|
-
channel:
|
|
192
|
-
# ... omitted
|
|
193
|
-
selectors:
|
|
194
|
-
# ... omitted
|
|
195
|
-
link:
|
|
196
|
-
selector: "a"
|
|
197
|
-
extractor: "href"
|
|
198
|
-
```
|
|
199
|
-
|
|
200
|
-
</details>
|
|
201
|
-
|
|
202
|
-
### Using post processors
|
|
203
|
-
|
|
204
|
-
Extracted information can be further manipulated with post processors.
|
|
205
|
-
|
|
206
|
-
| name | |
|
|
207
|
-
| ------------------ | ------------------------------------------------------------------------------------- |
|
|
208
|
-
| `gsub` | Allows global substitution operations on Strings (Regexp or simple pattern). |
|
|
209
|
-
| `html_to_markdown` | HTML to Markdown, using [reverse_markdown](https://github.com/xijo/reverse_markdown). |
|
|
210
|
-
| `markdown_to_html` | converts Markdown to HTML, using [kramdown](https://github.com/gettalong/kramdown). |
|
|
211
|
-
| `parse_time` | Parses a String containing a time in a time zone. |
|
|
212
|
-
| `parse_uri` | Parses a String as URL. |
|
|
213
|
-
| `sanitize_html` | Strips unsafe and uneeded HTML and adds security related attributes. |
|
|
214
|
-
| `substring` | Cuts a part off of a String, starting at a position. |
|
|
215
|
-
| `template` | Based on a template, it creates a new String filled with other selectors values. |
|
|
216
|
-
|
|
217
|
-
⚠️ Always make use of the `sanitize_html` post processor for HTML content. _Never trust the internet!_ ⚠️
|
|
218
|
-
|
|
219
|
-
#### Chaining post processors
|
|
220
|
-
|
|
221
|
-
Pass an array to `post_process` to chain the post processors.
|
|
222
|
-
|
|
223
|
-
<details><summary>YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML</summary>
|
|
224
|
-
|
|
225
|
-
```yml
|
|
226
|
-
channel:
|
|
227
|
-
# ... omitted
|
|
228
|
-
selectors:
|
|
229
|
-
# ... omitted
|
|
230
|
-
price:
|
|
231
|
-
selector: '.price'
|
|
232
|
-
description:
|
|
233
|
-
selector: '.section'
|
|
234
|
-
post_process:
|
|
235
|
-
- name: template
|
|
236
|
-
string: |
|
|
237
|
-
# %{self}
|
|
238
|
-
|
|
239
|
-
Price: %{price}
|
|
240
|
-
- name: markdown_to_html
|
|
241
|
-
```
|
|
242
|
-
|
|
243
|
-
</details>
|
|
244
|
-
|
|
245
|
-
##### Post processor `gsub`
|
|
246
|
-
|
|
247
|
-
The post processor `gsub` makes use of Ruby's [`gsub`](https://apidock.com/ruby/String/gsub) method.
|
|
248
|
-
|
|
249
|
-
| key | type | required | note |
|
|
250
|
-
| ------------- | ------ | -------- | ------------------------ |
|
|
251
|
-
| `pattern` | String | yes | Can be Regexp or String. |
|
|
252
|
-
| `replacement` | String | yes | Can be a backreference. |
|
|
253
|
-
|
|
254
|
-
<details><summary>See a Ruby example</summary>
|
|
255
|
-
|
|
256
|
-
```ruby
|
|
257
|
-
Html2rss.feed(
|
|
258
|
-
channel: {},
|
|
259
|
-
selectors: {
|
|
260
|
-
title: { selector: 'a', post_process: [{ name: 'gsub', pattern: 'foo', replacement: 'bar' }] }
|
|
261
|
-
}
|
|
262
|
-
)
|
|
263
|
-
```
|
|
264
|
-
|
|
265
|
-
</details>
|
|
266
|
-
|
|
267
|
-
<details><summary>See a YAML feed config example</summary>
|
|
268
|
-
|
|
269
|
-
```yml
|
|
270
|
-
channel:
|
|
271
|
-
# ... omitted
|
|
272
|
-
selectors:
|
|
273
|
-
# ... omitted
|
|
274
|
-
title:
|
|
275
|
-
selector: "a"
|
|
276
|
-
post_process:
|
|
277
|
-
- name: "gsub"
|
|
278
|
-
pattern: "foo"
|
|
279
|
-
replacement: "bar"
|
|
280
|
-
```
|
|
281
|
-
|
|
282
|
-
</details>
|
|
283
|
-
|
|
284
|
-
#### Adding `<category>` tags to an item
|
|
285
|
-
|
|
286
|
-
The `categories` selector takes an array of selector names. Each value of those
|
|
287
|
-
selectors will become a `<category>` on the RSS item.
|
|
288
|
-
|
|
289
|
-
<details>
|
|
290
|
-
<summary>See a Ruby example</summary>
|
|
291
|
-
|
|
292
|
-
```ruby
|
|
293
|
-
Html2rss.feed(
|
|
294
|
-
channel: {},
|
|
295
|
-
selectors: {
|
|
296
|
-
genre: {
|
|
297
|
-
# ... omitted
|
|
298
|
-
selector: '.genre'
|
|
299
|
-
},
|
|
300
|
-
branch: { selector: '.branch' },
|
|
301
|
-
categories: %i[genre branch]
|
|
302
|
-
}
|
|
303
|
-
)
|
|
304
|
-
```
|
|
305
|
-
|
|
306
|
-
</details>
|
|
307
|
-
|
|
308
|
-
<details>
|
|
309
|
-
<summary>See a YAML feed config example</summary>
|
|
310
|
-
|
|
311
|
-
```yml
|
|
312
|
-
channel:
|
|
313
|
-
# ... omitted
|
|
314
|
-
selectors:
|
|
315
|
-
# ... omitted
|
|
316
|
-
genre:
|
|
317
|
-
selector: ".genre"
|
|
318
|
-
branch:
|
|
319
|
-
selector: ".branch"
|
|
320
|
-
categories:
|
|
321
|
-
- genre
|
|
322
|
-
- branch
|
|
323
|
-
```
|
|
324
|
-
|
|
325
|
-
</details>
|
|
326
|
-
|
|
327
|
-
#### Custom item GUID
|
|
328
|
-
|
|
329
|
-
By default, html2rss generates a GUID from the `title` or `description`.
|
|
330
|
-
|
|
331
|
-
If this does not work well, you can choose other attributes from which the GUID is build.
|
|
332
|
-
The principle is the same as for the categories: pass an array of selectors names.
|
|
333
|
-
|
|
334
|
-
In all cases, the GUID is a SHA1-encoded string.
|
|
335
|
-
|
|
336
|
-
<details><summary>See a Ruby example</summary>
|
|
337
|
-
|
|
338
|
-
```ruby
|
|
339
|
-
Html2rss.feed(
|
|
340
|
-
channel: {},
|
|
341
|
-
selectors: {
|
|
342
|
-
title: {
|
|
343
|
-
# ... omitted
|
|
344
|
-
selector: 'h1'
|
|
345
|
-
},
|
|
346
|
-
link: { selector: 'a', extractor: 'href' },
|
|
347
|
-
guid: %i[link]
|
|
348
|
-
}
|
|
349
|
-
)
|
|
350
|
-
```
|
|
351
|
-
|
|
352
|
-
</details>
|
|
353
|
-
|
|
354
|
-
<details><summary>See a YAML feed config example</summary>
|
|
355
|
-
|
|
356
|
-
```yml
|
|
357
|
-
channel:
|
|
358
|
-
# ... omitted
|
|
359
|
-
selectors:
|
|
360
|
-
# ... omitted
|
|
361
|
-
title:
|
|
362
|
-
selector: "h1"
|
|
363
|
-
link:
|
|
364
|
-
selector: "a"
|
|
365
|
-
extractor: "href"
|
|
366
|
-
guid:
|
|
367
|
-
- link
|
|
368
|
-
```
|
|
369
|
-
|
|
370
|
-
</details>
|
|
371
|
-
|
|
372
|
-
#### Adding an `<enclosure>` tag to an item
|
|
373
|
-
|
|
374
|
-
An enclosure can be any file, e.g. a image, audio or video - think Podcast.
|
|
375
|
-
|
|
376
|
-
The `enclosure` selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.
|
|
377
|
-
|
|
378
|
-
Since `html2rss` does no further inspection of the enclosure, its support comes with trade-offs:
|
|
379
|
-
|
|
380
|
-
1. The content-type is guessed from the file extension of the URL, unless one is specified in `content_type`.
|
|
381
|
-
2. If the content-type guessing fails, it will default to `application/octet-stream`.
|
|
382
|
-
3. The content-length will always be undetermined and therefore stated as `0` bytes.
|
|
383
|
-
|
|
384
|
-
Read the [RSS 2.0 spec](http://www.rssboard.org/rss-profile#element-channel-item-enclosure) for further information on enclosing content.
|
|
385
|
-
|
|
386
|
-
<details>
|
|
387
|
-
<summary>See a Ruby example</summary>
|
|
388
|
-
|
|
389
|
-
```ruby
|
|
390
|
-
Html2rss.feed(
|
|
391
|
-
channel: {},
|
|
392
|
-
selectors: {
|
|
393
|
-
enclosure: {
|
|
394
|
-
selector: 'audio',
|
|
395
|
-
extractor: 'attribute',
|
|
396
|
-
attribute: 'src',
|
|
397
|
-
content_type: 'audio/mp3'
|
|
398
|
-
}
|
|
399
|
-
}
|
|
400
|
-
)
|
|
401
|
-
```
|
|
402
|
-
|
|
403
|
-
</details>
|
|
404
|
-
|
|
405
|
-
<details>
|
|
406
|
-
<summary>See a YAML feed config example</summary>
|
|
407
|
-
|
|
408
|
-
```yml
|
|
409
|
-
channel:
|
|
410
|
-
# ... omitted
|
|
411
|
-
selectors:
|
|
412
|
-
# ... omitted
|
|
413
|
-
enclosure:
|
|
414
|
-
selector: "audio"
|
|
415
|
-
extractor: "attribute"
|
|
416
|
-
attribute: "src"
|
|
417
|
-
content_type: "audio/mp3"
|
|
418
|
-
```
|
|
419
|
-
|
|
420
|
-
</details>
|
|
421
|
-
|
|
422
|
-
## Scraping and handling JSON responses
|
|
423
|
-
|
|
424
|
-
By default, `html2rss` assumes the URL responds with HTML. However, it can also handle JSON responses. The JSON response must be an Array or Hash.
|
|
425
|
-
|
|
426
|
-
The JSON is converted to XML which you can query using CSS selectors.
|
|
427
|
-
|
|
428
|
-
<details><summary>See a Ruby example</summary>
|
|
429
|
-
|
|
430
|
-
```ruby
|
|
431
|
-
Html2rss.feed(
|
|
432
|
-
channel: { url: 'http://domainname.tld/whatever.json', json: true },
|
|
433
|
-
selectors: { title: { selector: 'foo' } }
|
|
434
|
-
)
|
|
435
|
-
```
|
|
436
|
-
|
|
437
|
-
</details>
|
|
438
|
-
|
|
439
|
-
<details><summary>See a YAML feed config example</summary>
|
|
440
|
-
|
|
441
|
-
```yml
|
|
442
|
-
channel:
|
|
443
|
-
url: "http://domainname.tld/whatever.json"
|
|
444
|
-
json: true
|
|
445
|
-
selectors:
|
|
446
|
-
title:
|
|
447
|
-
selector: "foo"
|
|
448
|
-
```
|
|
449
|
-
|
|
450
|
-
</details>
|
|
451
|
-
|
|
452
|
-
## Customization of how requests to the channel URL are sent
|
|
453
|
-
|
|
454
|
-
By default, html2rss issues a naiive HTTP request and extracts information from the response. That is performant and works for many websites.
|
|
455
|
-
|
|
456
|
-
However, modern websites often do not render much HTML on the server, but evaluate JavaScript on the client to create the HTML. In such cases, the default strategy will not find the "juicy content".
|
|
457
|
-
|
|
458
|
-
### Use Browserless.io
|
|
459
|
-
|
|
460
|
-
You can use _Browserless.io_ to run a Chrome browser and return the website's source code after the website generated it.
|
|
461
|
-
For this, you can either run your own Browserless.io instance (Docker image available -- [read their license](https://github.com/browserless/browserless/pkgs/container/chromium#licensing)!) or pay them for a hosted instance.
|
|
462
|
-
|
|
463
|
-
To run a local Browserless.io instance, you can use the following Docker command:
|
|
464
|
-
|
|
465
|
-
```sh
|
|
466
|
-
docker run \
|
|
467
|
-
--rm \
|
|
468
|
-
-p 3000:3000 \
|
|
469
|
-
-e "CONCURRENT=10" \
|
|
470
|
-
-e "TOKEN=6R0W53R135510" \
|
|
471
|
-
ghcr.io/browserless/chromium
|
|
472
|
-
```
|
|
473
|
-
|
|
474
|
-
To make html2rss use your instance,
|
|
475
|
-
|
|
476
|
-
1. specify the environment variables accordingly, and
|
|
477
|
-
2. use the `browserless` strategy for those websites.
|
|
478
|
-
|
|
479
|
-
When running locally with commands from above, you can skip setting the environment variables, as they are aligned with the default values.
|
|
480
|
-
|
|
481
|
-
```sh
|
|
482
|
-
BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \
|
|
483
|
-
html2rss auto --strategy=browserless https://example.com
|
|
484
|
-
```
|
|
485
|
-
|
|
486
|
-
When using traditional feed configs, inside your channel config set `strategy: browserless`.
|
|
487
|
-
|
|
488
|
-
<details><summary>See a YAML feed config example</summary>
|
|
489
|
-
|
|
490
|
-
```yml
|
|
491
|
-
channel:
|
|
492
|
-
url: https://www.imdb.com/user/ur67728460/ratings
|
|
493
|
-
time_zone: UTC
|
|
494
|
-
ttl: 1440
|
|
495
|
-
strategy: browserless
|
|
496
|
-
headers:
|
|
497
|
-
User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
|
498
|
-
selectors:
|
|
499
|
-
items:
|
|
500
|
-
selector: "li.ipc-metadata-list-summary-item"
|
|
501
|
-
title:
|
|
502
|
-
selector: ".ipc-title__text"
|
|
503
|
-
post_process:
|
|
504
|
-
- name: gsub
|
|
505
|
-
pattern: "/^(\\d+.)\\s/"
|
|
506
|
-
replacement: ""
|
|
507
|
-
- name: template
|
|
508
|
-
string: "%{self} rated with: %{user_rating}"
|
|
509
|
-
link:
|
|
510
|
-
selector: "a.ipc-title-link-wrapper"
|
|
511
|
-
extractor: "href"
|
|
512
|
-
user_rating:
|
|
513
|
-
selector: "[data-testid='ratingGroup--other-user-rating'] > .ipc-rating-star--rating"
|
|
514
|
-
```
|
|
515
|
-
|
|
516
|
-
</details>
|
|
517
|
-
|
|
518
|
-
### Set any HTTP header in the request
|
|
519
|
-
|
|
520
|
-
To set HTTP request headers, you can add them to the channel's `headers` hash. This is useful for APIs that require an Authorization header.
|
|
521
|
-
|
|
522
|
-
```yml
|
|
523
|
-
channel:
|
|
524
|
-
url: "https://example.com/api/resource"
|
|
525
|
-
headers:
|
|
526
|
-
Authorization: "Bearer YOUR_TOKEN"
|
|
527
|
-
selectors:
|
|
528
|
-
# ... omitted
|
|
529
|
-
```
|
|
530
|
-
|
|
531
|
-
Or for setting a User-Agent:
|
|
532
|
-
|
|
533
|
-
```yml
|
|
534
|
-
channel:
|
|
535
|
-
url: "https://example.com"
|
|
536
|
-
headers:
|
|
537
|
-
User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
|
538
|
-
selectors:
|
|
539
|
-
# ... omitted
|
|
540
|
-
```
|
|
541
|
-
|
|
542
|
-
## Usage with a YAML config file
|
|
543
|
-
|
|
544
|
-
This step is not required to work with this gem. If you're using
|
|
545
|
-
[`html2rss-web`](https://github.com/html2rss/html2rss-web)
|
|
546
|
-
and want to create your private feed configs, keep on reading!
|
|
547
|
-
|
|
548
|
-
First, create a YAML file, e.g. `feeds.yml`. This file will contain your global config and multiple feed configs under the key `feeds`.
|
|
549
|
-
|
|
550
|
-
Example:
|
|
551
|
-
|
|
552
|
-
```yml
|
|
553
|
-
headers:
|
|
554
|
-
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
|
|
555
|
-
feeds:
|
|
556
|
-
myfeed:
|
|
557
|
-
channel:
|
|
558
|
-
selectors:
|
|
559
|
-
myotherfeed:
|
|
560
|
-
channel:
|
|
561
|
-
selectors:
|
|
562
|
-
```
|
|
563
|
-
|
|
564
|
-
Your feed configs go below `feeds`. Everything else is part of the global config.
|
|
565
|
-
|
|
566
|
-
Find a full example of a `feeds.yml` at [`spec/fixtures/feeds.test.yml`](https://github.com/html2rss/html2rss/blob/master/spec/fixtures/feeds.test.yml).
|
|
567
|
-
|
|
568
|
-
Now you can build your feeds like this:
|
|
569
|
-
|
|
570
|
-
<details>
|
|
571
|
-
<summary>Build feeds in Ruby</summary>
|
|
572
|
-
|
|
573
|
-
```ruby
|
|
574
|
-
require 'html2rss'
|
|
575
|
-
|
|
576
|
-
myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
|
|
577
|
-
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')
|
|
578
|
-
```
|
|
579
|
-
|
|
580
|
-
</details>
|
|
581
|
-
|
|
582
|
-
<details>
|
|
583
|
-
<summary>Build feeds on the command line</summary>
|
|
584
|
-
|
|
585
|
-
```sh
|
|
586
|
-
html2rss feed feeds.yml myfeed
|
|
587
|
-
html2rss feed feeds.yml myotherfeed
|
|
588
|
-
```
|
|
589
|
-
|
|
590
|
-
</details>
|
|
591
|
-
|
|
592
|
-
## Display the RSS feed nicely in a web browser
|
|
593
|
-
|
|
594
|
-
To display RSS feeds nicely in a web browser, you can:
|
|
595
|
-
|
|
596
|
-
- add a plain old CSS stylesheet, or
|
|
597
|
-
- use XSLT (e**X**tensible **S**tylesheet **L**anguage **T**ransformations).
|
|
598
|
-
|
|
599
|
-
A web browser will apply these stylesheets and show the contents as described.
|
|
600
|
-
|
|
601
|
-
In a CSS stylesheet, you'd use `element` selectors to apply styles.
|
|
602
|
-
|
|
603
|
-
If you want to do more, then you need to create a XSLT. XSLT allows you
|
|
604
|
-
to use a HTML template and to freely design the information of the RSS,
|
|
605
|
-
including using JavaScript and external resources.
|
|
606
|
-
|
|
607
|
-
You can add as many stylesheets and types as you like. Just add them to your global configuration.
|
|
608
|
-
|
|
609
|
-
<details>
|
|
610
|
-
<summary>Ruby: a stylesheet config example</summary>
|
|
611
|
-
|
|
612
|
-
```ruby
|
|
613
|
-
config = Html2rss::Config.new(
|
|
614
|
-
{ channel: {}, selectors: {} }, # omitted
|
|
615
|
-
{
|
|
616
|
-
stylesheets: [
|
|
617
|
-
{
|
|
618
|
-
href: '/relative/base/path/to/style.xls',
|
|
619
|
-
media: :all,
|
|
620
|
-
type: 'text/xsl'
|
|
621
|
-
},
|
|
622
|
-
{
|
|
623
|
-
href: 'http://example.com/rss.css',
|
|
624
|
-
media: :all,
|
|
625
|
-
type: 'text/css'
|
|
626
|
-
}
|
|
627
|
-
]
|
|
628
|
-
}
|
|
629
|
-
)
|
|
630
|
-
|
|
631
|
-
Html2rss.feed(config)
|
|
632
|
-
```
|
|
633
|
-
|
|
634
|
-
</details>
|
|
635
|
-
|
|
636
|
-
<details>
|
|
637
|
-
<summary>YAML: a stylesheet config example</summary>
|
|
638
|
-
|
|
639
|
-
```yml
|
|
640
|
-
stylesheets:
|
|
641
|
-
- href: "/relative/base/path/to/style.xls"
|
|
642
|
-
media: "all"
|
|
643
|
-
type: "text/xsl"
|
|
644
|
-
- href: "http://example.com/rss.css"
|
|
645
|
-
media: "all"
|
|
646
|
-
type: "text/css"
|
|
647
|
-
feeds:
|
|
648
|
-
# ... omitted
|
|
649
|
-
```
|
|
650
|
-
|
|
651
|
-
</details>
|
|
652
|
-
|
|
653
|
-
Recommended further readings:
|
|
654
|
-
|
|
655
|
-
- [How to format RSS with CSS on lifewire.com](https://www.lifewire.com/how-to-format-rss-3469302)
|
|
656
|
-
- [XSLT: Extensible Stylesheet Language Transformations on MDN](https://developer.mozilla.org/en-US/docs/Web/XSLT)
|
|
657
|
-
- [The XSLT used by html2rss-web](https://github.com/html2rss/html2rss-web/blob/master/public/rss.xsl)
|
|
658
|
-
|
|
659
|
-
## Gotchas and tips & tricks
|
|
50
|
+
### Config schema workflow
|
|
660
51
|
|
|
661
|
-
|
|
662
|
-
- Do not rely on your web browser's developer console. `html2rss` does not execute JavaScript.
|
|
663
|
-
- Fiddling with [`curl`](https://github.com/curl/curl) and [`pup`](https://github.com/ericchiang/pup) to find the selectors seems efficient (`curl URL | pup`).
|
|
664
|
-
- [CSS selectors are versatile. Here's an overview.](https://www.w3.org/TR/selectors-4/#overview)
|
|
52
|
+
The config schema is generated from the runtime `dry-validation` contracts and exported for client-side tooling.
|
|
665
53
|
|
|
666
|
-
|
|
54
|
+
- Ruby API: `Html2rss::Config.json_schema`
|
|
55
|
+
- CLI: `html2rss schema`
|
|
56
|
+
- CLI options:
|
|
57
|
+
- `html2rss schema --write tmp/html2rss-config.schema.json`
|
|
58
|
+
- `html2rss schema --no-pretty`
|
|
59
|
+
- Runtime validation API: `Html2rss::Config.validate(config_hash)`
|
|
60
|
+
- Runtime validation CLI: `html2rss validate config.yml`
|
|
61
|
+
- Packaged JSON file: `schema/html2rss-config.schema.json`
|
|
667
62
|
|
|
668
|
-
|
|
63
|
+
If you are an editor integration, automation script, or AI tool, prefer these stable discovery points:
|
|
669
64
|
|
|
670
|
-
|
|
671
|
-
|
|
65
|
+
- call `html2rss schema` to read the current exported schema
|
|
66
|
+
- read `schema/html2rss-config.schema.json` when working from the repository or installed gem
|
|
67
|
+
- use `Html2rss::Config.schema_path` if you already have Ruby loaded
|
|
68
|
+
- use `Html2rss::Config.validate` or `html2rss validate config.yml` when you need authoritative runtime validation of selector references
|
|
672
69
|
|
|
673
|
-
|
|
70
|
+
Run `bundle exec rake config:schema` before committing to regenerate `schema/html2rss-config.schema.json` and keep the checked-in JSON Schema in sync with the validators. The exported schema covers client-side validation, while runtime validation remains authoritative for dynamic cross-field checks such as selector-key references.
|
|
674
71
|
|
|
675
|
-
|
|
676
|
-
2. Create your feature branch (`git checkout -b my-new-feature`)
|
|
677
|
-
3. Implement a commit your changes (`git commit -am 'feat: add XYZ'`)
|
|
678
|
-
4. Push to the branch (`git push origin my-new-feature`)
|
|
679
|
-
5. Create a new Pull Request using the Github web UI
|
|
72
|
+
## 📄 License
|
|
680
73
|
|
|
681
|
-
|
|
74
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
682
75
|
|
|
683
|
-
|
|
684
|
-
2. for a modern Ruby development experience: install [`ruby-lsp`](https://github.com/Shopify/ruby-lsp) and integrate it to your IDE.
|
|
76
|
+
## 💖 Sponsoring
|
|
685
77
|
|
|
686
|
-
|
|
78
|
+
If you find `html2rss` useful, please consider [sponsoring the project](https://github.com/sponsors/gildesmarais).
|