selma 0.4.15-arm-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/LICENSE.txt +21 -0
- data/README.md +334 -0
- data/lib/selma/3.2/selma.so +0 -0
- data/lib/selma/3.3/selma.so +0 -0
- data/lib/selma/3.4/selma.so +0 -0
- data/lib/selma/4.0/selma.so +0 -0
- data/lib/selma/config.rb +12 -0
- data/lib/selma/extension.rb +14 -0
- data/lib/selma/html/element.rb +11 -0
- data/lib/selma/html.rb +8 -0
- data/lib/selma/rewriter.rb +6 -0
- data/lib/selma/sanitizer/config/basic.rb +58 -0
- data/lib/selma/sanitizer/config/default.rb +82 -0
- data/lib/selma/sanitizer/config/relaxed.rb +99 -0
- data/lib/selma/sanitizer/config/restricted.rb +13 -0
- data/lib/selma/sanitizer/config.rb +67 -0
- data/lib/selma/sanitizer.rb +8 -0
- data/lib/selma/selector.rb +6 -0
- data/lib/selma/version.rb +5 -0
- data/lib/selma.rb +13 -0
- metadata +99 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: be5d989930ee24e511c1da6855dbb31976d65f7a0e0bb0afd14e70ad6c11e0cd
|
|
4
|
+
data.tar.gz: ae3fbab1d118865f3351cd1aefccbf1e95441a23ed77b45b84ef73cca5127df2
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: c828bc9168daa3af844c7f477fdf1840817924859db7a5a6b4ba9c97e81d6f597e64de259d7fe3e2e4a91692a5ba8da0c81c75d241814f191c5ee5ed7888c20f
|
|
7
|
+
data.tar.gz: 2c672ccff8536720f47f588d48acb20beb2432c4dd4398b1e1afe8ee35e08e4f34d742941eaffad64ac209a49243cb45148159a809cb7a25e61d6c1750f637bd
|
data/LICENSE.txt
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2022 Garen J. Torikian
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
|
13
|
+
all copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
|
@@ -0,0 +1,334 @@
|
|
|
1
|
+
# Selma
|
|
2
|
+
|
|
3
|
+
Selma **sel**ects and **ma**tches HTML nodes using CSS rules. (It can also reject/delete nodes, but then the name isn't as cool.) It's mostly an idiomatic wrapper around Cloudflare's [lol-html](https://github.com/cloudflare/lol-html) project.
|
|
4
|
+
|
|
5
|
+

|
|
6
|
+
|
|
7
|
+
Selma's strength (aside from being backed by Rust) is that HTML content is parsed _once_ and can be manipulated multiple times.
|
|
8
|
+
|
|
9
|
+
## Installation
|
|
10
|
+
|
|
11
|
+
Add this line to your application's Gemfile:
|
|
12
|
+
|
|
13
|
+
```ruby
|
|
14
|
+
gem 'selma'
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
And then execute:
|
|
18
|
+
|
|
19
|
+
$ bundle install
|
|
20
|
+
|
|
21
|
+
Or install it yourself as:
|
|
22
|
+
|
|
23
|
+
$ gem install selma
|
|
24
|
+
|
|
25
|
+
## Usage
|
|
26
|
+
|
|
27
|
+
Selma can perform two different actions, either independently or together:
|
|
28
|
+
|
|
29
|
+
- Sanitize HTML, through a [Sanitize](https://github.com/rgrove/sanitize)-like allowlist syntax; and
|
|
30
|
+
- Select HTML using CSS rules, and manipulate elements and text nodes along the way.
|
|
31
|
+
|
|
32
|
+
It does this through two kwargs: `sanitizer` and `handlers`. The basic API for Selma looks like this:
|
|
33
|
+
|
|
34
|
+
```ruby
|
|
35
|
+
sanitizer_config = {
|
|
36
|
+
elements: ["b", "em", "i", "strong", "u"],
|
|
37
|
+
}
|
|
38
|
+
sanitizer = Selma::Sanitizer.new(sanitizer_config)
|
|
39
|
+
rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
|
|
40
|
+
# removes any element that is not ["b", "em", "i", "strong", "u"];
|
|
41
|
+
# then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
|
|
42
|
+
rewriter.rewrite(html)
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
Here's a look at each individual part.
|
|
46
|
+
|
|
47
|
+
### Sanitization config
|
|
48
|
+
|
|
49
|
+
Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass `nil`:
|
|
50
|
+
|
|
51
|
+
```ruby
|
|
52
|
+
Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
The configuration for the sanitization process is based on the follow key-value hash allowlist:
|
|
56
|
+
|
|
57
|
+
```ruby
|
|
58
|
+
# Whether or not to allow HTML comments.
|
|
59
|
+
allow_comments: false,
|
|
60
|
+
|
|
61
|
+
# Whether or not to allow well-formed HTML doctype declarations such as
|
|
62
|
+
# "<!DOCTYPE html>" when sanitizing a document.
|
|
63
|
+
allow_doctype: false,
|
|
64
|
+
|
|
65
|
+
# HTML elements to allow. By default, no elements are allowed (which means
|
|
66
|
+
# that all HTML will be stripped).
|
|
67
|
+
elements: ["a", "b", "img", ],
|
|
68
|
+
|
|
69
|
+
# HTML attributes to allow in specific elements. The key is the name of the element,
|
|
70
|
+
# and the value is an array of allowed attributes. By default, no attributes
|
|
71
|
+
# are allowed.
|
|
72
|
+
attributes: {
|
|
73
|
+
"a" => ["href"],
|
|
74
|
+
"img" => ["src"],
|
|
75
|
+
},
|
|
76
|
+
|
|
77
|
+
# URL handling protocols to allow in specific attributes. By default, no
|
|
78
|
+
# protocols are allowed. Use :relative in place of a protocol if you want
|
|
79
|
+
# to allow relative URLs sans protocol. Set to `:all` to allow any protocol.
|
|
80
|
+
protocols: {
|
|
81
|
+
"a" => { "href" => ["http", "https", "mailto", :relative] },
|
|
82
|
+
"img" => { "href" => ["http", "https"] },
|
|
83
|
+
},
|
|
84
|
+
|
|
85
|
+
# An Array of element names whose contents will be removed. The contents
|
|
86
|
+
# of all other filtered elements will be left behind.
|
|
87
|
+
remove_contents: ["iframe", "math", "noembed", "noframes", "noscript"],
|
|
88
|
+
|
|
89
|
+
# Elements which, when removed, should have their contents surrounded by
|
|
90
|
+
# whitespace.
|
|
91
|
+
whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Defining handlers
|
|
95
|
+
|
|
96
|
+
The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:
|
|
97
|
+
|
|
98
|
+
- `selector`, a method which MUST return an instance of `Selma::Selector`, defining the CSS classes to match
|
|
99
|
+
- `handle_element`, a method that's called on each matched element
|
|
100
|
+
- `handle_text_chunk`, a method that's called on each matched text node
|
|
101
|
+
|
|
102
|
+
Here's an example which rewrites the `href` attribute on `a` and the `src` attribute on `img` to be `https` rather than `http`.
|
|
103
|
+
|
|
104
|
+
```ruby
|
|
105
|
+
class MatchAttribute
|
|
106
|
+
SELECTOR = Selma::Selector.new(match_element: %(a[href^="http:"], img[src^="http:"]"))
|
|
107
|
+
|
|
108
|
+
def selector
|
|
109
|
+
SELECTOR
|
|
110
|
+
end
|
|
111
|
+
|
|
112
|
+
def handle_element(element)
|
|
113
|
+
if element.tag_name == "a"
|
|
114
|
+
element["href"] = rename_http(element["href"])
|
|
115
|
+
elsif element.tag_name == "img"
|
|
116
|
+
element["src"] = rename_http(element["src"])
|
|
117
|
+
end
|
|
118
|
+
end
|
|
119
|
+
|
|
120
|
+
private def rename_http(link)
|
|
121
|
+
link.sub("http", "https")
|
|
122
|
+
end
|
|
123
|
+
end
|
|
124
|
+
|
|
125
|
+
rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
The `Selma::Selector` object has three possible kwargs:
|
|
129
|
+
|
|
130
|
+
- `match_element`: any element which matches this CSS rule will be passed on to `handle_element`
|
|
131
|
+
- `match_text_within`: any text_chunk which matches this CSS rule will be passed on to `handle_text_chunk`
|
|
132
|
+
- `ignore_text_within`: this is an array of element names whose text contents will be ignored
|
|
133
|
+
|
|
134
|
+
Here's an example for `handle_text_chunk` which changes strings in various elements which are _not_ `pre` or `code`:
|
|
135
|
+
|
|
136
|
+
```ruby
|
|
137
|
+
class MatchText
|
|
138
|
+
SELECTOR = Selma::Selector.new(match_text_within: "*", ignore_text_within: ["pre", "code"])
|
|
139
|
+
|
|
140
|
+
def selector
|
|
141
|
+
SELECTOR
|
|
142
|
+
end
|
|
143
|
+
|
|
144
|
+
def handle_text_chunk(text)
|
|
145
|
+
text.replace(text.to_s, text.sub(/@.+/, "<a href=\"www.yetto.app/#{Regexp.last_match}\">"))
|
|
146
|
+
end
|
|
147
|
+
end
|
|
148
|
+
|
|
149
|
+
rewriter = Selma::Rewriter.new(handlers: [MatchText.new])
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
#### `element` methods
|
|
153
|
+
|
|
154
|
+
The `element` argument in `handle_element` has the following methods:
|
|
155
|
+
|
|
156
|
+
- `tag_name`: Gets the element's name
|
|
157
|
+
- `tag_name=`: Sets the element's name
|
|
158
|
+
- `self_closing?`: A bool which identifies whether or not the element is self-closing
|
|
159
|
+
- `[]`: Get an attribute
|
|
160
|
+
- `[]=`: Set an attribute
|
|
161
|
+
- `remove_attribute`: Remove an attribute
|
|
162
|
+
- `has_attribute?`: A bool which identifies whether or not the element has an attribute
|
|
163
|
+
- `attributes`: List all the attributes
|
|
164
|
+
- `ancestors`: List all of an element's ancestors as an array of strings
|
|
165
|
+
- `before(content, as: content_type)`: Inserts `content` before the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
166
|
+
- `after(content, as: content_type)`: Inserts `content` after the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
167
|
+
- `prepend(content, as: content_type)`: prepends `content` to the element's inner content, i.e. inserts content right after the element's start tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
168
|
+
- `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
169
|
+
- `set_inner_content`: Replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
170
|
+
- `remove`: Removes the element and its inner content.
|
|
171
|
+
- `remove_and_keep_content`: Removes the element, but keeps its content. I.e. remove start and end tags of the element.
|
|
172
|
+
- `removed?`: A bool which identifies if the element has been removed or replaced with some content.
|
|
173
|
+
|
|
174
|
+
#### `text_chunk` methods
|
|
175
|
+
|
|
176
|
+
- `to_s` / `.content`: Gets the text node's content
|
|
177
|
+
- `text_type`: identifies the type of text in the text node
|
|
178
|
+
- `before(content, as: content_type)`: Inserts `content` before the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
179
|
+
- `after(content, as: content_type)`: Inserts `content` after the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
180
|
+
- `replace(content, as: content_type)`: Replaces the text node with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
|
181
|
+
|
|
182
|
+
## Security
|
|
183
|
+
|
|
184
|
+
Theoretically, a malicious user can provide a very large document for processing, which can exhaust the memory of the host machine. To set a limit on how much string content is processed at once, you can provide `memory` options:
|
|
185
|
+
|
|
186
|
+
```ruby
|
|
187
|
+
Selma::Rewriter.new(options: { memory: { max_allowed_memory_usage: 1_000_000 } }) # ~1MB
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
The structure of the `memory` options looks like this:
|
|
191
|
+
|
|
192
|
+
```ruby
|
|
193
|
+
{
|
|
194
|
+
memory: {
|
|
195
|
+
max_allowed_memory_usage: 1000,
|
|
196
|
+
preallocated_parsing_buffer_size: 100,
|
|
197
|
+
}
|
|
198
|
+
}
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
Note that `preallocated_parsing_buffer_size` must always be less than `max_allowed_memory_usage`. See [the`lol_html` project documentation](https://docs.rs/lol_html/1.2.1/lol_html/struct.MemorySettings.html) to learn more about the default values.
|
|
202
|
+
|
|
203
|
+
## Benchmarks
|
|
204
|
+
|
|
205
|
+
When `bundle exec rake benchmark`, two different benchmarks are calculated. Here are those results on my machine.
|
|
206
|
+
|
|
207
|
+
### Benchmarks for just the sanitization process
|
|
208
|
+
|
|
209
|
+
Comparing Selma against popular Ruby sanitization gems:
|
|
210
|
+
|
|
211
|
+
<!-- prettier-ignore-start -->
|
|
212
|
+
<details>
|
|
213
|
+
<pre>
|
|
214
|
+
input size = 25309 bytes, 0.03 MB
|
|
215
|
+
|
|
216
|
+
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
|
|
217
|
+
Warming up --------------------------------------
|
|
218
|
+
sanitize-sm 15.000 i/100ms
|
|
219
|
+
selma-sm 127.000 i/100ms
|
|
220
|
+
Calculating -------------------------------------
|
|
221
|
+
sanitize-sm 157.643 (± 1.9%) i/s - 4.740k in 30.077172s
|
|
222
|
+
selma-sm 1.278k (± 1.5%) i/s - 38.354k in 30.019722s
|
|
223
|
+
|
|
224
|
+
Comparison:
|
|
225
|
+
selma-sm: 1277.9 i/s
|
|
226
|
+
sanitize-sm: 157.6 i/s - 8.11x slower
|
|
227
|
+
|
|
228
|
+
input size = 86686 bytes, 0.09 MB
|
|
229
|
+
|
|
230
|
+
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
|
|
231
|
+
Warming up --------------------------------------
|
|
232
|
+
sanitize-md 4.000 i/100ms
|
|
233
|
+
selma-md 33.000 i/100ms
|
|
234
|
+
Calculating -------------------------------------
|
|
235
|
+
sanitize-md 40.034 (± 5.0%) i/s - 1.200k in 30.043322s
|
|
236
|
+
selma-md 332.959 (± 2.1%) i/s - 9.999k in 30.045733s
|
|
237
|
+
|
|
238
|
+
Comparison:
|
|
239
|
+
selma-md: 333.0 i/s
|
|
240
|
+
sanitize-md: 40.0 i/s - 8.32x slower
|
|
241
|
+
|
|
242
|
+
input size = 7172510 bytes, 7.17 MB
|
|
243
|
+
|
|
244
|
+
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
|
|
245
|
+
Warming up --------------------------------------
|
|
246
|
+
sanitize-lg 1.000 i/100ms
|
|
247
|
+
selma-lg 1.000 i/100ms
|
|
248
|
+
Calculating -------------------------------------
|
|
249
|
+
sanitize-lg 0.141 (± 0.0%) i/s - 5.000 in 35.426127s
|
|
250
|
+
selma-lg 3.963 (± 0.0%) i/s - 119.000 in 30.037386s
|
|
251
|
+
|
|
252
|
+
Comparison:
|
|
253
|
+
selma-lg: 4.0 i/s
|
|
254
|
+
sanitize-lg: 0.1 i/s - 28.03x slower
|
|
255
|
+
|
|
256
|
+
</pre>
|
|
257
|
+
</details>
|
|
258
|
+
<!-- prettier-ignore-end -->
|
|
259
|
+
|
|
260
|
+
### Benchmarks for just the rewriting process
|
|
261
|
+
|
|
262
|
+
Comparing Selma against popular Ruby HTML parsing gems:
|
|
263
|
+
|
|
264
|
+
<!-- prettier-ignore-start -->
|
|
265
|
+
<details>
|
|
266
|
+
<pre>
|
|
267
|
+
input size = 25309 bytes, 0.03 MB
|
|
268
|
+
|
|
269
|
+
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
|
|
270
|
+
Warming up --------------------------------------
|
|
271
|
+
nokogiri-sm 79.000 i/100ms
|
|
272
|
+
nokolexbor-sm 295.000 i/100ms
|
|
273
|
+
selma-sm 237.000 i/100ms
|
|
274
|
+
Calculating -------------------------------------
|
|
275
|
+
nokogiri-sm 800.531 (± 2.2%) i/s - 24.016k in 30.016056s
|
|
276
|
+
nokolexbor-sm 3.033k (± 3.6%) i/s - 91.155k in 30.094884s
|
|
277
|
+
selma-sm 2.386k (± 1.6%) i/s - 71.574k in 30.001701s
|
|
278
|
+
|
|
279
|
+
Comparison:
|
|
280
|
+
nokolexbor-sm: 3033.1 i/s
|
|
281
|
+
selma-sm: 2386.3 i/s - 1.27x slower
|
|
282
|
+
nokogiri-sm: 800.5 i/s - 3.79x slower
|
|
283
|
+
|
|
284
|
+
input size = 86686 bytes, 0.09 MB
|
|
285
|
+
|
|
286
|
+
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
|
|
287
|
+
Warming up --------------------------------------
|
|
288
|
+
nokogiri-md 8.000 i/100ms
|
|
289
|
+
nokolexbor-md 43.000 i/100ms
|
|
290
|
+
selma-md 38.000 i/100ms
|
|
291
|
+
Calculating -------------------------------------
|
|
292
|
+
nokogiri-md 85.013 (± 8.2%) i/s - 2.024k in 52.257472s
|
|
293
|
+
nokolexbor-md 416.074 (±11.1%) i/s - 12.341k in 30.111613s
|
|
294
|
+
selma-md 361.471 (± 4.7%) i/s - 10.830k in 30.033997s
|
|
295
|
+
|
|
296
|
+
Comparison:
|
|
297
|
+
nokolexbor-md: 416.1 i/s
|
|
298
|
+
selma-md: 361.5 i/s - same-ish: difference falls within error
|
|
299
|
+
nokogiri-md: 85.0 i/s - 4.89x slower
|
|
300
|
+
|
|
301
|
+
input size = 7172510 bytes, 7.17 MB
|
|
302
|
+
|
|
303
|
+
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
|
|
304
|
+
Warming up --------------------------------------
|
|
305
|
+
nokogiri-lg 1.000 i/100ms
|
|
306
|
+
nokolexbor-lg 1.000 i/100ms
|
|
307
|
+
selma-lg 1.000 i/100ms
|
|
308
|
+
Calculating -------------------------------------
|
|
309
|
+
nokogiri-lg 0.805 (± 0.0%) i/s - 25.000 in 31.148730s
|
|
310
|
+
nokolexbor-lg 2.194 (± 0.0%) i/s - 66.000 in 30.278108s
|
|
311
|
+
selma-lg 5.541 (± 0.0%) i/s - 166.000 in 30.037197s
|
|
312
|
+
|
|
313
|
+
Comparison:
|
|
314
|
+
selma-lg: 5.5 i/s
|
|
315
|
+
nokolexbor-lg: 2.2 i/s - 2.53x slower
|
|
316
|
+
nokogiri-lg: 0.8 i/s - 6.88x slower
|
|
317
|
+
|
|
318
|
+
</pre>
|
|
319
|
+
</details>
|
|
320
|
+
<!-- prettier-ignore-end -->
|
|
321
|
+
|
|
322
|
+
## Contributing
|
|
323
|
+
|
|
324
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/gjtorikian/selma. This project is a safe, welcoming space for collaboration.
|
|
325
|
+
|
|
326
|
+
## Acknowledgements
|
|
327
|
+
|
|
328
|
+
- https://github.com/flavorjones/ruby-c-extensions-explained#strategy-3-precompiled and [Nokogiri](https://github.com/sparklemotion/nokogiri) for hints on how to ship precompiled cross-platform gems
|
|
329
|
+
- @vmg for his work at GitHub on goomba, from which some design patterns were learned
|
|
330
|
+
- [sanitize](https://github.com/rgrove/sanitize) for a comprehensive configuration API and test suite
|
|
331
|
+
|
|
332
|
+
## License
|
|
333
|
+
|
|
334
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
data/lib/selma/config.rb
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
begin
|
|
4
|
+
# native precompiled gems package shared libraries in <gem_dir>/lib/selma/<ruby_version>
|
|
5
|
+
# load the precompiled extension file
|
|
6
|
+
ruby_version = /\d+\.\d+/.match(RUBY_VERSION)
|
|
7
|
+
require_relative "#{ruby_version}/selma"
|
|
8
|
+
rescue LoadError
|
|
9
|
+
# fall back to the extension compiled upon installation.
|
|
10
|
+
# use "require" instead of "require_relative" because non-native gems will place C extension files
|
|
11
|
+
# in Gem::BasicSpecification#extension_dir after compilation (during normal installation), which
|
|
12
|
+
# is in $LOAD_PATH but not necessarily relative to this file (see nokogiri#2300)
|
|
13
|
+
require "selma/selma"
|
|
14
|
+
end
|
data/lib/selma/html.rb
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Selma
|
|
4
|
+
class Sanitizer
|
|
5
|
+
module Config
|
|
6
|
+
BASIC = freeze_config(
|
|
7
|
+
elements: [
|
|
8
|
+
"a",
|
|
9
|
+
"abbr",
|
|
10
|
+
"blockquote",
|
|
11
|
+
"b",
|
|
12
|
+
"br",
|
|
13
|
+
"cite",
|
|
14
|
+
"code",
|
|
15
|
+
"dd",
|
|
16
|
+
"dfn",
|
|
17
|
+
"dl",
|
|
18
|
+
"dt",
|
|
19
|
+
"em",
|
|
20
|
+
"i",
|
|
21
|
+
"kbd",
|
|
22
|
+
"li",
|
|
23
|
+
"mark",
|
|
24
|
+
"ol",
|
|
25
|
+
"p",
|
|
26
|
+
"pre",
|
|
27
|
+
"q",
|
|
28
|
+
"s",
|
|
29
|
+
"samp",
|
|
30
|
+
"small",
|
|
31
|
+
"strike",
|
|
32
|
+
"strong",
|
|
33
|
+
"sub",
|
|
34
|
+
"sup",
|
|
35
|
+
"time",
|
|
36
|
+
"u",
|
|
37
|
+
"ul",
|
|
38
|
+
"var",
|
|
39
|
+
],
|
|
40
|
+
|
|
41
|
+
attributes: {
|
|
42
|
+
"a" => ["href"],
|
|
43
|
+
"abbr" => ["title"],
|
|
44
|
+
"blockquote" => ["cite"],
|
|
45
|
+
"dfn" => ["title"],
|
|
46
|
+
"q" => ["cite"],
|
|
47
|
+
"time" => ["datetime", "pubdate"],
|
|
48
|
+
},
|
|
49
|
+
|
|
50
|
+
protocols: {
|
|
51
|
+
"a" => { "href" => ["ftp", "http", "https", "mailto", :relative] },
|
|
52
|
+
"blockquote" => { "cite" => ["http", "https", :relative] },
|
|
53
|
+
"q" => { "cite" => ["http", "https", :relative] },
|
|
54
|
+
},
|
|
55
|
+
)
|
|
56
|
+
end
|
|
57
|
+
end
|
|
58
|
+
end
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Selma
|
|
4
|
+
class Sanitizer
|
|
5
|
+
module Config
|
|
6
|
+
# although there are many more protocol types, eg., ftp, xmpp, etc.,
|
|
7
|
+
# these are the only ones that are allowed by default
|
|
8
|
+
VALID_PROTOCOLS = ["http", "https", "mailto", :relative]
|
|
9
|
+
|
|
10
|
+
DEFAULT = freeze_config(
|
|
11
|
+
# Whether or not to allow HTML comments. Allowing comments is strongly
|
|
12
|
+
# discouraged, since IE allows script execution within conditional
|
|
13
|
+
# comments.
|
|
14
|
+
allow_comments: false,
|
|
15
|
+
|
|
16
|
+
# Whether or not to allow well-formed HTML doctype declarations such as
|
|
17
|
+
# "<!DOCTYPE html>" when sanitizing a document.
|
|
18
|
+
allow_doctype: false,
|
|
19
|
+
|
|
20
|
+
# HTML attributes to allow in specific elements. By default, no attributes
|
|
21
|
+
# are allowed. Use the symbol :data to indicate that arbitrary HTML5
|
|
22
|
+
# data-* attributes should be allowed.
|
|
23
|
+
attributes: {},
|
|
24
|
+
|
|
25
|
+
# HTML elements to allow. By default, no elements are allowed (which means
|
|
26
|
+
# that all HTML will be stripped).
|
|
27
|
+
elements: [],
|
|
28
|
+
|
|
29
|
+
# URL handling protocols to allow in specific attributes. By default, no
|
|
30
|
+
# protocols are allowed. Use :relative in place of a protocol if you want
|
|
31
|
+
# to allow relative URLs sans protocol. Set to `:all` to allow any protocol.
|
|
32
|
+
protocols: {},
|
|
33
|
+
|
|
34
|
+
# An Array of element names whose contents will be removed. The contents
|
|
35
|
+
# of all other filtered elements will be left behind.
|
|
36
|
+
remove_contents: [
|
|
37
|
+
"iframe",
|
|
38
|
+
"math",
|
|
39
|
+
"noembed",
|
|
40
|
+
"noframes",
|
|
41
|
+
"noscript",
|
|
42
|
+
"plaintext",
|
|
43
|
+
"script",
|
|
44
|
+
"style",
|
|
45
|
+
"svg",
|
|
46
|
+
"xmp",
|
|
47
|
+
],
|
|
48
|
+
|
|
49
|
+
# Elements which, when removed, should have their contents surrounded by
|
|
50
|
+
# whitespace.
|
|
51
|
+
whitespace_elements: [
|
|
52
|
+
"address",
|
|
53
|
+
"article",
|
|
54
|
+
"aside",
|
|
55
|
+
"blockquote",
|
|
56
|
+
"br",
|
|
57
|
+
"dd",
|
|
58
|
+
"div",
|
|
59
|
+
"dl",
|
|
60
|
+
"dt",
|
|
61
|
+
"footer",
|
|
62
|
+
"h1",
|
|
63
|
+
"h2",
|
|
64
|
+
"h3",
|
|
65
|
+
"h4",
|
|
66
|
+
"h5",
|
|
67
|
+
"h6",
|
|
68
|
+
"header",
|
|
69
|
+
"hgroup",
|
|
70
|
+
"hr",
|
|
71
|
+
"li",
|
|
72
|
+
"nav",
|
|
73
|
+
"ol",
|
|
74
|
+
"p",
|
|
75
|
+
"pre",
|
|
76
|
+
"section",
|
|
77
|
+
"ul",
|
|
78
|
+
],
|
|
79
|
+
)
|
|
80
|
+
end
|
|
81
|
+
end
|
|
82
|
+
end
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module Selma
|
|
4
|
+
class Sanitizer
|
|
5
|
+
module Config
|
|
6
|
+
RELAXED = freeze_config(
|
|
7
|
+
elements: BASIC[:elements] + [
|
|
8
|
+
"address",
|
|
9
|
+
"article",
|
|
10
|
+
"aside",
|
|
11
|
+
"bdi",
|
|
12
|
+
"bdo",
|
|
13
|
+
"body",
|
|
14
|
+
"caption",
|
|
15
|
+
"col",
|
|
16
|
+
"colgroup",
|
|
17
|
+
"data",
|
|
18
|
+
"del",
|
|
19
|
+
"details",
|
|
20
|
+
"div",
|
|
21
|
+
"figcaption",
|
|
22
|
+
"figure",
|
|
23
|
+
"footer",
|
|
24
|
+
"h1",
|
|
25
|
+
"h2",
|
|
26
|
+
"h3",
|
|
27
|
+
"h4",
|
|
28
|
+
"h5",
|
|
29
|
+
"h6",
|
|
30
|
+
"head",
|
|
31
|
+
"header",
|
|
32
|
+
"hgroup",
|
|
33
|
+
"hr",
|
|
34
|
+
"html",
|
|
35
|
+
"img",
|
|
36
|
+
"ins",
|
|
37
|
+
"main",
|
|
38
|
+
"nav",
|
|
39
|
+
"rp",
|
|
40
|
+
"rt",
|
|
41
|
+
"ruby",
|
|
42
|
+
"section",
|
|
43
|
+
"span",
|
|
44
|
+
"style",
|
|
45
|
+
"summary",
|
|
46
|
+
"sup",
|
|
47
|
+
"table",
|
|
48
|
+
"tbody",
|
|
49
|
+
"td",
|
|
50
|
+
"tfoot",
|
|
51
|
+
"th",
|
|
52
|
+
"thead",
|
|
53
|
+
"title",
|
|
54
|
+
"tr",
|
|
55
|
+
"wbr",
|
|
56
|
+
],
|
|
57
|
+
|
|
58
|
+
allow_doctype: true,
|
|
59
|
+
|
|
60
|
+
attributes: merge(
|
|
61
|
+
BASIC[:attributes],
|
|
62
|
+
:all => ["class", "dir", "hidden", "id", "lang", "style", "tabindex", "title", "translate"],
|
|
63
|
+
"a" => ["href", "hreflang", "name", "rel"],
|
|
64
|
+
"col" => ["span", "width"],
|
|
65
|
+
"colgroup" => ["span", "width"],
|
|
66
|
+
"data" => ["value"],
|
|
67
|
+
"del" => ["cite", "datetime"],
|
|
68
|
+
"img" => ["align", "alt", "border", "height", "src", "srcset", "width"],
|
|
69
|
+
"ins" => ["cite", "datetime"],
|
|
70
|
+
"li" => ["value"],
|
|
71
|
+
"ol" => ["reversed", "start", "type"],
|
|
72
|
+
"style" => ["media", "scoped", "type"],
|
|
73
|
+
"table" => [
|
|
74
|
+
"align",
|
|
75
|
+
"bgcolor",
|
|
76
|
+
"border",
|
|
77
|
+
"cellpadding",
|
|
78
|
+
"cellspacing",
|
|
79
|
+
"frame",
|
|
80
|
+
"rules",
|
|
81
|
+
"sortable",
|
|
82
|
+
"summary",
|
|
83
|
+
"width",
|
|
84
|
+
],
|
|
85
|
+
"td" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "valign", "width"],
|
|
86
|
+
"th" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "scope", "sorted", "valign", "width"],
|
|
87
|
+
"ul" => ["type"],
|
|
88
|
+
),
|
|
89
|
+
|
|
90
|
+
protocols: merge(
|
|
91
|
+
BASIC[:protocols],
|
|
92
|
+
"del" => { "cite" => ["http", "https", :relative] },
|
|
93
|
+
"img" => { "src" => ["http", "https", :relative] },
|
|
94
|
+
"ins" => { "cite" => ["http", "https", :relative] },
|
|
95
|
+
),
|
|
96
|
+
)
|
|
97
|
+
end
|
|
98
|
+
end
|
|
99
|
+
end
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "set"
|
|
4
|
+
|
|
5
|
+
module Selma
|
|
6
|
+
class Sanitizer
|
|
7
|
+
module Config
|
|
8
|
+
class << self
|
|
9
|
+
# Deeply freezes and returns the given configuration Hash.
|
|
10
|
+
def freeze_config(config)
|
|
11
|
+
case config
|
|
12
|
+
when Hash
|
|
13
|
+
config.each_value { |c| freeze_config(c) }
|
|
14
|
+
when Array, Set
|
|
15
|
+
config.each { |c| freeze_config(c) }
|
|
16
|
+
end
|
|
17
|
+
|
|
18
|
+
config.freeze
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
# Returns a new Hash containing the result of deeply merging *other_config*
|
|
22
|
+
# into *config*. Does not modify *config* or *other_config*.
|
|
23
|
+
#
|
|
24
|
+
# This is the safest way to use a built-in config as the basis for
|
|
25
|
+
# your own custom config.
|
|
26
|
+
def merge(config, other_config = {})
|
|
27
|
+
raise ArgumentError, "config must be a Hash" unless config.is_a?(Hash)
|
|
28
|
+
raise ArgumentError, "other_config must be a Hash" unless other_config.is_a?(Hash)
|
|
29
|
+
|
|
30
|
+
merged = {}
|
|
31
|
+
keys = Set.new(config.keys + other_config.keys).to_a
|
|
32
|
+
|
|
33
|
+
keys.each do |key|
|
|
34
|
+
oldval = config[key]
|
|
35
|
+
|
|
36
|
+
if other_config.key?(key)
|
|
37
|
+
newval = other_config[key]
|
|
38
|
+
|
|
39
|
+
merged[key] = if oldval.is_a?(Hash) && newval.is_a?(Hash)
|
|
40
|
+
oldval.empty? ? newval.dup : merge(oldval, newval)
|
|
41
|
+
elsif newval.is_a?(Array) && key != :transformers
|
|
42
|
+
Set.new(newval).to_a
|
|
43
|
+
else
|
|
44
|
+
can_dupe?(newval) ? newval.dup : newval
|
|
45
|
+
end
|
|
46
|
+
else
|
|
47
|
+
merged[key] = can_dupe?(oldval) ? oldval.dup : oldval
|
|
48
|
+
end
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
merged
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
# Returns `true` if `dup` may be safely called on _value_, `false`
|
|
55
|
+
# otherwise.
|
|
56
|
+
def can_dupe?(value)
|
|
57
|
+
!(value == true || value == false || value.nil? || value.is_a?(Method) || value.is_a?(Numeric) || value.is_a?(Symbol))
|
|
58
|
+
end
|
|
59
|
+
end
|
|
60
|
+
end
|
|
61
|
+
end
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
require "selma/sanitizer/config/basic"
|
|
65
|
+
require "selma/sanitizer/config/default"
|
|
66
|
+
require "selma/sanitizer/config/relaxed"
|
|
67
|
+
require "selma/sanitizer/config/restricted"
|
data/lib/selma.rb
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
if ENV.fetch("DEBUG", false)
|
|
4
|
+
require "amazing_print"
|
|
5
|
+
require "debug"
|
|
6
|
+
end
|
|
7
|
+
|
|
8
|
+
require_relative "selma/extension"
|
|
9
|
+
|
|
10
|
+
require_relative "selma/sanitizer"
|
|
11
|
+
require_relative "selma/html"
|
|
12
|
+
require_relative "selma/rewriter"
|
|
13
|
+
require_relative "selma/selector"
|
metadata
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: selma
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.4.15
|
|
5
|
+
platform: arm-linux
|
|
6
|
+
authors:
|
|
7
|
+
- Garen J. Torikian
|
|
8
|
+
autorequire:
|
|
9
|
+
bindir: exe
|
|
10
|
+
cert_chain: []
|
|
11
|
+
date: 2026-01-06 00:00:00.000000000 Z
|
|
12
|
+
dependencies:
|
|
13
|
+
- !ruby/object:Gem::Dependency
|
|
14
|
+
name: rake
|
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
|
16
|
+
requirements:
|
|
17
|
+
- - "~>"
|
|
18
|
+
- !ruby/object:Gem::Version
|
|
19
|
+
version: '13.0'
|
|
20
|
+
type: :development
|
|
21
|
+
prerelease: false
|
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
23
|
+
requirements:
|
|
24
|
+
- - "~>"
|
|
25
|
+
- !ruby/object:Gem::Version
|
|
26
|
+
version: '13.0'
|
|
27
|
+
- !ruby/object:Gem::Dependency
|
|
28
|
+
name: rake-compiler
|
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
|
30
|
+
requirements:
|
|
31
|
+
- - "~>"
|
|
32
|
+
- !ruby/object:Gem::Version
|
|
33
|
+
version: '1.2'
|
|
34
|
+
type: :development
|
|
35
|
+
prerelease: false
|
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
+
requirements:
|
|
38
|
+
- - "~>"
|
|
39
|
+
- !ruby/object:Gem::Version
|
|
40
|
+
version: '1.2'
|
|
41
|
+
description:
|
|
42
|
+
email:
|
|
43
|
+
- gjtorikian@gmail.com
|
|
44
|
+
executables: []
|
|
45
|
+
extensions: []
|
|
46
|
+
extra_rdoc_files: []
|
|
47
|
+
files:
|
|
48
|
+
- LICENSE.txt
|
|
49
|
+
- README.md
|
|
50
|
+
- lib/selma.rb
|
|
51
|
+
- lib/selma/3.2/selma.so
|
|
52
|
+
- lib/selma/3.3/selma.so
|
|
53
|
+
- lib/selma/3.4/selma.so
|
|
54
|
+
- lib/selma/4.0/selma.so
|
|
55
|
+
- lib/selma/config.rb
|
|
56
|
+
- lib/selma/extension.rb
|
|
57
|
+
- lib/selma/html.rb
|
|
58
|
+
- lib/selma/html/element.rb
|
|
59
|
+
- lib/selma/rewriter.rb
|
|
60
|
+
- lib/selma/sanitizer.rb
|
|
61
|
+
- lib/selma/sanitizer/config.rb
|
|
62
|
+
- lib/selma/sanitizer/config/basic.rb
|
|
63
|
+
- lib/selma/sanitizer/config/default.rb
|
|
64
|
+
- lib/selma/sanitizer/config/relaxed.rb
|
|
65
|
+
- lib/selma/sanitizer/config/restricted.rb
|
|
66
|
+
- lib/selma/selector.rb
|
|
67
|
+
- lib/selma/version.rb
|
|
68
|
+
homepage:
|
|
69
|
+
licenses:
|
|
70
|
+
- MIT
|
|
71
|
+
metadata:
|
|
72
|
+
allowed_push_host: https://rubygems.org
|
|
73
|
+
funding_uri: https://github.com/sponsors/gjtorikian/
|
|
74
|
+
source_code_uri: https://github.com/gjtorikian/selma
|
|
75
|
+
rubygems_mfa_required: 'true'
|
|
76
|
+
post_install_message:
|
|
77
|
+
rdoc_options: []
|
|
78
|
+
require_paths:
|
|
79
|
+
- lib
|
|
80
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
81
|
+
requirements:
|
|
82
|
+
- - ">="
|
|
83
|
+
- !ruby/object:Gem::Version
|
|
84
|
+
version: '3.2'
|
|
85
|
+
- - "<"
|
|
86
|
+
- !ruby/object:Gem::Version
|
|
87
|
+
version: 4.1.dev
|
|
88
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
89
|
+
requirements:
|
|
90
|
+
- - ">="
|
|
91
|
+
- !ruby/object:Gem::Version
|
|
92
|
+
version: '3.4'
|
|
93
|
+
requirements: []
|
|
94
|
+
rubygems_version: 3.5.23
|
|
95
|
+
signing_key:
|
|
96
|
+
specification_version: 4
|
|
97
|
+
summary: Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html
|
|
98
|
+
parser.
|
|
99
|
+
test_files: []
|