selma 0.4.15-arm-linux

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: be5d989930ee24e511c1da6855dbb31976d65f7a0e0bb0afd14e70ad6c11e0cd
4
+ data.tar.gz: ae3fbab1d118865f3351cd1aefccbf1e95441a23ed77b45b84ef73cca5127df2
5
+ SHA512:
6
+ metadata.gz: c828bc9168daa3af844c7f477fdf1840817924859db7a5a6b4ba9c97e81d6f597e64de259d7fe3e2e4a91692a5ba8da0c81c75d241814f191c5ee5ed7888c20f
7
+ data.tar.gz: 2c672ccff8536720f47f588d48acb20beb2432c4dd4398b1e1afe8ee35e08e4f34d742941eaffad64ac209a49243cb45148159a809cb7a25e61d6c1750f637bd
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2022 Garen J. Torikian
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,334 @@
1
+ # Selma
2
+
3
+ Selma **sel**ects and **ma**tches HTML nodes using CSS rules. (It can also reject/delete nodes, but then the name isn't as cool.) It's mostly an idiomatic wrapper around Cloudflare's [lol-html](https://github.com/cloudflare/lol-html) project.
4
+
5
+ ![Principal Skinner asking Selma after their date: 'Isn't it nice we hate the same things?'](https://user-images.githubusercontent.com/64050/207155384-14e8bd40-780c-466f-bfff-31a8a8fc3d25.jpg)
6
+
7
+ Selma's strength (aside from being backed by Rust) is that HTML content is parsed _once_ and can be manipulated multiple times.
8
+
9
+ ## Installation
10
+
11
+ Add this line to your application's Gemfile:
12
+
13
+ ```ruby
14
+ gem 'selma'
15
+ ```
16
+
17
+ And then execute:
18
+
19
+ $ bundle install
20
+
21
+ Or install it yourself as:
22
+
23
+ $ gem install selma
24
+
25
+ ## Usage
26
+
27
+ Selma can perform two different actions, either independently or together:
28
+
29
+ - Sanitize HTML, through a [Sanitize](https://github.com/rgrove/sanitize)-like allowlist syntax; and
30
+ - Select HTML using CSS rules, and manipulate elements and text nodes along the way.
31
+
32
+ It does this through two kwargs: `sanitizer` and `handlers`. The basic API for Selma looks like this:
33
+
34
+ ```ruby
35
+ sanitizer_config = {
36
+ elements: ["b", "em", "i", "strong", "u"],
37
+ }
38
+ sanitizer = Selma::Sanitizer.new(sanitizer_config)
39
+ rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
40
+ # removes any element that is not ["b", "em", "i", "strong", "u"];
41
+ # then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
42
+ rewriter.rewrite(html)
43
+ ```
44
+
45
+ Here's a look at each individual part.
46
+
47
+ ### Sanitization config
48
+
49
+ Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass `nil`:
50
+
51
+ ```ruby
52
+ Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised
53
+ ```
54
+
55
+ The configuration for the sanitization process is based on the follow key-value hash allowlist:
56
+
57
+ ```ruby
58
+ # Whether or not to allow HTML comments.
59
+ allow_comments: false,
60
+
61
+ # Whether or not to allow well-formed HTML doctype declarations such as
62
+ # "<!DOCTYPE html>" when sanitizing a document.
63
+ allow_doctype: false,
64
+
65
+ # HTML elements to allow. By default, no elements are allowed (which means
66
+ # that all HTML will be stripped).
67
+ elements: ["a", "b", "img", ],
68
+
69
+ # HTML attributes to allow in specific elements. The key is the name of the element,
70
+ # and the value is an array of allowed attributes. By default, no attributes
71
+ # are allowed.
72
+ attributes: {
73
+ "a" => ["href"],
74
+ "img" => ["src"],
75
+ },
76
+
77
+ # URL handling protocols to allow in specific attributes. By default, no
78
+ # protocols are allowed. Use :relative in place of a protocol if you want
79
+ # to allow relative URLs sans protocol. Set to `:all` to allow any protocol.
80
+ protocols: {
81
+ "a" => { "href" => ["http", "https", "mailto", :relative] },
82
+ "img" => { "href" => ["http", "https"] },
83
+ },
84
+
85
+ # An Array of element names whose contents will be removed. The contents
86
+ # of all other filtered elements will be left behind.
87
+ remove_contents: ["iframe", "math", "noembed", "noframes", "noscript"],
88
+
89
+ # Elements which, when removed, should have their contents surrounded by
90
+ # whitespace.
91
+ whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]
92
+ ```
93
+
94
+ ### Defining handlers
95
+
96
+ The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:
97
+
98
+ - `selector`, a method which MUST return an instance of `Selma::Selector`, defining the CSS classes to match
99
+ - `handle_element`, a method that's called on each matched element
100
+ - `handle_text_chunk`, a method that's called on each matched text node
101
+
102
+ Here's an example which rewrites the `href` attribute on `a` and the `src` attribute on `img` to be `https` rather than `http`.
103
+
104
+ ```ruby
105
+ class MatchAttribute
106
+ SELECTOR = Selma::Selector.new(match_element: %(a[href^="http:"], img[src^="http:"]"))
107
+
108
+ def selector
109
+ SELECTOR
110
+ end
111
+
112
+ def handle_element(element)
113
+ if element.tag_name == "a"
114
+ element["href"] = rename_http(element["href"])
115
+ elsif element.tag_name == "img"
116
+ element["src"] = rename_http(element["src"])
117
+ end
118
+ end
119
+
120
+ private def rename_http(link)
121
+ link.sub("http", "https")
122
+ end
123
+ end
124
+
125
+ rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])
126
+ ```
127
+
128
+ The `Selma::Selector` object has three possible kwargs:
129
+
130
+ - `match_element`: any element which matches this CSS rule will be passed on to `handle_element`
131
+ - `match_text_within`: any text_chunk which matches this CSS rule will be passed on to `handle_text_chunk`
132
+ - `ignore_text_within`: this is an array of element names whose text contents will be ignored
133
+
134
+ Here's an example for `handle_text_chunk` which changes strings in various elements which are _not_ `pre` or `code`:
135
+
136
+ ```ruby
137
+ class MatchText
138
+ SELECTOR = Selma::Selector.new(match_text_within: "*", ignore_text_within: ["pre", "code"])
139
+
140
+ def selector
141
+ SELECTOR
142
+ end
143
+
144
+ def handle_text_chunk(text)
145
+ text.replace(text.to_s, text.sub(/@.+/, "<a href=\"www.yetto.app/#{Regexp.last_match}\">"))
146
+ end
147
+ end
148
+
149
+ rewriter = Selma::Rewriter.new(handlers: [MatchText.new])
150
+ ```
151
+
152
+ #### `element` methods
153
+
154
+ The `element` argument in `handle_element` has the following methods:
155
+
156
+ - `tag_name`: Gets the element's name
157
+ - `tag_name=`: Sets the element's name
158
+ - `self_closing?`: A bool which identifies whether or not the element is self-closing
159
+ - `[]`: Get an attribute
160
+ - `[]=`: Set an attribute
161
+ - `remove_attribute`: Remove an attribute
162
+ - `has_attribute?`: A bool which identifies whether or not the element has an attribute
163
+ - `attributes`: List all the attributes
164
+ - `ancestors`: List all of an element's ancestors as an array of strings
165
+ - `before(content, as: content_type)`: Inserts `content` before the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
166
+ - `after(content, as: content_type)`: Inserts `content` after the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
167
+ - `prepend(content, as: content_type)`: prepends `content` to the element's inner content, i.e. inserts content right after the element's start tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
168
+ - `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
169
+ - `set_inner_content`: Replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
170
+ - `remove`: Removes the element and its inner content.
171
+ - `remove_and_keep_content`: Removes the element, but keeps its content. I.e. remove start and end tags of the element.
172
+ - `removed?`: A bool which identifies if the element has been removed or replaced with some content.
173
+
174
+ #### `text_chunk` methods
175
+
176
+ - `to_s` / `.content`: Gets the text node's content
177
+ - `text_type`: identifies the type of text in the text node
178
+ - `before(content, as: content_type)`: Inserts `content` before the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
179
+ - `after(content, as: content_type)`: Inserts `content` after the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
180
+ - `replace(content, as: content_type)`: Replaces the text node with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
181
+
182
+ ## Security
183
+
184
+ Theoretically, a malicious user can provide a very large document for processing, which can exhaust the memory of the host machine. To set a limit on how much string content is processed at once, you can provide `memory` options:
185
+
186
+ ```ruby
187
+ Selma::Rewriter.new(options: { memory: { max_allowed_memory_usage: 1_000_000 } }) # ~1MB
188
+ ```
189
+
190
+ The structure of the `memory` options looks like this:
191
+
192
+ ```ruby
193
+ {
194
+ memory: {
195
+ max_allowed_memory_usage: 1000,
196
+ preallocated_parsing_buffer_size: 100,
197
+ }
198
+ }
199
+ ```
200
+
201
+ Note that `preallocated_parsing_buffer_size` must always be less than `max_allowed_memory_usage`. See [the`lol_html` project documentation](https://docs.rs/lol_html/1.2.1/lol_html/struct.MemorySettings.html) to learn more about the default values.
202
+
203
+ ## Benchmarks
204
+
205
+ When `bundle exec rake benchmark`, two different benchmarks are calculated. Here are those results on my machine.
206
+
207
+ ### Benchmarks for just the sanitization process
208
+
209
+ Comparing Selma against popular Ruby sanitization gems:
210
+
211
+ <!-- prettier-ignore-start -->
212
+ <details>
213
+ <pre>
214
+ input size = 25309 bytes, 0.03 MB
215
+
216
+ ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
217
+ Warming up --------------------------------------
218
+ sanitize-sm 15.000 i/100ms
219
+ selma-sm 127.000 i/100ms
220
+ Calculating -------------------------------------
221
+ sanitize-sm 157.643 (± 1.9%) i/s - 4.740k in 30.077172s
222
+ selma-sm 1.278k (± 1.5%) i/s - 38.354k in 30.019722s
223
+
224
+ Comparison:
225
+ selma-sm: 1277.9 i/s
226
+ sanitize-sm: 157.6 i/s - 8.11x slower
227
+
228
+ input size = 86686 bytes, 0.09 MB
229
+
230
+ ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
231
+ Warming up --------------------------------------
232
+ sanitize-md 4.000 i/100ms
233
+ selma-md 33.000 i/100ms
234
+ Calculating -------------------------------------
235
+ sanitize-md 40.034 (± 5.0%) i/s - 1.200k in 30.043322s
236
+ selma-md 332.959 (± 2.1%) i/s - 9.999k in 30.045733s
237
+
238
+ Comparison:
239
+ selma-md: 333.0 i/s
240
+ sanitize-md: 40.0 i/s - 8.32x slower
241
+
242
+ input size = 7172510 bytes, 7.17 MB
243
+
244
+ ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
245
+ Warming up --------------------------------------
246
+ sanitize-lg 1.000 i/100ms
247
+ selma-lg 1.000 i/100ms
248
+ Calculating -------------------------------------
249
+ sanitize-lg 0.141 (± 0.0%) i/s - 5.000 in 35.426127s
250
+ selma-lg 3.963 (± 0.0%) i/s - 119.000 in 30.037386s
251
+
252
+ Comparison:
253
+ selma-lg: 4.0 i/s
254
+ sanitize-lg: 0.1 i/s - 28.03x slower
255
+
256
+ </pre>
257
+ </details>
258
+ <!-- prettier-ignore-end -->
259
+
260
+ ### Benchmarks for just the rewriting process
261
+
262
+ Comparing Selma against popular Ruby HTML parsing gems:
263
+
264
+ <!-- prettier-ignore-start -->
265
+ <details>
266
+ <pre>
267
+ input size = 25309 bytes, 0.03 MB
268
+
269
+ ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
270
+ Warming up --------------------------------------
271
+ nokogiri-sm 79.000 i/100ms
272
+ nokolexbor-sm 295.000 i/100ms
273
+ selma-sm 237.000 i/100ms
274
+ Calculating -------------------------------------
275
+ nokogiri-sm 800.531 (± 2.2%) i/s - 24.016k in 30.016056s
276
+ nokolexbor-sm 3.033k (± 3.6%) i/s - 91.155k in 30.094884s
277
+ selma-sm 2.386k (± 1.6%) i/s - 71.574k in 30.001701s
278
+
279
+ Comparison:
280
+ nokolexbor-sm: 3033.1 i/s
281
+ selma-sm: 2386.3 i/s - 1.27x slower
282
+ nokogiri-sm: 800.5 i/s - 3.79x slower
283
+
284
+ input size = 86686 bytes, 0.09 MB
285
+
286
+ ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
287
+ Warming up --------------------------------------
288
+ nokogiri-md 8.000 i/100ms
289
+ nokolexbor-md 43.000 i/100ms
290
+ selma-md 38.000 i/100ms
291
+ Calculating -------------------------------------
292
+ nokogiri-md 85.013 (± 8.2%) i/s - 2.024k in 52.257472s
293
+ nokolexbor-md 416.074 (±11.1%) i/s - 12.341k in 30.111613s
294
+ selma-md 361.471 (± 4.7%) i/s - 10.830k in 30.033997s
295
+
296
+ Comparison:
297
+ nokolexbor-md: 416.1 i/s
298
+ selma-md: 361.5 i/s - same-ish: difference falls within error
299
+ nokogiri-md: 85.0 i/s - 4.89x slower
300
+
301
+ input size = 7172510 bytes, 7.17 MB
302
+
303
+ ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]
304
+ Warming up --------------------------------------
305
+ nokogiri-lg 1.000 i/100ms
306
+ nokolexbor-lg 1.000 i/100ms
307
+ selma-lg 1.000 i/100ms
308
+ Calculating -------------------------------------
309
+ nokogiri-lg 0.805 (± 0.0%) i/s - 25.000 in 31.148730s
310
+ nokolexbor-lg 2.194 (± 0.0%) i/s - 66.000 in 30.278108s
311
+ selma-lg 5.541 (± 0.0%) i/s - 166.000 in 30.037197s
312
+
313
+ Comparison:
314
+ selma-lg: 5.5 i/s
315
+ nokolexbor-lg: 2.2 i/s - 2.53x slower
316
+ nokogiri-lg: 0.8 i/s - 6.88x slower
317
+
318
+ </pre>
319
+ </details>
320
+ <!-- prettier-ignore-end -->
321
+
322
+ ## Contributing
323
+
324
+ Bug reports and pull requests are welcome on GitHub at https://github.com/gjtorikian/selma. This project is a safe, welcoming space for collaboration.
325
+
326
+ ## Acknowledgements
327
+
328
+ - https://github.com/flavorjones/ruby-c-extensions-explained#strategy-3-precompiled and [Nokogiri](https://github.com/sparklemotion/nokogiri) for hints on how to ship precompiled cross-platform gems
329
+ - @vmg for his work at GitHub on goomba, from which some design patterns were learned
330
+ - [sanitize](https://github.com/rgrove/sanitize) for a comprehensive configuration API and test suite
331
+
332
+ ## License
333
+
334
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
Binary file
Binary file
Binary file
Binary file
@@ -0,0 +1,12 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ module Config
5
+ OPTIONS = {
6
+ memory: {
7
+ max_allowed_memory_usage: nil,
8
+ preallocated_parsing_buffer_size: nil,
9
+ },
10
+ }
11
+ end
12
+ end
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ begin
4
+ # native precompiled gems package shared libraries in <gem_dir>/lib/selma/<ruby_version>
5
+ # load the precompiled extension file
6
+ ruby_version = /\d+\.\d+/.match(RUBY_VERSION)
7
+ require_relative "#{ruby_version}/selma"
8
+ rescue LoadError
9
+ # fall back to the extension compiled upon installation.
10
+ # use "require" instead of "require_relative" because non-native gems will place C extension files
11
+ # in Gem::BasicSpecification#extension_dir after compilation (during normal installation), which
12
+ # is in $LOAD_PATH but not necessarily relative to this file (see nokogiri#2300)
13
+ require "selma/selma"
14
+ end
@@ -0,0 +1,11 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ class HTML
5
+ class Element
6
+ def available?
7
+ !removed?
8
+ end
9
+ end
10
+ end
11
+ end
data/lib/selma/html.rb ADDED
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative "html/element"
4
+
5
+ module Selma
6
+ class HTML
7
+ end
8
+ end
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ class Rewriter
5
+ end
6
+ end
@@ -0,0 +1,58 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ class Sanitizer
5
+ module Config
6
+ BASIC = freeze_config(
7
+ elements: [
8
+ "a",
9
+ "abbr",
10
+ "blockquote",
11
+ "b",
12
+ "br",
13
+ "cite",
14
+ "code",
15
+ "dd",
16
+ "dfn",
17
+ "dl",
18
+ "dt",
19
+ "em",
20
+ "i",
21
+ "kbd",
22
+ "li",
23
+ "mark",
24
+ "ol",
25
+ "p",
26
+ "pre",
27
+ "q",
28
+ "s",
29
+ "samp",
30
+ "small",
31
+ "strike",
32
+ "strong",
33
+ "sub",
34
+ "sup",
35
+ "time",
36
+ "u",
37
+ "ul",
38
+ "var",
39
+ ],
40
+
41
+ attributes: {
42
+ "a" => ["href"],
43
+ "abbr" => ["title"],
44
+ "blockquote" => ["cite"],
45
+ "dfn" => ["title"],
46
+ "q" => ["cite"],
47
+ "time" => ["datetime", "pubdate"],
48
+ },
49
+
50
+ protocols: {
51
+ "a" => { "href" => ["ftp", "http", "https", "mailto", :relative] },
52
+ "blockquote" => { "cite" => ["http", "https", :relative] },
53
+ "q" => { "cite" => ["http", "https", :relative] },
54
+ },
55
+ )
56
+ end
57
+ end
58
+ end
@@ -0,0 +1,82 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ class Sanitizer
5
+ module Config
6
+ # although there are many more protocol types, eg., ftp, xmpp, etc.,
7
+ # these are the only ones that are allowed by default
8
+ VALID_PROTOCOLS = ["http", "https", "mailto", :relative]
9
+
10
+ DEFAULT = freeze_config(
11
+ # Whether or not to allow HTML comments. Allowing comments is strongly
12
+ # discouraged, since IE allows script execution within conditional
13
+ # comments.
14
+ allow_comments: false,
15
+
16
+ # Whether or not to allow well-formed HTML doctype declarations such as
17
+ # "<!DOCTYPE html>" when sanitizing a document.
18
+ allow_doctype: false,
19
+
20
+ # HTML attributes to allow in specific elements. By default, no attributes
21
+ # are allowed. Use the symbol :data to indicate that arbitrary HTML5
22
+ # data-* attributes should be allowed.
23
+ attributes: {},
24
+
25
+ # HTML elements to allow. By default, no elements are allowed (which means
26
+ # that all HTML will be stripped).
27
+ elements: [],
28
+
29
+ # URL handling protocols to allow in specific attributes. By default, no
30
+ # protocols are allowed. Use :relative in place of a protocol if you want
31
+ # to allow relative URLs sans protocol. Set to `:all` to allow any protocol.
32
+ protocols: {},
33
+
34
+ # An Array of element names whose contents will be removed. The contents
35
+ # of all other filtered elements will be left behind.
36
+ remove_contents: [
37
+ "iframe",
38
+ "math",
39
+ "noembed",
40
+ "noframes",
41
+ "noscript",
42
+ "plaintext",
43
+ "script",
44
+ "style",
45
+ "svg",
46
+ "xmp",
47
+ ],
48
+
49
+ # Elements which, when removed, should have their contents surrounded by
50
+ # whitespace.
51
+ whitespace_elements: [
52
+ "address",
53
+ "article",
54
+ "aside",
55
+ "blockquote",
56
+ "br",
57
+ "dd",
58
+ "div",
59
+ "dl",
60
+ "dt",
61
+ "footer",
62
+ "h1",
63
+ "h2",
64
+ "h3",
65
+ "h4",
66
+ "h5",
67
+ "h6",
68
+ "header",
69
+ "hgroup",
70
+ "hr",
71
+ "li",
72
+ "nav",
73
+ "ol",
74
+ "p",
75
+ "pre",
76
+ "section",
77
+ "ul",
78
+ ],
79
+ )
80
+ end
81
+ end
82
+ end
@@ -0,0 +1,99 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ class Sanitizer
5
+ module Config
6
+ RELAXED = freeze_config(
7
+ elements: BASIC[:elements] + [
8
+ "address",
9
+ "article",
10
+ "aside",
11
+ "bdi",
12
+ "bdo",
13
+ "body",
14
+ "caption",
15
+ "col",
16
+ "colgroup",
17
+ "data",
18
+ "del",
19
+ "details",
20
+ "div",
21
+ "figcaption",
22
+ "figure",
23
+ "footer",
24
+ "h1",
25
+ "h2",
26
+ "h3",
27
+ "h4",
28
+ "h5",
29
+ "h6",
30
+ "head",
31
+ "header",
32
+ "hgroup",
33
+ "hr",
34
+ "html",
35
+ "img",
36
+ "ins",
37
+ "main",
38
+ "nav",
39
+ "rp",
40
+ "rt",
41
+ "ruby",
42
+ "section",
43
+ "span",
44
+ "style",
45
+ "summary",
46
+ "sup",
47
+ "table",
48
+ "tbody",
49
+ "td",
50
+ "tfoot",
51
+ "th",
52
+ "thead",
53
+ "title",
54
+ "tr",
55
+ "wbr",
56
+ ],
57
+
58
+ allow_doctype: true,
59
+
60
+ attributes: merge(
61
+ BASIC[:attributes],
62
+ :all => ["class", "dir", "hidden", "id", "lang", "style", "tabindex", "title", "translate"],
63
+ "a" => ["href", "hreflang", "name", "rel"],
64
+ "col" => ["span", "width"],
65
+ "colgroup" => ["span", "width"],
66
+ "data" => ["value"],
67
+ "del" => ["cite", "datetime"],
68
+ "img" => ["align", "alt", "border", "height", "src", "srcset", "width"],
69
+ "ins" => ["cite", "datetime"],
70
+ "li" => ["value"],
71
+ "ol" => ["reversed", "start", "type"],
72
+ "style" => ["media", "scoped", "type"],
73
+ "table" => [
74
+ "align",
75
+ "bgcolor",
76
+ "border",
77
+ "cellpadding",
78
+ "cellspacing",
79
+ "frame",
80
+ "rules",
81
+ "sortable",
82
+ "summary",
83
+ "width",
84
+ ],
85
+ "td" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "valign", "width"],
86
+ "th" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "scope", "sorted", "valign", "width"],
87
+ "ul" => ["type"],
88
+ ),
89
+
90
+ protocols: merge(
91
+ BASIC[:protocols],
92
+ "del" => { "cite" => ["http", "https", :relative] },
93
+ "img" => { "src" => ["http", "https", :relative] },
94
+ "ins" => { "cite" => ["http", "https", :relative] },
95
+ ),
96
+ )
97
+ end
98
+ end
99
+ end
@@ -0,0 +1,13 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ class Sanitizer
5
+ module Config
6
+ RESTRICTED = freeze_config(
7
+ elements: ["b", "em", "i", "strong", "u"],
8
+
9
+ whitespace_elements: DEFAULT[:whitespace_elements],
10
+ )
11
+ end
12
+ end
13
+ end
@@ -0,0 +1,67 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+
5
+ module Selma
6
+ class Sanitizer
7
+ module Config
8
+ class << self
9
+ # Deeply freezes and returns the given configuration Hash.
10
+ def freeze_config(config)
11
+ case config
12
+ when Hash
13
+ config.each_value { |c| freeze_config(c) }
14
+ when Array, Set
15
+ config.each { |c| freeze_config(c) }
16
+ end
17
+
18
+ config.freeze
19
+ end
20
+
21
+ # Returns a new Hash containing the result of deeply merging *other_config*
22
+ # into *config*. Does not modify *config* or *other_config*.
23
+ #
24
+ # This is the safest way to use a built-in config as the basis for
25
+ # your own custom config.
26
+ def merge(config, other_config = {})
27
+ raise ArgumentError, "config must be a Hash" unless config.is_a?(Hash)
28
+ raise ArgumentError, "other_config must be a Hash" unless other_config.is_a?(Hash)
29
+
30
+ merged = {}
31
+ keys = Set.new(config.keys + other_config.keys).to_a
32
+
33
+ keys.each do |key|
34
+ oldval = config[key]
35
+
36
+ if other_config.key?(key)
37
+ newval = other_config[key]
38
+
39
+ merged[key] = if oldval.is_a?(Hash) && newval.is_a?(Hash)
40
+ oldval.empty? ? newval.dup : merge(oldval, newval)
41
+ elsif newval.is_a?(Array) && key != :transformers
42
+ Set.new(newval).to_a
43
+ else
44
+ can_dupe?(newval) ? newval.dup : newval
45
+ end
46
+ else
47
+ merged[key] = can_dupe?(oldval) ? oldval.dup : oldval
48
+ end
49
+ end
50
+
51
+ merged
52
+ end
53
+
54
+ # Returns `true` if `dup` may be safely called on _value_, `false`
55
+ # otherwise.
56
+ def can_dupe?(value)
57
+ !(value == true || value == false || value.nil? || value.is_a?(Method) || value.is_a?(Numeric) || value.is_a?(Symbol))
58
+ end
59
+ end
60
+ end
61
+ end
62
+ end
63
+
64
+ require "selma/sanitizer/config/basic"
65
+ require "selma/sanitizer/config/default"
66
+ require "selma/sanitizer/config/relaxed"
67
+ require "selma/sanitizer/config/restricted"
@@ -0,0 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "selma/sanitizer/config"
4
+
5
+ module Selma
6
+ class Sanitizer
7
+ end
8
+ end
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ class Selector
5
+ end
6
+ end
@@ -0,0 +1,5 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Selma
4
+ VERSION = "0.4.15"
5
+ end
data/lib/selma.rb ADDED
@@ -0,0 +1,13 @@
1
+ # frozen_string_literal: true
2
+
3
+ if ENV.fetch("DEBUG", false)
4
+ require "amazing_print"
5
+ require "debug"
6
+ end
7
+
8
+ require_relative "selma/extension"
9
+
10
+ require_relative "selma/sanitizer"
11
+ require_relative "selma/html"
12
+ require_relative "selma/rewriter"
13
+ require_relative "selma/selector"
metadata ADDED
@@ -0,0 +1,99 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: selma
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.4.15
5
+ platform: arm-linux
6
+ authors:
7
+ - Garen J. Torikian
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2026-01-06 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rake
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '13.0'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '13.0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake-compiler
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '1.2'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '1.2'
41
+ description:
42
+ email:
43
+ - gjtorikian@gmail.com
44
+ executables: []
45
+ extensions: []
46
+ extra_rdoc_files: []
47
+ files:
48
+ - LICENSE.txt
49
+ - README.md
50
+ - lib/selma.rb
51
+ - lib/selma/3.2/selma.so
52
+ - lib/selma/3.3/selma.so
53
+ - lib/selma/3.4/selma.so
54
+ - lib/selma/4.0/selma.so
55
+ - lib/selma/config.rb
56
+ - lib/selma/extension.rb
57
+ - lib/selma/html.rb
58
+ - lib/selma/html/element.rb
59
+ - lib/selma/rewriter.rb
60
+ - lib/selma/sanitizer.rb
61
+ - lib/selma/sanitizer/config.rb
62
+ - lib/selma/sanitizer/config/basic.rb
63
+ - lib/selma/sanitizer/config/default.rb
64
+ - lib/selma/sanitizer/config/relaxed.rb
65
+ - lib/selma/sanitizer/config/restricted.rb
66
+ - lib/selma/selector.rb
67
+ - lib/selma/version.rb
68
+ homepage:
69
+ licenses:
70
+ - MIT
71
+ metadata:
72
+ allowed_push_host: https://rubygems.org
73
+ funding_uri: https://github.com/sponsors/gjtorikian/
74
+ source_code_uri: https://github.com/gjtorikian/selma
75
+ rubygems_mfa_required: 'true'
76
+ post_install_message:
77
+ rdoc_options: []
78
+ require_paths:
79
+ - lib
80
+ required_ruby_version: !ruby/object:Gem::Requirement
81
+ requirements:
82
+ - - ">="
83
+ - !ruby/object:Gem::Version
84
+ version: '3.2'
85
+ - - "<"
86
+ - !ruby/object:Gem::Version
87
+ version: 4.1.dev
88
+ required_rubygems_version: !ruby/object:Gem::Requirement
89
+ requirements:
90
+ - - ">="
91
+ - !ruby/object:Gem::Version
92
+ version: '3.4'
93
+ requirements: []
94
+ rubygems_version: 3.5.23
95
+ signing_key:
96
+ specification_version: 4
97
+ summary: Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html
98
+ parser.
99
+ test_files: []