selma 0.0.5-aarch64-linux → 0.0.7-aarch64-linux
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +72 -23
- data/ext/selma/extconf.rb +1 -1
- data/lib/selma/3.1/selma.so +0 -0
- data/lib/selma/3.2/selma.so +0 -0
- data/lib/selma/sanitizer/config/basic.rb +33 -2
- data/lib/selma/sanitizer/config/default.rb +40 -4
- data/lib/selma/sanitizer/config/relaxed.rb +69 -8
- data/lib/selma/version.rb +1 -1
- data/selma.gemspec +1 -1
- metadata +10 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 6adb6d21580f0bdf84d55e5820ce078d39676276970a5fcd124bdf5c83efe916
|
4
|
+
data.tar.gz: 3c8e20b7eeb2f3081512369389bffa34f991131b12fcc4916c9beddf2bb514cf
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d6c17eed8ee8c52a03b1e6c95844e5f4c97a45576cc1e40813cb94944809a66afe89678f4b1e19aa7efaab9bc9a280761b5e5fed8da71b2d31db26ee47305ccf
|
7
|
+
data.tar.gz: 390d48b3a06c4bf98d6690ff682bfc1b139d54ca2e7cf3eef4c30f89835a5044356db2aaeb4696897df9d612d3782bb92979fb01476343d46a47daf090f2ce72
|
data/README.md
CHANGED
@@ -24,23 +24,29 @@ Or install it yourself as:
|
|
24
24
|
|
25
25
|
## Usage
|
26
26
|
|
27
|
-
Selma can perform two different actions:
|
27
|
+
Selma can perform two different actions, either independently or together:
|
28
28
|
|
29
29
|
- Sanitize HTML, through a [Sanitize](https://github.com/rgrove/sanitize)-like allowlist syntax; and
|
30
|
-
- Select HTML using CSS rules, and manipulate elements and text
|
30
|
+
- Select HTML using CSS rules, and manipulate elements and text nodes along the way.
|
31
31
|
|
32
|
-
The basic API for Selma looks like this:
|
32
|
+
It does this through two kwargsL `sanitizer` and `handlers`. The basic API for Selma looks like this:
|
33
33
|
|
34
34
|
```ruby
|
35
|
-
|
36
|
-
|
35
|
+
sanitizer_config = {
|
36
|
+
elements: ["b", "em", "i", "strong", "u"],
|
37
|
+
}
|
38
|
+
sanitizer = Selma::Sanitizer.new(sanitizer_config)
|
39
|
+
rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
|
40
|
+
# removes any element that is not ["b", "em", "i", "strong", "u"];
|
41
|
+
# then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
|
42
|
+
rewriter.rewrite(html)
|
37
43
|
```
|
38
44
|
|
39
|
-
|
45
|
+
Here's a look at each individual part.
|
40
46
|
|
41
47
|
### Sanitization config
|
42
48
|
|
43
|
-
Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you want to disable HTML sanitization (for some reason), pass `nil`:
|
49
|
+
Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass `nil`:
|
44
50
|
|
45
51
|
```ruby
|
46
52
|
Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised
|
@@ -87,22 +93,22 @@ whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]
|
|
87
93
|
|
88
94
|
### Defining handlers
|
89
95
|
|
90
|
-
The real power in Selma comes in its use of handlers. A handler is simply an object with various methods:
|
96
|
+
The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:
|
91
97
|
|
92
98
|
- `selector`, a method which MUST return instance of `Selma::Selector` which defines the CSS classes to match
|
93
99
|
- `handle_element`, a method that's call on each matched element
|
94
|
-
- `handle_text_chunk`, a method that's called on each matched text node
|
100
|
+
- `handle_text_chunk`, a method that's called on each matched text node
|
95
101
|
|
96
102
|
Here's an example which rewrites the `href` attribute on `a` and the `src` attribute on `img` to be `https` rather than `http`.
|
97
103
|
|
98
104
|
```ruby
|
99
105
|
class MatchAttribute
|
100
|
-
SELECTOR = Selma::Selector(match_element: "
|
106
|
+
SELECTOR = Selma::Selector(match_element: %(a[href^="http:"], img[src^="http:"]"))
|
101
107
|
|
102
108
|
def handle_element(element)
|
103
|
-
if element.tag_name == "a"
|
109
|
+
if element.tag_name == "a"
|
104
110
|
element["href"] = rename_http(element["href"])
|
105
|
-
elsif element.tag_name == "img"
|
111
|
+
elsif element.tag_name == "img"
|
106
112
|
element["src"] = rename_http(element["src"])
|
107
113
|
end
|
108
114
|
end
|
@@ -118,10 +124,10 @@ rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])
|
|
118
124
|
The `Selma::Selector` object has three possible kwargs:
|
119
125
|
|
120
126
|
- `match_element`: any element which matches this CSS rule will be passed on to `handle_element`
|
121
|
-
- `match_text_within`: any
|
127
|
+
- `match_text_within`: any text_chunk which matches this CSS rule will be passed on to `handle_text_chunk`
|
122
128
|
- `ignore_text_within`: this is an array of element names whose text contents will be ignored
|
123
129
|
|
124
|
-
|
130
|
+
Here's an example for `handle_text_chunk` which changes strings in various elements which are _not_ `pre` or `code`:
|
125
131
|
|
126
132
|
```ruby
|
127
133
|
|
@@ -144,20 +150,63 @@ rewriter = Selma::Rewriter.new(handlers: [MatchText.new])
|
|
144
150
|
|
145
151
|
The `element` argument in `handle_element` has the following methods:
|
146
152
|
|
147
|
-
- `tag_name`:
|
148
|
-
- `
|
149
|
-
- `
|
150
|
-
- `
|
151
|
-
- `
|
152
|
-
- `
|
153
|
-
- `
|
153
|
+
- `tag_name`: Gets the element's name
|
154
|
+
- `tag_name=`: Sets the element's name
|
155
|
+
- `self_closing?`: A bool which identifies whether or not the element is self-closing
|
156
|
+
- `[]`: Get an attribute
|
157
|
+
- `[]=`: Set an attribute
|
158
|
+
- `remove_attribute`: Remove an attribute
|
159
|
+
- `has_attribute?`: A bool which identifies whether or not the element has an attribute
|
160
|
+
- `attributes`: List all the attributes
|
161
|
+
- `ancestors`: List all of an element's ancestors as an array of strings
|
154
162
|
- `before(content, as: content_type)`: Inserts `content` before the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
155
163
|
- `after(content, as: content_type)`: Inserts `content` after the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
156
|
-
- `
|
164
|
+
- `prepend(content, as: content_type)`: prepends `content` to the element's inner content, i.e. inserts content right after the element's start tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
165
|
+
- `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
166
|
+
- `set_inner_content`: Replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
167
|
+
|
168
|
+
#### `text_chunk` methods
|
169
|
+
|
170
|
+
- `to_s` / `.content`: Gets the text node's content
|
171
|
+
- `text_type`: identifies the type of text in the text node
|
172
|
+
- `before(content, as: content_type)`: Inserts `content` before the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
173
|
+
- `after(content, as: content_type)`: Inserts `content` after the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
174
|
+
- `replace(content, as: content_type)`: Replaces the text node with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
157
175
|
|
158
176
|
## Benchmarks
|
159
177
|
|
160
|
-
|
178
|
+
<details>
|
179
|
+
<pre>
|
180
|
+
ruby test/benchmark.rb
|
181
|
+
ruby test/benchmark.rb
|
182
|
+
Warming up --------------------------------------
|
183
|
+
sanitize-document-huge
|
184
|
+
1.000 i/100ms
|
185
|
+
selma-document-huge 1.000 i/100ms
|
186
|
+
Calculating -------------------------------------
|
187
|
+
sanitize-document-huge
|
188
|
+
0.257 (± 0.0%) i/s - 2.000 in 7.783398s
|
189
|
+
selma-document-huge 4.602 (± 0.0%) i/s - 23.000 in 5.002870s
|
190
|
+
Warming up --------------------------------------
|
191
|
+
sanitize-document-medium
|
192
|
+
2.000 i/100ms
|
193
|
+
selma-document-medium
|
194
|
+
22.000 i/100ms
|
195
|
+
Calculating -------------------------------------
|
196
|
+
sanitize-document-medium
|
197
|
+
28.676 (± 3.5%) i/s - 144.000 in 5.024669s
|
198
|
+
selma-document-medium
|
199
|
+
121.500 (±22.2%) i/s - 594.000 in 5.135410s
|
200
|
+
Warming up --------------------------------------
|
201
|
+
sanitize-document-small
|
202
|
+
10.000 i/100ms
|
203
|
+
selma-document-small 20.000 i/100ms
|
204
|
+
Calculating -------------------------------------
|
205
|
+
sanitize-document-small
|
206
|
+
107.280 (± 0.9%) i/s - 540.000 in 5.033850s
|
207
|
+
selma-document-small 118.867 (±31.1%) i/s - 540.000 in 5.080726s
|
208
|
+
</pre>
|
209
|
+
</details>
|
161
210
|
|
162
211
|
## Contributing
|
163
212
|
|
data/ext/selma/extconf.rb
CHANGED
data/lib/selma/3.1/selma.so
CHANGED
Binary file
|
Binary file
|
@@ -4,8 +4,39 @@ module Selma
|
|
4
4
|
class Sanitizer
|
5
5
|
module Config
|
6
6
|
BASIC = freeze_config(
|
7
|
-
elements: [
|
8
|
-
|
7
|
+
elements: [
|
8
|
+
"a",
|
9
|
+
"abbr",
|
10
|
+
"blockquote",
|
11
|
+
"b",
|
12
|
+
"br",
|
13
|
+
"cite",
|
14
|
+
"code",
|
15
|
+
"dd",
|
16
|
+
"dfn",
|
17
|
+
"dl",
|
18
|
+
"dt",
|
19
|
+
"em",
|
20
|
+
"i",
|
21
|
+
"kbd",
|
22
|
+
"li",
|
23
|
+
"mark",
|
24
|
+
"ol",
|
25
|
+
"p",
|
26
|
+
"pre",
|
27
|
+
"q",
|
28
|
+
"s",
|
29
|
+
"samp",
|
30
|
+
"small",
|
31
|
+
"strike",
|
32
|
+
"strong",
|
33
|
+
"sub",
|
34
|
+
"sup",
|
35
|
+
"time",
|
36
|
+
"u",
|
37
|
+
"ul",
|
38
|
+
"var",
|
39
|
+
],
|
9
40
|
|
10
41
|
attributes: {
|
11
42
|
"a" => ["href"],
|
@@ -33,13 +33,49 @@ module Selma
|
|
33
33
|
|
34
34
|
# An Array of element names whose contents will be removed. The contents
|
35
35
|
# of all other filtered elements will be left behind.
|
36
|
-
remove_contents: [
|
37
|
-
|
36
|
+
remove_contents: [
|
37
|
+
"iframe",
|
38
|
+
"math",
|
39
|
+
"noembed",
|
40
|
+
"noframes",
|
41
|
+
"noscript",
|
42
|
+
"plaintext",
|
43
|
+
"script",
|
44
|
+
"style",
|
45
|
+
"svg",
|
46
|
+
"xmp",
|
47
|
+
],
|
38
48
|
|
39
49
|
# Elements which, when removed, should have their contents surrounded by
|
40
50
|
# whitespace.
|
41
|
-
whitespace_elements: [
|
42
|
-
|
51
|
+
whitespace_elements: [
|
52
|
+
"address",
|
53
|
+
"article",
|
54
|
+
"aside",
|
55
|
+
"blockquote",
|
56
|
+
"br",
|
57
|
+
"dd",
|
58
|
+
"div",
|
59
|
+
"dl",
|
60
|
+
"dt",
|
61
|
+
"footer",
|
62
|
+
"h1",
|
63
|
+
"h2",
|
64
|
+
"h3",
|
65
|
+
"h4",
|
66
|
+
"h5",
|
67
|
+
"h6",
|
68
|
+
"header",
|
69
|
+
"hgroup",
|
70
|
+
"hr",
|
71
|
+
"li",
|
72
|
+
"nav",
|
73
|
+
"ol",
|
74
|
+
"p",
|
75
|
+
"pre",
|
76
|
+
"section",
|
77
|
+
"ul",
|
78
|
+
],
|
43
79
|
)
|
44
80
|
end
|
45
81
|
end
|
@@ -4,12 +4,60 @@ module Selma
|
|
4
4
|
class Sanitizer
|
5
5
|
module Config
|
6
6
|
RELAXED = freeze_config(
|
7
|
-
elements: BASIC[:elements] + [
|
8
|
-
|
7
|
+
elements: BASIC[:elements] + [
|
8
|
+
"address",
|
9
|
+
"article",
|
10
|
+
"aside",
|
11
|
+
"bdi",
|
12
|
+
"bdo",
|
13
|
+
"body",
|
14
|
+
"caption",
|
15
|
+
"col",
|
16
|
+
"colgroup",
|
17
|
+
"data",
|
18
|
+
"del",
|
19
|
+
"div",
|
20
|
+
"figcaption",
|
21
|
+
"figure",
|
22
|
+
"footer",
|
23
|
+
"h1",
|
24
|
+
"h2",
|
25
|
+
"h3",
|
26
|
+
"h4",
|
27
|
+
"h5",
|
28
|
+
"h6",
|
29
|
+
"head",
|
30
|
+
"header",
|
31
|
+
"hgroup",
|
32
|
+
"hr",
|
33
|
+
"html",
|
34
|
+
"img",
|
35
|
+
"ins",
|
36
|
+
"main",
|
37
|
+
"nav",
|
38
|
+
"rp",
|
39
|
+
"rt",
|
40
|
+
"ruby",
|
41
|
+
"section",
|
42
|
+
"span",
|
43
|
+
"style",
|
44
|
+
"summary",
|
45
|
+
"sup",
|
46
|
+
"table",
|
47
|
+
"tbody",
|
48
|
+
"td",
|
49
|
+
"tfoot",
|
50
|
+
"th",
|
51
|
+
"thead",
|
52
|
+
"title",
|
53
|
+
"tr",
|
54
|
+
"wbr",
|
55
|
+
],
|
9
56
|
|
10
57
|
allow_doctype: true,
|
11
58
|
|
12
|
-
attributes: merge(
|
59
|
+
attributes: merge(
|
60
|
+
BASIC[:attributes],
|
13
61
|
:all => ["class", "dir", "hidden", "id", "lang", "style", "tabindex", "title", "translate"],
|
14
62
|
"a" => ["href", "hreflang", "name", "rel"],
|
15
63
|
"col" => ["span", "width"],
|
@@ -21,16 +69,29 @@ module Selma
|
|
21
69
|
"li" => ["value"],
|
22
70
|
"ol" => ["reversed", "start", "type"],
|
23
71
|
"style" => ["media", "scoped", "type"],
|
24
|
-
"table" => [
|
25
|
-
|
72
|
+
"table" => [
|
73
|
+
"align",
|
74
|
+
"bgcolor",
|
75
|
+
"border",
|
76
|
+
"cellpadding",
|
77
|
+
"cellspacing",
|
78
|
+
"frame",
|
79
|
+
"rules",
|
80
|
+
"sortable",
|
81
|
+
"summary",
|
82
|
+
"width",
|
83
|
+
],
|
26
84
|
"td" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "valign", "width"],
|
27
85
|
"th" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "scope", "sorted", "valign", "width"],
|
28
|
-
"ul" => ["type"]
|
86
|
+
"ul" => ["type"],
|
87
|
+
),
|
29
88
|
|
30
|
-
protocols: merge(
|
89
|
+
protocols: merge(
|
90
|
+
BASIC[:protocols],
|
31
91
|
"del" => { "cite" => ["http", "https", :relative] },
|
32
92
|
"img" => { "src" => ["http", "https", :relative] },
|
33
|
-
"ins" => { "cite" => ["http", "https", :relative] }
|
93
|
+
"ins" => { "cite" => ["http", "https", :relative] },
|
94
|
+
),
|
34
95
|
)
|
35
96
|
end
|
36
97
|
end
|
data/lib/selma/version.rb
CHANGED
data/selma.gemspec
CHANGED
@@ -24,7 +24,7 @@ Gem::Specification.new do |spec|
|
|
24
24
|
spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
|
25
25
|
|
26
26
|
spec.require_paths = ["lib"]
|
27
|
-
spec.extensions = ["ext/selma/
|
27
|
+
spec.extensions = ["ext/selma/extconf.rb"]
|
28
28
|
|
29
29
|
spec.metadata = {
|
30
30
|
"allowed_push_host" => "https://rubygems.org",
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: selma
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.7
|
5
5
|
platform: aarch64-linux
|
6
6
|
authors:
|
7
7
|
- Garen J. Torikian
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-01-09 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rb_sys
|
@@ -66,7 +66,7 @@ dependencies:
|
|
66
66
|
- - "~>"
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '1.2'
|
69
|
-
description:
|
69
|
+
description:
|
70
70
|
email:
|
71
71
|
- gjtorikian@gmail.com
|
72
72
|
executables: []
|
@@ -91,6 +91,7 @@ files:
|
|
91
91
|
- ext/selma/src/wrapped_struct.rs
|
92
92
|
- lib/selma.rb
|
93
93
|
- lib/selma/3.1/selma.so
|
94
|
+
- lib/selma/3.2/selma.so
|
94
95
|
- lib/selma/extension.rb
|
95
96
|
- lib/selma/html.rb
|
96
97
|
- lib/selma/rewriter.rb
|
@@ -103,7 +104,7 @@ files:
|
|
103
104
|
- lib/selma/selector.rb
|
104
105
|
- lib/selma/version.rb
|
105
106
|
- selma.gemspec
|
106
|
-
homepage:
|
107
|
+
homepage:
|
107
108
|
licenses:
|
108
109
|
- MIT
|
109
110
|
metadata:
|
@@ -111,7 +112,7 @@ metadata:
|
|
111
112
|
funding_uri: https://github.com/sponsors/gjtorikian/
|
112
113
|
source_code_uri: https://github.com/gjtorikian/selma
|
113
114
|
rubygems_mfa_required: 'true'
|
114
|
-
post_install_message:
|
115
|
+
post_install_message:
|
115
116
|
rdoc_options: []
|
116
117
|
require_paths:
|
117
118
|
- lib
|
@@ -122,15 +123,15 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
122
123
|
version: '3.1'
|
123
124
|
- - "<"
|
124
125
|
- !ruby/object:Gem::Version
|
125
|
-
version: 3.
|
126
|
+
version: 3.3.dev
|
126
127
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
127
128
|
requirements:
|
128
129
|
- - ">="
|
129
130
|
- !ruby/object:Gem::Version
|
130
131
|
version: 3.3.22
|
131
132
|
requirements: []
|
132
|
-
rubygems_version: 3.3
|
133
|
-
signing_key:
|
133
|
+
rubygems_version: 3.4.3
|
134
|
+
signing_key:
|
134
135
|
specification_version: 4
|
135
136
|
summary: Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html
|
136
137
|
parser.
|