selma 0.0.5-x86_64-linux → 0.0.7-x86_64-linux
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +72 -23
- data/ext/selma/extconf.rb +1 -1
- data/lib/selma/3.1/selma.so +0 -0
- data/lib/selma/3.2/selma.so +0 -0
- data/lib/selma/sanitizer/config/basic.rb +33 -2
- data/lib/selma/sanitizer/config/default.rb +40 -4
- data/lib/selma/sanitizer/config/relaxed.rb +69 -8
- data/lib/selma/version.rb +1 -1
- data/selma.gemspec +1 -1
- metadata +10 -9
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bbd21b1ab2dfc59fc7c16475c993a497cdd025ca136270d92114bbab596602fe
|
4
|
+
data.tar.gz: 5a5b5884d5ed3b714ca80c8acfbfaf8b3b044054e834ae9fffaacce9946ac7a2
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 8ddc5f4256ffaa7bf8756c872a46caa139925bf4e9fee6c2cd447fe96e85276ddb7af24ae2038ed4eb9c30a21052cf09dd298ff44bad6f8d36c6f6a88690164e
|
7
|
+
data.tar.gz: e3eb10cc9b5db8eb9a1a11e2a8bb3eb92338d627d351068945ffd1bfce9d08960946288ae6513df8f9b7ee74271583b38ba67d4761ecce7dda664c1ae2eeef1b
|
data/README.md
CHANGED
@@ -24,23 +24,29 @@ Or install it yourself as:
|
|
24
24
|
|
25
25
|
## Usage
|
26
26
|
|
27
|
-
Selma can perform two different actions:
|
27
|
+
Selma can perform two different actions, either independently or together:
|
28
28
|
|
29
29
|
- Sanitize HTML, through a [Sanitize](https://github.com/rgrove/sanitize)-like allowlist syntax; and
|
30
|
-
- Select HTML using CSS rules, and manipulate elements and text
|
30
|
+
- Select HTML using CSS rules, and manipulate elements and text nodes along the way.
|
31
31
|
|
32
|
-
The basic API for Selma looks like this:
|
32
|
+
It does this through two kwargsL `sanitizer` and `handlers`. The basic API for Selma looks like this:
|
33
33
|
|
34
34
|
```ruby
|
35
|
-
|
36
|
-
|
35
|
+
sanitizer_config = {
|
36
|
+
elements: ["b", "em", "i", "strong", "u"],
|
37
|
+
}
|
38
|
+
sanitizer = Selma::Sanitizer.new(sanitizer_config)
|
39
|
+
rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
|
40
|
+
# removes any element that is not ["b", "em", "i", "strong", "u"];
|
41
|
+
# then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
|
42
|
+
rewriter.rewrite(html)
|
37
43
|
```
|
38
44
|
|
39
|
-
|
45
|
+
Here's a look at each individual part.
|
40
46
|
|
41
47
|
### Sanitization config
|
42
48
|
|
43
|
-
Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you want to disable HTML sanitization (for some reason), pass `nil`:
|
49
|
+
Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass `nil`:
|
44
50
|
|
45
51
|
```ruby
|
46
52
|
Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised
|
@@ -87,22 +93,22 @@ whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]
|
|
87
93
|
|
88
94
|
### Defining handlers
|
89
95
|
|
90
|
-
The real power in Selma comes in its use of handlers. A handler is simply an object with various methods:
|
96
|
+
The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:
|
91
97
|
|
92
98
|
- `selector`, a method which MUST return instance of `Selma::Selector` which defines the CSS classes to match
|
93
99
|
- `handle_element`, a method that's call on each matched element
|
94
|
-
- `handle_text_chunk`, a method that's called on each matched text node
|
100
|
+
- `handle_text_chunk`, a method that's called on each matched text node
|
95
101
|
|
96
102
|
Here's an example which rewrites the `href` attribute on `a` and the `src` attribute on `img` to be `https` rather than `http`.
|
97
103
|
|
98
104
|
```ruby
|
99
105
|
class MatchAttribute
|
100
|
-
SELECTOR = Selma::Selector(match_element: "
|
106
|
+
SELECTOR = Selma::Selector(match_element: %(a[href^="http:"], img[src^="http:"]"))
|
101
107
|
|
102
108
|
def handle_element(element)
|
103
|
-
if element.tag_name == "a"
|
109
|
+
if element.tag_name == "a"
|
104
110
|
element["href"] = rename_http(element["href"])
|
105
|
-
elsif element.tag_name == "img"
|
111
|
+
elsif element.tag_name == "img"
|
106
112
|
element["src"] = rename_http(element["src"])
|
107
113
|
end
|
108
114
|
end
|
@@ -118,10 +124,10 @@ rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])
|
|
118
124
|
The `Selma::Selector` object has three possible kwargs:
|
119
125
|
|
120
126
|
- `match_element`: any element which matches this CSS rule will be passed on to `handle_element`
|
121
|
-
- `match_text_within`: any
|
127
|
+
- `match_text_within`: any text_chunk which matches this CSS rule will be passed on to `handle_text_chunk`
|
122
128
|
- `ignore_text_within`: this is an array of element names whose text contents will be ignored
|
123
129
|
|
124
|
-
|
130
|
+
Here's an example for `handle_text_chunk` which changes strings in various elements which are _not_ `pre` or `code`:
|
125
131
|
|
126
132
|
```ruby
|
127
133
|
|
@@ -144,20 +150,63 @@ rewriter = Selma::Rewriter.new(handlers: [MatchText.new])
|
|
144
150
|
|
145
151
|
The `element` argument in `handle_element` has the following methods:
|
146
152
|
|
147
|
-
- `tag_name`:
|
148
|
-
- `
|
149
|
-
- `
|
150
|
-
- `
|
151
|
-
- `
|
152
|
-
- `
|
153
|
-
- `
|
153
|
+
- `tag_name`: Gets the element's name
|
154
|
+
- `tag_name=`: Sets the element's name
|
155
|
+
- `self_closing?`: A bool which identifies whether or not the element is self-closing
|
156
|
+
- `[]`: Get an attribute
|
157
|
+
- `[]=`: Set an attribute
|
158
|
+
- `remove_attribute`: Remove an attribute
|
159
|
+
- `has_attribute?`: A bool which identifies whether or not the element has an attribute
|
160
|
+
- `attributes`: List all the attributes
|
161
|
+
- `ancestors`: List all of an element's ancestors as an array of strings
|
154
162
|
- `before(content, as: content_type)`: Inserts `content` before the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
155
163
|
- `after(content, as: content_type)`: Inserts `content` after the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
156
|
-
- `
|
164
|
+
- `prepend(content, as: content_type)`: prepends `content` to the element's inner content, i.e. inserts content right after the element's start tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
165
|
+
- `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
166
|
+
- `set_inner_content`: Replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
167
|
+
|
168
|
+
#### `text_chunk` methods
|
169
|
+
|
170
|
+
- `to_s` / `.content`: Gets the text node's content
|
171
|
+
- `text_type`: identifies the type of text in the text node
|
172
|
+
- `before(content, as: content_type)`: Inserts `content` before the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
173
|
+
- `after(content, as: content_type)`: Inserts `content` after the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
174
|
+
- `replace(content, as: content_type)`: Replaces the text node with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
|
157
175
|
|
158
176
|
## Benchmarks
|
159
177
|
|
160
|
-
|
178
|
+
<details>
|
179
|
+
<pre>
|
180
|
+
ruby test/benchmark.rb
|
181
|
+
ruby test/benchmark.rb
|
182
|
+
Warming up --------------------------------------
|
183
|
+
sanitize-document-huge
|
184
|
+
1.000 i/100ms
|
185
|
+
selma-document-huge 1.000 i/100ms
|
186
|
+
Calculating -------------------------------------
|
187
|
+
sanitize-document-huge
|
188
|
+
0.257 (± 0.0%) i/s - 2.000 in 7.783398s
|
189
|
+
selma-document-huge 4.602 (± 0.0%) i/s - 23.000 in 5.002870s
|
190
|
+
Warming up --------------------------------------
|
191
|
+
sanitize-document-medium
|
192
|
+
2.000 i/100ms
|
193
|
+
selma-document-medium
|
194
|
+
22.000 i/100ms
|
195
|
+
Calculating -------------------------------------
|
196
|
+
sanitize-document-medium
|
197
|
+
28.676 (± 3.5%) i/s - 144.000 in 5.024669s
|
198
|
+
selma-document-medium
|
199
|
+
121.500 (±22.2%) i/s - 594.000 in 5.135410s
|
200
|
+
Warming up --------------------------------------
|
201
|
+
sanitize-document-small
|
202
|
+
10.000 i/100ms
|
203
|
+
selma-document-small 20.000 i/100ms
|
204
|
+
Calculating -------------------------------------
|
205
|
+
sanitize-document-small
|
206
|
+
107.280 (± 0.9%) i/s - 540.000 in 5.033850s
|
207
|
+
selma-document-small 118.867 (±31.1%) i/s - 540.000 in 5.080726s
|
208
|
+
</pre>
|
209
|
+
</details>
|
161
210
|
|
162
211
|
## Contributing
|
163
212
|
|
data/ext/selma/extconf.rb
CHANGED
data/lib/selma/3.1/selma.so
CHANGED
Binary file
|
Binary file
|
@@ -4,8 +4,39 @@ module Selma
|
|
4
4
|
class Sanitizer
|
5
5
|
module Config
|
6
6
|
BASIC = freeze_config(
|
7
|
-
elements: [
|
8
|
-
|
7
|
+
elements: [
|
8
|
+
"a",
|
9
|
+
"abbr",
|
10
|
+
"blockquote",
|
11
|
+
"b",
|
12
|
+
"br",
|
13
|
+
"cite",
|
14
|
+
"code",
|
15
|
+
"dd",
|
16
|
+
"dfn",
|
17
|
+
"dl",
|
18
|
+
"dt",
|
19
|
+
"em",
|
20
|
+
"i",
|
21
|
+
"kbd",
|
22
|
+
"li",
|
23
|
+
"mark",
|
24
|
+
"ol",
|
25
|
+
"p",
|
26
|
+
"pre",
|
27
|
+
"q",
|
28
|
+
"s",
|
29
|
+
"samp",
|
30
|
+
"small",
|
31
|
+
"strike",
|
32
|
+
"strong",
|
33
|
+
"sub",
|
34
|
+
"sup",
|
35
|
+
"time",
|
36
|
+
"u",
|
37
|
+
"ul",
|
38
|
+
"var",
|
39
|
+
],
|
9
40
|
|
10
41
|
attributes: {
|
11
42
|
"a" => ["href"],
|
@@ -33,13 +33,49 @@ module Selma
|
|
33
33
|
|
34
34
|
# An Array of element names whose contents will be removed. The contents
|
35
35
|
# of all other filtered elements will be left behind.
|
36
|
-
remove_contents: [
|
37
|
-
|
36
|
+
remove_contents: [
|
37
|
+
"iframe",
|
38
|
+
"math",
|
39
|
+
"noembed",
|
40
|
+
"noframes",
|
41
|
+
"noscript",
|
42
|
+
"plaintext",
|
43
|
+
"script",
|
44
|
+
"style",
|
45
|
+
"svg",
|
46
|
+
"xmp",
|
47
|
+
],
|
38
48
|
|
39
49
|
# Elements which, when removed, should have their contents surrounded by
|
40
50
|
# whitespace.
|
41
|
-
whitespace_elements: [
|
42
|
-
|
51
|
+
whitespace_elements: [
|
52
|
+
"address",
|
53
|
+
"article",
|
54
|
+
"aside",
|
55
|
+
"blockquote",
|
56
|
+
"br",
|
57
|
+
"dd",
|
58
|
+
"div",
|
59
|
+
"dl",
|
60
|
+
"dt",
|
61
|
+
"footer",
|
62
|
+
"h1",
|
63
|
+
"h2",
|
64
|
+
"h3",
|
65
|
+
"h4",
|
66
|
+
"h5",
|
67
|
+
"h6",
|
68
|
+
"header",
|
69
|
+
"hgroup",
|
70
|
+
"hr",
|
71
|
+
"li",
|
72
|
+
"nav",
|
73
|
+
"ol",
|
74
|
+
"p",
|
75
|
+
"pre",
|
76
|
+
"section",
|
77
|
+
"ul",
|
78
|
+
],
|
43
79
|
)
|
44
80
|
end
|
45
81
|
end
|
@@ -4,12 +4,60 @@ module Selma
|
|
4
4
|
class Sanitizer
|
5
5
|
module Config
|
6
6
|
RELAXED = freeze_config(
|
7
|
-
elements: BASIC[:elements] + [
|
8
|
-
|
7
|
+
elements: BASIC[:elements] + [
|
8
|
+
"address",
|
9
|
+
"article",
|
10
|
+
"aside",
|
11
|
+
"bdi",
|
12
|
+
"bdo",
|
13
|
+
"body",
|
14
|
+
"caption",
|
15
|
+
"col",
|
16
|
+
"colgroup",
|
17
|
+
"data",
|
18
|
+
"del",
|
19
|
+
"div",
|
20
|
+
"figcaption",
|
21
|
+
"figure",
|
22
|
+
"footer",
|
23
|
+
"h1",
|
24
|
+
"h2",
|
25
|
+
"h3",
|
26
|
+
"h4",
|
27
|
+
"h5",
|
28
|
+
"h6",
|
29
|
+
"head",
|
30
|
+
"header",
|
31
|
+
"hgroup",
|
32
|
+
"hr",
|
33
|
+
"html",
|
34
|
+
"img",
|
35
|
+
"ins",
|
36
|
+
"main",
|
37
|
+
"nav",
|
38
|
+
"rp",
|
39
|
+
"rt",
|
40
|
+
"ruby",
|
41
|
+
"section",
|
42
|
+
"span",
|
43
|
+
"style",
|
44
|
+
"summary",
|
45
|
+
"sup",
|
46
|
+
"table",
|
47
|
+
"tbody",
|
48
|
+
"td",
|
49
|
+
"tfoot",
|
50
|
+
"th",
|
51
|
+
"thead",
|
52
|
+
"title",
|
53
|
+
"tr",
|
54
|
+
"wbr",
|
55
|
+
],
|
9
56
|
|
10
57
|
allow_doctype: true,
|
11
58
|
|
12
|
-
attributes: merge(
|
59
|
+
attributes: merge(
|
60
|
+
BASIC[:attributes],
|
13
61
|
:all => ["class", "dir", "hidden", "id", "lang", "style", "tabindex", "title", "translate"],
|
14
62
|
"a" => ["href", "hreflang", "name", "rel"],
|
15
63
|
"col" => ["span", "width"],
|
@@ -21,16 +69,29 @@ module Selma
|
|
21
69
|
"li" => ["value"],
|
22
70
|
"ol" => ["reversed", "start", "type"],
|
23
71
|
"style" => ["media", "scoped", "type"],
|
24
|
-
"table" => [
|
25
|
-
|
72
|
+
"table" => [
|
73
|
+
"align",
|
74
|
+
"bgcolor",
|
75
|
+
"border",
|
76
|
+
"cellpadding",
|
77
|
+
"cellspacing",
|
78
|
+
"frame",
|
79
|
+
"rules",
|
80
|
+
"sortable",
|
81
|
+
"summary",
|
82
|
+
"width",
|
83
|
+
],
|
26
84
|
"td" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "valign", "width"],
|
27
85
|
"th" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "scope", "sorted", "valign", "width"],
|
28
|
-
"ul" => ["type"]
|
86
|
+
"ul" => ["type"],
|
87
|
+
),
|
29
88
|
|
30
|
-
protocols: merge(
|
89
|
+
protocols: merge(
|
90
|
+
BASIC[:protocols],
|
31
91
|
"del" => { "cite" => ["http", "https", :relative] },
|
32
92
|
"img" => { "src" => ["http", "https", :relative] },
|
33
|
-
"ins" => { "cite" => ["http", "https", :relative] }
|
93
|
+
"ins" => { "cite" => ["http", "https", :relative] },
|
94
|
+
),
|
34
95
|
)
|
35
96
|
end
|
36
97
|
end
|
data/lib/selma/version.rb
CHANGED
data/selma.gemspec
CHANGED
@@ -24,7 +24,7 @@ Gem::Specification.new do |spec|
|
|
24
24
|
spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
|
25
25
|
|
26
26
|
spec.require_paths = ["lib"]
|
27
|
-
spec.extensions = ["ext/selma/
|
27
|
+
spec.extensions = ["ext/selma/extconf.rb"]
|
28
28
|
|
29
29
|
spec.metadata = {
|
30
30
|
"allowed_push_host" => "https://rubygems.org",
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: selma
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.7
|
5
5
|
platform: x86_64-linux
|
6
6
|
authors:
|
7
7
|
- Garen J. Torikian
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-01-09 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rb_sys
|
@@ -66,7 +66,7 @@ dependencies:
|
|
66
66
|
- - "~>"
|
67
67
|
- !ruby/object:Gem::Version
|
68
68
|
version: '1.2'
|
69
|
-
description:
|
69
|
+
description:
|
70
70
|
email:
|
71
71
|
- gjtorikian@gmail.com
|
72
72
|
executables: []
|
@@ -91,6 +91,7 @@ files:
|
|
91
91
|
- ext/selma/src/wrapped_struct.rs
|
92
92
|
- lib/selma.rb
|
93
93
|
- lib/selma/3.1/selma.so
|
94
|
+
- lib/selma/3.2/selma.so
|
94
95
|
- lib/selma/extension.rb
|
95
96
|
- lib/selma/html.rb
|
96
97
|
- lib/selma/rewriter.rb
|
@@ -103,7 +104,7 @@ files:
|
|
103
104
|
- lib/selma/selector.rb
|
104
105
|
- lib/selma/version.rb
|
105
106
|
- selma.gemspec
|
106
|
-
homepage:
|
107
|
+
homepage:
|
107
108
|
licenses:
|
108
109
|
- MIT
|
109
110
|
metadata:
|
@@ -111,7 +112,7 @@ metadata:
|
|
111
112
|
funding_uri: https://github.com/sponsors/gjtorikian/
|
112
113
|
source_code_uri: https://github.com/gjtorikian/selma
|
113
114
|
rubygems_mfa_required: 'true'
|
114
|
-
post_install_message:
|
115
|
+
post_install_message:
|
115
116
|
rdoc_options: []
|
116
117
|
require_paths:
|
117
118
|
- lib
|
@@ -122,15 +123,15 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
122
123
|
version: '3.1'
|
123
124
|
- - "<"
|
124
125
|
- !ruby/object:Gem::Version
|
125
|
-
version: 3.
|
126
|
+
version: 3.3.dev
|
126
127
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
127
128
|
requirements:
|
128
129
|
- - ">="
|
129
130
|
- !ruby/object:Gem::Version
|
130
131
|
version: 3.3.22
|
131
132
|
requirements: []
|
132
|
-
rubygems_version: 3.3
|
133
|
-
signing_key:
|
133
|
+
rubygems_version: 3.4.3
|
134
|
+
signing_key:
|
134
135
|
specification_version: 4
|
135
136
|
summary: Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html
|
136
137
|
parser.
|