selma 0.0.5-arm64-darwin → 0.0.7-arm64-darwin

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6a8ae1537672de0e7cca2dcecf7c19d470e9f481130fa9c3dad37eef2bc91507
4
- data.tar.gz: 4d897c25589773b5a5bcf987d7b420aa82dd69bcb4d832686f87ea14e40610a7
3
+ metadata.gz: 57a3cdbf3d988c3e6982a605a05dd52bf3793ec81d38c652ee67618130bebb98
4
+ data.tar.gz: db2172c9762317a5e2bd58ef99faeabfb014d09ed3f38b2530a9a25cc157e2dc
5
5
  SHA512:
6
- metadata.gz: 183446b3cf5e97ef5f61e96da4db1198a8468f7971d6ae61dd9f0130cd019bde29b7975c1a46ff822b055543ed3657597bd02dc0c0a9234e24cefaa93e3e39fb
7
- data.tar.gz: d105ad92ef51135b6d8fe675febe539eb41368bde7f08e10aba9f118d6c8950328d766c874202f759becb1ae75236ebf4e91c756ea1ba12c688941742801401d
6
+ metadata.gz: 4da3b4ff30074908591b86b3faba1f0461dee4aacd24bf4d54fe1aaaeeef0b0b1008d0c6a36e043e729ec072fa2edef7eda2548202cef9ed84c89b6ddd2b21b1
7
+ data.tar.gz: 6556dc9353d76e7e56ad103209de91a94c06b139b57e405ed9ffc2a18585840598766fa1a7d89aefe42c2902bc4b1d53584fca2e9559414bb545e721cdc0929d
data/README.md CHANGED
@@ -24,23 +24,29 @@ Or install it yourself as:
24
24
 
25
25
  ## Usage
26
26
 
27
- Selma can perform two different actions:
27
+ Selma can perform two different actions, either independently or together:
28
28
 
29
29
  - Sanitize HTML, through a [Sanitize](https://github.com/rgrove/sanitize)-like allowlist syntax; and
30
- - Select HTML using CSS rules, and manipulate elements and text
30
+ - Select HTML using CSS rules, and manipulate elements and text nodes along the way.
31
31
 
32
- The basic API for Selma looks like this:
32
+ It does this through two kwargsL `sanitizer` and `handlers`. The basic API for Selma looks like this:
33
33
 
34
34
  ```ruby
35
- rewriter = Selma::Rewriter.new(sanitizer: sanitizer_config, handlers: [MatchAttribute.new, TextRewrite.new])
36
- rewriter(html)
35
+ sanitizer_config = {
36
+ elements: ["b", "em", "i", "strong", "u"],
37
+ }
38
+ sanitizer = Selma::Sanitizer.new(sanitizer_config)
39
+ rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
40
+ # removes any element that is not ["b", "em", "i", "strong", "u"];
41
+ # then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
42
+ rewriter.rewrite(html)
37
43
  ```
38
44
 
39
- Let's take a look at each part individually.
45
+ Here's a look at each individual part.
40
46
 
41
47
  ### Sanitization config
42
48
 
43
- Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you want to disable HTML sanitization (for some reason), pass `nil`:
49
+ Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass `nil`:
44
50
 
45
51
  ```ruby
46
52
  Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised
@@ -87,22 +93,22 @@ whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]
87
93
 
88
94
  ### Defining handlers
89
95
 
90
- The real power in Selma comes in its use of handlers. A handler is simply an object with various methods:
96
+ The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:
91
97
 
92
98
  - `selector`, a method which MUST return instance of `Selma::Selector` which defines the CSS classes to match
93
99
  - `handle_element`, a method that's call on each matched element
94
- - `handle_text_chunk`, a method that's called on each matched text node; this MUST return a string
100
+ - `handle_text_chunk`, a method that's called on each matched text node
95
101
 
96
102
  Here's an example which rewrites the `href` attribute on `a` and the `src` attribute on `img` to be `https` rather than `http`.
97
103
 
98
104
  ```ruby
99
105
  class MatchAttribute
100
- SELECTOR = Selma::Selector(match_element: "a, img")
106
+ SELECTOR = Selma::Selector(match_element: %(a[href^="http:"], img[src^="http:"]"))
101
107
 
102
108
  def handle_element(element)
103
- if element.tag_name == "a" && element["href"] =~ /^http:/
109
+ if element.tag_name == "a"
104
110
  element["href"] = rename_http(element["href"])
105
- elsif element.tag_name == "img" && element["src"] =~ /^http:/
111
+ elsif element.tag_name == "img"
106
112
  element["src"] = rename_http(element["src"])
107
113
  end
108
114
  end
@@ -118,10 +124,10 @@ rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])
118
124
  The `Selma::Selector` object has three possible kwargs:
119
125
 
120
126
  - `match_element`: any element which matches this CSS rule will be passed on to `handle_element`
121
- - `match_text_within`: any element which matches this CSS rule will be passed on to `handle_text_chunk`
127
+ - `match_text_within`: any text_chunk which matches this CSS rule will be passed on to `handle_text_chunk`
122
128
  - `ignore_text_within`: this is an array of element names whose text contents will be ignored
123
129
 
124
- You've seen an example of `match_element`; here's one for `match_text` which changes strings in various elements which are _not_ `pre` or `code`:
130
+ Here's an example for `handle_text_chunk` which changes strings in various elements which are _not_ `pre` or `code`:
125
131
 
126
132
  ```ruby
127
133
 
@@ -144,20 +150,63 @@ rewriter = Selma::Rewriter.new(handlers: [MatchText.new])
144
150
 
145
151
  The `element` argument in `handle_element` has the following methods:
146
152
 
147
- - `tag_name`: The element's name
148
- - `[]`: get an attribute
149
- - `[]=`: set an attribute
150
- - `remove_attribute`: remove an attribute
151
- - `attributes`: list all the attributes
152
- - `ancestors`: list all the ancestors
153
- - `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
153
+ - `tag_name`: Gets the element's name
154
+ - `tag_name=`: Sets the element's name
155
+ - `self_closing?`: A bool which identifies whether or not the element is self-closing
156
+ - `[]`: Get an attribute
157
+ - `[]=`: Set an attribute
158
+ - `remove_attribute`: Remove an attribute
159
+ - `has_attribute?`: A bool which identifies whether or not the element has an attribute
160
+ - `attributes`: List all the attributes
161
+ - `ancestors`: List all of an element's ancestors as an array of strings
154
162
  - `before(content, as: content_type)`: Inserts `content` before the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
155
163
  - `after(content, as: content_type)`: Inserts `content` after the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
156
- - `set_inner_content`: replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
164
+ - `prepend(content, as: content_type)`: prepends `content` to the element's inner content, i.e. inserts content right after the element's start tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
165
+ - `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
166
+ - `set_inner_content`: Replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
167
+
168
+ #### `text_chunk` methods
169
+
170
+ - `to_s` / `.content`: Gets the text node's content
171
+ - `text_type`: identifies the type of text in the text node
172
+ - `before(content, as: content_type)`: Inserts `content` before the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
173
+ - `after(content, as: content_type)`: Inserts `content` after the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
174
+ - `replace(content, as: content_type)`: Replaces the text node with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
157
175
 
158
176
  ## Benchmarks
159
177
 
160
- TBD
178
+ <details>
179
+ <pre>
180
+ ruby test/benchmark.rb
181
+ ruby test/benchmark.rb
182
+ Warming up --------------------------------------
183
+ sanitize-document-huge
184
+ 1.000 i/100ms
185
+ selma-document-huge 1.000 i/100ms
186
+ Calculating -------------------------------------
187
+ sanitize-document-huge
188
+ 0.257 (± 0.0%) i/s - 2.000 in 7.783398s
189
+ selma-document-huge 4.602 (± 0.0%) i/s - 23.000 in 5.002870s
190
+ Warming up --------------------------------------
191
+ sanitize-document-medium
192
+ 2.000 i/100ms
193
+ selma-document-medium
194
+ 22.000 i/100ms
195
+ Calculating -------------------------------------
196
+ sanitize-document-medium
197
+ 28.676 (± 3.5%) i/s - 144.000 in 5.024669s
198
+ selma-document-medium
199
+ 121.500 (±22.2%) i/s - 594.000 in 5.135410s
200
+ Warming up --------------------------------------
201
+ sanitize-document-small
202
+ 10.000 i/100ms
203
+ selma-document-small 20.000 i/100ms
204
+ Calculating -------------------------------------
205
+ sanitize-document-small
206
+ 107.280 (± 0.9%) i/s - 540.000 in 5.033850s
207
+ selma-document-small 118.867 (±31.1%) i/s - 540.000 in 5.080726s
208
+ </pre>
209
+ </details>
161
210
 
162
211
  ## Contributing
163
212
 
data/ext/selma/extconf.rb CHANGED
@@ -3,4 +3,4 @@ require "rb_sys/mkmf"
3
3
 
4
4
  require_relative "_util"
5
5
 
6
- create_rust_makefile("selma")
6
+ create_rust_makefile("selma/selma")
Binary file
Binary file
@@ -4,8 +4,39 @@ module Selma
4
4
  class Sanitizer
5
5
  module Config
6
6
  BASIC = freeze_config(
7
- elements: ["a", "abbr", "blockquote", "b", "br", "cite", "code", "dd", "dfn", "dl", "dt", "em", "i", "kbd",
8
- "li", "mark", "ol", "p", "pre", "q", "s", "samp", "small", "strike", "strong", "sub", "sup", "time", "u", "ul", "var",],
7
+ elements: [
8
+ "a",
9
+ "abbr",
10
+ "blockquote",
11
+ "b",
12
+ "br",
13
+ "cite",
14
+ "code",
15
+ "dd",
16
+ "dfn",
17
+ "dl",
18
+ "dt",
19
+ "em",
20
+ "i",
21
+ "kbd",
22
+ "li",
23
+ "mark",
24
+ "ol",
25
+ "p",
26
+ "pre",
27
+ "q",
28
+ "s",
29
+ "samp",
30
+ "small",
31
+ "strike",
32
+ "strong",
33
+ "sub",
34
+ "sup",
35
+ "time",
36
+ "u",
37
+ "ul",
38
+ "var",
39
+ ],
9
40
 
10
41
  attributes: {
11
42
  "a" => ["href"],
@@ -33,13 +33,49 @@ module Selma
33
33
 
34
34
  # An Array of element names whose contents will be removed. The contents
35
35
  # of all other filtered elements will be left behind.
36
- remove_contents: ["iframe", "math", "noembed", "noframes", "noscript", "plaintext", "script", "style", "svg",
37
- "xmp",],
36
+ remove_contents: [
37
+ "iframe",
38
+ "math",
39
+ "noembed",
40
+ "noframes",
41
+ "noscript",
42
+ "plaintext",
43
+ "script",
44
+ "style",
45
+ "svg",
46
+ "xmp",
47
+ ],
38
48
 
39
49
  # Elements which, when removed, should have their contents surrounded by
40
50
  # whitespace.
41
- whitespace_elements: ["address", "article", "aside", "blockquote", "br", "dd", "div", "dl", "dt", "footer",
42
- "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "li", "nav", "ol", "p", "pre", "section", "ul",],
51
+ whitespace_elements: [
52
+ "address",
53
+ "article",
54
+ "aside",
55
+ "blockquote",
56
+ "br",
57
+ "dd",
58
+ "div",
59
+ "dl",
60
+ "dt",
61
+ "footer",
62
+ "h1",
63
+ "h2",
64
+ "h3",
65
+ "h4",
66
+ "h5",
67
+ "h6",
68
+ "header",
69
+ "hgroup",
70
+ "hr",
71
+ "li",
72
+ "nav",
73
+ "ol",
74
+ "p",
75
+ "pre",
76
+ "section",
77
+ "ul",
78
+ ],
43
79
  )
44
80
  end
45
81
  end
@@ -4,12 +4,60 @@ module Selma
4
4
  class Sanitizer
5
5
  module Config
6
6
  RELAXED = freeze_config(
7
- elements: BASIC[:elements] + ["address", "article", "aside", "bdi", "bdo", "body", "caption", "col",
8
- "colgroup", "data", "del", "div", "figcaption", "figure", "footer", "h1", "h2", "h3", "h4", "h5", "h6", "head", "header", "hgroup", "hr", "html", "img", "ins", "main", "nav", "rp", "rt", "ruby", "section", "span", "style", "summary", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "title", "tr", "wbr",],
7
+ elements: BASIC[:elements] + [
8
+ "address",
9
+ "article",
10
+ "aside",
11
+ "bdi",
12
+ "bdo",
13
+ "body",
14
+ "caption",
15
+ "col",
16
+ "colgroup",
17
+ "data",
18
+ "del",
19
+ "div",
20
+ "figcaption",
21
+ "figure",
22
+ "footer",
23
+ "h1",
24
+ "h2",
25
+ "h3",
26
+ "h4",
27
+ "h5",
28
+ "h6",
29
+ "head",
30
+ "header",
31
+ "hgroup",
32
+ "hr",
33
+ "html",
34
+ "img",
35
+ "ins",
36
+ "main",
37
+ "nav",
38
+ "rp",
39
+ "rt",
40
+ "ruby",
41
+ "section",
42
+ "span",
43
+ "style",
44
+ "summary",
45
+ "sup",
46
+ "table",
47
+ "tbody",
48
+ "td",
49
+ "tfoot",
50
+ "th",
51
+ "thead",
52
+ "title",
53
+ "tr",
54
+ "wbr",
55
+ ],
9
56
 
10
57
  allow_doctype: true,
11
58
 
12
- attributes: merge(BASIC[:attributes],
59
+ attributes: merge(
60
+ BASIC[:attributes],
13
61
  :all => ["class", "dir", "hidden", "id", "lang", "style", "tabindex", "title", "translate"],
14
62
  "a" => ["href", "hreflang", "name", "rel"],
15
63
  "col" => ["span", "width"],
@@ -21,16 +69,29 @@ module Selma
21
69
  "li" => ["value"],
22
70
  "ol" => ["reversed", "start", "type"],
23
71
  "style" => ["media", "scoped", "type"],
24
- "table" => ["align", "bgcolor", "border", "cellpadding", "cellspacing", "frame", "rules", "sortable",
25
- "summary", "width",],
72
+ "table" => [
73
+ "align",
74
+ "bgcolor",
75
+ "border",
76
+ "cellpadding",
77
+ "cellspacing",
78
+ "frame",
79
+ "rules",
80
+ "sortable",
81
+ "summary",
82
+ "width",
83
+ ],
26
84
  "td" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "valign", "width"],
27
85
  "th" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "scope", "sorted", "valign", "width"],
28
- "ul" => ["type"]),
86
+ "ul" => ["type"],
87
+ ),
29
88
 
30
- protocols: merge(BASIC[:protocols],
89
+ protocols: merge(
90
+ BASIC[:protocols],
31
91
  "del" => { "cite" => ["http", "https", :relative] },
32
92
  "img" => { "src" => ["http", "https", :relative] },
33
- "ins" => { "cite" => ["http", "https", :relative] }),
93
+ "ins" => { "cite" => ["http", "https", :relative] },
94
+ ),
34
95
  )
35
96
  end
36
97
  end
data/lib/selma/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Selma
4
- VERSION = "0.0.5"
4
+ VERSION = "0.0.7"
5
5
  end
data/selma.gemspec CHANGED
@@ -24,7 +24,7 @@ Gem::Specification.new do |spec|
24
24
  spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
25
25
 
26
26
  spec.require_paths = ["lib"]
27
- spec.extensions = ["ext/selma/Cargo.toml"]
27
+ spec.extensions = ["ext/selma/extconf.rb"]
28
28
 
29
29
  spec.metadata = {
30
30
  "allowed_push_host" => "https://rubygems.org",
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: selma
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.5
4
+ version: 0.0.7
5
5
  platform: arm64-darwin
6
6
  authors:
7
7
  - Garen J. Torikian
8
- autorequire:
8
+ autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-12-27 00:00:00.000000000 Z
11
+ date: 2023-01-09 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rb_sys
@@ -66,7 +66,7 @@ dependencies:
66
66
  - - "~>"
67
67
  - !ruby/object:Gem::Version
68
68
  version: '1.2'
69
- description:
69
+ description:
70
70
  email:
71
71
  - gjtorikian@gmail.com
72
72
  executables: []
@@ -91,6 +91,7 @@ files:
91
91
  - ext/selma/src/wrapped_struct.rs
92
92
  - lib/selma.rb
93
93
  - lib/selma/3.1/selma.bundle
94
+ - lib/selma/3.2/selma.bundle
94
95
  - lib/selma/extension.rb
95
96
  - lib/selma/html.rb
96
97
  - lib/selma/rewriter.rb
@@ -103,7 +104,7 @@ files:
103
104
  - lib/selma/selector.rb
104
105
  - lib/selma/version.rb
105
106
  - selma.gemspec
106
- homepage:
107
+ homepage:
107
108
  licenses:
108
109
  - MIT
109
110
  metadata:
@@ -111,7 +112,7 @@ metadata:
111
112
  funding_uri: https://github.com/sponsors/gjtorikian/
112
113
  source_code_uri: https://github.com/gjtorikian/selma
113
114
  rubygems_mfa_required: 'true'
114
- post_install_message:
115
+ post_install_message:
115
116
  rdoc_options: []
116
117
  require_paths:
117
118
  - lib
@@ -122,15 +123,15 @@ required_ruby_version: !ruby/object:Gem::Requirement
122
123
  version: '3.1'
123
124
  - - "<"
124
125
  - !ruby/object:Gem::Version
125
- version: 3.2.dev
126
+ version: 3.3.dev
126
127
  required_rubygems_version: !ruby/object:Gem::Requirement
127
128
  requirements:
128
129
  - - ">="
129
130
  - !ruby/object:Gem::Version
130
131
  version: 3.3.22
131
132
  requirements: []
132
- rubygems_version: 3.3.22
133
- signing_key:
133
+ rubygems_version: 3.4.3
134
+ signing_key:
134
135
  specification_version: 4
135
136
  summary: Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html
136
137
  parser.