selma 0.0.5-arm64-darwin → 0.0.7-arm64-darwin

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6a8ae1537672de0e7cca2dcecf7c19d470e9f481130fa9c3dad37eef2bc91507
4
- data.tar.gz: 4d897c25589773b5a5bcf987d7b420aa82dd69bcb4d832686f87ea14e40610a7
3
+ metadata.gz: 57a3cdbf3d988c3e6982a605a05dd52bf3793ec81d38c652ee67618130bebb98
4
+ data.tar.gz: db2172c9762317a5e2bd58ef99faeabfb014d09ed3f38b2530a9a25cc157e2dc
5
5
  SHA512:
6
- metadata.gz: 183446b3cf5e97ef5f61e96da4db1198a8468f7971d6ae61dd9f0130cd019bde29b7975c1a46ff822b055543ed3657597bd02dc0c0a9234e24cefaa93e3e39fb
7
- data.tar.gz: d105ad92ef51135b6d8fe675febe539eb41368bde7f08e10aba9f118d6c8950328d766c874202f759becb1ae75236ebf4e91c756ea1ba12c688941742801401d
6
+ metadata.gz: 4da3b4ff30074908591b86b3faba1f0461dee4aacd24bf4d54fe1aaaeeef0b0b1008d0c6a36e043e729ec072fa2edef7eda2548202cef9ed84c89b6ddd2b21b1
7
+ data.tar.gz: 6556dc9353d76e7e56ad103209de91a94c06b139b57e405ed9ffc2a18585840598766fa1a7d89aefe42c2902bc4b1d53584fca2e9559414bb545e721cdc0929d
data/README.md CHANGED
@@ -24,23 +24,29 @@ Or install it yourself as:
24
24
 
25
25
  ## Usage
26
26
 
27
- Selma can perform two different actions:
27
+ Selma can perform two different actions, either independently or together:
28
28
 
29
29
  - Sanitize HTML, through a [Sanitize](https://github.com/rgrove/sanitize)-like allowlist syntax; and
30
- - Select HTML using CSS rules, and manipulate elements and text
30
+ - Select HTML using CSS rules, and manipulate elements and text nodes along the way.
31
31
 
32
- The basic API for Selma looks like this:
32
+ It does this through two kwargsL `sanitizer` and `handlers`. The basic API for Selma looks like this:
33
33
 
34
34
  ```ruby
35
- rewriter = Selma::Rewriter.new(sanitizer: sanitizer_config, handlers: [MatchAttribute.new, TextRewrite.new])
36
- rewriter(html)
35
+ sanitizer_config = {
36
+ elements: ["b", "em", "i", "strong", "u"],
37
+ }
38
+ sanitizer = Selma::Sanitizer.new(sanitizer_config)
39
+ rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
40
+ # removes any element that is not ["b", "em", "i", "strong", "u"];
41
+ # then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
42
+ rewriter.rewrite(html)
37
43
  ```
38
44
 
39
- Let's take a look at each part individually.
45
+ Here's a look at each individual part.
40
46
 
41
47
  ### Sanitization config
42
48
 
43
- Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you want to disable HTML sanitization (for some reason), pass `nil`:
49
+ Selma sanitizes by default. That is, even if the `sanitizer` kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass `nil`:
44
50
 
45
51
  ```ruby
46
52
  Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised
@@ -87,22 +93,22 @@ whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]
87
93
 
88
94
  ### Defining handlers
89
95
 
90
- The real power in Selma comes in its use of handlers. A handler is simply an object with various methods:
96
+ The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:
91
97
 
92
98
  - `selector`, a method which MUST return instance of `Selma::Selector` which defines the CSS classes to match
93
99
  - `handle_element`, a method that's call on each matched element
94
- - `handle_text_chunk`, a method that's called on each matched text node; this MUST return a string
100
+ - `handle_text_chunk`, a method that's called on each matched text node
95
101
 
96
102
  Here's an example which rewrites the `href` attribute on `a` and the `src` attribute on `img` to be `https` rather than `http`.
97
103
 
98
104
  ```ruby
99
105
  class MatchAttribute
100
- SELECTOR = Selma::Selector(match_element: "a, img")
106
+ SELECTOR = Selma::Selector(match_element: %(a[href^="http:"], img[src^="http:"]"))
101
107
 
102
108
  def handle_element(element)
103
- if element.tag_name == "a" && element["href"] =~ /^http:/
109
+ if element.tag_name == "a"
104
110
  element["href"] = rename_http(element["href"])
105
- elsif element.tag_name == "img" && element["src"] =~ /^http:/
111
+ elsif element.tag_name == "img"
106
112
  element["src"] = rename_http(element["src"])
107
113
  end
108
114
  end
@@ -118,10 +124,10 @@ rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])
118
124
  The `Selma::Selector` object has three possible kwargs:
119
125
 
120
126
  - `match_element`: any element which matches this CSS rule will be passed on to `handle_element`
121
- - `match_text_within`: any element which matches this CSS rule will be passed on to `handle_text_chunk`
127
+ - `match_text_within`: any text_chunk which matches this CSS rule will be passed on to `handle_text_chunk`
122
128
  - `ignore_text_within`: this is an array of element names whose text contents will be ignored
123
129
 
124
- You've seen an example of `match_element`; here's one for `match_text` which changes strings in various elements which are _not_ `pre` or `code`:
130
+ Here's an example for `handle_text_chunk` which changes strings in various elements which are _not_ `pre` or `code`:
125
131
 
126
132
  ```ruby
127
133
 
@@ -144,20 +150,63 @@ rewriter = Selma::Rewriter.new(handlers: [MatchText.new])
144
150
 
145
151
  The `element` argument in `handle_element` has the following methods:
146
152
 
147
- - `tag_name`: The element's name
148
- - `[]`: get an attribute
149
- - `[]=`: set an attribute
150
- - `remove_attribute`: remove an attribute
151
- - `attributes`: list all the attributes
152
- - `ancestors`: list all the ancestors
153
- - `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
153
+ - `tag_name`: Gets the element's name
154
+ - `tag_name=`: Sets the element's name
155
+ - `self_closing?`: A bool which identifies whether or not the element is self-closing
156
+ - `[]`: Get an attribute
157
+ - `[]=`: Set an attribute
158
+ - `remove_attribute`: Remove an attribute
159
+ - `has_attribute?`: A bool which identifies whether or not the element has an attribute
160
+ - `attributes`: List all the attributes
161
+ - `ancestors`: List all of an element's ancestors as an array of strings
154
162
  - `before(content, as: content_type)`: Inserts `content` before the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
155
163
  - `after(content, as: content_type)`: Inserts `content` after the element. `content_type` is either `:text` or `:html` and determines how the content will be applied.
156
- - `set_inner_content`: replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
164
+ - `prepend(content, as: content_type)`: prepends `content` to the element's inner content, i.e. inserts content right after the element's start tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
165
+ - `append(content, as: content_type)`: appends `content` to the element's inner content, i.e. inserts content right before the element's end tag. `content_type` is either `:text` or `:html` and determines how the content will be applied.
166
+ - `set_inner_content`: Replaces inner content of the element with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
167
+
168
+ #### `text_chunk` methods
169
+
170
+ - `to_s` / `.content`: Gets the text node's content
171
+ - `text_type`: identifies the type of text in the text node
172
+ - `before(content, as: content_type)`: Inserts `content` before the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
173
+ - `after(content, as: content_type)`: Inserts `content` after the text. `content_type` is either `:text` or `:html` and determines how the content will be applied.
174
+ - `replace(content, as: content_type)`: Replaces the text node with `content`. `content_type` is either `:text` or `:html` and determines how the content will be applied.
157
175
 
158
176
  ## Benchmarks
159
177
 
160
- TBD
178
+ <details>
179
+ <pre>
180
+ ruby test/benchmark.rb
181
+ ruby test/benchmark.rb
182
+ Warming up --------------------------------------
183
+ sanitize-document-huge
184
+ 1.000 i/100ms
185
+ selma-document-huge 1.000 i/100ms
186
+ Calculating -------------------------------------
187
+ sanitize-document-huge
188
+ 0.257 (± 0.0%) i/s - 2.000 in 7.783398s
189
+ selma-document-huge 4.602 (± 0.0%) i/s - 23.000 in 5.002870s
190
+ Warming up --------------------------------------
191
+ sanitize-document-medium
192
+ 2.000 i/100ms
193
+ selma-document-medium
194
+ 22.000 i/100ms
195
+ Calculating -------------------------------------
196
+ sanitize-document-medium
197
+ 28.676 (± 3.5%) i/s - 144.000 in 5.024669s
198
+ selma-document-medium
199
+ 121.500 (±22.2%) i/s - 594.000 in 5.135410s
200
+ Warming up --------------------------------------
201
+ sanitize-document-small
202
+ 10.000 i/100ms
203
+ selma-document-small 20.000 i/100ms
204
+ Calculating -------------------------------------
205
+ sanitize-document-small
206
+ 107.280 (± 0.9%) i/s - 540.000 in 5.033850s
207
+ selma-document-small 118.867 (±31.1%) i/s - 540.000 in 5.080726s
208
+ </pre>
209
+ </details>
161
210
 
162
211
  ## Contributing
163
212
 
data/ext/selma/extconf.rb CHANGED
@@ -3,4 +3,4 @@ require "rb_sys/mkmf"
3
3
 
4
4
  require_relative "_util"
5
5
 
6
- create_rust_makefile("selma")
6
+ create_rust_makefile("selma/selma")
Binary file
Binary file
@@ -4,8 +4,39 @@ module Selma
4
4
  class Sanitizer
5
5
  module Config
6
6
  BASIC = freeze_config(
7
- elements: ["a", "abbr", "blockquote", "b", "br", "cite", "code", "dd", "dfn", "dl", "dt", "em", "i", "kbd",
8
- "li", "mark", "ol", "p", "pre", "q", "s", "samp", "small", "strike", "strong", "sub", "sup", "time", "u", "ul", "var",],
7
+ elements: [
8
+ "a",
9
+ "abbr",
10
+ "blockquote",
11
+ "b",
12
+ "br",
13
+ "cite",
14
+ "code",
15
+ "dd",
16
+ "dfn",
17
+ "dl",
18
+ "dt",
19
+ "em",
20
+ "i",
21
+ "kbd",
22
+ "li",
23
+ "mark",
24
+ "ol",
25
+ "p",
26
+ "pre",
27
+ "q",
28
+ "s",
29
+ "samp",
30
+ "small",
31
+ "strike",
32
+ "strong",
33
+ "sub",
34
+ "sup",
35
+ "time",
36
+ "u",
37
+ "ul",
38
+ "var",
39
+ ],
9
40
 
10
41
  attributes: {
11
42
  "a" => ["href"],
@@ -33,13 +33,49 @@ module Selma
33
33
 
34
34
  # An Array of element names whose contents will be removed. The contents
35
35
  # of all other filtered elements will be left behind.
36
- remove_contents: ["iframe", "math", "noembed", "noframes", "noscript", "plaintext", "script", "style", "svg",
37
- "xmp",],
36
+ remove_contents: [
37
+ "iframe",
38
+ "math",
39
+ "noembed",
40
+ "noframes",
41
+ "noscript",
42
+ "plaintext",
43
+ "script",
44
+ "style",
45
+ "svg",
46
+ "xmp",
47
+ ],
38
48
 
39
49
  # Elements which, when removed, should have their contents surrounded by
40
50
  # whitespace.
41
- whitespace_elements: ["address", "article", "aside", "blockquote", "br", "dd", "div", "dl", "dt", "footer",
42
- "h1", "h2", "h3", "h4", "h5", "h6", "header", "hgroup", "hr", "li", "nav", "ol", "p", "pre", "section", "ul",],
51
+ whitespace_elements: [
52
+ "address",
53
+ "article",
54
+ "aside",
55
+ "blockquote",
56
+ "br",
57
+ "dd",
58
+ "div",
59
+ "dl",
60
+ "dt",
61
+ "footer",
62
+ "h1",
63
+ "h2",
64
+ "h3",
65
+ "h4",
66
+ "h5",
67
+ "h6",
68
+ "header",
69
+ "hgroup",
70
+ "hr",
71
+ "li",
72
+ "nav",
73
+ "ol",
74
+ "p",
75
+ "pre",
76
+ "section",
77
+ "ul",
78
+ ],
43
79
  )
44
80
  end
45
81
  end
@@ -4,12 +4,60 @@ module Selma
4
4
  class Sanitizer
5
5
  module Config
6
6
  RELAXED = freeze_config(
7
- elements: BASIC[:elements] + ["address", "article", "aside", "bdi", "bdo", "body", "caption", "col",
8
- "colgroup", "data", "del", "div", "figcaption", "figure", "footer", "h1", "h2", "h3", "h4", "h5", "h6", "head", "header", "hgroup", "hr", "html", "img", "ins", "main", "nav", "rp", "rt", "ruby", "section", "span", "style", "summary", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "title", "tr", "wbr",],
7
+ elements: BASIC[:elements] + [
8
+ "address",
9
+ "article",
10
+ "aside",
11
+ "bdi",
12
+ "bdo",
13
+ "body",
14
+ "caption",
15
+ "col",
16
+ "colgroup",
17
+ "data",
18
+ "del",
19
+ "div",
20
+ "figcaption",
21
+ "figure",
22
+ "footer",
23
+ "h1",
24
+ "h2",
25
+ "h3",
26
+ "h4",
27
+ "h5",
28
+ "h6",
29
+ "head",
30
+ "header",
31
+ "hgroup",
32
+ "hr",
33
+ "html",
34
+ "img",
35
+ "ins",
36
+ "main",
37
+ "nav",
38
+ "rp",
39
+ "rt",
40
+ "ruby",
41
+ "section",
42
+ "span",
43
+ "style",
44
+ "summary",
45
+ "sup",
46
+ "table",
47
+ "tbody",
48
+ "td",
49
+ "tfoot",
50
+ "th",
51
+ "thead",
52
+ "title",
53
+ "tr",
54
+ "wbr",
55
+ ],
9
56
 
10
57
  allow_doctype: true,
11
58
 
12
- attributes: merge(BASIC[:attributes],
59
+ attributes: merge(
60
+ BASIC[:attributes],
13
61
  :all => ["class", "dir", "hidden", "id", "lang", "style", "tabindex", "title", "translate"],
14
62
  "a" => ["href", "hreflang", "name", "rel"],
15
63
  "col" => ["span", "width"],
@@ -21,16 +69,29 @@ module Selma
21
69
  "li" => ["value"],
22
70
  "ol" => ["reversed", "start", "type"],
23
71
  "style" => ["media", "scoped", "type"],
24
- "table" => ["align", "bgcolor", "border", "cellpadding", "cellspacing", "frame", "rules", "sortable",
25
- "summary", "width",],
72
+ "table" => [
73
+ "align",
74
+ "bgcolor",
75
+ "border",
76
+ "cellpadding",
77
+ "cellspacing",
78
+ "frame",
79
+ "rules",
80
+ "sortable",
81
+ "summary",
82
+ "width",
83
+ ],
26
84
  "td" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "valign", "width"],
27
85
  "th" => ["abbr", "align", "axis", "colspan", "headers", "rowspan", "scope", "sorted", "valign", "width"],
28
- "ul" => ["type"]),
86
+ "ul" => ["type"],
87
+ ),
29
88
 
30
- protocols: merge(BASIC[:protocols],
89
+ protocols: merge(
90
+ BASIC[:protocols],
31
91
  "del" => { "cite" => ["http", "https", :relative] },
32
92
  "img" => { "src" => ["http", "https", :relative] },
33
- "ins" => { "cite" => ["http", "https", :relative] }),
93
+ "ins" => { "cite" => ["http", "https", :relative] },
94
+ ),
34
95
  )
35
96
  end
36
97
  end
data/lib/selma/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Selma
4
- VERSION = "0.0.5"
4
+ VERSION = "0.0.7"
5
5
  end
data/selma.gemspec CHANGED
@@ -24,7 +24,7 @@ Gem::Specification.new do |spec|
24
24
  spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
25
25
 
26
26
  spec.require_paths = ["lib"]
27
- spec.extensions = ["ext/selma/Cargo.toml"]
27
+ spec.extensions = ["ext/selma/extconf.rb"]
28
28
 
29
29
  spec.metadata = {
30
30
  "allowed_push_host" => "https://rubygems.org",
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: selma
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.5
4
+ version: 0.0.7
5
5
  platform: arm64-darwin
6
6
  authors:
7
7
  - Garen J. Torikian
8
- autorequire:
8
+ autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2022-12-27 00:00:00.000000000 Z
11
+ date: 2023-01-09 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: rb_sys
@@ -66,7 +66,7 @@ dependencies:
66
66
  - - "~>"
67
67
  - !ruby/object:Gem::Version
68
68
  version: '1.2'
69
- description:
69
+ description:
70
70
  email:
71
71
  - gjtorikian@gmail.com
72
72
  executables: []
@@ -91,6 +91,7 @@ files:
91
91
  - ext/selma/src/wrapped_struct.rs
92
92
  - lib/selma.rb
93
93
  - lib/selma/3.1/selma.bundle
94
+ - lib/selma/3.2/selma.bundle
94
95
  - lib/selma/extension.rb
95
96
  - lib/selma/html.rb
96
97
  - lib/selma/rewriter.rb
@@ -103,7 +104,7 @@ files:
103
104
  - lib/selma/selector.rb
104
105
  - lib/selma/version.rb
105
106
  - selma.gemspec
106
- homepage:
107
+ homepage:
107
108
  licenses:
108
109
  - MIT
109
110
  metadata:
@@ -111,7 +112,7 @@ metadata:
111
112
  funding_uri: https://github.com/sponsors/gjtorikian/
112
113
  source_code_uri: https://github.com/gjtorikian/selma
113
114
  rubygems_mfa_required: 'true'
114
- post_install_message:
115
+ post_install_message:
115
116
  rdoc_options: []
116
117
  require_paths:
117
118
  - lib
@@ -122,15 +123,15 @@ required_ruby_version: !ruby/object:Gem::Requirement
122
123
  version: '3.1'
123
124
  - - "<"
124
125
  - !ruby/object:Gem::Version
125
- version: 3.2.dev
126
+ version: 3.3.dev
126
127
  required_rubygems_version: !ruby/object:Gem::Requirement
127
128
  requirements:
128
129
  - - ">="
129
130
  - !ruby/object:Gem::Version
130
131
  version: 3.3.22
131
132
  requirements: []
132
- rubygems_version: 3.3.22
133
- signing_key:
133
+ rubygems_version: 3.4.3
134
+ signing_key:
134
135
  specification_version: 4
135
136
  summary: Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html
136
137
  parser.