loofah 2.2.3 → 2.21.1
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +269 -31
- data/README.md +109 -124
- data/lib/loofah/concerns.rb +207 -0
- data/lib/loofah/elements.rb +85 -79
- data/lib/loofah/helpers.rb +37 -20
- data/lib/loofah/{html → html4}/document.rb +6 -7
- data/lib/loofah/html4/document_fragment.rb +15 -0
- data/lib/loofah/html5/document.rb +17 -0
- data/lib/loofah/html5/document_fragment.rb +15 -0
- data/lib/loofah/html5/libxml2_workarounds.rb +10 -8
- data/lib/loofah/html5/safelist.rb +1055 -0
- data/lib/loofah/html5/scrub.rb +153 -58
- data/lib/loofah/metahelpers.rb +11 -6
- data/lib/loofah/scrubber.rb +22 -15
- data/lib/loofah/scrubbers.rb +66 -55
- data/lib/loofah/version.rb +6 -0
- data/lib/loofah/xml/document.rb +2 -0
- data/lib/loofah/xml/document_fragment.rb +4 -7
- data/lib/loofah.rb +131 -38
- metadata +28 -216
- data/.gemtest +0 -0
- data/Gemfile +0 -22
- data/Manifest.txt +0 -40
- data/Rakefile +0 -79
- data/benchmark/benchmark.rb +0 -149
- data/benchmark/fragment.html +0 -96
- data/benchmark/helper.rb +0 -73
- data/benchmark/www.slashdot.com.html +0 -2560
- data/lib/loofah/html/document_fragment.rb +0 -40
- data/lib/loofah/html5/whitelist.rb +0 -186
- data/lib/loofah/instance_methods.rb +0 -127
- data/test/assets/msword.html +0 -63
- data/test/assets/testdata_sanitizer_tests1.dat +0 -502
- data/test/helper.rb +0 -18
- data/test/html5/test_sanitizer.rb +0 -382
- data/test/integration/test_ad_hoc.rb +0 -204
- data/test/integration/test_helpers.rb +0 -43
- data/test/integration/test_html.rb +0 -72
- data/test/integration/test_scrubbers.rb +0 -400
- data/test/integration/test_xml.rb +0 -55
- data/test/unit/test_api.rb +0 -142
- data/test/unit/test_encoding.rb +0 -20
- data/test/unit/test_helpers.rb +0 -62
- data/test/unit/test_scrubber.rb +0 -229
- data/test/unit/test_scrubbers.rb +0 -14
data/README.md
CHANGED
@@ -1,81 +1,74 @@
|
|
1
1
|
# Loofah
|
2
2
|
|
3
3
|
* https://github.com/flavorjones/loofah
|
4
|
-
* Docs: http://rubydoc.info/github/flavorjones/loofah/
|
4
|
+
* Docs: http://rubydoc.info/github/flavorjones/loofah/main/frames
|
5
5
|
* Mailing list: [loofah-talk@googlegroups.com](https://groups.google.com/forum/#!forum/loofah-talk)
|
6
6
|
|
7
7
|
## Status
|
8
8
|
|
9
|
-
|
10
|
-
|
11
|
-
| Concourse | [![Concourse CI](https://ci.nokogiri.org/api/v1/teams/nokogiri-core/pipelines/loofah/jobs/ruby-2.5/badge)](https://ci.nokogiri.org/teams/nokogiri-core/pipelines/loofah?groups=master) |
|
12
|
-
| Code Climate | [![Code Climate](https://codeclimate.com/github/flavorjones/loofah.svg)](https://codeclimate.com/github/flavorjones/loofah) |
|
13
|
-
| Version Eye | [![Version Eye](https://www.versioneye.com/ruby/loofah/badge.png)](https://www.versioneye.com/ruby/loofah) |
|
9
|
+
[![ci](https://github.com/flavorjones/loofah/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/flavorjones/loofah/actions/workflows/ci.yml)
|
10
|
+
[![Tidelift dependencies](https://tidelift.com/badges/package/rubygems/loofah)](https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=rubygems-loofah&utm_medium=referral&utm_campaign=readme)
|
14
11
|
|
15
12
|
|
16
13
|
## Description
|
17
14
|
|
18
|
-
Loofah is a general library for manipulating and transforming HTML/XML
|
19
|
-
documents and fragments. It's built on top of Nokogiri and libxml2, so
|
20
|
-
it's fast and has a nice API.
|
15
|
+
Loofah is a general library for manipulating and transforming HTML/XML documents and fragments, built on top of Nokogiri.
|
21
16
|
|
22
|
-
Loofah
|
23
|
-
nice HTML sanitizers, which are based on HTML5lib's whitelist, so it
|
24
|
-
most likely won't make your codes less secure. (These statements have
|
25
|
-
not been evaluated by Netexperts.)
|
17
|
+
Loofah also includes some HTML sanitizers based on `html5lib`'s safelist, which are a specific application of the general transformation functionality.
|
26
18
|
|
27
|
-
|
28
|
-
[`loofah-activerecord` gem](https://github.com/flavorjones/loofah-activerecord).
|
19
|
+
Active Record extensions for HTML sanitization are available in the [`loofah-activerecord` gem](https://github.com/flavorjones/loofah-activerecord).
|
29
20
|
|
30
21
|
|
31
22
|
## Features
|
32
23
|
|
33
|
-
* Easily write custom
|
34
|
-
* Common HTML sanitizing
|
24
|
+
* Easily write custom transformations for HTML and XML
|
25
|
+
* Common HTML sanitizing transformations are built-in:
|
35
26
|
* _Strip_ unsafe tags, leaving behind only the inner text.
|
36
27
|
* _Prune_ unsafe tags and their subtrees, removing all traces that they ever existed.
|
37
28
|
* _Escape_ unsafe tags and their subtrees, leaving behind lots of <tt><</tt> and <tt>></tt> entities.
|
38
29
|
* _Whitewash_ the markup, removing all attributes and namespaced nodes.
|
39
|
-
*
|
30
|
+
* Other common HTML transformations are built-in:
|
40
31
|
* Add the _nofollow_ attribute to all hyperlinks.
|
41
|
-
*
|
32
|
+
* Remove _unprintable_ characters from text nodes.
|
33
|
+
* Format markup as plain text, with (or without) sensible whitespace handling around block elements.
|
42
34
|
* Replace Rails's `strip_tags` and `sanitize` view helper methods.
|
43
35
|
|
44
36
|
|
45
37
|
## Compare and Contrast
|
46
38
|
|
47
|
-
Loofah is
|
48
|
-
guarantees well-formed and valid markup (the other is Sanitize, which
|
49
|
-
also uses Nokogiri).
|
39
|
+
Loofah is both:
|
50
40
|
|
51
|
-
|
41
|
+
- a general framework for transforming XML, XHTML, and HTML documents
|
42
|
+
- a specific toolkit for HTML sanitization
|
52
43
|
|
53
|
-
|
54
|
-
commonly-used libraries (ActionView, Sanitize, HTML5lib and HTMLfilter):
|
44
|
+
### General document transformation
|
55
45
|
|
56
|
-
|
46
|
+
Loofah tries to make it easy to write your own custom scrubbers for whatever document transformation you need. You don't like the built-in scrubbers? Build your own, like a boss.
|
57
47
|
|
58
|
-
|
59
|
-
|
60
|
-
|
48
|
+
|
49
|
+
### HTML sanitization
|
50
|
+
|
51
|
+
Another Ruby library that provides HTML sanitization is [`rgrove/sanitize`](https://github.com/rgrove/sanitize), another library built on top of Nokogiri, which provides a bit more flexibility on the tags and attributes being scrubbed.
|
52
|
+
|
53
|
+
You may also want to look at [`rails/rails-html-sanitizer`](https://github.com/rails/rails-html-sanitizer) which is built on top of Loofah and provides some useful extensions and additional flexibility in the HTML sanitization.
|
61
54
|
|
62
55
|
|
63
56
|
## The Basics
|
64
57
|
|
65
|
-
Loofah wraps [Nokogiri](http://nokogiri.org) in a loving
|
66
|
-
embrace. Nokogiri is an excellent HTML/XML parser. If you don't know
|
67
|
-
how Nokogiri works, you might want to pause for a moment and go check
|
68
|
-
it out. I'll wait.
|
58
|
+
Loofah wraps [Nokogiri](http://nokogiri.org) in a loving embrace. Nokogiri is a stable, well-maintained parser for XML, HTML4, and HTML5.
|
69
59
|
|
70
|
-
Loofah
|
60
|
+
Loofah implements the following classes:
|
71
61
|
|
72
|
-
* `Loofah::
|
73
|
-
* `Loofah::
|
74
|
-
* `Loofah::
|
62
|
+
* `Loofah::HTML5::Document`
|
63
|
+
* `Loofah::HTML5::DocumentFragment`
|
64
|
+
* `Loofah::HTML4::Document` (aliased as `Loofah::HTML::Document` for now)
|
65
|
+
* `Loofah::HTML4::DocumentFragment` (aliased as `Loofah::HTML::DocumentFragment` for now)
|
66
|
+
* `Loofah::XML::Document`
|
67
|
+
* `Loofah::XML::DocumentFragment`
|
75
68
|
|
76
|
-
|
69
|
+
These document and fragment classes are subclasses of the similarly-named Nokogiri classes `Nokogiri::HTML5::Document` et al.
|
77
70
|
|
78
|
-
|
71
|
+
Loofah also implements `Loofah::Scrubber`, which represents the document transformation, either by wrapping
|
79
72
|
a block,
|
80
73
|
|
81
74
|
``` ruby
|
@@ -89,50 +82,49 @@ or by implementing a method.
|
|
89
82
|
|
90
83
|
### Side Note: Fragments vs Documents
|
91
84
|
|
92
|
-
Generally speaking, unless you expect to have a DOCTYPE and a single
|
93
|
-
|
94
|
-
HTML
|
95
|
-
|
85
|
+
Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don't have a *document*, you have a *fragment*. For HTML, another rule of thumb is that *documents* have `html` and `body` tags, and *fragments* usually do not.
|
86
|
+
|
87
|
+
**HTML fragments** should be parsed with `Loofah.html5_fragment` or `Loofah.html4_fragment`. The result won't be wrapped in `html` or `body` tags, won't have a DOCTYPE declaration, `head` elements will be silently ignored, and multiple root nodes are allowed.
|
88
|
+
|
89
|
+
**HTML documents** should be parsed with `Loofah.html5_document` or `Loofah.html4_document`. The result will have a DOCTYPE declaration, along with `html`, `head` and `body` tags.
|
90
|
+
|
91
|
+
**XML fragments** should be parsed with `Loofah.xml_fragment`. The result won't have a DOCTYPE declaration, and multiple root nodes are allowed.
|
96
92
|
|
97
|
-
|
98
|
-
be wrapped in `html` or `body` tags, won't have a DOCTYPE declaration,
|
99
|
-
`head` elements will be silently ignored, and multiple root nodes are
|
100
|
-
allowed.
|
93
|
+
**XML documents** should be parsed with `Loofah.xml_document`. The result will have a DOCTYPE declaration and a single root node.
|
101
94
|
|
102
|
-
XML fragments should be parsed with Loofah.xml_fragment. The result
|
103
|
-
won't have a DOCTYPE declaration, and multiple root nodes are allowed.
|
104
95
|
|
105
|
-
|
106
|
-
have a DOCTYPE declaration, along with `html`, `head` and `body` tags.
|
96
|
+
### Side Note: HTML4 vs HTML5
|
107
97
|
|
108
|
-
|
109
|
-
will have a DOCTYPE declaration and a single root node.
|
98
|
+
⚠ _HTML5 functionality is not available on JRuby, or with versions of Nokogiri `< 1.14.0`._
|
110
99
|
|
100
|
+
Currently, Loofah's methods `Loofah.document` and `Loofah.fragment` are aliases to `.html4_document` and `.html4_fragment`, which use Nokogiri's HTML4 parser. (Similarly, `Loofah::HTML::Document` and `Loofah::HTML::DocumentFragment` are aliased to `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`.)
|
111
101
|
|
112
|
-
|
102
|
+
**Please note** that in a future version of Loofah, these methods and classes may switch to using Nokogiri's HTML5 parser and classes on platforms that support it [1].
|
113
103
|
|
114
|
-
|
115
|
-
Nokogiri::HTML::DocumentFragment, so you get all the markup
|
116
|
-
fixer-uppery and API goodness of Nokogiri.
|
104
|
+
**We strongly recommend that you explicitly use `.html5_document` or `.html5_fragment`** unless you know of a compelling reason not to. If you are sure that you need to use the HTML4 parser, you should explicitly call `.html4_document` or `.html4_fragment` to avoid breakage in a future version.
|
117
105
|
|
118
|
-
|
119
|
-
|
106
|
+
[1]: [[feature request] HTML5 parser for JRuby implementation · Issue #2227 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/2227)
|
107
|
+
|
108
|
+
|
109
|
+
### `Loofah::HTML5::Document` and `Loofah::HTML5::DocumentFragment`
|
110
|
+
|
111
|
+
These classes are subclasses of `Nokogiri::HTML5::Document` and `Nokogiri::HTML5::DocumentFragment`.
|
112
|
+
|
113
|
+
The module methods `Loofah.html5_document` and `Loofah.html5_fragment` will parse either an HTML document and an HTML fragment, respectively.
|
120
114
|
|
121
115
|
``` ruby
|
122
|
-
Loofah.
|
123
|
-
Loofah.
|
116
|
+
Loofah.html5_document(unsafe_html).is_a?(Nokogiri::HTML5::Document) # => true
|
117
|
+
Loofah.html5_fragment(unsafe_html).is_a?(Nokogiri::HTML5::DocumentFragment) # => true
|
124
118
|
```
|
125
119
|
|
126
|
-
Loofah injects a `scrub!` method, which takes either a symbol (for
|
127
|
-
built-in scrubbers) or a Loofah::Scrubber object (for custom
|
128
|
-
scrubbers), and modifies the document in-place.
|
120
|
+
Loofah injects a `scrub!` method, which takes either a symbol (for built-in scrubbers) or a `Loofah::Scrubber` object (for custom scrubbers), and modifies the document in-place.
|
129
121
|
|
130
122
|
Loofah overrides `to_s` to return HTML:
|
131
123
|
|
132
124
|
``` ruby
|
133
125
|
unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"
|
134
126
|
|
135
|
-
doc = Loofah.
|
127
|
+
doc = Loofah.html5_fragment(unsafe_html).scrub!(:prune)
|
136
128
|
doc.to_s # => "ohai! <div>div is safe</div> "
|
137
129
|
```
|
138
130
|
|
@@ -142,36 +134,41 @@ and `text` to return plain text:
|
|
142
134
|
doc.text # => "ohai! div is safe "
|
143
135
|
```
|
144
136
|
|
145
|
-
Also, `to_text` is available, which does the right thing with
|
146
|
-
|
137
|
+
Also, `to_text` is available, which does the right thing with whitespace around block-level and line break elements.
|
138
|
+
|
139
|
+
``` ruby
|
140
|
+
doc = Loofah.html5_fragment("<h1>Title</h1><div>Content<br>Next line</div>")
|
141
|
+
doc.text # => "TitleContentNext line" # probably not what you want
|
142
|
+
doc.to_text # => "\nTitle\n\nContent\nNext line\n" # better
|
143
|
+
```
|
144
|
+
|
145
|
+
### `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`
|
146
|
+
|
147
|
+
These classes are subclasses of `Nokogiri::HTML4::Document` and `Nokogiri::HTML4::DocumentFragment`.
|
148
|
+
|
149
|
+
The module methods `Loofah.html4_document` and `Loofah.html4_fragment` will parse either an HTML document and an HTML fragment, respectively.
|
147
150
|
|
148
151
|
``` ruby
|
149
|
-
|
150
|
-
|
151
|
-
doc.to_text # => "\nTitle\n\nContent\n" # better
|
152
|
+
Loofah.html4_document(unsafe_html).is_a?(Nokogiri::HTML4::Document) # => true
|
153
|
+
Loofah.html4_fragment(unsafe_html).is_a?(Nokogiri::HTML4::DocumentFragment) # => true
|
152
154
|
```
|
153
155
|
|
154
|
-
### Loofah::XML::Document and Loofah::XML::DocumentFragment
|
156
|
+
### `Loofah::XML::Document` and `Loofah::XML::DocumentFragment`
|
155
157
|
|
156
|
-
These classes are subclasses of Nokogiri::XML::Document and
|
157
|
-
Nokogiri::XML::DocumentFragment, so you get all the markup
|
158
|
-
fixer-uppery and API goodness of Nokogiri.
|
158
|
+
These classes are subclasses of `Nokogiri::XML::Document` and `Nokogiri::XML::DocumentFragment`.
|
159
159
|
|
160
|
-
The module methods Loofah.xml_document and Loofah.xml_fragment will
|
161
|
-
parse an XML document and an XML fragment, respectively.
|
160
|
+
The module methods `Loofah.xml_document` and `Loofah.xml_fragment` will parse an XML document and an XML fragment, respectively.
|
162
161
|
|
163
162
|
``` ruby
|
164
163
|
Loofah.xml_document(bad_xml).is_a?(Nokogiri::XML::Document) # => true
|
165
164
|
Loofah.xml_fragment(bad_xml).is_a?(Nokogiri::XML::DocumentFragment) # => true
|
166
165
|
```
|
167
166
|
|
168
|
-
### Nodes and
|
167
|
+
### Nodes and Node Sets
|
169
168
|
|
170
|
-
Nokogiri
|
171
|
-
method, which makes it easy to scrub subtrees.
|
169
|
+
Nokogiri's `Node` and `NodeSet` classes also get a `scrub!` method, which makes it easy to scrub subtrees.
|
172
170
|
|
173
|
-
The following code will apply the `employee_scrubber` only to the
|
174
|
-
`employee` nodes (and their subtrees) in the document:
|
171
|
+
The following code will apply the `employee_scrubber` only to the `employee` nodes (and their subtrees) in the document:
|
175
172
|
|
176
173
|
``` ruby
|
177
174
|
Loofah.xml_document(bad_xml).xpath("//employee").scrub!(employee_scrubber)
|
@@ -183,7 +180,7 @@ And this code will only scrub the first `employee` node and its subtree:
|
|
183
180
|
Loofah.xml_document(bad_xml).at_xpath("//employee").scrub!(employee_scrubber)
|
184
181
|
```
|
185
182
|
|
186
|
-
### Loofah::Scrubber
|
183
|
+
### `Loofah::Scrubber`
|
187
184
|
|
188
185
|
A Scrubber wraps up a block (or method) that is run on a document node:
|
189
186
|
|
@@ -197,14 +194,11 @@ end
|
|
197
194
|
This can then be run on a document:
|
198
195
|
|
199
196
|
``` ruby
|
200
|
-
Loofah.
|
197
|
+
Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
|
201
198
|
# => "<div>foo</div><p>bar</p>"
|
202
199
|
```
|
203
200
|
|
204
|
-
Scrubbers can be run on a document in either a top-down traversal (the
|
205
|
-
default) or bottom-up. Top-down scrubbers can optionally return
|
206
|
-
Scrubber::STOP to terminate the traversal of a subtree. Read below and
|
207
|
-
in the Loofah::Scrubber class for more detailed usage.
|
201
|
+
Scrubbers can be run on a document in either a top-down traversal (the default) or bottom-up. Top-down scrubbers can optionally return `Scrubber::STOP` to terminate the traversal of a subtree. Read below and in the `Loofah::Scrubber` class for more detailed usage.
|
208
202
|
|
209
203
|
Here's an XML example:
|
210
204
|
|
@@ -219,12 +213,12 @@ end
|
|
219
213
|
Loofah.xml_document(File.read('plague.xml')).scrub!(bring_out_your_dead)
|
220
214
|
```
|
221
215
|
|
222
|
-
|
216
|
+
### Built-In HTML Scrubbers
|
223
217
|
|
224
|
-
Loofah comes with a set of sanitizing scrubbers that use
|
225
|
-
whitelist algorithm:
|
218
|
+
Loofah comes with a set of sanitizing scrubbers that use `html5lib`'s safelist algorithm:
|
226
219
|
|
227
220
|
``` ruby
|
221
|
+
doc = Loofah.html5_document(input)
|
228
222
|
doc.scrub!(:strip) # replaces unknown/unsafe tags with their inner text
|
229
223
|
doc.scrub!(:prune) # removes unknown/unsafe tags and their children
|
230
224
|
doc.scrub!(:escape) # escapes unknown/unsafe tags, like this: <script>
|
@@ -232,14 +226,14 @@ doc.scrub!(:whitewash) # removes unknown/unsafe/namespaced tags and their chi
|
|
232
226
|
# and strips all node attributes
|
233
227
|
```
|
234
228
|
|
235
|
-
Loofah also comes with some common transformation tasks:
|
229
|
+
Loofah also comes with some common transformation tasks:
|
236
230
|
|
237
231
|
``` ruby
|
238
232
|
doc.scrub!(:nofollow) # adds rel="nofollow" attribute to links
|
239
233
|
doc.scrub!(:unprintable) # removes unprintable characters from text nodes
|
240
234
|
```
|
241
235
|
|
242
|
-
See Loofah::Scrubbers for more details and example usage.
|
236
|
+
See `Loofah::Scrubbers` for more details and example usage.
|
243
237
|
|
244
238
|
|
245
239
|
### Chaining Scrubbers
|
@@ -247,7 +241,7 @@ See Loofah::Scrubbers for more details and example usage.
|
|
247
241
|
You can chain scrubbers:
|
248
242
|
|
249
243
|
``` ruby
|
250
|
-
Loofah.
|
244
|
+
Loofah.html5_fragment("<span>hello</span> <script>alert('OHAI')</script>") \
|
251
245
|
.scrub!(:prune) \
|
252
246
|
.scrub!(span2div).to_s
|
253
247
|
# => "<div>hello</div> "
|
@@ -255,21 +249,26 @@ Loofah.fragment("<span>hello</span> <script>alert('OHAI')</script>") \
|
|
255
249
|
|
256
250
|
### Shorthand
|
257
251
|
|
258
|
-
The class methods Loofah.
|
259
|
-
|
252
|
+
The class methods `Loofah.scrub_html5_fragment` and `Loofah.scrub_html5_document` (and the corresponding HTML4 methods) are shorthand.
|
253
|
+
|
254
|
+
These methods:
|
260
255
|
|
261
256
|
``` ruby
|
262
|
-
Loofah.
|
263
|
-
Loofah.
|
257
|
+
Loofah.scrub_html5_fragment(unsafe_html, :prune)
|
258
|
+
Loofah.scrub_html5_document(unsafe_html, :prune)
|
259
|
+
Loofah.scrub_html4_fragment(unsafe_html, :prune)
|
260
|
+
Loofah.scrub_html4_document(unsafe_html, :prune)
|
264
261
|
Loofah.scrub_xml_fragment(bad_xml, custom_scrubber)
|
265
262
|
Loofah.scrub_xml_document(bad_xml, custom_scrubber)
|
266
263
|
```
|
267
264
|
|
268
|
-
|
265
|
+
do the same thing as (and arguably semantically clearer than):
|
269
266
|
|
270
267
|
``` ruby
|
271
|
-
Loofah.
|
272
|
-
Loofah.
|
268
|
+
Loofah.html5_fragment(unsafe_html).scrub!(:prune)
|
269
|
+
Loofah.html5_document(unsafe_html).scrub!(:prune)
|
270
|
+
Loofah.html4_fragment(unsafe_html).scrub!(:prune)
|
271
|
+
Loofah.html4_document(unsafe_html).scrub!(:prune)
|
273
272
|
Loofah.xml_fragment(bad_xml).scrub!(custom_scrubber)
|
274
273
|
Loofah.xml_document(bad_xml).scrub!(custom_scrubber)
|
275
274
|
```
|
@@ -277,10 +276,9 @@ Loofah.xml_document(bad_xml).scrub!(custom_scrubber)
|
|
277
276
|
|
278
277
|
### View Helpers
|
279
278
|
|
280
|
-
Loofah has two "view helpers": Loofah::Helpers.sanitize and
|
281
|
-
|
282
|
-
|
283
|
-
These are no longer required automatically. You must require `loofah/helpers`.
|
279
|
+
Loofah has two "view helpers": `Loofah::Helpers.sanitize` and `Loofah::Helpers.strip_tags`, both of which are drop-in replacements for the Rails Action View helpers of the same name.
|
280
|
+
|
281
|
+
These are not required automatically. You must require `loofah/helpers` to use them.
|
284
282
|
|
285
283
|
|
286
284
|
## Requirements
|
@@ -306,7 +304,9 @@ And the mailing list is on Google Groups:
|
|
306
304
|
* Mail: loofah-talk@googlegroups.com
|
307
305
|
* Archive: https://groups.google.com/forum/#!forum/loofah-talk
|
308
306
|
|
309
|
-
|
307
|
+
Consider subscribing to [Tidelift][tidelift] which provides license assurances and timely security notifications for your open source dependencies, including Loofah. [Tidelift][tidelift] subscriptions also help the Loofah maintainers fund our [automated testing](https://ci.nokogiri.org) which in turn allows us to ship releases, bugfixes, and security updates more often.
|
308
|
+
|
309
|
+
[tidelift]: https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=undefined&utm_medium=referral&utm_campaign=enterprise
|
310
310
|
|
311
311
|
|
312
312
|
## Security
|
@@ -314,26 +314,12 @@ And the IRC channel is \#loofah on freenode.
|
|
314
314
|
See [`SECURITY.md`](SECURITY.md) for vulnerability reporting details.
|
315
315
|
|
316
316
|
|
317
|
-
### "Secure by Default"
|
318
|
-
|
319
|
-
Some tools may incorrectly report Loofah as a potential security
|
320
|
-
vulnerability.
|
321
|
-
|
322
|
-
Loofah depends on Nokogiri, and it's _possible_ to use Nokogiri in a
|
323
|
-
dangerous way (by enabling its DTDLOAD option and disabling its NONET
|
324
|
-
option). This specifically allows the opportunity for an XML External
|
325
|
-
Entity (XXE) vulnerability if the XML data is untrusted.
|
326
|
-
|
327
|
-
However, Loofah __never enables this Nokogiri configuration__; Loofah
|
328
|
-
never enables DTDLOAD, and it never disables NONET, thereby protecting
|
329
|
-
you by default from this XXE vulnerability.
|
330
|
-
|
331
|
-
|
332
317
|
## Related Links
|
333
318
|
|
319
|
+
* loofah-activerecord: https://github.com/flavorjones/loofah-activerecord
|
334
320
|
* Nokogiri: http://nokogiri.org
|
335
321
|
* libxml2: http://xmlsoft.org
|
336
|
-
* html5lib: https://
|
322
|
+
* html5lib: https://github.com/html5lib/
|
337
323
|
|
338
324
|
|
339
325
|
## Authors
|
@@ -354,15 +340,14 @@ And a big shout-out to Corey Innis for the name, and feedback on the API.
|
|
354
340
|
|
355
341
|
## Thank You
|
356
342
|
|
357
|
-
The following people have generously
|
343
|
+
The following people have generously funded Loofah:
|
358
344
|
|
359
345
|
* Bill Harding
|
360
346
|
|
361
347
|
|
362
348
|
## Historical Note
|
363
349
|
|
364
|
-
This library was
|
365
|
-
name that nobody could spell properly.
|
350
|
+
This library was once named "Dryopteris", which was a very bad name that nobody could spell properly.
|
366
351
|
|
367
352
|
|
368
353
|
## License
|
@@ -0,0 +1,207 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module Loofah
|
4
|
+
#
|
5
|
+
# Mixes +scrub!+ into Document, DocumentFragment, Node and NodeSet.
|
6
|
+
#
|
7
|
+
# Traverse the document or fragment, invoking the +scrubber+ on each node.
|
8
|
+
#
|
9
|
+
# +scrubber+ must either be one of the symbols representing the built-in scrubbers (see
|
10
|
+
# Scrubbers), or a Scrubber instance.
|
11
|
+
#
|
12
|
+
# span2div = Loofah::Scrubber.new do |node|
|
13
|
+
# node.name = "div" if node.name == "span"
|
14
|
+
# end
|
15
|
+
# Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
|
16
|
+
# # => "<div>foo</div><p>bar</p>"
|
17
|
+
#
|
18
|
+
# or
|
19
|
+
#
|
20
|
+
# unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"
|
21
|
+
# Loofah.html5_fragment(unsafe_html).scrub!(:strip).to_s
|
22
|
+
# # => "ohai! <div>div is safe</div> "
|
23
|
+
#
|
24
|
+
# Note that this method is called implicitly from the shortcuts Loofah.scrub_html5_fragment et
|
25
|
+
# al.
|
26
|
+
#
|
27
|
+
# Please see Scrubber for more information on implementation and traversal, and README.rdoc for
|
28
|
+
# more example usage.
|
29
|
+
#
|
30
|
+
module ScrubBehavior
|
31
|
+
module Node # :nodoc:
|
32
|
+
def scrub!(scrubber)
|
33
|
+
#
|
34
|
+
# yes. this should be three separate methods. but nokogiri decorates (or not) based on
|
35
|
+
# whether the module name has already been included. and since documents get decorated just
|
36
|
+
# like their constituent nodes, we need to jam all the logic into a single module.
|
37
|
+
#
|
38
|
+
scrubber = ScrubBehavior.resolve_scrubber(scrubber)
|
39
|
+
case self
|
40
|
+
when Nokogiri::XML::Document
|
41
|
+
scrubber.traverse(root) if root
|
42
|
+
when Nokogiri::XML::DocumentFragment
|
43
|
+
children.scrub!(scrubber)
|
44
|
+
else
|
45
|
+
scrubber.traverse(self)
|
46
|
+
end
|
47
|
+
self
|
48
|
+
end
|
49
|
+
end
|
50
|
+
|
51
|
+
module NodeSet # :nodoc:
|
52
|
+
def scrub!(scrubber)
|
53
|
+
each { |node| node.scrub!(scrubber) }
|
54
|
+
self
|
55
|
+
end
|
56
|
+
end
|
57
|
+
|
58
|
+
class << self
|
59
|
+
def resolve_scrubber(scrubber) # :nodoc:
|
60
|
+
scrubber = Scrubbers::MAP[scrubber].new if Scrubbers::MAP[scrubber]
|
61
|
+
unless scrubber.is_a?(Loofah::Scrubber)
|
62
|
+
raise Loofah::ScrubberNotFound, "not a Scrubber or a scrubber name: #{scrubber.inspect}"
|
63
|
+
end
|
64
|
+
|
65
|
+
scrubber
|
66
|
+
end
|
67
|
+
end
|
68
|
+
end
|
69
|
+
|
70
|
+
#
|
71
|
+
# Overrides +text+ in Document and DocumentFragment classes, and mixes in +to_text+.
|
72
|
+
#
|
73
|
+
module TextBehavior
|
74
|
+
#
|
75
|
+
# Returns a plain-text version of the markup contained by the document, with HTML entities
|
76
|
+
# encoded.
|
77
|
+
#
|
78
|
+
# This method is significantly faster than #to_text, but isn't clever about whitespace around
|
79
|
+
# block elements.
|
80
|
+
#
|
81
|
+
# Loofah.html5_document("<h1>Title</h1><div>Content</div>").text
|
82
|
+
# # => "TitleContent"
|
83
|
+
#
|
84
|
+
# By default, the returned text will have HTML entities escaped. If you want unescaped
|
85
|
+
# entities, and you understand that the result is unsafe to render in a browser, then you can
|
86
|
+
# pass an argument as shown:
|
87
|
+
#
|
88
|
+
# frag = Loofah.html5_fragment("<script>alert('EVIL');</script>")
|
89
|
+
# # ok for browser:
|
90
|
+
# frag.text # => "<script>alert('EVIL');</script>"
|
91
|
+
# # decidedly not ok for browser:
|
92
|
+
# frag.text(:encode_special_chars => false) # => "<script>alert('EVIL');</script>"
|
93
|
+
#
|
94
|
+
def text(options = {})
|
95
|
+
result = if serialize_root
|
96
|
+
serialize_root.children.reject(&:comment?).map(&:inner_text).join("")
|
97
|
+
else
|
98
|
+
""
|
99
|
+
end
|
100
|
+
if options[:encode_special_chars] == false
|
101
|
+
result # possibly dangerous if rendered in a browser
|
102
|
+
else
|
103
|
+
encode_special_chars(result)
|
104
|
+
end
|
105
|
+
end
|
106
|
+
|
107
|
+
alias_method :inner_text, :text
|
108
|
+
alias_method :to_str, :text
|
109
|
+
|
110
|
+
#
|
111
|
+
# Returns a plain-text version of the markup contained by the fragment, with HTML entities
|
112
|
+
# encoded.
|
113
|
+
#
|
114
|
+
# This method is slower than #text, but is clever about whitespace around block elements and
|
115
|
+
# line break elements.
|
116
|
+
#
|
117
|
+
# Loofah.html5_document("<h1>Title</h1><div>Content<br>Next line</div>").to_text
|
118
|
+
# # => "\nTitle\n\nContent\nNext line\n"
|
119
|
+
#
|
120
|
+
def to_text(options = {})
|
121
|
+
Loofah.remove_extraneous_whitespace(dup.scrub!(:newline_block_elements).text(options))
|
122
|
+
end
|
123
|
+
end
|
124
|
+
|
125
|
+
module DocumentDecorator # :nodoc:
|
126
|
+
def initialize(*args, &block)
|
127
|
+
super
|
128
|
+
decorators(Nokogiri::XML::Node) << ScrubBehavior::Node
|
129
|
+
decorators(Nokogiri::XML::NodeSet) << ScrubBehavior::NodeSet
|
130
|
+
end
|
131
|
+
end
|
132
|
+
|
133
|
+
module HtmlDocumentBehavior # :nodoc:
|
134
|
+
module ClassMethods
|
135
|
+
def parse(*args, &block)
|
136
|
+
remove_comments_before_html_element(super)
|
137
|
+
end
|
138
|
+
|
139
|
+
private
|
140
|
+
|
141
|
+
# remove comments that exist outside of the HTML element.
|
142
|
+
#
|
143
|
+
# these comments are allowed by the HTML spec:
|
144
|
+
#
|
145
|
+
# https://www.w3.org/TR/html401/struct/global.html#h-7.1
|
146
|
+
#
|
147
|
+
# but are not scrubbed by Loofah because these nodes don't meet
|
148
|
+
# the contract that scrubbers expect of a node (e.g., it can be
|
149
|
+
# replaced, sibling and children nodes can be created).
|
150
|
+
def remove_comments_before_html_element(doc)
|
151
|
+
doc.children.each do |child|
|
152
|
+
child.unlink if child.comment?
|
153
|
+
end
|
154
|
+
doc
|
155
|
+
end
|
156
|
+
end
|
157
|
+
|
158
|
+
class << self
|
159
|
+
def included(base)
|
160
|
+
base.extend(ClassMethods)
|
161
|
+
end
|
162
|
+
end
|
163
|
+
|
164
|
+
def serialize_root
|
165
|
+
at_xpath("/html/body")
|
166
|
+
end
|
167
|
+
end
|
168
|
+
|
169
|
+
module HtmlFragmentBehavior # :nodoc:
|
170
|
+
module ClassMethods
|
171
|
+
def parse(tags, encoding = nil)
|
172
|
+
doc = document_klass.new
|
173
|
+
|
174
|
+
encoding ||= tags.respond_to?(:encoding) ? tags.encoding.name : "UTF-8"
|
175
|
+
doc.encoding = encoding
|
176
|
+
|
177
|
+
new(doc, tags)
|
178
|
+
end
|
179
|
+
|
180
|
+
def document_klass
|
181
|
+
@document_klass ||= if Loofah.html5_support? && self == Loofah::HTML5::DocumentFragment
|
182
|
+
Loofah::HTML5::Document
|
183
|
+
elsif self == Loofah::HTML4::DocumentFragment
|
184
|
+
Loofah::HTML4::Document
|
185
|
+
else
|
186
|
+
raise ArgumentError, "unexpected class: #{self}"
|
187
|
+
end
|
188
|
+
end
|
189
|
+
end
|
190
|
+
|
191
|
+
class << self
|
192
|
+
def included(base)
|
193
|
+
base.extend(ClassMethods)
|
194
|
+
end
|
195
|
+
end
|
196
|
+
|
197
|
+
def to_s
|
198
|
+
serialize_root.children.to_s
|
199
|
+
end
|
200
|
+
|
201
|
+
alias_method :serialize, :to_s
|
202
|
+
|
203
|
+
def serialize_root
|
204
|
+
at_xpath("./body") || self
|
205
|
+
end
|
206
|
+
end
|
207
|
+
end
|