loofah 2.2.3 → 2.21.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (46) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +269 -31
  3. data/README.md +109 -124
  4. data/lib/loofah/concerns.rb +207 -0
  5. data/lib/loofah/elements.rb +85 -79
  6. data/lib/loofah/helpers.rb +37 -20
  7. data/lib/loofah/{html → html4}/document.rb +6 -7
  8. data/lib/loofah/html4/document_fragment.rb +15 -0
  9. data/lib/loofah/html5/document.rb +17 -0
  10. data/lib/loofah/html5/document_fragment.rb +15 -0
  11. data/lib/loofah/html5/libxml2_workarounds.rb +10 -8
  12. data/lib/loofah/html5/safelist.rb +1055 -0
  13. data/lib/loofah/html5/scrub.rb +153 -58
  14. data/lib/loofah/metahelpers.rb +11 -6
  15. data/lib/loofah/scrubber.rb +22 -15
  16. data/lib/loofah/scrubbers.rb +66 -55
  17. data/lib/loofah/version.rb +6 -0
  18. data/lib/loofah/xml/document.rb +2 -0
  19. data/lib/loofah/xml/document_fragment.rb +4 -7
  20. data/lib/loofah.rb +131 -38
  21. metadata +28 -216
  22. data/.gemtest +0 -0
  23. data/Gemfile +0 -22
  24. data/Manifest.txt +0 -40
  25. data/Rakefile +0 -79
  26. data/benchmark/benchmark.rb +0 -149
  27. data/benchmark/fragment.html +0 -96
  28. data/benchmark/helper.rb +0 -73
  29. data/benchmark/www.slashdot.com.html +0 -2560
  30. data/lib/loofah/html/document_fragment.rb +0 -40
  31. data/lib/loofah/html5/whitelist.rb +0 -186
  32. data/lib/loofah/instance_methods.rb +0 -127
  33. data/test/assets/msword.html +0 -63
  34. data/test/assets/testdata_sanitizer_tests1.dat +0 -502
  35. data/test/helper.rb +0 -18
  36. data/test/html5/test_sanitizer.rb +0 -382
  37. data/test/integration/test_ad_hoc.rb +0 -204
  38. data/test/integration/test_helpers.rb +0 -43
  39. data/test/integration/test_html.rb +0 -72
  40. data/test/integration/test_scrubbers.rb +0 -400
  41. data/test/integration/test_xml.rb +0 -55
  42. data/test/unit/test_api.rb +0 -142
  43. data/test/unit/test_encoding.rb +0 -20
  44. data/test/unit/test_helpers.rb +0 -62
  45. data/test/unit/test_scrubber.rb +0 -229
  46. data/test/unit/test_scrubbers.rb +0 -14
data/README.md CHANGED
@@ -1,81 +1,74 @@
1
1
  # Loofah
2
2
 
3
3
  * https://github.com/flavorjones/loofah
4
- * Docs: http://rubydoc.info/github/flavorjones/loofah/master/frames
4
+ * Docs: http://rubydoc.info/github/flavorjones/loofah/main/frames
5
5
  * Mailing list: [loofah-talk@googlegroups.com](https://groups.google.com/forum/#!forum/loofah-talk)
6
6
 
7
7
  ## Status
8
8
 
9
- |System|Status|
10
- |--|--|
11
- | Concourse | [![Concourse CI](https://ci.nokogiri.org/api/v1/teams/nokogiri-core/pipelines/loofah/jobs/ruby-2.5/badge)](https://ci.nokogiri.org/teams/nokogiri-core/pipelines/loofah?groups=master) |
12
- | Code Climate | [![Code Climate](https://codeclimate.com/github/flavorjones/loofah.svg)](https://codeclimate.com/github/flavorjones/loofah) |
13
- | Version Eye | [![Version Eye](https://www.versioneye.com/ruby/loofah/badge.png)](https://www.versioneye.com/ruby/loofah) |
9
+ [![ci](https://github.com/flavorjones/loofah/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/flavorjones/loofah/actions/workflows/ci.yml)
10
+ [![Tidelift dependencies](https://tidelift.com/badges/package/rubygems/loofah)](https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=rubygems-loofah&utm_medium=referral&utm_campaign=readme)
14
11
 
15
12
 
16
13
  ## Description
17
14
 
18
- Loofah is a general library for manipulating and transforming HTML/XML
19
- documents and fragments. It's built on top of Nokogiri and libxml2, so
20
- it's fast and has a nice API.
15
+ Loofah is a general library for manipulating and transforming HTML/XML documents and fragments, built on top of Nokogiri.
21
16
 
22
- Loofah excels at HTML sanitization (XSS prevention). It includes some
23
- nice HTML sanitizers, which are based on HTML5lib's whitelist, so it
24
- most likely won't make your codes less secure. (These statements have
25
- not been evaluated by Netexperts.)
17
+ Loofah also includes some HTML sanitizers based on `html5lib`'s safelist, which are a specific application of the general transformation functionality.
26
18
 
27
- ActiveRecord extensions for sanitization are available in the
28
- [`loofah-activerecord` gem](https://github.com/flavorjones/loofah-activerecord).
19
+ Active Record extensions for HTML sanitization are available in the [`loofah-activerecord` gem](https://github.com/flavorjones/loofah-activerecord).
29
20
 
30
21
 
31
22
  ## Features
32
23
 
33
- * Easily write custom scrubbers for HTML/XML leveraging the sweetness of Nokogiri (and HTML5lib's whitelists).
34
- * Common HTML sanitizing tasks are built-in:
24
+ * Easily write custom transformations for HTML and XML
25
+ * Common HTML sanitizing transformations are built-in:
35
26
  * _Strip_ unsafe tags, leaving behind only the inner text.
36
27
  * _Prune_ unsafe tags and their subtrees, removing all traces that they ever existed.
37
28
  * _Escape_ unsafe tags and their subtrees, leaving behind lots of <tt>&lt;</tt> and <tt>&gt;</tt> entities.
38
29
  * _Whitewash_ the markup, removing all attributes and namespaced nodes.
39
- * Common HTML transformation tasks are built-in:
30
+ * Other common HTML transformations are built-in:
40
31
  * Add the _nofollow_ attribute to all hyperlinks.
41
- * Format markup as plain text, with or without sensible whitespace handling around block elements.
32
+ * Remove _unprintable_ characters from text nodes.
33
+ * Format markup as plain text, with (or without) sensible whitespace handling around block elements.
42
34
  * Replace Rails's `strip_tags` and `sanitize` view helper methods.
43
35
 
44
36
 
45
37
  ## Compare and Contrast
46
38
 
47
- Loofah is one of two known Ruby XSS/sanitization solutions that
48
- guarantees well-formed and valid markup (the other is Sanitize, which
49
- also uses Nokogiri).
39
+ Loofah is both:
50
40
 
51
- Loofah works on XML, XHTML and HTML documents.
41
+ - a general framework for transforming XML, XHTML, and HTML documents
42
+ - a specific toolkit for HTML sanitization
52
43
 
53
- Also, it's pretty fast. Here is a benchmark comparing Loofah to other
54
- commonly-used libraries (ActionView, Sanitize, HTML5lib and HTMLfilter):
44
+ ### General document transformation
55
45
 
56
- * https://gist.github.com/170193
46
+ Loofah tries to make it easy to write your own custom scrubbers for whatever document transformation you need. You don't like the built-in scrubbers? Build your own, like a boss.
57
47
 
58
- Lastly, Loofah is extensible. It's super-easy to write your own custom
59
- scrubbers for whatever document manipulation you need. You don't like
60
- the built-in scrubbers? Build your own, like a boss.
48
+
49
+ ### HTML sanitization
50
+
51
+ Another Ruby library that provides HTML sanitization is [`rgrove/sanitize`](https://github.com/rgrove/sanitize), another library built on top of Nokogiri, which provides a bit more flexibility on the tags and attributes being scrubbed.
52
+
53
+ You may also want to look at [`rails/rails-html-sanitizer`](https://github.com/rails/rails-html-sanitizer) which is built on top of Loofah and provides some useful extensions and additional flexibility in the HTML sanitization.
61
54
 
62
55
 
63
56
  ## The Basics
64
57
 
65
- Loofah wraps [Nokogiri](http://nokogiri.org) in a loving
66
- embrace. Nokogiri is an excellent HTML/XML parser. If you don't know
67
- how Nokogiri works, you might want to pause for a moment and go check
68
- it out. I'll wait.
58
+ Loofah wraps [Nokogiri](http://nokogiri.org) in a loving embrace. Nokogiri is a stable, well-maintained parser for XML, HTML4, and HTML5.
69
59
 
70
- Loofah presents the following classes:
60
+ Loofah implements the following classes:
71
61
 
72
- * `Loofah::HTML::Document` and `Loofah::HTML::DocumentFragment`
73
- * `Loofah::XML::Document` and `Loofah::XML::DocumentFragment`
74
- * `Loofah::Scrubber`
62
+ * `Loofah::HTML5::Document`
63
+ * `Loofah::HTML5::DocumentFragment`
64
+ * `Loofah::HTML4::Document` (aliased as `Loofah::HTML::Document` for now)
65
+ * `Loofah::HTML4::DocumentFragment` (aliased as `Loofah::HTML::DocumentFragment` for now)
66
+ * `Loofah::XML::Document`
67
+ * `Loofah::XML::DocumentFragment`
75
68
 
76
- The documents and fragments are subclasses of the similar Nokogiri classes.
69
+ These document and fragment classes are subclasses of the similarly-named Nokogiri classes `Nokogiri::HTML5::Document` et al.
77
70
 
78
- The Scrubber represents the document manipulation, either by wrapping
71
+ Loofah also implements `Loofah::Scrubber`, which represents the document transformation, either by wrapping
79
72
  a block,
80
73
 
81
74
  ``` ruby
@@ -89,50 +82,49 @@ or by implementing a method.
89
82
 
90
83
  ### Side Note: Fragments vs Documents
91
84
 
92
- Generally speaking, unless you expect to have a DOCTYPE and a single
93
- root node, you don't have a *document*, you have a *fragment*. For
94
- HTML, another rule of thumb is that *documents* have `html` and `body`
95
- tags, and *fragments* usually do not.
85
+ Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don't have a *document*, you have a *fragment*. For HTML, another rule of thumb is that *documents* have `html` and `body` tags, and *fragments* usually do not.
86
+
87
+ **HTML fragments** should be parsed with `Loofah.html5_fragment` or `Loofah.html4_fragment`. The result won't be wrapped in `html` or `body` tags, won't have a DOCTYPE declaration, `head` elements will be silently ignored, and multiple root nodes are allowed.
88
+
89
+ **HTML documents** should be parsed with `Loofah.html5_document` or `Loofah.html4_document`. The result will have a DOCTYPE declaration, along with `html`, `head` and `body` tags.
90
+
91
+ **XML fragments** should be parsed with `Loofah.xml_fragment`. The result won't have a DOCTYPE declaration, and multiple root nodes are allowed.
96
92
 
97
- HTML fragments should be parsed with Loofah.fragment. The result won't
98
- be wrapped in `html` or `body` tags, won't have a DOCTYPE declaration,
99
- `head` elements will be silently ignored, and multiple root nodes are
100
- allowed.
93
+ **XML documents** should be parsed with `Loofah.xml_document`. The result will have a DOCTYPE declaration and a single root node.
101
94
 
102
- XML fragments should be parsed with Loofah.xml_fragment. The result
103
- won't have a DOCTYPE declaration, and multiple root nodes are allowed.
104
95
 
105
- HTML documents should be parsed with Loofah.document. The result will
106
- have a DOCTYPE declaration, along with `html`, `head` and `body` tags.
96
+ ### Side Note: HTML4 vs HTML5
107
97
 
108
- XML documents should be parsed with Loofah.xml_document. The result
109
- will have a DOCTYPE declaration and a single root node.
98
+ _HTML5 functionality is not available on JRuby, or with versions of Nokogiri `< 1.14.0`._
110
99
 
100
+ Currently, Loofah's methods `Loofah.document` and `Loofah.fragment` are aliases to `.html4_document` and `.html4_fragment`, which use Nokogiri's HTML4 parser. (Similarly, `Loofah::HTML::Document` and `Loofah::HTML::DocumentFragment` are aliased to `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`.)
111
101
 
112
- ### Loofah::HTML::Document and Loofah::HTML::DocumentFragment
102
+ **Please note** that in a future version of Loofah, these methods and classes may switch to using Nokogiri's HTML5 parser and classes on platforms that support it [1].
113
103
 
114
- These classes are subclasses of Nokogiri::HTML::Document and
115
- Nokogiri::HTML::DocumentFragment, so you get all the markup
116
- fixer-uppery and API goodness of Nokogiri.
104
+ **We strongly recommend that you explicitly use `.html5_document` or `.html5_fragment`** unless you know of a compelling reason not to. If you are sure that you need to use the HTML4 parser, you should explicitly call `.html4_document` or `.html4_fragment` to avoid breakage in a future version.
117
105
 
118
- The module methods Loofah.document and Loofah.fragment will parse an
119
- HTML document and an HTML fragment, respectively.
106
+ [1]: [[feature request] HTML5 parser for JRuby implementation · Issue #2227 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/2227)
107
+
108
+
109
+ ### `Loofah::HTML5::Document` and `Loofah::HTML5::DocumentFragment`
110
+
111
+ These classes are subclasses of `Nokogiri::HTML5::Document` and `Nokogiri::HTML5::DocumentFragment`.
112
+
113
+ The module methods `Loofah.html5_document` and `Loofah.html5_fragment` will parse either an HTML document and an HTML fragment, respectively.
120
114
 
121
115
  ``` ruby
122
- Loofah.document(unsafe_html).is_a?(Nokogiri::HTML::Document) # => true
123
- Loofah.fragment(unsafe_html).is_a?(Nokogiri::HTML::DocumentFragment) # => true
116
+ Loofah.html5_document(unsafe_html).is_a?(Nokogiri::HTML5::Document) # => true
117
+ Loofah.html5_fragment(unsafe_html).is_a?(Nokogiri::HTML5::DocumentFragment) # => true
124
118
  ```
125
119
 
126
- Loofah injects a `scrub!` method, which takes either a symbol (for
127
- built-in scrubbers) or a Loofah::Scrubber object (for custom
128
- scrubbers), and modifies the document in-place.
120
+ Loofah injects a `scrub!` method, which takes either a symbol (for built-in scrubbers) or a `Loofah::Scrubber` object (for custom scrubbers), and modifies the document in-place.
129
121
 
130
122
  Loofah overrides `to_s` to return HTML:
131
123
 
132
124
  ``` ruby
133
125
  unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"
134
126
 
135
- doc = Loofah.fragment(unsafe_html).scrub!(:prune)
127
+ doc = Loofah.html5_fragment(unsafe_html).scrub!(:prune)
136
128
  doc.to_s # => "ohai! <div>div is safe</div> "
137
129
  ```
138
130
 
@@ -142,36 +134,41 @@ and `text` to return plain text:
142
134
  doc.text # => "ohai! div is safe "
143
135
  ```
144
136
 
145
- Also, `to_text` is available, which does the right thing with
146
- whitespace around block-level elements.
137
+ Also, `to_text` is available, which does the right thing with whitespace around block-level and line break elements.
138
+
139
+ ``` ruby
140
+ doc = Loofah.html5_fragment("<h1>Title</h1><div>Content<br>Next line</div>")
141
+ doc.text # => "TitleContentNext line" # probably not what you want
142
+ doc.to_text # => "\nTitle\n\nContent\nNext line\n" # better
143
+ ```
144
+
145
+ ### `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`
146
+
147
+ These classes are subclasses of `Nokogiri::HTML4::Document` and `Nokogiri::HTML4::DocumentFragment`.
148
+
149
+ The module methods `Loofah.html4_document` and `Loofah.html4_fragment` will parse either an HTML document and an HTML fragment, respectively.
147
150
 
148
151
  ``` ruby
149
- doc = Loofah.fragment("<h1>Title</h1><div>Content</div>")
150
- doc.text # => "TitleContent" # probably not what you want
151
- doc.to_text # => "\nTitle\n\nContent\n" # better
152
+ Loofah.html4_document(unsafe_html).is_a?(Nokogiri::HTML4::Document) # => true
153
+ Loofah.html4_fragment(unsafe_html).is_a?(Nokogiri::HTML4::DocumentFragment) # => true
152
154
  ```
153
155
 
154
- ### Loofah::XML::Document and Loofah::XML::DocumentFragment
156
+ ### `Loofah::XML::Document` and `Loofah::XML::DocumentFragment`
155
157
 
156
- These classes are subclasses of Nokogiri::XML::Document and
157
- Nokogiri::XML::DocumentFragment, so you get all the markup
158
- fixer-uppery and API goodness of Nokogiri.
158
+ These classes are subclasses of `Nokogiri::XML::Document` and `Nokogiri::XML::DocumentFragment`.
159
159
 
160
- The module methods Loofah.xml_document and Loofah.xml_fragment will
161
- parse an XML document and an XML fragment, respectively.
160
+ The module methods `Loofah.xml_document` and `Loofah.xml_fragment` will parse an XML document and an XML fragment, respectively.
162
161
 
163
162
  ``` ruby
164
163
  Loofah.xml_document(bad_xml).is_a?(Nokogiri::XML::Document) # => true
165
164
  Loofah.xml_fragment(bad_xml).is_a?(Nokogiri::XML::DocumentFragment) # => true
166
165
  ```
167
166
 
168
- ### Nodes and NodeSets
167
+ ### Nodes and Node Sets
169
168
 
170
- Nokogiri::XML::Node and Nokogiri::XML::NodeSet also get a `scrub!`
171
- method, which makes it easy to scrub subtrees.
169
+ Nokogiri's `Node` and `NodeSet` classes also get a `scrub!` method, which makes it easy to scrub subtrees.
172
170
 
173
- The following code will apply the `employee_scrubber` only to the
174
- `employee` nodes (and their subtrees) in the document:
171
+ The following code will apply the `employee_scrubber` only to the `employee` nodes (and their subtrees) in the document:
175
172
 
176
173
  ``` ruby
177
174
  Loofah.xml_document(bad_xml).xpath("//employee").scrub!(employee_scrubber)
@@ -183,7 +180,7 @@ And this code will only scrub the first `employee` node and its subtree:
183
180
  Loofah.xml_document(bad_xml).at_xpath("//employee").scrub!(employee_scrubber)
184
181
  ```
185
182
 
186
- ### Loofah::Scrubber
183
+ ### `Loofah::Scrubber`
187
184
 
188
185
  A Scrubber wraps up a block (or method) that is run on a document node:
189
186
 
@@ -197,14 +194,11 @@ end
197
194
  This can then be run on a document:
198
195
 
199
196
  ``` ruby
200
- Loofah.fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
197
+ Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
201
198
  # => "<div>foo</div><p>bar</p>"
202
199
  ```
203
200
 
204
- Scrubbers can be run on a document in either a top-down traversal (the
205
- default) or bottom-up. Top-down scrubbers can optionally return
206
- Scrubber::STOP to terminate the traversal of a subtree. Read below and
207
- in the Loofah::Scrubber class for more detailed usage.
201
+ Scrubbers can be run on a document in either a top-down traversal (the default) or bottom-up. Top-down scrubbers can optionally return `Scrubber::STOP` to terminate the traversal of a subtree. Read below and in the `Loofah::Scrubber` class for more detailed usage.
208
202
 
209
203
  Here's an XML example:
210
204
 
@@ -219,12 +213,12 @@ end
219
213
  Loofah.xml_document(File.read('plague.xml')).scrub!(bring_out_your_dead)
220
214
  ```
221
215
 
222
- === Built-In HTML Scrubbers
216
+ ### Built-In HTML Scrubbers
223
217
 
224
- Loofah comes with a set of sanitizing scrubbers that use HTML5lib's
225
- whitelist algorithm:
218
+ Loofah comes with a set of sanitizing scrubbers that use `html5lib`'s safelist algorithm:
226
219
 
227
220
  ``` ruby
221
+ doc = Loofah.html5_document(input)
228
222
  doc.scrub!(:strip) # replaces unknown/unsafe tags with their inner text
229
223
  doc.scrub!(:prune) # removes unknown/unsafe tags and their children
230
224
  doc.scrub!(:escape) # escapes unknown/unsafe tags, like this: &lt;script&gt;
@@ -232,14 +226,14 @@ doc.scrub!(:whitewash) # removes unknown/unsafe/namespaced tags and their chi
232
226
  # and strips all node attributes
233
227
  ```
234
228
 
235
- Loofah also comes with some common transformation tasks:
229
+ Loofah also comes with some common transformation tasks:
236
230
 
237
231
  ``` ruby
238
232
  doc.scrub!(:nofollow) # adds rel="nofollow" attribute to links
239
233
  doc.scrub!(:unprintable) # removes unprintable characters from text nodes
240
234
  ```
241
235
 
242
- See Loofah::Scrubbers for more details and example usage.
236
+ See `Loofah::Scrubbers` for more details and example usage.
243
237
 
244
238
 
245
239
  ### Chaining Scrubbers
@@ -247,7 +241,7 @@ See Loofah::Scrubbers for more details and example usage.
247
241
  You can chain scrubbers:
248
242
 
249
243
  ``` ruby
250
- Loofah.fragment("<span>hello</span> <script>alert('OHAI')</script>") \
244
+ Loofah.html5_fragment("<span>hello</span> <script>alert('OHAI')</script>") \
251
245
  .scrub!(:prune) \
252
246
  .scrub!(span2div).to_s
253
247
  # => "<div>hello</div> "
@@ -255,21 +249,26 @@ Loofah.fragment("<span>hello</span> <script>alert('OHAI')</script>") \
255
249
 
256
250
  ### Shorthand
257
251
 
258
- The class methods Loofah.scrub_fragment and Loofah.scrub_document are
259
- shorthand.
252
+ The class methods `Loofah.scrub_html5_fragment` and `Loofah.scrub_html5_document` (and the corresponding HTML4 methods) are shorthand.
253
+
254
+ These methods:
260
255
 
261
256
  ``` ruby
262
- Loofah.scrub_fragment(unsafe_html, :prune)
263
- Loofah.scrub_document(unsafe_html, :prune)
257
+ Loofah.scrub_html5_fragment(unsafe_html, :prune)
258
+ Loofah.scrub_html5_document(unsafe_html, :prune)
259
+ Loofah.scrub_html4_fragment(unsafe_html, :prune)
260
+ Loofah.scrub_html4_document(unsafe_html, :prune)
264
261
  Loofah.scrub_xml_fragment(bad_xml, custom_scrubber)
265
262
  Loofah.scrub_xml_document(bad_xml, custom_scrubber)
266
263
  ```
267
264
 
268
- are the same thing as (and arguably semantically clearer than):
265
+ do the same thing as (and arguably semantically clearer than):
269
266
 
270
267
  ``` ruby
271
- Loofah.fragment(unsafe_html).scrub!(:prune)
272
- Loofah.document(unsafe_html).scrub!(:prune)
268
+ Loofah.html5_fragment(unsafe_html).scrub!(:prune)
269
+ Loofah.html5_document(unsafe_html).scrub!(:prune)
270
+ Loofah.html4_fragment(unsafe_html).scrub!(:prune)
271
+ Loofah.html4_document(unsafe_html).scrub!(:prune)
273
272
  Loofah.xml_fragment(bad_xml).scrub!(custom_scrubber)
274
273
  Loofah.xml_document(bad_xml).scrub!(custom_scrubber)
275
274
  ```
@@ -277,10 +276,9 @@ Loofah.xml_document(bad_xml).scrub!(custom_scrubber)
277
276
 
278
277
  ### View Helpers
279
278
 
280
- Loofah has two "view helpers": Loofah::Helpers.sanitize and
281
- Loofah::Helpers.strip_tags, both of which are drop-in replacements for
282
- the Rails ActionView helpers of the same name.
283
- These are no longer required automatically. You must require `loofah/helpers`.
279
+ Loofah has two "view helpers": `Loofah::Helpers.sanitize` and `Loofah::Helpers.strip_tags`, both of which are drop-in replacements for the Rails Action View helpers of the same name.
280
+
281
+ These are not required automatically. You must require `loofah/helpers` to use them.
284
282
 
285
283
 
286
284
  ## Requirements
@@ -306,7 +304,9 @@ And the mailing list is on Google Groups:
306
304
  * Mail: loofah-talk@googlegroups.com
307
305
  * Archive: https://groups.google.com/forum/#!forum/loofah-talk
308
306
 
309
- And the IRC channel is \#loofah on freenode.
307
+ Consider subscribing to [Tidelift][tidelift] which provides license assurances and timely security notifications for your open source dependencies, including Loofah. [Tidelift][tidelift] subscriptions also help the Loofah maintainers fund our [automated testing](https://ci.nokogiri.org) which in turn allows us to ship releases, bugfixes, and security updates more often.
308
+
309
+ [tidelift]: https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=undefined&utm_medium=referral&utm_campaign=enterprise
310
310
 
311
311
 
312
312
  ## Security
@@ -314,26 +314,12 @@ And the IRC channel is \#loofah on freenode.
314
314
  See [`SECURITY.md`](SECURITY.md) for vulnerability reporting details.
315
315
 
316
316
 
317
- ### "Secure by Default"
318
-
319
- Some tools may incorrectly report Loofah as a potential security
320
- vulnerability.
321
-
322
- Loofah depends on Nokogiri, and it's _possible_ to use Nokogiri in a
323
- dangerous way (by enabling its DTDLOAD option and disabling its NONET
324
- option). This specifically allows the opportunity for an XML External
325
- Entity (XXE) vulnerability if the XML data is untrusted.
326
-
327
- However, Loofah __never enables this Nokogiri configuration__; Loofah
328
- never enables DTDLOAD, and it never disables NONET, thereby protecting
329
- you by default from this XXE vulnerability.
330
-
331
-
332
317
  ## Related Links
333
318
 
319
+ * loofah-activerecord: https://github.com/flavorjones/loofah-activerecord
334
320
  * Nokogiri: http://nokogiri.org
335
321
  * libxml2: http://xmlsoft.org
336
- * html5lib: https://code.google.com/p/html5lib
322
+ * html5lib: https://github.com/html5lib/
337
323
 
338
324
 
339
325
  ## Authors
@@ -354,15 +340,14 @@ And a big shout-out to Corey Innis for the name, and feedback on the API.
354
340
 
355
341
  ## Thank You
356
342
 
357
- The following people have generously donated via the [Pledgie](http://pledgie.com) badge on the [Loofah github page](https://github.com/flavorjones/loofah):
343
+ The following people have generously funded Loofah:
358
344
 
359
345
  * Bill Harding
360
346
 
361
347
 
362
348
  ## Historical Note
363
349
 
364
- This library was formerly known as Dryopteris, which was a very bad
365
- name that nobody could spell properly.
350
+ This library was once named "Dryopteris", which was a very bad name that nobody could spell properly.
366
351
 
367
352
 
368
353
  ## License
@@ -0,0 +1,207 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Loofah
4
+ #
5
+ # Mixes +scrub!+ into Document, DocumentFragment, Node and NodeSet.
6
+ #
7
+ # Traverse the document or fragment, invoking the +scrubber+ on each node.
8
+ #
9
+ # +scrubber+ must either be one of the symbols representing the built-in scrubbers (see
10
+ # Scrubbers), or a Scrubber instance.
11
+ #
12
+ # span2div = Loofah::Scrubber.new do |node|
13
+ # node.name = "div" if node.name == "span"
14
+ # end
15
+ # Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
16
+ # # => "<div>foo</div><p>bar</p>"
17
+ #
18
+ # or
19
+ #
20
+ # unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"
21
+ # Loofah.html5_fragment(unsafe_html).scrub!(:strip).to_s
22
+ # # => "ohai! <div>div is safe</div> "
23
+ #
24
+ # Note that this method is called implicitly from the shortcuts Loofah.scrub_html5_fragment et
25
+ # al.
26
+ #
27
+ # Please see Scrubber for more information on implementation and traversal, and README.rdoc for
28
+ # more example usage.
29
+ #
30
+ module ScrubBehavior
31
+ module Node # :nodoc:
32
+ def scrub!(scrubber)
33
+ #
34
+ # yes. this should be three separate methods. but nokogiri decorates (or not) based on
35
+ # whether the module name has already been included. and since documents get decorated just
36
+ # like their constituent nodes, we need to jam all the logic into a single module.
37
+ #
38
+ scrubber = ScrubBehavior.resolve_scrubber(scrubber)
39
+ case self
40
+ when Nokogiri::XML::Document
41
+ scrubber.traverse(root) if root
42
+ when Nokogiri::XML::DocumentFragment
43
+ children.scrub!(scrubber)
44
+ else
45
+ scrubber.traverse(self)
46
+ end
47
+ self
48
+ end
49
+ end
50
+
51
+ module NodeSet # :nodoc:
52
+ def scrub!(scrubber)
53
+ each { |node| node.scrub!(scrubber) }
54
+ self
55
+ end
56
+ end
57
+
58
+ class << self
59
+ def resolve_scrubber(scrubber) # :nodoc:
60
+ scrubber = Scrubbers::MAP[scrubber].new if Scrubbers::MAP[scrubber]
61
+ unless scrubber.is_a?(Loofah::Scrubber)
62
+ raise Loofah::ScrubberNotFound, "not a Scrubber or a scrubber name: #{scrubber.inspect}"
63
+ end
64
+
65
+ scrubber
66
+ end
67
+ end
68
+ end
69
+
70
+ #
71
+ # Overrides +text+ in Document and DocumentFragment classes, and mixes in +to_text+.
72
+ #
73
+ module TextBehavior
74
+ #
75
+ # Returns a plain-text version of the markup contained by the document, with HTML entities
76
+ # encoded.
77
+ #
78
+ # This method is significantly faster than #to_text, but isn't clever about whitespace around
79
+ # block elements.
80
+ #
81
+ # Loofah.html5_document("<h1>Title</h1><div>Content</div>").text
82
+ # # => "TitleContent"
83
+ #
84
+ # By default, the returned text will have HTML entities escaped. If you want unescaped
85
+ # entities, and you understand that the result is unsafe to render in a browser, then you can
86
+ # pass an argument as shown:
87
+ #
88
+ # frag = Loofah.html5_fragment("&lt;script&gt;alert('EVIL');&lt;/script&gt;")
89
+ # # ok for browser:
90
+ # frag.text # => "&lt;script&gt;alert('EVIL');&lt;/script&gt;"
91
+ # # decidedly not ok for browser:
92
+ # frag.text(:encode_special_chars => false) # => "<script>alert('EVIL');</script>"
93
+ #
94
+ def text(options = {})
95
+ result = if serialize_root
96
+ serialize_root.children.reject(&:comment?).map(&:inner_text).join("")
97
+ else
98
+ ""
99
+ end
100
+ if options[:encode_special_chars] == false
101
+ result # possibly dangerous if rendered in a browser
102
+ else
103
+ encode_special_chars(result)
104
+ end
105
+ end
106
+
107
+ alias_method :inner_text, :text
108
+ alias_method :to_str, :text
109
+
110
+ #
111
+ # Returns a plain-text version of the markup contained by the fragment, with HTML entities
112
+ # encoded.
113
+ #
114
+ # This method is slower than #text, but is clever about whitespace around block elements and
115
+ # line break elements.
116
+ #
117
+ # Loofah.html5_document("<h1>Title</h1><div>Content<br>Next line</div>").to_text
118
+ # # => "\nTitle\n\nContent\nNext line\n"
119
+ #
120
+ def to_text(options = {})
121
+ Loofah.remove_extraneous_whitespace(dup.scrub!(:newline_block_elements).text(options))
122
+ end
123
+ end
124
+
125
+ module DocumentDecorator # :nodoc:
126
+ def initialize(*args, &block)
127
+ super
128
+ decorators(Nokogiri::XML::Node) << ScrubBehavior::Node
129
+ decorators(Nokogiri::XML::NodeSet) << ScrubBehavior::NodeSet
130
+ end
131
+ end
132
+
133
+ module HtmlDocumentBehavior # :nodoc:
134
+ module ClassMethods
135
+ def parse(*args, &block)
136
+ remove_comments_before_html_element(super)
137
+ end
138
+
139
+ private
140
+
141
+ # remove comments that exist outside of the HTML element.
142
+ #
143
+ # these comments are allowed by the HTML spec:
144
+ #
145
+ # https://www.w3.org/TR/html401/struct/global.html#h-7.1
146
+ #
147
+ # but are not scrubbed by Loofah because these nodes don't meet
148
+ # the contract that scrubbers expect of a node (e.g., it can be
149
+ # replaced, sibling and children nodes can be created).
150
+ def remove_comments_before_html_element(doc)
151
+ doc.children.each do |child|
152
+ child.unlink if child.comment?
153
+ end
154
+ doc
155
+ end
156
+ end
157
+
158
+ class << self
159
+ def included(base)
160
+ base.extend(ClassMethods)
161
+ end
162
+ end
163
+
164
+ def serialize_root
165
+ at_xpath("/html/body")
166
+ end
167
+ end
168
+
169
+ module HtmlFragmentBehavior # :nodoc:
170
+ module ClassMethods
171
+ def parse(tags, encoding = nil)
172
+ doc = document_klass.new
173
+
174
+ encoding ||= tags.respond_to?(:encoding) ? tags.encoding.name : "UTF-8"
175
+ doc.encoding = encoding
176
+
177
+ new(doc, tags)
178
+ end
179
+
180
+ def document_klass
181
+ @document_klass ||= if Loofah.html5_support? && self == Loofah::HTML5::DocumentFragment
182
+ Loofah::HTML5::Document
183
+ elsif self == Loofah::HTML4::DocumentFragment
184
+ Loofah::HTML4::Document
185
+ else
186
+ raise ArgumentError, "unexpected class: #{self}"
187
+ end
188
+ end
189
+ end
190
+
191
+ class << self
192
+ def included(base)
193
+ base.extend(ClassMethods)
194
+ end
195
+ end
196
+
197
+ def to_s
198
+ serialize_root.children.to_s
199
+ end
200
+
201
+ alias_method :serialize, :to_s
202
+
203
+ def serialize_root
204
+ at_xpath("./body") || self
205
+ end
206
+ end
207
+ end