loofah 0.4.2 → 2.25.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. checksums.yaml +7 -0
  2. data/CHANGELOG.md +604 -0
  3. data/MIT-LICENSE.txt +3 -1
  4. data/README.md +410 -0
  5. data/SECURITY.md +18 -0
  6. data/lib/loofah/concerns.rb +207 -0
  7. data/lib/loofah/elements.rb +98 -0
  8. data/lib/loofah/helpers.rb +91 -4
  9. data/lib/loofah/html4/document.rb +17 -0
  10. data/lib/loofah/html4/document_fragment.rb +15 -0
  11. data/lib/loofah/html5/document.rb +17 -0
  12. data/lib/loofah/html5/document_fragment.rb +15 -0
  13. data/lib/loofah/html5/libxml2_workarounds.rb +28 -0
  14. data/lib/loofah/html5/safelist.rb +1058 -0
  15. data/lib/loofah/html5/scrub.rb +211 -40
  16. data/lib/loofah/metahelpers.rb +18 -0
  17. data/lib/loofah/scrubber.rb +31 -13
  18. data/lib/loofah/scrubbers.rb +262 -31
  19. data/lib/loofah/version.rb +6 -0
  20. data/lib/loofah/xml/document.rb +2 -0
  21. data/lib/loofah/xml/document_fragment.rb +6 -9
  22. data/lib/loofah.rb +131 -52
  23. metadata +79 -158
  24. data/CHANGELOG.rdoc +0 -92
  25. data/DEPRECATED.rdoc +0 -12
  26. data/Manifest.txt +0 -34
  27. data/README.rdoc +0 -330
  28. data/Rakefile +0 -61
  29. data/TODO.rdoc +0 -4
  30. data/benchmark/benchmark.rb +0 -149
  31. data/benchmark/fragment.html +0 -96
  32. data/benchmark/helper.rb +0 -73
  33. data/benchmark/www.slashdot.com.html +0 -2560
  34. data/init.rb +0 -1
  35. data/lib/loofah/active_record.rb +0 -62
  36. data/lib/loofah/html/document.rb +0 -22
  37. data/lib/loofah/html/document_fragment.rb +0 -46
  38. data/lib/loofah/html5/whitelist.rb +0 -174
  39. data/lib/loofah/instance_methods.rb +0 -77
  40. data/lib/loofah/xss_foliate.rb +0 -212
  41. data/test/helper.rb +0 -8
  42. data/test/html5/test_sanitizer.rb +0 -248
  43. data/test/test_active_record.rb +0 -146
  44. data/test/test_ad_hoc.rb +0 -272
  45. data/test/test_api.rb +0 -128
  46. data/test/test_helpers.rb +0 -28
  47. data/test/test_scrubber.rb +0 -227
  48. data/test/test_scrubbers.rb +0 -144
  49. data/test/test_xss_foliate.rb +0 -171
  50. data.tar.gz.sig +0 -0
  51. metadata.gz.sig +0 -2
data/README.md ADDED
@@ -0,0 +1,410 @@
1
+ # Loofah
2
+
3
+ * https://github.com/flavorjones/loofah
4
+ * Docs: http://rubydoc.info/github/flavorjones/loofah/main/frames
5
+ * Mailing list: [loofah-talk@googlegroups.com](https://groups.google.com/forum/#!forum/loofah-talk)
6
+
7
+ ## Status
8
+
9
+ [![ci](https://github.com/flavorjones/loofah/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/flavorjones/loofah/actions/workflows/ci.yml)
10
+ [![Tidelift dependencies](https://tidelift.com/badges/package/rubygems/loofah)](https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=rubygems-loofah&utm_medium=referral&utm_campaign=readme)
11
+
12
+
13
+ ## Description
14
+
15
+ Loofah is a general library for manipulating and transforming HTML/XML documents and fragments, built on top of Nokogiri.
16
+
17
+ Loofah also includes some HTML sanitizers based on `html5lib`'s safelist, which are a specific application of the general transformation functionality.
18
+
19
+ Active Record extensions for HTML sanitization are available in the [`loofah-activerecord` gem](https://github.com/flavorjones/loofah-activerecord).
20
+
21
+
22
+ ## Features
23
+
24
+ * Easily write custom transformations for HTML and XML
25
+ * Common HTML sanitizing transformations are built-in:
26
+ * _Strip_ unsafe tags, leaving behind only the inner text.
27
+ * _Prune_ unsafe tags and their subtrees, removing all traces that they ever existed.
28
+ * _Escape_ unsafe tags and their subtrees, leaving behind lots of <tt>&lt;</tt> and <tt>&gt;</tt> entities.
29
+ * _Whitewash_ the markup, removing all attributes and namespaced nodes.
30
+ * Other common HTML transformations are built-in:
31
+ * Add the _nofollow_ attribute to all hyperlinks.
32
+ * Add the _target=\_blank_ attribute to all hyperlinks.
33
+ * Remove _unprintable_ characters from text nodes.
34
+ * Some specialized HTML transformations are also built-in:
35
+ * Where `<br><br>` exists inside a `p` tag, close the `p` and open a new one.
36
+ * Format markup as plain text, with (or without) sensible whitespace handling around block elements.
37
+ * Replace Rails's `strip_tags` and `sanitize` view helper methods.
38
+
39
+
40
+ ## Compare and Contrast
41
+
42
+ Loofah is both:
43
+
44
+ - a general framework for transforming XML, XHTML, and HTML documents
45
+ - a specific toolkit for HTML sanitization
46
+
47
+ ### General document transformation
48
+
49
+ Loofah tries to make it easy to write your own custom scrubbers for whatever document transformation you need. You don't like the built-in scrubbers? Build your own, like a boss.
50
+
51
+
52
+ ### HTML sanitization
53
+
54
+ Another Ruby library that provides HTML sanitization is [`rgrove/sanitize`](https://github.com/rgrove/sanitize), another library built on top of Nokogiri, which provides a bit more flexibility on the tags and attributes being scrubbed.
55
+
56
+ You may also want to look at [`rails/rails-html-sanitizer`](https://github.com/rails/rails-html-sanitizer) which is built on top of Loofah and provides some useful extensions and additional flexibility in the HTML sanitization.
57
+
58
+
59
+ ## The Basics
60
+
61
+ Loofah wraps [Nokogiri](http://nokogiri.org) in a loving embrace. Nokogiri is a stable, well-maintained parser for XML, HTML4, and HTML5.
62
+
63
+ Loofah implements the following classes:
64
+
65
+ * `Loofah::HTML5::Document`
66
+ * `Loofah::HTML5::DocumentFragment`
67
+ * `Loofah::HTML4::Document` (aliased as `Loofah::HTML::Document` for now)
68
+ * `Loofah::HTML4::DocumentFragment` (aliased as `Loofah::HTML::DocumentFragment` for now)
69
+ * `Loofah::XML::Document`
70
+ * `Loofah::XML::DocumentFragment`
71
+
72
+ These document and fragment classes are subclasses of the similarly-named Nokogiri classes `Nokogiri::HTML5::Document` et al.
73
+
74
+ Loofah also implements `Loofah::Scrubber`, which represents the document transformation, either by wrapping
75
+ a block,
76
+
77
+ ``` ruby
78
+ span2div = Loofah::Scrubber.new do |node|
79
+ node.name = "div" if node.name == "span"
80
+ end
81
+ ```
82
+
83
+ or by implementing a method.
84
+
85
+
86
+ ### Side Note: Fragments vs Documents
87
+
88
+ Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don't have a *document*, you have a *fragment*. For HTML, another rule of thumb is that *documents* have `html` and `body` tags, and *fragments* usually do not.
89
+
90
+ **HTML fragments** should be parsed with `Loofah.html5_fragment` or `Loofah.html4_fragment`. The result won't be wrapped in `html` or `body` tags, won't have a DOCTYPE declaration, `head` elements will be silently ignored, and multiple root nodes are allowed.
91
+
92
+ **HTML documents** should be parsed with `Loofah.html5_document` or `Loofah.html4_document`. The result will have a DOCTYPE declaration, along with `html`, `head` and `body` tags.
93
+
94
+ **XML fragments** should be parsed with `Loofah.xml_fragment`. The result won't have a DOCTYPE declaration, and multiple root nodes are allowed.
95
+
96
+ **XML documents** should be parsed with `Loofah.xml_document`. The result will have a DOCTYPE declaration and a single root node.
97
+
98
+
99
+ ### Side Note: HTML4 vs HTML5
100
+
101
+ ⚠ _HTML5 functionality is not available on JRuby, or with versions of Nokogiri `< 1.14.0`._
102
+
103
+ Currently, Loofah's methods `Loofah.document` and `Loofah.fragment` are aliases to `.html4_document` and `.html4_fragment`, which use Nokogiri's HTML4 parser. (Similarly, `Loofah::HTML::Document` and `Loofah::HTML::DocumentFragment` are aliased to `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`.)
104
+
105
+ **Please note** that in a future version of Loofah, these methods and classes may switch to using Nokogiri's HTML5 parser and classes on platforms that support it [1].
106
+
107
+ **We strongly recommend that you explicitly use `.html5_document` or `.html5_fragment`** unless you know of a compelling reason not to. If you are sure that you need to use the HTML4 parser, you should explicitly call `.html4_document` or `.html4_fragment` to avoid breakage in a future version.
108
+
109
+ [1]: [[feature request] HTML5 parser for JRuby implementation · Issue #2227 · sparklemotion/nokogiri](https://github.com/sparklemotion/nokogiri/issues/2227)
110
+
111
+
112
+ ### `Loofah::HTML5::Document` and `Loofah::HTML5::DocumentFragment`
113
+
114
+ These classes are subclasses of `Nokogiri::HTML5::Document` and `Nokogiri::HTML5::DocumentFragment`.
115
+
116
+ The module methods `Loofah.html5_document` and `Loofah.html5_fragment` will parse either an HTML document and an HTML fragment, respectively.
117
+
118
+ ``` ruby
119
+ Loofah.html5_document(unsafe_html).is_a?(Nokogiri::HTML5::Document) # => true
120
+ Loofah.html5_fragment(unsafe_html).is_a?(Nokogiri::HTML5::DocumentFragment) # => true
121
+ ```
122
+
123
+ Loofah injects a `scrub!` method, which takes either a symbol (for built-in scrubbers) or a `Loofah::Scrubber` object (for custom scrubbers), and modifies the document in-place.
124
+
125
+ Loofah overrides `to_s` to return HTML:
126
+
127
+ ``` ruby
128
+ unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"
129
+
130
+ doc = Loofah.html5_fragment(unsafe_html).scrub!(:prune)
131
+ doc.to_s # => "ohai! <div>div is safe</div> "
132
+ ```
133
+
134
+ and `text` to return plain text:
135
+
136
+ ``` ruby
137
+ doc.text # => "ohai! div is safe "
138
+ ```
139
+
140
+ Also, `to_text` is available, which does the right thing with whitespace around block-level and line break elements.
141
+
142
+ ``` ruby
143
+ doc = Loofah.html5_fragment("<h1>Title</h1><div>Content<br>Next line</div>")
144
+ doc.text # => "TitleContentNext line" # probably not what you want
145
+ doc.to_text # => "\nTitle\n\nContent\nNext line\n" # better
146
+ ```
147
+
148
+ ### `Loofah::HTML4::Document` and `Loofah::HTML4::DocumentFragment`
149
+
150
+ These classes are subclasses of `Nokogiri::HTML4::Document` and `Nokogiri::HTML4::DocumentFragment`.
151
+
152
+ The module methods `Loofah.html4_document` and `Loofah.html4_fragment` will parse either an HTML document and an HTML fragment, respectively.
153
+
154
+ ``` ruby
155
+ Loofah.html4_document(unsafe_html).is_a?(Nokogiri::HTML4::Document) # => true
156
+ Loofah.html4_fragment(unsafe_html).is_a?(Nokogiri::HTML4::DocumentFragment) # => true
157
+ ```
158
+
159
+ ### `Loofah::XML::Document` and `Loofah::XML::DocumentFragment`
160
+
161
+ These classes are subclasses of `Nokogiri::XML::Document` and `Nokogiri::XML::DocumentFragment`.
162
+
163
+ The module methods `Loofah.xml_document` and `Loofah.xml_fragment` will parse an XML document and an XML fragment, respectively.
164
+
165
+ ``` ruby
166
+ Loofah.xml_document(bad_xml).is_a?(Nokogiri::XML::Document) # => true
167
+ Loofah.xml_fragment(bad_xml).is_a?(Nokogiri::XML::DocumentFragment) # => true
168
+ ```
169
+
170
+ ### Nodes and Node Sets
171
+
172
+ Nokogiri's `Node` and `NodeSet` classes also get a `scrub!` method, which makes it easy to scrub subtrees.
173
+
174
+ The following code will apply the `employee_scrubber` only to the `employee` nodes (and their subtrees) in the document:
175
+
176
+ ``` ruby
177
+ Loofah.xml_document(bad_xml).xpath("//employee").scrub!(employee_scrubber)
178
+ ```
179
+
180
+ And this code will only scrub the first `employee` node and its subtree:
181
+
182
+ ``` ruby
183
+ Loofah.xml_document(bad_xml).at_xpath("//employee").scrub!(employee_scrubber)
184
+ ```
185
+
186
+ ### `Loofah::Scrubber`
187
+
188
+ A Scrubber wraps up a block (or method) that is run on a document node:
189
+
190
+ ``` ruby
191
+ # change all <span> tags to <div> tags
192
+ span2div = Loofah::Scrubber.new do |node|
193
+ node.name = "div" if node.name == "span"
194
+ end
195
+ ```
196
+
197
+ This can then be run on a document:
198
+
199
+ ``` ruby
200
+ Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
201
+ # => "<div>foo</div><p>bar</p>"
202
+ ```
203
+
204
+ Scrubbers can be run on a document in either a top-down traversal (the default) or bottom-up. Top-down scrubbers can optionally return `Scrubber::STOP` to terminate the traversal of a subtree. Read below and in the `Loofah::Scrubber` class for more detailed usage.
205
+
206
+ Here's an XML example:
207
+
208
+ ``` ruby
209
+ # remove all <employee> tags that have a "deceased" attribute set to true
210
+ bring_out_your_dead = Loofah::Scrubber.new do |node|
211
+ if node.name == "employee" and node["deceased"] == "true"
212
+ node.remove
213
+ Loofah::Scrubber::STOP # don't bother with the rest of the subtree
214
+ end
215
+ end
216
+ Loofah.xml_document(File.read('plague.xml')).scrub!(bring_out_your_dead)
217
+ ```
218
+
219
+ ### Built-In HTML Scrubbers
220
+
221
+ Loofah comes with a set of sanitizing scrubbers that use `html5lib`'s safelist algorithm:
222
+
223
+ ``` ruby
224
+ doc = Loofah.html5_document(input)
225
+ doc.scrub!(:strip) # replaces unknown/unsafe tags with their inner text
226
+ doc.scrub!(:prune) # removes unknown/unsafe tags and their children
227
+ doc.scrub!(:escape) # escapes unknown/unsafe tags, like this: &lt;script&gt;
228
+ doc.scrub!(:whitewash) # removes unknown/unsafe/namespaced tags and their children,
229
+ # and strips all node attributes
230
+ ```
231
+
232
+ Loofah also comes with built-in scrubers for some common transformation tasks:
233
+
234
+ ``` ruby
235
+ doc.scrub!(:nofollow) # adds rel="nofollow" attribute to links
236
+ doc.scrub!(:noopener) # adds rel="noopener" attribute to links
237
+ doc.scrub!(:noreferrer) # adds rel="noreferrer" attribute to links
238
+ doc.scrub!(:unprintable) # removes unprintable characters from text nodes
239
+ doc.scrub!(:targetblank) # adds target="_blank" attribute to links
240
+ doc.scrub!(:double_breakpoint) # where `<br><br>` appears in a `p` tag, close the `p` and open a new one
241
+ ```
242
+
243
+ See `Loofah::Scrubbers` for more details and example usage.
244
+
245
+
246
+ ### Chaining Scrubbers
247
+
248
+ You can chain scrubbers:
249
+
250
+ ``` ruby
251
+ Loofah.html5_fragment("<span>hello</span> <script>alert('OHAI')</script>") \
252
+ .scrub!(:prune) \
253
+ .scrub!(span2div).to_s
254
+ # => "<div>hello</div> "
255
+ ```
256
+
257
+ ### Shorthand
258
+
259
+ The class methods `Loofah.scrub_html5_fragment` and `Loofah.scrub_html5_document` (and the corresponding HTML4 methods) are shorthand.
260
+
261
+ These methods:
262
+
263
+ ``` ruby
264
+ Loofah.scrub_html5_fragment(unsafe_html, :prune)
265
+ Loofah.scrub_html5_document(unsafe_html, :prune)
266
+ Loofah.scrub_html4_fragment(unsafe_html, :prune)
267
+ Loofah.scrub_html4_document(unsafe_html, :prune)
268
+ Loofah.scrub_xml_fragment(bad_xml, custom_scrubber)
269
+ Loofah.scrub_xml_document(bad_xml, custom_scrubber)
270
+ ```
271
+
272
+ do the same thing as (and arguably semantically clearer than):
273
+
274
+ ``` ruby
275
+ Loofah.html5_fragment(unsafe_html).scrub!(:prune)
276
+ Loofah.html5_document(unsafe_html).scrub!(:prune)
277
+ Loofah.html4_fragment(unsafe_html).scrub!(:prune)
278
+ Loofah.html4_document(unsafe_html).scrub!(:prune)
279
+ Loofah.xml_fragment(bad_xml).scrub!(custom_scrubber)
280
+ Loofah.xml_document(bad_xml).scrub!(custom_scrubber)
281
+ ```
282
+
283
+
284
+ ### View Helpers
285
+
286
+ Loofah has two "view helpers": `Loofah::Helpers.sanitize` and `Loofah::Helpers.strip_tags`, both of which are drop-in replacements for the Rails Action View helpers of the same name.
287
+
288
+ These are not required automatically. You must require `loofah/helpers` to use them.
289
+
290
+
291
+ ## Requirements
292
+
293
+ * Nokogiri >= 1.5.9
294
+
295
+
296
+ ## Installation
297
+
298
+ Unsurprisingly:
299
+
300
+ > gem install loofah
301
+
302
+ Requirements:
303
+
304
+ * Ruby >= 2.5
305
+
306
+
307
+ ## Support
308
+
309
+ The bug tracker is available here:
310
+
311
+ * https://github.com/flavorjones/loofah/issues
312
+
313
+ And the mailing list is on Google Groups:
314
+
315
+ * Mail: loofah-talk@googlegroups.com
316
+ * Archive: https://groups.google.com/forum/#!forum/loofah-talk
317
+
318
+ Consider subscribing to [Tidelift][tidelift] which provides license assurances and timely security notifications for your open source dependencies, including Loofah. [Tidelift][tidelift] subscriptions also help the Loofah maintainers fund our [automated testing](https://ci.nokogiri.org) which in turn allows us to ship releases, bugfixes, and security updates more often.
319
+
320
+ [tidelift]: https://tidelift.com/subscription/pkg/rubygems-loofah?utm_source=undefined&utm_medium=referral&utm_campaign=enterprise
321
+
322
+
323
+ ## Security
324
+
325
+ See [`SECURITY.md`](SECURITY.md) for vulnerability reporting details.
326
+
327
+
328
+ ## Related Links
329
+
330
+ * loofah-activerecord: https://github.com/flavorjones/loofah-activerecord
331
+ * Nokogiri: http://nokogiri.org
332
+ * libxml2: http://xmlsoft.org
333
+ * html5lib: https://github.com/html5lib/
334
+
335
+
336
+ ## Authors
337
+
338
+ * [Mike Dalessio](http://mike.daless.io) ([@flavorjones](https://twitter.com/flavorjones))
339
+ * Bryan Helmkamp
340
+
341
+ Featuring code contributed by:
342
+
343
+ * [@flavorjones](https://github.com/flavorjones)
344
+ * [@brynary](https://github.com/brynary)
345
+ * [@olleolleolle](https://github.com/olleolleolle)
346
+ * [@JuanitoFatas](https://github.com/JuanitoFatas)
347
+ * [@kaspth](https://github.com/kaspth)
348
+ * [@tenderlove](https://github.com/tenderlove)
349
+ * [@ktdreyer](https://github.com/ktdreyer)
350
+ * [@orien](https://github.com/orien)
351
+ * [@asok](https://github.com/asok)
352
+ * [@junaruga](https://github.com/junaruga)
353
+ * [@MothOnMars](https://github.com/MothOnMars)
354
+ * [@nick-desteffen](https://github.com/nick-desteffen)
355
+ * [@NikoRoberts](https://github.com/NikoRoberts)
356
+ * [@trans](https://github.com/trans)
357
+ * [@andreynering](https://github.com/andreynering)
358
+ * [@aried3r](https://github.com/aried3r)
359
+ * [@baopham](https://github.com/baopham)
360
+ * [@batter](https://github.com/batter)
361
+ * [@brendon](https://github.com/brendon)
362
+ * [@cjba7](https://github.com/cjba7)
363
+ * [@christiankisssner](https://github.com/christiankisssner)
364
+ * [@dacort](https://github.com/dacort)
365
+ * [@danfstucky](https://github.com/danfstucky)
366
+ * [@david-a-wheeler](https://github.com/david-a-wheeler)
367
+ * [@dharamgollapudi](https://github.com/dharamgollapudi)
368
+ * [@georgeclaghorn](https://github.com/georgeclaghorn)
369
+ * [@gogainda](https://github.com/gogainda)
370
+ * [@jaredbeck](https://github.com/jaredbeck)
371
+ * [@ThatHurleyGuy](https://github.com/ThatHurleyGuy)
372
+ * [@jstorimer](https://github.com/jstorimer)
373
+ * [@jbarnette](https://github.com/jbarnette)
374
+ * [@queso](https://github.com/queso)
375
+ * [@technicalpickles](https://github.com/technicalpickles)
376
+ * [@kyoshidajp](https://github.com/kyoshidajp)
377
+ * [@kristianfreeman](https://github.com/kristianfreeman)
378
+ * [@louim](https://github.com/louim)
379
+ * [@mrpasquini](https://github.com/mrpasquini)
380
+ * [@olivierlacan](https://github.com/olivierlacan)
381
+ * [@pauldix](https://github.com/pauldix)
382
+ * [@sampokuokkanen](https://github.com/sampokuokkanen)
383
+ * [@stefannibrasil](https://github.com/stefannibrasil)
384
+ * [@tastycode](https://github.com/tastycode)
385
+ * [@vipulnsward](https://github.com/vipulnsward)
386
+ * [@joncalhoun](https://github.com/joncalhoun)
387
+ * [@ahorek](https://github.com/ahorek)
388
+ * [@rmacklin](https://github.com/rmacklin)
389
+ * [@y-yagi](https://github.com/y-yagi)
390
+ * [@lazyatom](https://github.com/lazyatom)
391
+
392
+ And a big shout-out to Corey Innis for the name, and feedback on the API.
393
+
394
+
395
+ ## Thank You
396
+
397
+ The following people have generously funded Loofah with financial sponsorship:
398
+
399
+ * Bill Harding
400
+ * [Sentry](https://sentry.io/) @getsentry
401
+
402
+
403
+ ## Historical Note
404
+
405
+ This library was once named "Dryopteris", which was a very bad name that nobody could spell properly.
406
+
407
+
408
+ ## License
409
+
410
+ Distributed under the MIT License. See `MIT-LICENSE.txt` for details.
data/SECURITY.md ADDED
@@ -0,0 +1,18 @@
1
+ # Security and Vulnerability Reporting
2
+
3
+ The Loofah core contributors take security very seriously and investigate all reported vulnerabilities.
4
+
5
+ If you would like to report a vulnerablity or have a security concern regarding Loofah, please [report it via HackerOne](https://hackerone.com/loofah/reports/new).
6
+
7
+ Your report will be acknowledged within 24 hours, and you'll receive a more detailed response within 72 hours indicating next steps in handling your report.
8
+
9
+ If you have not received a reply to your submission within 48 hours, there are a few steps you can take:
10
+
11
+ * Contact the current security coordinator (Mike Dalessio <mike.dalessio@gmail.com>)
12
+ * Email the Loofah user group at loofah-talk@googlegroups.com (archive at https://groups.google.com/forum/#!forum/loofah-talk)
13
+
14
+ Please note, the user group list is a public area. When escalating in that venue, please do not discuss your issue. Simply say that you're trying to get a hold of someone from the core team.
15
+
16
+ The information you share with the Loofah core contributors as part of this process will be kept confidential within the team, unless or until we need to share information upstream with our dependent libraries' core teams, at which point we will notify you.
17
+
18
+ If a vulnerability is first reported by you, we will credit you with the discovery in the public disclosure.
@@ -0,0 +1,207 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Loofah
4
+ #
5
+ # Mixes +scrub!+ into Document, DocumentFragment, Node and NodeSet.
6
+ #
7
+ # Traverse the document or fragment, invoking the +scrubber+ on each node.
8
+ #
9
+ # +scrubber+ must either be one of the symbols representing the built-in scrubbers (see
10
+ # Scrubbers), or a Scrubber instance.
11
+ #
12
+ # span2div = Loofah::Scrubber.new do |node|
13
+ # node.name = "div" if node.name == "span"
14
+ # end
15
+ # Loofah.html5_fragment("<span>foo</span><p>bar</p>").scrub!(span2div).to_s
16
+ # # => "<div>foo</div><p>bar</p>"
17
+ #
18
+ # or
19
+ #
20
+ # unsafe_html = "ohai! <div>div is safe</div> <script>but script is not</script>"
21
+ # Loofah.html5_fragment(unsafe_html).scrub!(:strip).to_s
22
+ # # => "ohai! <div>div is safe</div> "
23
+ #
24
+ # Note that this method is called implicitly from the shortcuts Loofah.scrub_html5_fragment et
25
+ # al.
26
+ #
27
+ # Please see Scrubber for more information on implementation and traversal, and README.rdoc for
28
+ # more example usage.
29
+ #
30
+ module ScrubBehavior
31
+ module Node # :nodoc:
32
+ def scrub!(scrubber)
33
+ #
34
+ # yes. this should be three separate methods. but nokogiri decorates (or not) based on
35
+ # whether the module name has already been included. and since documents get decorated just
36
+ # like their constituent nodes, we need to jam all the logic into a single module.
37
+ #
38
+ scrubber = ScrubBehavior.resolve_scrubber(scrubber)
39
+ case self
40
+ when Nokogiri::XML::Document
41
+ scrubber.traverse(root) if root
42
+ when Nokogiri::XML::DocumentFragment
43
+ children.scrub!(scrubber)
44
+ else
45
+ scrubber.traverse(self)
46
+ end
47
+ self
48
+ end
49
+ end
50
+
51
+ module NodeSet # :nodoc:
52
+ def scrub!(scrubber)
53
+ each { |node| node.scrub!(scrubber) }
54
+ self
55
+ end
56
+ end
57
+
58
+ class << self
59
+ def resolve_scrubber(scrubber) # :nodoc:
60
+ scrubber = Scrubbers::MAP[scrubber].new if Scrubbers::MAP[scrubber]
61
+ unless scrubber.is_a?(Loofah::Scrubber)
62
+ raise Loofah::ScrubberNotFound, "not a Scrubber or a scrubber name: #{scrubber.inspect}"
63
+ end
64
+
65
+ scrubber
66
+ end
67
+ end
68
+ end
69
+
70
+ #
71
+ # Overrides +text+ in Document and DocumentFragment classes, and mixes in +to_text+.
72
+ #
73
+ module TextBehavior
74
+ #
75
+ # Returns a plain-text version of the markup contained by the document, with HTML entities
76
+ # encoded.
77
+ #
78
+ # This method is significantly faster than #to_text, but isn't clever about whitespace around
79
+ # block elements.
80
+ #
81
+ # Loofah.html5_document("<h1>Title</h1><div>Content</div>").text
82
+ # # => "TitleContent"
83
+ #
84
+ # By default, the returned text will have HTML entities escaped. If you want unescaped
85
+ # entities, and you understand that the result is unsafe to render in a browser, then you can
86
+ # pass an argument as shown:
87
+ #
88
+ # frag = Loofah.html5_fragment("&lt;script&gt;alert('EVIL');&lt;/script&gt;")
89
+ # # ok for browser:
90
+ # frag.text # => "&lt;script&gt;alert('EVIL');&lt;/script&gt;"
91
+ # # decidedly not ok for browser:
92
+ # frag.text(:encode_special_chars => false) # => "<script>alert('EVIL');</script>"
93
+ #
94
+ def text(options = {})
95
+ result = if serialize_root
96
+ serialize_root.children.reject(&:comment?).map(&:inner_text).join("")
97
+ else
98
+ ""
99
+ end
100
+ if options[:encode_special_chars] == false
101
+ result # possibly dangerous if rendered in a browser
102
+ else
103
+ encode_special_chars(result)
104
+ end
105
+ end
106
+
107
+ alias_method :inner_text, :text
108
+ alias_method :to_str, :text
109
+
110
+ #
111
+ # Returns a plain-text version of the markup contained by the fragment, with HTML entities
112
+ # encoded.
113
+ #
114
+ # This method is slower than #text, but is clever about whitespace around block elements and
115
+ # line break elements.
116
+ #
117
+ # Loofah.html5_document("<h1>Title</h1><div>Content<br>Next line</div>").to_text
118
+ # # => "\nTitle\n\nContent\nNext line\n"
119
+ #
120
+ def to_text(options = {})
121
+ Loofah.remove_extraneous_whitespace(dup.scrub!(:newline_block_elements).text(options))
122
+ end
123
+ end
124
+
125
+ module DocumentDecorator # :nodoc:
126
+ def initialize(*args, &block)
127
+ super
128
+ decorators(Nokogiri::XML::Node) << ScrubBehavior::Node
129
+ decorators(Nokogiri::XML::NodeSet) << ScrubBehavior::NodeSet
130
+ end
131
+ end
132
+
133
+ module HtmlDocumentBehavior # :nodoc:
134
+ module ClassMethods
135
+ def parse(*args, &block)
136
+ remove_comments_before_html_element(super)
137
+ end
138
+
139
+ private
140
+
141
+ # remove comments that exist outside of the HTML element.
142
+ #
143
+ # these comments are allowed by the HTML spec:
144
+ #
145
+ # https://www.w3.org/TR/html401/struct/global.html#h-7.1
146
+ #
147
+ # but are not scrubbed by Loofah because these nodes don't meet
148
+ # the contract that scrubbers expect of a node (e.g., it can be
149
+ # replaced, sibling and children nodes can be created).
150
+ def remove_comments_before_html_element(doc)
151
+ doc.children.each do |child|
152
+ child.unlink if child.comment?
153
+ end
154
+ doc
155
+ end
156
+ end
157
+
158
+ class << self
159
+ def included(base)
160
+ base.extend(ClassMethods)
161
+ end
162
+ end
163
+
164
+ def serialize_root
165
+ at_xpath("/html/body")
166
+ end
167
+ end
168
+
169
+ module HtmlFragmentBehavior # :nodoc:
170
+ module ClassMethods
171
+ def parse(tags, encoding = nil)
172
+ doc = document_klass.new
173
+
174
+ encoding ||= tags.respond_to?(:encoding) ? tags.encoding.name : "UTF-8"
175
+ doc.encoding = encoding
176
+
177
+ new(doc, tags)
178
+ end
179
+
180
+ def document_klass
181
+ @document_klass ||= if Loofah.html5_support? && self == Loofah::HTML5::DocumentFragment
182
+ Loofah::HTML5::Document
183
+ elsif self == Loofah::HTML4::DocumentFragment
184
+ Loofah::HTML4::Document
185
+ else
186
+ raise ArgumentError, "unexpected class: #{self}"
187
+ end
188
+ end
189
+ end
190
+
191
+ class << self
192
+ def included(base)
193
+ base.extend(ClassMethods)
194
+ end
195
+ end
196
+
197
+ def to_s
198
+ serialize_root.children.to_s
199
+ end
200
+
201
+ alias_method :serialize, :to_s
202
+
203
+ def serialize_root
204
+ at_xpath("./body") || self
205
+ end
206
+ end
207
+ end