dandruff 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,1196 @@
1
+ # Dandruff
2
+
3
+ **Because your markup deserves a good shampoo.**
4
+
5
+ If you're scratching your head because of XSS and your HTML is flaking with `<script>alert('gotcha')</script>`, it's time to wash that mess out!
6
+
7
+
8
+ ```ruby
9
+ clean_html = Dandruff.scrub(dirty_html)
10
+ ```
11
+
12
+ ## Introduction
13
+
14
+ Dandruff is a Ruby HTML sanitizer providing comprehensive XSS protection with an idiomatic, developer-friendly API. It's built on the battle-tested security foundations of [DOMPurify](https://github.com/cure53/DOMPurify), bringing proven XSS defense to the Ruby ecosystem. Whether you're sanitizing user comments, rendering rich content, or processing HTML emails, Dandruff can help you keep your markup clean and secure.
15
+
16
+ ### Key Features
17
+
18
+ - **Comprehensive XSS Protection** - Defends against XSS, mXSS, DOM clobbering, and protocol injection
19
+ - **Flexible Configuration** - Fine-grained control over tags, attributes, and sanitization behavior
20
+ - **Content Type Profiles** - Pre-configured settings for HTML, SVG, MathML, and HTML email
21
+ - **Hook System** - Extend sanitization with custom processing logic
22
+ - **Developer-Friendly API** - Intuitive Ruby idioms with block-based configuration
23
+ - **Performance Optimized** - Efficient multi-pass sanitization with configurable limits
24
+ - **Battle-Tested** - Based on DOMPurify's proven security model
25
+
26
+ ## 🚀 Quickstart
27
+
28
+ Install the gem:
29
+
30
+ ```bash
31
+ gem install dandruff
32
+ ```
33
+
34
+ or with Bundler:
35
+
36
+ ```ruby
37
+ # in your Gemfile
38
+ gem 'dandruff'
39
+
40
+ # then run
41
+ bundle install
42
+ ```
43
+
44
+ Then in your controller or wherever you need to sanitize HTML:
45
+
46
+ ```ruby
47
+ safe_comment = Dandruff.scrub(params[:comment], allowed_tags: ['p', 'strong', 'em'])
48
+ ```
49
+
50
+ ## ⚙️ Configuration
51
+
52
+ Dandruff offers three ways to configure sanitization: block-based, direct configuration, and per-call options.
53
+
54
+ ### Configuration Styles
55
+
56
+ ```ruby
57
+ # 1. Block-based configuration (recommended for instances)
58
+ dandruff = Dandruff.new do |config|
59
+ config.allowed_tags = ['p', 'strong', 'em']
60
+ config.allowed_attributes = ['class', 'href']
61
+ end
62
+
63
+ # 2. Direct configuration
64
+ dandruff = Dandruff.new
65
+ dandruff.set_config(
66
+ allowed_tags: ['p', 'strong'],
67
+ allowed_attributes: ['class']
68
+ )
69
+
70
+ # 3. Per-call configuration
71
+ dandruff = Dandruff.new
72
+ dandruff.scrub(html, allowed_tags: ['p'], allowed_attributes: ['class'])
73
+
74
+ # 4. Class method with configuration
75
+ clean = Dandruff.scrub(html, allowed_tags: ['p', 'strong'])
76
+ ```
77
+
78
+ ### Common Configuration Patterns
79
+
80
+ #### Restrict to specific tags and attributes
81
+
82
+ ```ruby
83
+ dandruff = Dandruff.new do |config|
84
+ config.allowed_tags = ['p', 'strong', 'em', 'a']
85
+ config.allowed_attributes = ['href', 'title']
86
+ end
87
+ ```
88
+
89
+ #### Extend defaults instead of replacing
90
+
91
+ ```ruby
92
+ dandruff = Dandruff.new do |config|
93
+ config.additional_tags = ['custom-element']
94
+ config.additional_attributes = ['data-custom-id']
95
+ end
96
+ ```
97
+
98
+ #### Block specific tags or attributes
99
+
100
+ ```ruby
101
+ dandruff = Dandruff.new do |config|
102
+ config.forbidden_tags = ['script', 'iframe']
103
+ config.forbidden_attributes = ['onclick', 'onerror']
104
+ end
105
+ ```
106
+
107
+ → See [Configuration Reference](#configuration-reference) for all available options.
108
+
109
+ ## 📖 Usage
110
+
111
+ ### Simple Use Cases
112
+
113
+ #### Sanitize User Comments
114
+
115
+ ```ruby
116
+ # Basic text formatting only
117
+ dandruff = Dandruff.new do |config|
118
+ config.allowed_tags = ['p', 'br', 'strong', 'em', 'a']
119
+ config.allowed_attributes = ['href']
120
+ config.forbidden_attributes = ['onclick', 'onerror']
121
+ end
122
+
123
+ comment = params[:comment]
124
+ safe_comment = dandruff.scrub(comment)
125
+ ```
126
+
127
+ #### Sanitize Markdown-Generated HTML
128
+
129
+ ```ruby
130
+ # Allow rich formatting from Markdown
131
+ dandruff = Dandruff.new do |config|
132
+ config.allowed_tags = [
133
+ 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
134
+ 'p', 'br', 'strong', 'em', 'code', 'pre',
135
+ 'ul', 'ol', 'li', 'blockquote', 'a'
136
+ ]
137
+ config.allowed_attributes = ['href', 'title']
138
+ end
139
+
140
+ html = markdown_renderer.render(params[:content])
141
+ safe_html = dandruff.scrub(html)
142
+ ```
143
+
144
+ #### Sanitize Blog Post Content
145
+
146
+ ```ruby
147
+ # Rich content with images
148
+ dandruff = Dandruff.new do |config|
149
+ config.allowed_tags = [
150
+ 'p', 'br', 'strong', 'em', 'ul', 'ol', 'li',
151
+ 'h2', 'h3', 'blockquote', 'code', 'pre',
152
+ 'a', 'img'
153
+ ]
154
+ config.allowed_attributes = ['href', 'title', 'src', 'alt', 'class']
155
+ config.allow_data_uri = false # Block data URIs for images
156
+ end
157
+
158
+ post_html = params[:post][:content]
159
+ safe_html = dandruff.scrub(post_html)
160
+ ```
161
+
162
+ ### Intermediate Use Cases
163
+
164
+ #### Using Profiles for Content Types
165
+
166
+ Profiles are pre-configured sets of tags and attributes for common content types:
167
+
168
+ ```ruby
169
+ # HTML content profile
170
+ dandruff = Dandruff.new do |config|
171
+ config.use_profiles = { html: true }
172
+ end
173
+
174
+ # SVG graphics
175
+ dandruff = Dandruff.new do |config|
176
+ config.use_profiles = { html: true, svg: true }
177
+ end
178
+
179
+ # Mathematical content
180
+ dandruff = Dandruff.new do |config|
181
+ config.use_profiles = { html: true, math_ml: true }
182
+ end
183
+
184
+ # Combine multiple profiles
185
+ dandruff = Dandruff.new do |config|
186
+ config.use_profiles = { html: true, svg: true, math_ml: true }
187
+ end
188
+ ```
189
+
190
+ #### HTML Email Sanitization
191
+
192
+ HTML emails require special handling with legacy attributes:
193
+
194
+ ```ruby
195
+ dandruff = Dandruff.new do |config|
196
+ config.use_profiles = { html_email: true }
197
+ end
198
+
199
+ email_html = message.html_part.body.to_s
200
+ safe_email = dandruff.scrub(email_html)
201
+ ```
202
+
203
+ The `html_email` profile:
204
+ - Allows document structure tags (`head`, `meta`, `style`)
205
+ - Permits legacy presentation attributes (`bgcolor`, `cellpadding`, `align`, etc.)
206
+ - Uses per-tag attribute restrictions for security
207
+ - Allows style tags with content sanitization
208
+ - Excludes form elements and scripts
209
+
210
+ #### Per-Tag Attribute Control
211
+
212
+ Restrict which attributes are allowed on specific tags for maximum security:
213
+
214
+ ```ruby
215
+ dandruff = Dandruff.new do |config|
216
+ config.allowed_attributes_per_tag = {
217
+ 'a' => ['href', 'title', 'target'],
218
+ 'img' => ['src', 'alt', 'width', 'height'],
219
+ 'table' => ['border', 'cellpadding', 'cellspacing'],
220
+ 'td' => ['colspan', 'rowspan'],
221
+ 'th' => ['colspan', 'rowspan', 'scope']
222
+ }
223
+ end
224
+
225
+ # Only specified attributes allowed on each tag
226
+ html = '<a href="/page" onclick="alert()">Link</a>'
227
+ dandruff.scrub(html)
228
+ # => '<a href="/page">Link</a>' (onclick removed)
229
+
230
+ html = '<img src="pic.jpg" href="/bad">'
231
+ dandruff.scrub(html)
232
+ # => '<img src="pic.jpg">' (href removed from img)
233
+ ```
234
+
235
+ **Security benefit:** Prevents attribute confusion attacks where dangerous attributes appear on unexpected elements.
236
+
237
+ ### Complex Use Cases
238
+
239
+ #### Custom URI Validation
240
+
241
+ ```ruby
242
+ # Only allow HTTPS URLs from your domain
243
+ dandruff = Dandruff.new do |config|
244
+ config.allowed_uri_regexp = /^https:\/\/(www\.)?example\.com\//
245
+ end
246
+
247
+ html = '<a href="https://example.com/safe">OK</a><a href="http://evil.com">Bad</a>'
248
+ dandruff.scrub(html)
249
+ # => '<a href="https://example.com/safe">OK</a><a>Bad</a>'
250
+ ```
251
+
252
+ #### Hook-Based Customization
253
+
254
+ Hooks allow you to extend Dandruff's behavior:
255
+
256
+ ```ruby
257
+ dandruff = Dandruff.new
258
+
259
+ # Custom attribute handling
260
+ dandruff.add_hook(:upon_sanitize_attribute) do |node, data, config|
261
+ tag_name = data[:tag_name]
262
+ attr_name = data[:attr_name]
263
+
264
+ # Allow specific custom data attributes
265
+ if attr_name.start_with?('data-safe-')
266
+ data[:keep_attr] = true
267
+ end
268
+
269
+ # Force lowercase on certain attributes
270
+ if attr_name == 'id'
271
+ node[attr_name] = node[attr_name].downcase
272
+ end
273
+ end
274
+
275
+ # Element processing
276
+ dandruff.add_hook(:upon_sanitize_element) do |node, data, config|
277
+ # Log removed elements
278
+ puts "Processing #{data[:tag_name]} element"
279
+ end
280
+
281
+ html = '<div data-safe-user-id="123" DATA-KEY="ABC" id="MyID">Content</div>'
282
+ dandruff.scrub(html)
283
+ # => '<div data-safe-user-id="123" id="myid">Content</div>'
284
+ ```
285
+
286
+ Available hooks:
287
+ - `:before_sanitize_elements` - Before processing elements
288
+ - `:after_sanitize_elements` - After processing elements
289
+ - `:before_sanitize_attributes` - Before processing attributes on an element
290
+ - `:after_sanitize_attributes` - After processing attributes on an element
291
+ - `:upon_sanitize_element` - When processing each element
292
+ - `:upon_sanitize_attribute` - When processing each attribute
293
+
294
+ #### Template Safety
295
+
296
+ Remove template expressions when sanitizing user-submitted content:
297
+
298
+ ```ruby
299
+ dandruff = Dandruff.new do |config|
300
+ config.safe_for_templates = true
301
+ end
302
+
303
+ html = '<div>{{user.name}} - <%= admin_link %> - ${secret}</div>'
304
+ dandruff.scrub(html)
305
+ # => '<div> - - </div>' (template expressions removed)
306
+ ```
307
+
308
+ Removes:
309
+ - Mustache/Handlebars: `{{ }}`
310
+ - ERB: `<% %>`, `<%= %>`
311
+ - Template literals: `${ }`
312
+
313
+ #### Multi-Pass Sanitization
314
+
315
+ Protect against mutation-based XSS (mXSS):
316
+
317
+ ```ruby
318
+ dandruff = Dandruff.new do |config|
319
+ config.scrub_until_stable = true # default
320
+ config.mutation_max_passes = 2 # default
321
+ end
322
+
323
+ # Dandruff will re-sanitize until output is stable
324
+ # or max passes reached, preventing mXSS attacks
325
+ ```
326
+
327
+ #### Return DOM Instead of String
328
+
329
+ For further processing with Nokogiri:
330
+
331
+ ```ruby
332
+ dandruff = Dandruff.new do |config|
333
+ config.return_dom = true
334
+ end
335
+
336
+ doc = dandruff.scrub(html)
337
+ # => Nokogiri::HTML::Document
338
+
339
+ # Or return a fragment
340
+ dandruff = Dandruff.new do |config|
341
+ config.return_dom_fragment = true
342
+ end
343
+
344
+ fragment = dandruff.scrub(html)
345
+ # => Nokogiri::HTML::DocumentFragment
346
+ ```
347
+
348
+ ## 📚 Reference
349
+
350
+ ### Configuration Reference
351
+
352
+ Complete list of configuration options with defaults and security implications:
353
+
354
+ | Option | Type | Default | Description |
355
+ |--------|------|---------|-------------|
356
+ | `allowed_tags` | `Array<String>` | `nil` (use defaults) | Exact allowlist of elements. When set, only these tags pass. |
357
+ | `additional_tags` | `Array<String>` | `[]` | Extends default safe set. |
358
+ | `forbidden_tags` | `Array<String>` | `['base','link','meta','annotation-xml','noscript']` | Always removed even if allowed elsewhere. |
359
+ | `allowed_attributes` | `Array<String>` | `nil` (use defaults) | Exact allowlist of attributes. |
360
+ | `allowed_attributes_per_tag` | `Hash<String, Array<String>>` | `nil` | Per-tag attribute restrictions. Takes precedence over `allowed_attributes`. |
361
+ | `additional_attributes` | `Array<String>` | `[]` | Extends default safe attributes. |
362
+ | `forbidden_attributes` | `Array<String>` | `nil` | Attributes always removed. |
363
+ | `allow_data_attributes` | `Boolean` | `true` | Controls `data-*` attributes. |
364
+ | `allow_aria_attributes` | `Boolean` | `true` | Controls `aria-*` attributes for accessibility. |
365
+ | `allow_data_uri` | `Boolean` | `true` | Blocks `data:` URIs by default for safety. |
366
+ | `allow_unknown_protocols` | `Boolean` | `false` | If true, permits non-standard schemes (⚠️ security risk). |
367
+ | `allowed_uri_regexp` | `Regexp` | `nil` | Custom regexp to validate URI attributes. |
368
+ | `additional_uri_safe_attributes` | `Array<String>` | `[]` | Extra attributes treated as URI-like. |
369
+ | `allow_style_tags` | `Boolean` | `true` | `<style>` tags with content scanning. |
370
+ | `sanitize_dom` | `Boolean` | `true` | Removes DOM clobbering `id`/`name` values. |
371
+ | `safe_for_templates` | `Boolean` | `false` | Strips template expressions (`{{ }}`, `<%= %>`, `${ }`). |
372
+ | `safe_for_xml` | `Boolean` | `true` | Removes comments/PI in XML-ish content. |
373
+ | `whole_document` | `Boolean` | `false` | Parse as full document instead of fragment. |
374
+ | `allow_document_elements` | `Boolean` | `false` | Retain `html/head/body` tags. |
375
+ | `minimal_profile` | `Boolean` | `false` | Use smaller HTML-only allowlist (no SVG/MathML). |
376
+ | `force_body` | `Boolean` | `false` | Forces body context when parsing fragments. |
377
+ | `return_dom` | `Boolean` | `false` | Return Nokogiri DOM instead of string. |
378
+ | `return_dom_fragment` | `Boolean` | `false` | Return Nokogiri fragment instead of string. |
379
+ | `sanitize_until_stable` | `Boolean` | `true` | Re-sanitize until stable to mitigate mXSS. |
380
+ | `mutation_max_passes` | `Integer` | `2` | Max passes for stabilization. Higher = more secure, slower. |
381
+ | `keep_content` | `Boolean` | `true` | If false, removes contents of stripped elements. |
382
+ | `in_place` | `Boolean` | `false` | Attempts to sanitize in place (experimental). |
383
+ | `use_profiles` | `Hash` | `{}` | Enable content type profiles: `:html`, `:svg`, `:svg_filters`, `:math_ml`, `:html_email`. |
384
+ | `namespace` | `String` | `'http://www.w3.org/1999/xhtml'` | Namespace for XHTML handling. |
385
+ | `parser_media_type` | `String` | `'text/html'` | Parser media type; set to `application/xhtml+xml` for XHTML. |
386
+
387
+ ### API Reference
388
+
389
+ #### Core Methods
390
+
391
+ ##### `Dandruff.new(config = nil, &block)` → `Sanitizer`
392
+
393
+ Creates a new Dandruff instance.
394
+
395
+ **Parameters:**
396
+ - `config` (Hash, Config) - Optional configuration hash or Config object
397
+ - `block` - Optional block for configuration
398
+
399
+ **Returns:** Sanitizer instance
400
+
401
+ **Example:**
402
+ ```ruby
403
+ dandruff = Dandruff.new do |config|
404
+ config.allowed_tags = ['p', 'strong']
405
+ end
406
+ ```
407
+
408
+ ##### `dandruff.scrub(dirty_html, config = {})` → `String` or `Nokogiri::XML::Document`
409
+
410
+ Sanitizes HTML string or Nokogiri node.
411
+
412
+ **Parameters:**
413
+ - `dirty_html` (String, Nokogiri::XML::Node) - Input to sanitize
414
+ - `config` (Hash) - Optional configuration override
415
+
416
+ **Returns:** Sanitized HTML string or Nokogiri document (based on config)
417
+
418
+ **Example:**
419
+ ```ruby
420
+ clean = dandruff.scrub('<script>xss</script><p>Safe</p>')
421
+ # => "<p>Safe</p>"
422
+ ```
423
+
424
+ ##### `Dandruff.scrub(dirty_html, config = {})` → `String`
425
+
426
+ Class method for one-off sanitization.
427
+
428
+ **Parameters:**
429
+ - `dirty_html` (String) - Input to sanitize
430
+ - `config` (Hash) - Configuration options
431
+
432
+ **Returns:** Sanitized HTML string
433
+
434
+ **Example:**
435
+ ```ruby
436
+ clean = Dandruff.scrub(html, allowed_tags: ['p'])
437
+ ```
438
+
439
+ #### Configuration Methods
440
+
441
+ ##### `dandruff.configure { |config| ... }` → `Sanitizer`
442
+
443
+ Configures the dandruff instance using a block.
444
+
445
+ **Example:**
446
+ ```ruby
447
+ dandruff.configure do |config|
448
+ config.allowed_tags = ['p', 'strong']
449
+ end
450
+ ```
451
+
452
+ ##### `dandruff.set_config(config_hash)` → `Config`
453
+
454
+ Sets configuration directly with a hash.
455
+
456
+ **Example:**
457
+ ```ruby
458
+ dandruff.set_config(allowed_tags: ['p'], allowed_attributes: ['class'])
459
+ ```
460
+
461
+ ##### `dandruff.clear_config` → `Config`
462
+
463
+ Resets to default configuration.
464
+
465
+ #### Hook Methods
466
+
467
+ ##### `dandruff.add_hook(entry_point, &block)` → `void`
468
+
469
+ Adds a hook function.
470
+
471
+ **Parameters:**
472
+ - `entry_point` (Symbol) - Hook name (`:before_sanitize_elements`, `:upon_sanitize_attribute`, etc.)
473
+ - `block` (Proc) - Hook function receiving `(node, data, config)`
474
+
475
+ **Example:**
476
+ ```ruby
477
+ dandruff.add_hook(:upon_sanitize_attribute) do |node, data, config|
478
+ data[:keep_attr] = true if data[:attr_name] == 'data-safe'
479
+ end
480
+ ```
481
+
482
+ ##### `dandruff.remove_hook(entry_point, hook_function = nil)` → `Proc` or `nil`
483
+
484
+ Removes specific hook or last hook for an entry point.
485
+
486
+ ##### `dandruff.remove_all_hooks` → `Hash`
487
+
488
+ Removes all hooks.
489
+
490
+ #### Utility Methods
491
+
492
+ ##### `dandruff.supported?` → `Boolean`
493
+
494
+ Checks if required dependencies (Nokogiri) are available.
495
+
496
+ ##### `dandruff.removed` → `Array`
497
+
498
+ Gets list of elements/attributes removed during last sanitization.
499
+
500
+ **Returns:** Array of removal records
501
+
502
+ ### Profiles Reference
503
+
504
+ #### HTML Profile
505
+
506
+ **Enable:** `use_profiles: { html: true }`
507
+
508
+ **Includes:** All standard HTML5 semantic elements, media elements, form controls, and text formatting.
509
+
510
+ **Use for:** Standard web content, blog posts, documentation
511
+
512
+ ```ruby
513
+ dandruff = Dandruff.new do |config|
514
+ config.use_profiles = { html: true }
515
+ end
516
+ ```
517
+
518
+ #### SVG Profile
519
+
520
+ **Enable:** `use_profiles: { svg: true }`
521
+
522
+ **Includes:** SVG elements for vector graphics (shapes, paths, gradients, basic filters)
523
+
524
+ **Use for:** Inline SVG graphics, icons, diagrams
525
+
526
+ ```ruby
527
+ dandruff = Dandruff.new do |config|
528
+ config.use_profiles = { html: true, svg: true }
529
+ end
530
+ ```
531
+
532
+ #### SVG Filters Profile
533
+
534
+ **Enable:** `use_profiles: { svg_filters: true }`
535
+
536
+ **Includes:** Advanced SVG filter primitives (blur, color manipulation, lighting)
537
+
538
+ **Use for:** SVG with visual effects
539
+
540
+ ```ruby
541
+ dandruff = Dandruff.new do |config|
542
+ config.use_profiles = { svg: true, svg_filters: true }
543
+ end
544
+ ```
545
+
546
+ #### MathML Profile
547
+
548
+ **Enable:** `use_profiles: { math_ml: true }`
549
+
550
+ **Includes:** MathML elements for mathematical notation
551
+
552
+ **Use for:** Scientific documents, mathematical content
553
+
554
+ ```ruby
555
+ dandruff = Dandruff.new do |config|
556
+ config.use_profiles = { html: true, math_ml: true }
557
+ end
558
+ ```
559
+
560
+ #### HTML Email Profile
561
+
562
+ **Enable:** `use_profiles: { html_email: true }`
563
+
564
+ **Includes:**
565
+ - HTML elements + document structure (`head`, `meta`, `style`)
566
+ - Legacy presentation tags (`font`, `center`)
567
+ - Legacy attributes (`bgcolor`, `cellpadding`, `valign`, etc.)
568
+ - Per-tag attribute restrictions (automatic)
569
+
570
+ **Excludes:** Forms, scripts, interactive elements
571
+
572
+ **Special settings:**
573
+ - Allows style tags (required for email)
574
+ - Disables DOM clobbering protection (emails are sandboxed)
575
+ - Parses as whole document
576
+
577
+ **Use for:** HTML email rendering
578
+
579
+ ```ruby
580
+ dandruff = Dandruff.new do |config|
581
+ config.use_profiles = { html_email: true }
582
+ end
583
+ ```
584
+
585
+ ## 🔒 Security
586
+
587
+ ### Threat Model
588
+
589
+ Dandruff defends against multiple attack vectors:
590
+
591
+ #### XSS (Cross-Site Scripting)
592
+
593
+ **Attack:** Injecting scripts via HTML tags or attributes
594
+
595
+ **Protection:**
596
+ - Removes `<script>`, `<iframe>`, `<object>`, `<embed>` tags
597
+ - Blocks event handlers (`onclick`, `onerror`, `onload`, etc.)
598
+ - Validates URI attributes to prevent `javascript:` and `vbscript:` protocols
599
+
600
+ ```ruby
601
+ # Attack blocked
602
+ dandruff.scrub('<script>alert("xss")</script>')
603
+ # => ""
604
+
605
+ dandruff.scrub('<img src="javascript:alert(1)">')
606
+ # => "<img>"
607
+
608
+ dandruff.scrub('<a onclick="alert(1)">Click</a>')
609
+ # => "<a>Click</a>"
610
+ ```
611
+
612
+ #### mXSS (Mutation-Based XSS)
613
+
614
+ **Attack:** HTML mutations during parsing that create XSS
615
+
616
+ **Protection:**
617
+ - Multi-pass sanitization (validates output is stable)
618
+ - Namespace confusion prevention (SVG/MathML)
619
+ - Proper HTML5 parsing
620
+
621
+ ```ruby
622
+ # mXSS prevented through multi-pass sanitization
623
+ dandruff = Dandruff.new do |config|
624
+ config.scrub_until_stable = true # default
625
+ config.mutation_max_passes = 2
626
+ end
627
+ ```
628
+
629
+ #### DOM Clobbering
630
+
631
+ **Attack:** Using `id`/`name` attributes to override built-in DOM properties
632
+
633
+ **Protection:**
634
+ - Blocks dangerous id/name values (`document`, `location`, `alert`, `window`, etc.)
635
+ - Can be disabled for sandboxed contexts like email
636
+
637
+ ```ruby
638
+ # DOM clobbering blocked
639
+ dandruff.scrub('<form name="document">')
640
+ # => "<form></form>" (name removed)
641
+
642
+ dandruff.scrub('<img id="location">')
643
+ # => "<img>" (id removed)
644
+ ```
645
+
646
+ #### Protocol Injection
647
+
648
+ **Attack:** Using dangerous URI protocols to execute code
649
+
650
+ **Protection:**
651
+ - Blocks `javascript:`, `vbscript:`, `data:text/html` protocols
652
+ - Validates against allowlist of safe protocols
653
+ - Custom protocol validation with `allowed_uri_regexp`
654
+
655
+ ```ruby
656
+ dandruff.scrub('<a href="javascript:alert(1)">Click</a>')
657
+ # => "<a>Click</a>"
658
+
659
+ dandruff.scrub('<link href="vbscript:msgbox(1)">')
660
+ # => (link removed)
661
+ ```
662
+
663
+ #### CSS Injection
664
+
665
+ **Attack:** Using CSS to execute code or exfiltrate data
666
+
667
+ **Protection:**
668
+ - Parses and validates inline `style` attributes
669
+ - Removes dangerous CSS properties and values
670
+ - Scans `<style>` tag content for unsafe patterns
671
+
672
+ ```ruby
673
+ dandruff.scrub('<div style="expression(alert(1))"></div>')
674
+ # => "<div></div>" (dangerous style removed)
675
+
676
+ dandruff.scrub('<div style="background: url(javascript:alert(1))"></div>')
677
+ # => "<div></div>" (dangerous style removed)
678
+ ```
679
+
680
+ ### Security Best Practices
681
+
682
+ #### 1. Use Allowlists, Not Blocklists
683
+
684
+ ```ruby
685
+ # ✅ Good - explicitly allow safe tags
686
+ config.allowed_tags = ['p', 'strong', 'em', 'a']
687
+
688
+ # ❌ Avoid - trying to block everything dangerous is error-prone
689
+ config.forbidden_tags = ['script', 'iframe', ...] # incomplete!
690
+ ```
691
+
692
+ #### 2. Restrict URI Protocols
693
+
694
+ ```ruby
695
+ # ✅ Good - only allow HTTPS
696
+ config.allowed_uri_regexp = /^https:/
697
+
698
+ # ⚠️ Caution - allowing unknown protocols is risky
699
+ config.allow_unknown_protocols = true # avoid unless necessary
700
+ ```
701
+
702
+ #### 3. Disable Data URIs Unless Needed
703
+
704
+ ```ruby
705
+ # ✅ Good for user-generated content
706
+ config.allow_data_uri = false
707
+
708
+ # ⚠️ Only enable for trusted content
709
+ config.allow_data_uri = true # only if you need it
710
+ ```
711
+
712
+ #### 4. Remove Event Handlers
713
+
714
+ ```ruby
715
+ # ✅ Good - block all event handlers
716
+ config.forbidden_attributes = [
717
+ 'onclick', 'onload', 'onerror', 'onmouseover',
718
+ 'onfocus', 'onblur', 'onchange', 'onsubmit'
719
+ ]
720
+ ```
721
+
722
+ #### 5. Keep DOM Sanitization Enabled
723
+
724
+ ```ruby
725
+ # ✅ Good - default setting
726
+ config.scrub_dom = true
727
+
728
+ # ⚠️ Only disable for sandboxed contexts (e.g., email rendering)
729
+ config.scrub_dom = false # use with caution
730
+ ```
731
+
732
+ #### 6. Use Per-Tag Attribute Control
733
+
734
+ ```ruby
735
+ # ✅ Good - prevents attribute confusion
736
+ config.allowed_attributes_per_tag = {
737
+ 'a' => ['href', 'title'], # no 'src' on links
738
+ 'img' => ['src', 'alt'], # no 'href' on images
739
+ 'form' => ['action', 'method'] # only form-specific attrs
740
+ }
741
+ ```
742
+
743
+ #### 7. Keep Dandruff Updated
744
+
745
+ ```ruby
746
+ # Check your Gemfile.lock regularly
747
+ bundle outdated dandruff
748
+
749
+ # Update to latest version
750
+ bundle safe update dandruff
751
+ ```
752
+
753
+ ### Recommended Configurations
754
+
755
+ #### Maximum Security (User Comments)
756
+
757
+ ```ruby
758
+ dandruff = Dandruff.new do |config|
759
+ config.allowed_tags = ['p', 'br', 'strong', 'em', 'a']
760
+ config.allowed_attributes = ['href']
761
+ config.forbidden_attributes = ['onclick', 'onerror', 'onload', 'style']
762
+ config.allow_data_uri = false
763
+ config.keep_content = false
764
+ end
765
+ ```
766
+
767
+ #### Content Management System
768
+
769
+ ```ruby
770
+ dandruff = Dandruff.new do |config|
771
+ config.allowed_tags = [
772
+ 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
773
+ 'p', 'br', 'strong', 'em', 'ul', 'ol', 'li',
774
+ 'blockquote', 'code', 'pre', 'a', 'img',
775
+ 'div', 'span', 'table', 'tr', 'td', 'th'
776
+ ]
777
+ config.allowed_attributes = ['href', 'src', 'alt', 'title', 'class', 'id']
778
+ config.allow_data_uri = true # for embedded images
779
+ config.allowed_uri_regexp = /^(?:https?:|\/)/ # only https and relative
780
+ end
781
+ ```
782
+
783
+ #### Rich Text Editor
784
+
785
+ ```ruby
786
+ dandruff = Dandruff.new do |config|
787
+ config.use_profiles = { html: true }
788
+ config.forbidden_tags = ['script', 'iframe', 'object', 'embed']
789
+ config.forbidden_attributes = ['on*'] # remove all event handlers
790
+ config.allowed_attributes_per_tag = {
791
+ 'img' => ['src', 'alt', 'width', 'height'],
792
+ 'a' => ['href', 'title']
793
+ }
794
+ end
795
+ ```
796
+
797
+ ## ⚡ Performance
798
+
799
+ Dandruff is optimized for performance while maintaining security.
800
+
801
+ ### Benchmarks
802
+
803
+ Executed on Apple M1 Max via `ruby spec/dandruff_performance_spec.rb`:
804
+
805
+ | Input Size | Default Config | Strict Config | Throughput (Default) | Throughput (Strict) |
806
+ |------------|----------------|---------------|----------------------|---------------------|
807
+ | 1KB | ~3.3ms | ~0.3ms | ~300 KB/s | ~3,300 KB/s |
808
+ | 10KB | ~31ms | ~3.3ms | ~320 KB/s | ~3,000 KB/s |
809
+ | 50KB | ~160ms | ~16ms | ~310 KB/s | ~3,100 KB/s |
810
+ | 100KB | ~340ms | ~31ms | ~290 KB/s | ~3,200 KB/s |
811
+ | 500KB | ~1.8s | ~170ms | ~280 KB/s | ~2,900 KB/s |
812
+
813
+ **Stress tests:**
814
+ - 1,000 small docs: ~0.40s total (~2,450 docs/sec)
815
+ - Deep nesting (100 levels): <5s
816
+ - Memory growth: <50k objects over 100 iterations
817
+
818
+ ### Performance Tips
819
+
820
+ #### 1. Reuse Configurations
821
+
822
+ ```ruby
823
+ # ✅ Good - reuse configuration
824
+ dandruff = Dandruff.new do |config|
825
+ config.allowed_tags = ['p', 'strong', 'em']
826
+ end
827
+
828
+ documents.each do |doc|
829
+ clean = dandruff.scrub(doc) # fast - config already set
830
+ end
831
+
832
+ # ❌ Slower - new config each time
833
+ documents.each do |doc|
834
+ clean = Dandruff.scrub(doc, allowed_tags: ['p', 'strong', 'em'])
835
+ end
836
+ ```
837
+
838
+ #### 2. Use Strict Configurations
839
+
840
+ More restrictive configurations are faster:
841
+
842
+ ```ruby
843
+ # Faster - small allowlist
844
+ config.allowed_tags = ['p', 'strong', 'em']
845
+
846
+ # Slower - large allowlist or nil (uses defaults)
847
+ config.allowed_tags = nil
848
+ ```
849
+
850
+ #### 3. Batch Processing
851
+
852
+ Process multiple documents with the same instance:
853
+
854
+ ```ruby
855
+ dandruff = Dandruff.new do |config|
856
+ # ... configuration
857
+ end
858
+
859
+ cleaned_docs = documents.map { |doc| dandruff.scrub(doc) }
860
+ ```
861
+
862
+ #### 4. Adjust Multi-Pass Limit
863
+
864
+ For trusted content, you can reduce passes:
865
+
866
+ ```ruby
867
+ # Faster but less secure - use only for pre-validated content
868
+ config.scrub_until_stable = false
869
+
870
+ # Or reduce max passes
871
+ config.mutation_max_passes = 1 # default is 2
872
+ ```
873
+
874
+ #### 5. Return DOM for Further Processing
875
+
876
+ If you need to process the output further:
877
+
878
+ ```ruby
879
+ config.return_dom = true
880
+ doc = dandruff.scrub(html) # Returns Nokogiri document
881
+ # ... further processing with Nokogiri
882
+ ```
883
+
884
+ ## 🔄 Migration Guides
885
+
886
+ ### From Rails Sanitizer
887
+
888
+ ```ruby
889
+ # Before (Rails)
890
+ ActionController::Base.helpers.scrub(html, tags: ['p', 'strong'])
891
+
892
+ # After (Dandruff)
893
+ Dandruff.scrub(html, allowed_tags: ['p', 'strong'])
894
+
895
+ # Or create reusable instance
896
+ @dandruff = Dandruff.new do |config|
897
+ config.allowed_tags = ['p', 'strong', 'em', 'a']
898
+ config.allowed_attributes = ['href']
899
+ end
900
+
901
+ @dandruff.scrub(html)
902
+ ```
903
+
904
+ ### From Loofah
905
+
906
+ ```ruby
907
+ # Before (Loofah)
908
+ Loofah.fragment(html).scrub!(:prune).to_s
909
+
910
+ # After (Dandruff)
911
+ Dandruff.scrub(html, keep_content: false)
912
+
913
+ # With specific tags
914
+ Loofah.fragment(html).scrub!(:strip).to_s
915
+
916
+ # Dandruff equivalent
917
+ Dandruff.scrub(html, allowed_tags: ['p', 'strong'])
918
+ ```
919
+
920
+ ### From Sanitize Gem
921
+
922
+ ```ruby
923
+ # Before (Sanitize)
924
+ Sanitize.fragment(html, elements: ['p', 'strong'])
925
+
926
+ # After (Dandruff)
927
+ Dandruff.scrub(html, allowed_tags: ['p', 'strong'])
928
+
929
+ # Custom config
930
+ Sanitize.fragment(html, Sanitize::Config::RELAXED)
931
+
932
+ # Dandruff profiles
933
+ Dandruff.scrub(html) do |config|
934
+ config.use_profiles = { html: true }
935
+ end
936
+ ```
937
+
938
+ ## 🆚 Comparison
939
+
940
+ See [COMPARISON.md](COMPARISON.md) for a detailed comparison with other Ruby HTML sanitization libraries:
941
+
942
+ - Rails' built-in sanitizer
943
+ - Loofah
944
+ - Sanitize gem
945
+
946
+ **Key differentiators:**
947
+ - Based on DOMPurify's proven security model
948
+ - Protection against mXSS attacks
949
+ - DOM clobbering prevention
950
+ - Per-tag attribute control
951
+ - Hook system for extensibility
952
+ - HTML email support with per-tag restrictions
953
+
954
+ ## ❓ FAQ
955
+
956
+ ### How is Dandruff different from other sanitizers?
957
+
958
+ Dandruff brings DOMPurify's battle-tested security model to Ruby, with specific defenses against mXSS, DOM clobbering, and protocol injection that other Ruby sanitizers may not provide. It also offers per-tag attribute control and an extensible hook system.
959
+
960
+ ### Is Dandruff safe for user-generated content?
961
+
962
+ Yes! Dandruff is specifically designed for sanitizing untrusted user input. Use restrictive configurations for maximum security (see [Recommended Configurations](#recommended-configurations)).
963
+
964
+ ### Can I use Dandruff with Rails?
965
+
966
+ Absolutely! Dandruff works great with Rails:
967
+
968
+ ```ruby
969
+ # In your helper
970
+ def sanitize_user_content(html)
971
+ @dandruff ||= Dandruff.new do |config|
972
+ config.allowed_tags = ['p', 'strong', 'em', 'a']
973
+ config.allowed_attributes = ['href']
974
+ end
975
+ @dandruff.scrub(html)
976
+ end
977
+ ```
978
+
979
+ ### Does Dandruff work with HTML emails?
980
+
981
+ Yes! Use the `html_email` profile:
982
+
983
+ ```ruby
984
+ dandruff = Dandruff.new do |config|
985
+ config.use_profiles = { html_email: true }
986
+ end
987
+ ```
988
+
989
+ This includes legacy attributes and per-tag restrictions needed for email clients.
990
+
991
+ ### What about performance?
992
+
993
+ Dandruff processes ~300 KB/s with default config and ~3,000 KB/s with strict config on modern hardware. Reuse configuration instances for best performance. See [Performance](#performance) section.
994
+
995
+ ### How do I allow custom elements?
996
+
997
+ ```ruby
998
+ dandruff = Dandruff.new do |config|
999
+ config.additional_tags = ['my-custom-element', 'web-component']
1000
+ end
1001
+ ```
1002
+
1003
+ Elements with hyphens are treated as custom elements by default.
1004
+
1005
+ ### Can I allow inline styles?
1006
+
1007
+ Yes, but they're sanitized for safety:
1008
+
1009
+ ```ruby
1010
+ dandruff = Dandruff.new do |config|
1011
+ config.allowed_attributes = ['style'] # style is allowed by default
1012
+ end
1013
+
1014
+ # Safe styles pass through
1015
+ dandruff.scrub('<div style="color: red;">Text</div>')
1016
+ # => '<div style="color:red;">Text</div>'
1017
+
1018
+ # Dangerous styles are removed
1019
+ dandruff.scrub('<div style="expression(alert(1))">Text</div>')
1020
+ # => '<div>Text</div>'
1021
+ ```
1022
+
1023
+ ### How do I debug what's being removed?
1024
+
1025
+ ```ruby
1026
+ dandruff = Dandruff.new
1027
+ dandruff.scrub(html)
1028
+
1029
+ # Check what was removed
1030
+ removed = dandruff.removed
1031
+ removed.each do |item|
1032
+ if item[:element]
1033
+ puts "Removed element: #{item[:element].name}"
1034
+ elsif item[:attribute]
1035
+ puts "Removed attribute: #{item[:attribute].name} from #{item[:from].name}"
1036
+ end
1037
+ end
1038
+ ```
1039
+
1040
+ ## 🛠️ Troubleshooting
1041
+
1042
+ ### Content is being removed unexpectedly
1043
+
1044
+ **Check your configuration:**
1045
+
1046
+ ```ruby
1047
+ # Enable keep_content to preserve text
1048
+ config.keep_content = true
1049
+
1050
+ # Check if tags are in your allowlist
1051
+ puts dandruff.config.allowed_tags
1052
+
1053
+ # Use additional_tags instead of allowed_tags to extend defaults
1054
+ config.additional_tags = ['custom-tag'] # instead of replacing all
1055
+ ```
1056
+
1057
+ ### Attributes are being stripped
1058
+
1059
+ **Verify attribute configuration:**
1060
+
1061
+ ```ruby
1062
+ # Check which attributes are allowed
1063
+ puts dandruff.config.allowed_attributes
1064
+
1065
+ # Use additional_attributes to extend
1066
+ config.additional_attributes = ['data-custom']
1067
+
1068
+ # Or use per-tag control
1069
+ config.allowed_attributes_per_tag = {
1070
+ 'div' => ['class', 'id', 'data-custom']
1071
+ }
1072
+ ```
1073
+
1074
+ ### Style tags are removed
1075
+
1076
+ **Enable style tags:**
1077
+
1078
+ ```ruby
1079
+ config.allow_style_tags = true
1080
+
1081
+ # For whole documents (like emails)
1082
+ config.whole_document = true
1083
+ ```
1084
+
1085
+ ### URI validation is too strict
1086
+
1087
+ **Customize URI validation:**
1088
+
1089
+ ```ruby
1090
+ # Allow more protocols
1091
+ config.allowed_uri_regexp = /^(?:https?|ftp|mailto):/
1092
+
1093
+ # Or allow unknown protocols (⚠️ less secure)
1094
+ config.allow_unknown_protocols = true
1095
+ ```
1096
+
1097
+ ### Performance is slow
1098
+
1099
+ **Optimize configuration:**
1100
+
1101
+ ```ruby
1102
+ # Use specific allowlists
1103
+ config.allowed_tags = ['p', 'strong', 'em'] # faster than nil/defaults
1104
+
1105
+ # Reduce multi-pass iterations for trusted content
1106
+ config.mutation_max_passes = 1 # default is 2
1107
+
1108
+ # Disable multi-pass for pre-validated content
1109
+ config.scrub_until_stable = false # use with caution
1110
+ ```
1111
+
1112
+ ## 🤝 Contributing
1113
+
1114
+ We welcome contributions! Here's how to get involved:
1115
+
1116
+ ### Development Setup
1117
+
1118
+ ```bash
1119
+ # Clone the repository
1120
+ git clone https://github.com/kuyio/dandruff.git
1121
+ cd dandruff
1122
+
1123
+ # Install dependencies
1124
+ bundle install
1125
+
1126
+ # Run tests
1127
+ make test
1128
+
1129
+ # Run linter
1130
+ make lint
1131
+
1132
+ # Open console
1133
+ bin/console
1134
+ ```
1135
+
1136
+ ### Running Tests
1137
+
1138
+ ```bash
1139
+ # All tests
1140
+ rake spec
1141
+
1142
+ # With coverage
1143
+ COVERAGE=true rake spec
1144
+
1145
+ # Specific test file
1146
+ rspec spec/basic_sanitization_spec.rb
1147
+
1148
+ # Performance tests
1149
+ ruby spec/dandruff_performance_spec.rb
1150
+ ```
1151
+
1152
+ ### Contribution Guidelines
1153
+
1154
+ 1. **Fork** the repository
1155
+ 2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
1156
+ 3. **Write** tests for your changes
1157
+ 4. **Ensure** all tests pass (`rake spec`)
1158
+ 5. **Update** documentation as needed
1159
+ 6. **Commit** your changes (`git commit -am 'Add amazing feature'`)
1160
+ 7. **Push** to the branch (`git push origin feature/amazing-feature`)
1161
+ 8. **Open** a Pull Request
1162
+
1163
+ ### Development Guidelines
1164
+
1165
+ - **Security First**: All changes must maintain or improve security
1166
+ - **Backward Compatibility**: Avoid breaking changes when possible
1167
+ - **Comprehensive Tests**: New features need full test coverage (aim for 100%)
1168
+ - **Documentation**: Update README and inline YARD docs for API changes
1169
+ - **Performance**: Consider performance impact of changes
1170
+ - **Code Quality**: Follow Ruby best practices and existing code style
1171
+
1172
+ ### Reporting Issues
1173
+
1174
+ Found a bug or have a feature request?
1175
+
1176
+ 1. **Search** existing issues to avoid duplicates
1177
+ 2. **Include** relevant details:
1178
+ - Ruby version
1179
+ - Dandruff version
1180
+ - Minimal reproduction code
1181
+ - Expected vs. actual behavior
1182
+ 3. **Security issues**: Email security@kuyio.com instead of filing public issues
1183
+
1184
+ ## 📄 License
1185
+
1186
+ This gem is available as open source under the terms of the **MIT License**.
1187
+
1188
+ ## 🙏 Acknowledgments
1189
+
1190
+ Originally inspired by the excellent [DOMPurify](https://github.com/cure53/DOMPurify) JavaScript library by Cure53 and contributors. Dandruff brings DOMPurify's battle-tested security model to the Ruby ecosystem with an idiomatic Ruby API.
1191
+
1192
+ Special thanks to all [contributors](https://github.com/kuyio/dandruff/graphs/contributors) who have helped make Dandruff better!
1193
+
1194
+ ---
1195
+
1196
+ **Made with ❤️ in Ottawa, Canada 🇨🇦** • [GitHub](https://github.com/kuyio/dandruff) • [Documentation](https://rubydoc.info/gems/dandruff)