inkmark 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,1166 @@
1
+ # Inkmark
2
+
3
+ A very fast, feature-packed, AI-first markdown gem for Ruby.
4
+
5
+ [![GitHub Release](https://img.shields.io/github/v/release/yaroslav/inkmark)](https://github.com/yaroslav/inkmark/releases)
6
+ [![Docs](https://img.shields.io/badge/yard-docs-blue.svg)](https://rubydoc.info/gems/inkmark)
7
+
8
+ <div align="center">
9
+ <img src="https://raw.githubusercontent.com/yaroslav/inkmark/refs/heads/main/assets/images/inky.png" width="400" height="400" alt="Inky">
10
+ </div>
11
+
12
+ - **Very fast**. Up to 1.3× faster than redcarpet _(not CommonMark-conformant)_, about 3×–9× faster than other Ruby Markdown gems with native extensions. Built with Rust, based on [pulldown-cmark](https://github.com/pulldown-cmark/pulldown-cmark), uses SIMD.
13
+ - **No surprises**. CommonMark + GitHub Flavored Markdown conformance.
14
+ - **"Batteries included" approach**. Build lots of useful features, make them easy to use and as fast as possible.
15
+ - **Easy to use**. As simple as a one-method API. Pass options inline as a hash, set them one by one, or set default options for the entire application.
16
+ - **Feature-packed**. Server-side syntax highlighting with themes, frontmatter support, table of contents in Markdown and HTML, plain text export, extraction of headers/links/images, statistics (character and word count, likely document language, blocks count), lazy image loading attributes, emoji shortcodes, autolinks, heading IDs with Unicode-transliterated slugs, wikilinks, footnotes, tables, task lists, smart punctuation, hard wraps, "nofollow/noopener" on external links.
17
+ - **AI-first**. Two chunking primitives: heading-based with breadcrumbs and per-chunk character/word counts, and sliding-window with overlap for size-bounded chunks where headings are absent or uneven. Block-aware or word-aware truncation for context-window budgeting. Markdown-to-Markdown pipeline. Plain-text extraction for embedding models. Structured extraction of headings, images, links, code blocks—each carrying byte ranges back into the source.
18
+ - **Security conscious**. Raw HTML denied by default. Hostname and URL-scheme allowlists for both links and images. GFM tagfilter for dangerous tags. A Rust-backed gem.
19
+ - **Easy extension API**. Hook any element with a Ruby block—no subclassing, no intermediate AST, no HTML post-processing. Rewrite URLs, swap code blocks for your own renderer, drop subtrees, or just walk the document for analysis. Handlers fire inside the single-pass parser, so extension costs essentially nothing beyond the render itself—and far less than regexing over output HTML.
20
+
21
+ ## Contents
22
+
23
+ - [Installation](#installation)
24
+ - [Quick start](#quick-start)
25
+ - [Presets](#presets)
26
+ - [Options](#options)
27
+ - [Raw HTML](#raw-html)
28
+ - [Host allowlists](#host-allowlists)
29
+ - [URL scheme filtering](#url-scheme-filtering)
30
+ - [Statistics and extraction](#statistics-and-extraction)
31
+ - [Chunks extraction (for RAG)](#chunks-extraction-for-rag)
32
+ - [Truncation](#truncation)
33
+ - [Plain-text extraction](#plain-text-extraction)
34
+ - [Markdown-to-Markdown pipeline](#markdown-to-markdown-pipeline)
35
+ - [Event handlers](#event-handlers)
36
+ - [Benchmarks](#benchmarks)
37
+ - [Contributing](#contributing)
38
+ - [Acknowledgements](#acknowledgements)
39
+ - [License](#license)
40
+
41
+ ## Installation
42
+
43
+ bundle add inkmark
44
+
45
+ Or in your `Gemfile`:
46
+
47
+ ```ruby
48
+ gem "inkmark"
49
+ ```
50
+
51
+ Ruby 3.3+ is supported.
52
+
53
+ ## Quick start
54
+
55
+ ```ruby
56
+ require "inkmark"
57
+
58
+ # Class-method shortcut
59
+ Inkmark.to_html("**hello**")
60
+ # => "<p><strong>hello</strong></p>\n"
61
+
62
+ # Instance form
63
+ Inkmark.new("# Hello").to_html
64
+
65
+ # With options
66
+ Inkmark.to_html("hi <em>there</em>", options: { raw_html: true })
67
+
68
+ # Mutable options via accessor
69
+ g = Inkmark.new("# Table\n\n| a | b |\n|---|---|\n| 1 | 2 |")
70
+ g.options.tables = false
71
+ g.to_html # tables render as paragraphs now
72
+ ```
73
+
74
+ ## Presets
75
+
76
+ Inkmark ships presets as opinionated shortcuts for common
77
+ rendering profiles. Pass one via `preset:` in the options hash; every
78
+ other option in the hash overrides the preset's values (deep-merging
79
+ for nested element-policy hashes). You can—and are recommended to!—override preset options as you see fit.
80
+
81
+ - **`:recommended`**: a curated profile for modern web content. On
82
+ top of GFM, enables smart punctuation, auto heading IDs, lazy-loading
83
+ images with an `http`/`https` scheme allowlist, autolinks,
84
+ `rel="nofollow noopener"` on external links, a scheme allowlist for
85
+ link destinations, emoji shortcodes, syntax highlighting, hard wraps,
86
+ and frontmatter parsing.
87
+
88
+ **This is a good starting point for most apps**. Still, you are expected to
89
+ override individual options to match your specific needs (e.g. adding statistics and table of contents, tightening link/image allowlists to your own hostnames, turning off features you don't want).
90
+
91
+ - **`:trusted`**: `:recommended` plus raw HTML pass-through.
92
+ **Dangerous.** Intended only for content you fully trust: internal,
93
+ team-authored. With raw HTML on, Inkmark does no sanitization beyond
94
+ the narrow GFM tagfilter (turn it off on your own risk); the caller is
95
+ responsible for output safety. Do not apply this preset to anything a user can influence, directly or indirectly.
96
+
97
+ - **`:gfm`**: the bare default. CommonMark plus the core GFM extensions
98
+ (tables, strikethrough, tasklists, footnotes, tagfilter). Strict,
99
+ conservative, and matches the render profile of every other major
100
+ GFM engine. Everything else is off.
101
+
102
+ - **`:commonmark`**: the minimum. Strict CommonMark. No GFM extensions, no
103
+ typographics, nothing opinionated.
104
+
105
+ ```ruby
106
+ # Recommended profile
107
+ Inkmark.to_html(md, options: { preset: :recommended })
108
+
109
+ # Recommended profile with stats and table of contents
110
+ Inkmark.to_html(md, options: { preset: :recommended, statistics: true, toc: true })
111
+
112
+ # Recommended profile, but disable smart punctuation
113
+ Inkmark.to_html(md, options: { preset: :recommended, smart_punctuation: false })
114
+
115
+ # Just GFM (the default)
116
+ Inkmark.to_html(md)
117
+ Inkmark.to_html(md, options: { preset: :gfm }) # equivalent
118
+
119
+ # Recommended profile with a tightened link-host allowlist
120
+ Inkmark.to_html(md, options: {
121
+ preset: :recommended,
122
+ links: { allowed_hosts: ["*.example.com"] }
123
+ })
124
+
125
+ # Trusted content (raw HTML passes through—use with care)
126
+ Inkmark.to_html(internal_doc, options: { preset: :trusted })
127
+ ```
128
+
129
+ ## Options
130
+
131
+ GFM extensions are on by default; raw HTML rendering is off by default.
132
+ Pass a hash to `Inkmark.to_html` / `Inkmark.new`, or mutate a `Inkmark::Options`
133
+ instance via its accessors.
134
+
135
+ | Key | Default | Description |
136
+ |---|---|---|
137
+ | `gfm` | `true` | GFM conformance mode + tables, strikethrough, tasklists, and footnotes. |
138
+ | `gfm_tag_filter` | `true` | GFM "Disallowed Raw HTML" extension. When `gfm` and `raw_html` are both true, protects you from several predefined tags (`title`, `textarea`, `style`, `xmp`, `iframe`, `noembed`, `noframes`, `script`, `plaintext`). No effect when `raw_html: false`. |
139
+ | `tables` | `true` | GFM pipe tables with optional column alignment markers (`:---`, `:---:`, `---:`). |
140
+ | `strikethrough` | `true` | `~~text~~` renders as `<del>text</del>`. |
141
+ | `tasklists` | `true` | `- [ ]` and `- [x]` render as disabled checkboxes. |
142
+ | `footnotes` | `true` | `text[^1]` + `[^1]: body` renders as superscript links and footnote block. |
143
+ | `raw_html` | `false` | Pass raw HTML through unescaped. Off by default for untrusted-input safety. **When enabled, the caller is fully responsible for sanitizing the output—see the [Raw HTML](#raw-html) section.** |
144
+ | `smart_punctuation` | `false` | Convert `"..."` → `"..."`, `...` → `…`, `--` → `–`, `---` → `—`. |
145
+ | `headings` | `{ attributes: false, ids: false }` | Heading-related policy. `:attributes` enables `# Heading {#id .klass}` Markdown inline attribute syntax; `:ids` auto-generates `id="slug"` on every heading from its text, with automatic Unicode transliteration of non-English headings (duplicates get a counter suffix; user-supplied ids from `:attributes` win). Deep-merges over defaults—pass only the sub-keys you care about. |
146
+ | `images` | `{ lazy: false, allowed_hosts: nil, allowed_schemes: nil }` | Image-related policy. `:lazy` adds `loading="lazy" decoding="async"` to every `<img>`. `:allowed_hosts` is a glob allowlist for `<img src>` hostnames (see examples; non-matching images drop to alt text). `:allowed_schemes` is a URL-scheme allowlist—typical: `["http", "https"]` to block `data:` image URIs. Both allowlists default to `nil` (no filtering); `[]` deny-all-external. Deep-merges. |
147
+ | `links` | `{ autolink: false, nofollow: false, allowed_hosts: nil, allowed_schemes: nil }` | Link-related policy. `:autolink` auto-links bare URLs and emails with correct boundary detection. `:nofollow` adds `rel="nofollow noopener"` to external `<a>` tags. `:allowed_hosts` / `:allowed_schemes` are glob / scheme allowlists for `<a href>` (relative/anchor/mailto URLs are never filtered). Non-matching links unwrap to plain text. Deep-merges. |
148
+ | `emoji_shortcodes` | `false` | Replace gemoji-style `:shortcode:` sequences with their emoji character (`:rocket:` → 🚀). Unknown codes and codes inside code blocks are preserved. |
149
+ | `syntax_highlight` | `false` | Server-side syntax highlighting for fenced code blocks with a language tag. Uses the `syntect` Rust crate with CSS class output. Batteries included: pair with CSS from `Inkmark.highlight_css` for the theme stylesheet. |
150
+ | `hard_wrap` | `false` | Treat every single newline as a hard line break (`<br />`). By default a bare `\n` is a soft break rendered as a space. Enable for one-sentence-per-line content or when migrating from renderers that default to hard wraps. |
151
+ | `toc` | `false` | Collect a table of contents from headings. Accepts `true` / `false` for simple enable/disable, or a Hash like `toc: { depth: 3 }` to limit which heading levels appear in the rendered TOC (h1–h3 in that example; default is no limit). Enables `Inkmark#toc` which returns a `Inkmark::Toc` value object (`#to_markdown` / `#to_html` / `#to_s`). Implicitly enables `headings: { ids: true }`. Also populates a lightweight `Inkmark#statistics` with `heading_count`. Depth affects only the rendered TOC; `heading_count`, `extracts[:headings]`, and `chunks_by_heading` still see every heading. |
152
+ | `statistics` | `false` | Collect scalar document statistics during parsing: language detection, character/word counts, and `*_count` fields for headings, code blocks, images, links, and footnote definitions. See examples. For structured arrays of records, use `extract`. Implies `toc` and `headings: { ids: true }`. |
153
+ | `extract` | `nil` | Hash opting into structured extraction of specific element kinds. Keys: `:images`, `:links`, `:code_blocks`, `:headings`, `:footnote_definitions`—each `true`/`false`. When set, `Inkmark#extracts` returns a Hash keyed by the requested kinds, each with an Array of record Hashes including a `:byte_range`. `extract: { headings: true }` and `toc: true` trigger each other—one heading walk powers both surfaces. |
154
+ | `math` | `false` | Recognize `$inline$` and `$$display$$` math blocks. |
155
+ | `definition_list` | `false` | `term\n: definition` renders as `<dl>`. |
156
+ | `superscript` | `false` | `^text^` renders as `<sup>`. |
157
+ | `subscript` | `false` | `~text~` renders as `<sub>`. Conflicts with strikethrough—enable only one. |
158
+ | `wikilinks` | `false` | `[[Page]]` and `[[Page\|label]]` render as links. |
159
+ | `frontmatter` | `false` | Frontmatter (YAML metadata at the start of the document). Parsed and exposed via `Inkmark#frontmatter`; the block is stripped from rendered output. |
160
+
161
+ Options can be supplied either way:
162
+
163
+ ```ruby
164
+ # As a hash at construction
165
+ Inkmark.to_html(md, options: { math: true, tables: false })
166
+
167
+ # Via mutable accessor
168
+ g = Inkmark.new(md)
169
+ g.options.math = true
170
+ g.options.tables = false
171
+ g.to_html
172
+
173
+ # Process-level defaults, to set in your application initializer
174
+ Inkmark.default_options.math = true
175
+ Inkmark.new(md).to_html # picks up the default
176
+ ```
177
+
178
+ Unknown option keys raise `ArgumentError` immediately, including via the
179
+ hash form—typos fail loudly:
180
+
181
+ ```ruby
182
+ Inkmark.new("x", options: { taples: true })
183
+ # => ArgumentError: unknown Inkmark option: :taples
184
+ ```
185
+
186
+ ## Raw HTML
187
+
188
+ Raw HTML is suppressed by default. This is safe-by-default for rendering untrusted markdown:
189
+
190
+ ```ruby
191
+ Inkmark.to_html("<script>alert(1)</script>")
192
+ # => "<p>&lt;script&gt;alert(1)&lt;/script&gt;</p>\n"
193
+ ```
194
+
195
+ Enable pass-through with `raw_html: true`; _only do this for trusted
196
+ input_:
197
+
198
+ ```ruby
199
+ Inkmark.to_html("<em>keep me</em>", options: { raw_html: true })
200
+ # => "<p><em>keep me</em></p>\n"
201
+ ```
202
+
203
+ > **Your responsibility.** With `raw_html: true` you are fully
204
+ > responsible for every `<tag>` that reaches the HTML output. Inkmark does not
205
+ > sanitize raw HTML beyond the narrow GFM tagfilter described below—it will
206
+ > happily emit `<img onerror="…">`, `<a href="javascript:…">`, `<style>`
207
+ > contents, and any other attack surface the source contains. Always pipe the
208
+ > output through a dedicated sanitizer (like [Loofah][] or
209
+ > [rails-html-sanitizer][]) before rendering untrusted content in a page.
210
+
211
+ [Loofah]: https://github.com/flavorjones/loofah
212
+ [rails-html-sanitizer]: https://github.com/rails/rails-html-sanitizer
213
+
214
+ Even with `raw_html: true`, the **GFM tagfilter** stays on by
215
+ default and escapes nine unsafe tag names—`title`, `textarea`, `style`, `xmp`,
216
+ `iframe`, `noembed`, `noframes`, `script`, `plaintext`. This is required for GFM conformance. Opt out with `gfm_tag_filter: false` (or `gfm: false`) if you need raw pass-through of those tags—trusted input only. The tagfilter is a narrow spec-compliance pass, **not** a sanitizer—the responsibility note above still applies in full.
217
+
218
+ ```ruby
219
+ Inkmark.to_html("<script>alert(1)</script>", options: { raw_html: true })
220
+ # => "<p>&lt;script>alert(1)&lt;/script></p>\n"
221
+ ```
222
+
223
+ ## Host allowlists
224
+
225
+ Restrict which hostnames can appear in links and images by passing glob
226
+ patterns. Disallowed links have their `<a>` tags stripped (the link text
227
+ stays); disallowed images drop to their alt text (or disappear when alt
228
+ is empty). Relative URLs, anchors, `mailto:`, and other non-web schemes
229
+ pass through unchanged—only `http://` / `https://` URLs are matched.
230
+
231
+ ```ruby
232
+ Inkmark.to_html(md, options: {
233
+ links: { allowed_hosts: ["example.com", "*.example.com"] },
234
+ images: { allowed_hosts: ["{cdn,static,img}.example.com"] }
235
+ })
236
+ ```
237
+
238
+ Patterns use glob syntax (same engine as `.gitignore`), **not regex**:
239
+
240
+ - `example.com`: exact host only
241
+ - `*.example.com`: any subdomain (matches `cdn.example.com`, `a.b.example.com`; does **not** match bare `example.com`)
242
+ - `{cdn,static}.example.com`: brace alternation for multiple explicit hosts
243
+ - `*.{example,trusted}.com`: combine wildcards and alternation
244
+
245
+ Hostnames are matched case-insensitively and ports are ignored. An empty
246
+ array `[]` blocks every external link or image while still allowing
247
+ relative URLs.
248
+
249
+ ## URL scheme filtering
250
+
251
+ For rendering untrusted markdown, opt in to scheme allowlists to block
252
+ `javascript:`, `data:`, and other dangerous URL schemes in links and
253
+ images:
254
+
255
+ ```ruby
256
+ Inkmark.to_html(md, options: {
257
+ links: { allowed_schemes: ["http", "https", "mailto"] },
258
+ images: { allowed_schemes: ["http", "https"] }
259
+ })
260
+ ```
261
+
262
+ Disallowed links are unwrapped (text stays, `<a>` tags drop); disallowed
263
+ images drop to alt text. Relative paths, anchors, and protocol-relative
264
+ URLs pass through—no scheme to check.
265
+
266
+ ```ruby
267
+ opts = { links: { allowed_schemes: ["http", "https"] } }
268
+
269
+ Inkmark.to_html("[click](javascript:alert(1))", options: opts)
270
+ # => "<p>click</p>\n"
271
+
272
+ Inkmark.to_html("![pic](data:image/svg+xml,<svg/onload=evil()>)",
273
+ options: { images: { allowed_schemes: ["http", "https"] } })
274
+ # => "<p>pic</p>\n" # dropped to alt text
275
+ ```
276
+
277
+ **Scope:** scheme filtering applies to markdown-emitted links and images
278
+ (`[text](url)` / `![alt](url)`). Raw HTML `<a href>` / `<img src>` inside
279
+ `raw_html: true` content is *not* filtered—for that case use a
280
+ downstream HTML sanitizer like Loofah.
281
+
282
+ **Default:** filtering is off. Full CommonMark autolink conformance is
283
+ preserved (including uncommon schemes like `irc:` and `ftp:`). Add the
284
+ filter explicitly when rendering untrusted input.
285
+
286
+ ## Statistics and extraction
287
+
288
+ Inkmark collects document metadata as a side effect of the single render pass.
289
+ Two independent options control what's exposed:
290
+
291
+ - **`statistics: true`** populates `Inkmark#statistics` with scalar counts and
292
+ language detection—nothing you have to iterate.
293
+ - **`extract: { kind: true, ... }`** populates `Inkmark#extracts` with structured
294
+ arrays of records. Opt into only the kinds you need; unasked-for arrays are
295
+ never allocated.
296
+
297
+ ```ruby
298
+ md = Inkmark.new(source, options: {
299
+ statistics: true,
300
+ extract: {
301
+ images: true,
302
+ links: true,
303
+ code_blocks: true,
304
+ headings: true,
305
+ footnote_definitions: true
306
+ }
307
+ })
308
+ md.to_html
309
+
310
+ md.statistics
311
+ # => {
312
+ # heading_count: 2,
313
+ # likely_language: "eng",
314
+ # language_confidence: 0.93,
315
+ # character_count: 142,
316
+ # word_count: 28,
317
+ # code_block_count: 1,
318
+ # image_count: 1,
319
+ # link_count: 2,
320
+ # footnote_definition_count: 1,
321
+ # }
322
+
323
+ md.extracts[:code_blocks]
324
+ # => [{ lang: "ruby", source: "puts \"hello\"\n", byte_range: 78...101 }]
325
+
326
+ md.extracts[:headings]
327
+ # => [
328
+ # { level: 1, text: "Hello World", id: "hello-world", byte_range: 0...14 },
329
+ # { level: 2, text: "Code Example", id: "code-example", byte_range: 68...83 }
330
+ # ]
331
+ ```
332
+
333
+ ### Extract record shapes
334
+
335
+ | Kind | Fields |
336
+ |---------------------------|------------------------------------------------|
337
+ | `:images` | `src`, `alt`, `title`, `byte_range` |
338
+ | `:links` | `href`, `text`, `title`, `byte_range` |
339
+ | `:code_blocks` | `lang`, `source`, `byte_range` |
340
+ | `:headings` | `level`, `text`, `id`, `byte_range` |
341
+ | `:footnote_definitions` | `label`, `text`, `byte_range` |
342
+
343
+ `byte_range` is an exclusive `Range` (`start...end`) pointing into the original
344
+ source string—slice with `source.byteslice(r.begin, r.size)` to recover the
345
+ raw Markdown. `source` on `:code_blocks` is pulldown-cmark's pre-filter code
346
+ content, so enabling `syntax_highlight: true` does not mutate it.
347
+
348
+ ### Mutual trigger: `toc` ↔ `extract[:headings]`
349
+
350
+ One heading walk powers both the TOC renderer and the heading extract, so the
351
+ two options trigger each other. Enabling either gives you access to both
352
+ `Inkmark#toc` (with `#to_markdown` / `#to_html`) and `Inkmark#extracts[:headings]`.
353
+
354
+ ```ruby
355
+ Inkmark.new(source, options: { toc: true }).extracts[:headings]
356
+ # => [{ level: 1, text: "Hello World", id: "hello-world", byte_range: 0...14 }, ...]
357
+ ```
358
+
359
+ ## Chunks extraction (for RAG)
360
+
361
+ `Inkmark.chunks_by_heading` splits a document by heading into an ordered
362
+ Array of section Hashes. Each section's `:content` is **filter-applied
363
+ Markdown**—emoji expanded, URLs autolinked, allowlists applied—serialized
364
+ back through pulldown-cmark. Designed as the first stage of a
365
+ chunk → embed → retrieve pipeline.
366
+
367
+ ```ruby
368
+ sections = Inkmark.chunks_by_heading(readme)
369
+ sections.each do |s|
370
+ puts "#{'#' * s[:level]} #{s[:heading]} (#{s[:id]})"
371
+ puts s[:content]
372
+ end
373
+ ```
374
+
375
+ Each entry:
376
+
377
+ ```ruby
378
+ {
379
+ heading: "From source", # String, or nil for the preamble
380
+ level: 3, # 1-6, or 0 for the preamble
381
+ id: "from-source", # slug, or nil for the preamble
382
+ breadcrumb: ["Docs", "Installation"], # ancestor heading texts, root to parent
383
+ content: "Run `bundle install`...\n" # filter-applied Markdown
384
+ }
385
+ ```
386
+
387
+ Sections are **hierarchical**: a `##` section's `:content` includes any
388
+ nested `###` subsections, which also appear as their own entries. Content
389
+ before the first heading (if any) becomes a preamble entry with
390
+ `heading: nil` and `level: 0`.
391
+
392
+ `:breadcrumb` carries the ancestor heading texts from root to immediate
393
+ parent. Root-level sections and the preamble have an empty array. Skipped
394
+ levels are omitted, so an `###` directly under an `#` has `breadcrumb:
395
+ ["Top"]`, not `["Top", nil]`. RAG pipelines typically prepend the
396
+ breadcrumb to each chunk before embedding—it gives the vector model a
397
+ cheap signal about the chunk's place in the document:
398
+
399
+ Enable `statistics: true` to add `:character_count` and `:word_count` to
400
+ every section entry. Counts reflect the section's filter-applied text
401
+ content including any code-block bodies (code is content for embedding
402
+ purposes, not just prose). Numbers across sections won't sum to the
403
+ document total because sections overlap hierarchically—a parent section's
404
+ count includes its nested subsections.
405
+
406
+ ```ruby
407
+ Inkmark.chunks_by_heading(doc, options: {statistics: true})
408
+ # => [
409
+ # { heading: "Installation", level: 2, id: "installation",
410
+ # breadcrumb: ["Intro"],
411
+ # character_count: 180, word_count: 32,
412
+ # content: "..." },
413
+ # ...
414
+ # ]
415
+ ```
416
+
417
+ ```ruby
418
+ Inkmark.chunks_by_heading(readme).each do |s|
419
+ next if s[:heading].nil? # skip preamble
420
+ context = (s[:breadcrumb] + [s[:heading]]).join(" > ")
421
+ embed_and_store("#{context}\n\n#{s[:content]}", metadata: {id: s[:id]})
422
+ end
423
+ ```
424
+
425
+ ### Picking specific sections
426
+
427
+ `chunks_by_heading` always returns the full array. Use plain `Enumerable`
428
+ to slice it however you need:
429
+
430
+ ```ruby
431
+ sections = Inkmark.chunks_by_heading(readme)
432
+
433
+ # Find one by heading text
434
+ sections.find { |s| s[:heading] == "Installation" }
435
+
436
+ # Filter by regexp
437
+ sections.select { |s| s[:heading]&.match?(/install|usage/i) }
438
+
439
+ # All top-level headings only
440
+ sections.select { |s| s[:level] == 1 }
441
+
442
+ # Skip the preamble
443
+ sections.reject { |s| s[:heading].nil? }
444
+ ```
445
+
446
+ No filter kwarg on the method—`.select` / `.find` / `.reject` already
447
+ cover every filtering shape, and you can compose conditions freely
448
+ (heading AND level, or heading NOT in a blocklist, etc.). The preamble
449
+ is a regular entry with `heading: nil` and falls out of Regexp/String
450
+ filters naturally (`nil == "Foo"` is false; `nil&.match?(x)` is nil).
451
+
452
+ ### RAG pipeline caveat: HTML-emitting filters
453
+
454
+ **Disable `syntax_highlight`, `images: { lazy: true }`, and `links: { nofollow: true }`
455
+ when chunking for RAG.** These filters embed raw `<pre>…`, `<img loading=…>`,
456
+ and `<a rel=…>` HTML into the serialized Markdown; the HTML noise hurts
457
+ embedding quality for downstream semantic search.
458
+
459
+ ```ruby
460
+ sections = Inkmark.chunks_by_heading(doc, options: {
461
+ emoji_shortcodes: true, # keep—improves semantic signal
462
+ links: {
463
+ autolink: true, # keep—proper anchor markdown
464
+ allowed_schemes: %w[http https mailto], # keep—safe URLs
465
+ nofollow: false # off—would embed <a rel=...> HTML
466
+ },
467
+ images: { lazy: false }, # off—would embed <img loading=...> HTML
468
+ syntax_highlight: false # off—would embed <pre><span...> HTML
469
+ })
470
+ ```
471
+
472
+ ### Scope
473
+
474
+ `chunks_by_heading` is a **structural chunking primitive**, not a
475
+ complete RAG chunker. It splits a document along heading boundaries.
476
+ For documents without headings—or when you need a strict size
477
+ budget regardless of document structure—reach for
478
+ [`chunks_by_size`](#sliding-window-chunking) below.
479
+
480
+ Inkmark does not ship token-based budgeting (there is no embedded
481
+ tokenizer). Use `character_count` / `word_count` or your own tokenizer
482
+ to approximate. Prepending document titles or parent-heading
483
+ breadcrumbs to each chunk is a few lines of Ruby on top of the array
484
+ this method returns.
485
+
486
+ ### Sliding-window chunking
487
+
488
+ `Inkmark.chunks_by_size` splits a document into fixed-size chunks with
489
+ optional overlap, walking the filter-applied Markdown sequentially. Use
490
+ this when headings are absent or uneven, or when you need a strict
491
+ size budget for embedding input.
492
+
493
+ ```ruby
494
+ # Char-budgeted windows with overlap
495
+ Inkmark.chunks_by_size(doc, chars: 500, overlap: 50)
496
+
497
+ # Word budget, word-boundary cuts
498
+ Inkmark.chunks_by_size(doc, words: 120, overlap: 15, at: :word)
499
+
500
+ # Dual budget: cut at whichever is reached first
501
+ Inkmark.chunks_by_size(doc, chars: 1000, words: 200)
502
+ ```
503
+
504
+ Each window:
505
+
506
+ ```ruby
507
+ {
508
+ index: 0, # 0-based sequence position
509
+ content: "..." # filter-applied Markdown slice
510
+ # character_count, word_count added when options: { statistics: true }
511
+ }
512
+ ```
513
+
514
+ **Boundary modes.** `at: :block` (default) cuts only between top-level
515
+ Markdown blocks—output stays valid Markdown, and a single block that
516
+ exceeds the budget is emitted as its own window rather than silently
517
+ dropped. `at: :word` serializes the full filtered Markdown and cuts at
518
+ the last Unicode word boundary that fits—tighter fit but may split
519
+ open constructs.
520
+
521
+ **Overlap.** Measured in chars. Each new window begins with the
522
+ trailing `overlap` chars of the previous window, so adjacent chunks
523
+ share context—useful when an embedding model's attention benefits
524
+ from neighbor overlap. Must be less than `chars:` when both are set.
525
+
526
+ **Validation.** `chars` or `words` required (at least one). Both must
527
+ be positive. `overlap` defaults to 0, must be non-negative, and must be
528
+ less than `chars` when `chars` is set. Invalid combinations raise
529
+ `ArgumentError` at the Ruby boundary—silent clamping would mask
530
+ bugs like swapped args.
531
+
532
+ #### Heading vs size: which to use
533
+
534
+ `chunks_by_heading` for docs where headings encode meaningful
535
+ structure (articles, specs, READMEs). Each chunk carries heading,
536
+ level, id, and breadcrumb metadata—retrieval benefits from that
537
+ context.
538
+
539
+ `chunks_by_size` for unstructured or uneven-heading docs, or when a
540
+ hard size ceiling matters more than document structure. No structural
541
+ metadata; windows are just positioned slices.
542
+
543
+ You can compose them for a hybrid "heading-based, but size-capped"
544
+ pattern:
545
+
546
+ ```ruby
547
+ Inkmark.chunks_by_heading(doc).flat_map do |c|
548
+ if c[:content].size > 2000
549
+ Inkmark.chunks_by_size(c[:content], chars: 500, overlap: 50)
550
+ else
551
+ [c]
552
+ end
553
+ end
554
+ ```
555
+
556
+ ## Truncation
557
+
558
+ `Inkmark.truncate_markdown` caps a document at a character and/or word
559
+ budget, cutting at either a Markdown block boundary (valid structure) or
560
+ a Unicode word boundary (tighter fit, may split an open construct).
561
+ Designed for LLM context-window budgeting and RAG chunk normalization.
562
+
563
+ ```ruby
564
+ # Block-boundary cut: last complete block that fits, output is valid Markdown
565
+ Inkmark.truncate_markdown(doc, chars: 4000, at: :block)
566
+
567
+ # Word-boundary cut: last word that fits, output may split open constructs
568
+ Inkmark.truncate_markdown(doc, chars: 4000, at: :word)
569
+
570
+ # Dual budget: cut at whichever limit is hit first
571
+ Inkmark.truncate_markdown(doc, chars: 4000, words: 500, at: :word)
572
+
573
+ # Suppress the marker
574
+ Inkmark.truncate_markdown(doc, chars: 4000, at: :block, marker: nil)
575
+
576
+ # Custom marker
577
+ Inkmark.truncate_markdown(doc, chars: 4000, at: :block, marker: "[…]")
578
+ ```
579
+
580
+ Default marker is `"…"`. When appended, it counts toward the
581
+ budget—`chars: 4000` always yields output ≤ 4000 codepoints.
582
+
583
+ **Behavior:**
584
+
585
+ - **Source fits the budget**: returned unchanged (no marker).
586
+ - **First block alone exceeds the budget** (block mode): empty string.
587
+ Honest to "no block fits"; fall through to word-mode truncation if you
588
+ want a best-effort cut.
589
+ - **Marker too large for the budget**: raises `ArgumentError`.
590
+ - **Filter pipeline**: `emoji_shortcodes`, `links: { autolink: true }`, host/scheme
591
+ allowlists etc. run before truncation, so the measured output matches
592
+ what downstream tools consume.
593
+
594
+ ### Per-section truncation
595
+
596
+ `chunks_by_heading` accepts a `truncate:` kwarg that applies the same
597
+ contract to every section's `:content` independently:
598
+
599
+ ```ruby
600
+ Inkmark.chunks_by_heading(doc, truncate: {chars: 500, at: :block})
601
+ ```
602
+
603
+ Each section's content is cut to the 500-char budget; metadata
604
+ (`:heading`, `:level`, `:id`, `:breadcrumb`) stays intact. When
605
+ `statistics: true` is also set, `:character_count` / `:word_count` are
606
+ recomputed against the truncated content.
607
+
608
+ ```ruby
609
+ Inkmark.chunks_by_heading(doc,
610
+ options: {statistics: true},
611
+ truncate: {chars: 500, at: :block, marker: "…"}
612
+ )
613
+ # => each entry: { heading:, level:, id:, breadcrumb:,
614
+ # character_count:, word_count:, content: (≤ 500 chars) }
615
+ ```
616
+
617
+ Because sections are hierarchical (a parent section's `:content`
618
+ includes nested subsections), applying the same budget to every entry
619
+ means each chunk stands alone as a self-contained, budget-capped unit.
620
+
621
+ ## Plain-text extraction
622
+
623
+ `Inkmark#to_plain_text` strips all Markdown syntax and returns inline content as
624
+ plain text. Designed for embedding models, token counting, LLM input, and any
625
+ downstream consumer that treats Markdown formatting as noise.
626
+
627
+ ```ruby
628
+ Inkmark.to_plain_text("**bold** and [a link](https://example.com)")
629
+ # => "bold and a link (https://example.com)\n"
630
+
631
+ g = Inkmark.new(source, options: { emoji_shortcodes: true, links: { autolink: true } })
632
+ g.to_plain_text
633
+ ```
634
+
635
+ The same event-level filters (emoji replacement, autolink, host/scheme
636
+ allowlists, etc.) run before plain-text serialization, so preprocessing passes
637
+ apply consistently across `to_html`, `to_markdown`, and `to_plain_text`.
638
+
639
+ ### Output grammar
640
+
641
+ | Element | Plain-text form |
642
+ |---|---|
643
+ | `**bold**`, `*italic*`, `~~strike~~` | inner text only |
644
+ | `` `code` `` | inner text (no backticks) |
645
+ | `[text](url)` | `text (url)` |
646
+ | `<https://x.com>` (autolink) | `https://x.com` (collapses when text == url) |
647
+ | `![alt](src)` | `alt (src)` |
648
+ | `# Heading` | plain text with blank line above/below |
649
+ | `> quote` | every line prefixed with `> ` (email-style; nests) |
650
+ | `- item` / `1. item` | `- ` / `1. ` bullets; 2-space indent per nesting |
651
+ | `- [x] task` | `- task` (checkbox dropped) |
652
+ | tables | header row `\t`-joined, blank line, body rows `\t`-joined |
653
+ | ``` ```code``` ``` | raw content, blank line above/below |
654
+ | `---` | `---` surrounded by blank lines |
655
+ | `[^foo]` | `[foo]` |
656
+ | `[^foo]: body` | appended at document end as `[foo]: body` |
657
+ | soft break | space |
658
+ | hard break | `\n` |
659
+ | raw HTML | stripped by default; passes through when `raw_html: true` |
660
+
661
+ Blank lines inside a blockquote emit a bare `>` marker (matching email quoting
662
+ conventions; no trailing whitespace).
663
+
664
+ ## Markdown-to-Markdown pipeline
665
+
666
+ `Inkmark#to_markdown` runs the same event-level filter pipeline as `to_html` and
667
+ serializes the result back to Markdown text. Use it as a preprocessing step in
668
+ pipelines that consume Markdown: LLM prompts, secondary renderers, content
669
+ storage, or any stage that needs clean Markdown rather than HTML.
670
+
671
+ ```ruby
672
+ # Class-method shortcut
673
+ Inkmark.to_markdown("**bold** :rocket:", options: { emoji_shortcodes: true })
674
+ # => "**bold** 🚀"
675
+
676
+ # Instance form—the same options object drives both outputs
677
+ g = Inkmark.new(source, options: {
678
+ emoji_shortcodes: true,
679
+ links: { allowed_hosts: ["trusted.com", "*.trusted.com"] }
680
+ })
681
+ g.to_markdown # filtered Markdown for pipeline
682
+ g.to_html # rendered HTML for display
683
+ ```
684
+
685
+ ### Choosing filters for a Markdown pipeline
686
+
687
+ Inkmark's filters fall into two groups depending on what they emit:
688
+
689
+ **Markdown-native filters** transform the event stream without producing HTML.
690
+ Their output is standard Markdown and is safe to pass to any downstream
691
+ consumer:
692
+
693
+ | Filter | Effect in `to_markdown` |
694
+ |---|---|
695
+ | `emoji_shortcodes` | `:rocket:` → `🚀` in the output text |
696
+ | `links: { autolink: true }` | bare `https://x.com` → `[https://x.com](https://x.com)` |
697
+ | `links: { allowed_hosts:, allowed_schemes: }` | disallowed links unwrapped to plain text |
698
+ | `images: { allowed_hosts:, allowed_schemes: }` | disallowed images dropped to alt text |
699
+ | `smart_punctuation` | `"..."` → `"…"` etc. (text-only transformation) |
700
+
701
+ **HTML-emitting filters** synthesize raw `<...>` markup. When these are active
702
+ and you call `to_markdown`, that markup is embedded verbatim in the output. Raw
703
+ HTML blocks are valid CommonMark, but they may break or confuse downstream
704
+ consumers—especially LLMs and renderers that do not expect HTML inside
705
+ Markdown:
706
+
707
+ | Filter | What ends up in the Markdown |
708
+ |---|---|
709
+ | `syntax_highlight` | fenced code blocks become `<pre><code><span class=...>` HTML |
710
+ | `images: { lazy: true }` | images become `<img loading="lazy" decoding="async" ...>` HTML |
711
+ | `links: { nofollow: true }` | links become `<a rel="nofollow noopener" ...>` HTML |
712
+
713
+ **Recommendation:** disable HTML-emitting filters when calling `to_markdown`.
714
+ They are designed for final HTML output and produce hard-to-process markup in a
715
+ Markdown pipeline:
716
+
717
+ ```ruby
718
+ Inkmark.to_markdown(source, options: {
719
+ # Markdown-native—safe to enable
720
+ emoji_shortcodes: true,
721
+ links: { allowed_schemes: %w[http https mailto], nofollow: false },
722
+ images: { lazy: false },
723
+
724
+ # HTML-emitting—turn off for clean Markdown output
725
+ syntax_highlight: false, # would embed <pre><span...> blocks
726
+ })
727
+ ```
728
+
729
+ ## Event handlers
730
+
731
+ Register handlers with `#on` to inspect or transform document elements as they
732
+ are parsed. Handlers fire **post-order**—children before parents—so when a
733
+ `:table` handler runs, its rows and cells are already available. Returns `self`
734
+ for chaining.
735
+
736
+ ```ruby
737
+ md = Inkmark.new(source)
738
+
739
+ md.on(:heading) { |h| ... }
740
+ .on(:image) { |img| ... }
741
+ .on(:link) { |l| ... }
742
+ ```
743
+
744
+ Two entry points trigger handlers:
745
+
746
+ - **`#walk`**—fires handlers without producing HTML. Use it for analysis:
747
+ collecting specific elements, validating content, extracting structured data.
748
+ For built-in heading/link/image/word-count collection, see `statistics: true`.
749
+ - **`#to_html`**—fires handlers then renders. Mutations made inside a handler
750
+ change what ends up in the HTML.
751
+
752
+ ### Collecting data with `#walk`
753
+
754
+ ```ruby
755
+ # Check that every image has alt text
756
+ md = Inkmark.new(source)
757
+ missing_alt = []
758
+ md.on(:image) { |img| missing_alt << img.dest if img.text.empty? }
759
+ md.walk
760
+ raise "Images missing alt text: #{missing_alt.join(', ')}" if missing_alt.any?
761
+ ```
762
+
763
+ ```ruby
764
+ # Collect every fenced code block language used in the document
765
+ languages = Set.new
766
+ md.on(:code_block) { |c| languages << c.lang if c.lang && !c.lang.empty? }
767
+ md.walk
768
+ ```
769
+
770
+ ```ruby
771
+ # Validate that no link points to a deprecated domain
772
+ deprecated = /old-docs\.example\.com/
773
+ md.on(:link) { |l| warn "Deprecated link: #{l.dest}" if l.dest =~ deprecated }
774
+ md.walk
775
+ ```
776
+
777
+ ### Rewriting output with `#to_html`
778
+
779
+ #### Image CDN rewriting
780
+
781
+ Set `dest=` to redirect images to a CDN. The change is reflected in the
782
+ rendered `<img src>`:
783
+
784
+ ```ruby
785
+ md = Inkmark.new(source)
786
+ md.on(:image) do |img|
787
+ img.dest = "https://cdn.example.net/#{File.basename(img.dest)}"
788
+ end
789
+ html = md.to_html
790
+ ```
791
+
792
+ #### Rewriting link destinations
793
+
794
+ ```ruby
795
+ md.on(:link) do |l|
796
+ if l.dest.start_with?("http")
797
+ l.html = %(<a href="#{l.dest}" target="_blank" rel="noopener">#{l.text}</a>)
798
+ end
799
+ end
800
+ ```
801
+
802
+ #### Shifting heading levels
803
+
804
+ Bump every heading down one level so the document fits inside a layout that
805
+ reserves `<h1>` for the page title:
806
+
807
+ ```ruby
808
+ md = Inkmark.new(source)
809
+ md.on(:heading) { |h| h.level = [h.level + 1, 6].min }
810
+ html = md.to_html
811
+ ```
812
+
813
+ #### Custom code block rendering
814
+
815
+ Intercept fenced code blocks by language tag. Setting `html=` skips Inkmark's
816
+ default `<pre><code>` output—and the `syntax_highlight` filter, even if
817
+ enabled:
818
+
819
+ ```ruby
820
+ md = Inkmark.new(source)
821
+ md.on(:code_block) do |c|
822
+ case c.lang
823
+ when "mermaid"
824
+ c.html = %(<div class="mermaid">#{c.text}</div>\n)
825
+ when "math"
826
+ c.html = %(<div class="math">\\[#{c.text}\\]</div>\n)
827
+ end
828
+ end
829
+ html = md.to_html
830
+ ```
831
+
832
+ #### Custom directives in paragraphs
833
+
834
+ Match a special directive syntax and replace the paragraph with a component:
835
+
836
+ ```ruby
837
+ # Markdown:
838
+ # @available_since rails=7.1 ruby=3.2
839
+ #
840
+ md.on(:paragraph) do |p|
841
+ next unless p.text =~ /\A@available_since\s+(.+)\z/
842
+ attrs = $1.scan(/(\w+)=(\S+)/).map { |k, v| %( #{k}="#{v}") }.join
843
+ p.html = %(<AvailableSince#{attrs} />\n)
844
+ end
845
+ ```
846
+
847
+ #### Replacing with Markdown
848
+
849
+ Use `markdown=` when the replacement is itself Markdown rather than raw HTML.
850
+ The replacement is parsed with the same options as the main document—emoji
851
+ expansion, heading IDs, raw HTML suppression—and is subject to the same
852
+ post-render filters (`syntax_highlight`, allowlists, `images: { lazy: true }`, `links: { nofollow: true }`).
853
+ Handlers do **not** fire on elements within the replacement.
854
+ `html=` takes priority when both are set on the same event.
855
+
856
+ ```ruby
857
+ md = Inkmark.new(source)
858
+ md.on(:paragraph) do |p|
859
+ if p.text.start_with?("@note ")
860
+ body = p.text.sub(/\A@note /, "")
861
+ p.markdown = "> **Note:** #{body}"
862
+ end
863
+ end
864
+ html = md.to_html
865
+ ```
866
+
867
+ #### Suppressing elements
868
+
869
+ Call `delete` on any event to omit it from the output. Children are suppressed
870
+ along with their parent:
871
+
872
+ ```ruby
873
+ md.on(:image) { |img| img.delete } # all images
874
+ md.on(:heading) { |h| h.delete if h.text.start_with?("INTERNAL:") } # by content
875
+ ```
876
+
877
+ #### Inline code annotation
878
+
879
+ `:code` fires for inline backtick spans. Use it to add links or decoration:
880
+
881
+ ```ruby
882
+ md.on(:code) do |c|
883
+ if c.text =~ /\A[A-Z][A-Za-z]+#\w+\z/ # e.g. String#split
884
+ c.html = %(<a href="/api/#{c.text.tr('#', '/')}"><code>#{c.text}</code></a>)
885
+ end
886
+ end
887
+ ```
888
+
889
+ ### Children and tree context
890
+
891
+ Container elements expose their child events (lazy, cached):
892
+
893
+ ```ruby
894
+ md.on(:table) do |t|
895
+ rows = t.children_of(:table_row)
896
+ rows.each_with_index do |row, i|
897
+ cells = row.children_of(:table_cell).map(&:text)
898
+ puts "Row #{i}: #{cells.join(' | ')}"
899
+ end
900
+ end
901
+ ```
902
+
903
+ Use `parent_kind` and `ancestor_kinds` for context-sensitive decisions:
904
+
905
+ ```ruby
906
+ # Skip decorative images that are already inside a link
907
+ md.on(:image) { |img| img.delete if img.ancestor_kinds.include?(:link) }
908
+
909
+ # Only process top-level paragraphs
910
+ md.on(:paragraph) { |p| next unless p.parent_kind.nil? }
911
+ ```
912
+
913
+ `depth` gives the nesting level (0 = top-level block):
914
+
915
+ ```ruby
916
+ md.on(:blockquote) { |b| puts "blockquote at depth #{b.depth}" }
917
+ md.on(:paragraph) { |p| puts "paragraph at depth #{p.depth}" }
918
+ # A paragraph inside a blockquote has depth 1.
919
+ ```
920
+
921
+ ### Source byte ranges
922
+
923
+ `byte_range` is an exclusive Ruby Range (`start...end`) that lets you slice the
924
+ original source to recover the raw Markdown for any element:
925
+
926
+ ```ruby
927
+ source = File.read("post.md")
928
+ md = Inkmark.new(source)
929
+ md.on(:heading) do |h|
930
+ puts "#{h.byte_range}: #{source[h.byte_range].inspect}"
931
+ end
932
+ md.walk
933
+ ```
934
+
935
+ Populated for all container kinds and the leaf kinds `:code`, `:rule`,
936
+ `:inline_math`, `:display_math`. Returns `nil` for `:text`, `:soft_break`,
937
+ and `:hard_break`. Also `nil` for `:link` when `links: { autolink: true }` is enabled
938
+ (the autolink filter inserts new link events that would shift the offset queue).
939
+
940
+ ### Event object reference
941
+
942
+ Every handler receives a `Inkmark::Event` with these fields and methods:
943
+
944
+ | Field / method | Type | Description |
945
+ |---|---|---|
946
+ | `kind` | `Symbol` | Element kind, e.g. `:heading`, `:image` |
947
+ | `text` | `String` | Plain text of all descendant text nodes |
948
+ | `depth` | `Integer` | Nesting depth; 0 = top-level block |
949
+ | `parent_kind` | `Symbol, nil` | Kind of the immediate parent, or `nil` at root |
950
+ | `ancestor_kinds` | `Array<Symbol>` | Ancestor kinds, nearest first |
951
+ | `byte_range` | `Range, nil` | Byte offsets in the original source string |
952
+ | `children` | `Array<Event>` | Direct child events (containers only) |
953
+ | `children_of(kind)` | `Array<Event>` | Children filtered by kind |
954
+ | `delete` |—| Suppress this element from output |
955
+ | `deleted?` | `Boolean` | True if `delete` was called |
956
+ | `html=` | `String, nil` | Replace output with a raw HTML string |
957
+ | `markdown=` | `String, nil` | Replace output by re-rendering a Markdown string |
958
+ | `dest=` | `String, nil` | Rewrite URL on `:link` / `:image` |
959
+ | `title=` | `String, nil` | Rewrite title attribute on `:link` / `:image` |
960
+ | `level=` | `Integer, nil` | Change heading level (1–6) on `:heading` |
961
+ | `id=` | `String, nil` | Change `id` attribute on `:heading` |
962
+
963
+ #### Per-kind field availability
964
+
965
+ **Container kinds**: handler fires after all children are processed:
966
+
967
+ | Kind | Readable | Mutable |
968
+ |---|---|---|
969
+ | `:heading` | `text`, `level`, `id` | `level=`, `id=`, `html=`, `markdown=` |
970
+ | `:paragraph` | `text` | `html=`, `markdown=` |
971
+ | `:blockquote` | `text` | `html=`, `markdown=` |
972
+ | `:list` |—| `html=`, `markdown=` |
973
+ | `:ordered_list` |—| `html=`, `markdown=` |
974
+ | `:list_item` | `text` | `html=`, `markdown=` |
975
+ | `:code_block` | `text`, `lang` | `html=`, `markdown=` |
976
+ | `:table` |—| `html=`, `markdown=` |
977
+ | `:table_head` |—| `html=`, `markdown=` |
978
+ | `:table_row` | `text` | `html=`, `markdown=` |
979
+ | `:table_cell` | `text` | `html=`, `markdown=` |
980
+ | `:emphasis` | `text` | `html=`, `markdown=` |
981
+ | `:strong` | `text` | `html=`, `markdown=` |
982
+ | `:strikethrough` | `text` | `html=`, `markdown=` |
983
+ | `:link` | `text`, `dest`, `title` | `dest=`, `title=`, `html=`, `markdown=` |
984
+ | `:image` | `text` (alt), `dest`, `title` | `dest=`, `title=`, `html=`, `markdown=` |
985
+ | `:footnote_definition` | `text` | `html=`, `markdown=` |
986
+
987
+ **Leaf kinds**: no children; handler fires on the event itself:
988
+
989
+ | Kind | Readable | Mutable |
990
+ |---|---|---|
991
+ | `:code` | `text` | `html=` |
992
+ | `:text` | `text` | `html=` |
993
+ | `:html` | `text` | `html=` |
994
+ | `:rule` |—| `html=` |
995
+ | `:soft_break` |—| `html=` |
996
+ | `:hard_break` |—| `html=` |
997
+ | `:footnote_reference` | `text` | `html=` |
998
+
999
+ All kinds expose `depth`, `parent_kind`, `ancestor_kinds`, `byte_range`,
1000
+ `children`, `children_of`, `delete`, `deleted?`.
1001
+
1002
+ `:code_block` `text` and `source` are identical—`source` is an alias for
1003
+ readability when treating the field as raw source code.
1004
+
1005
+ ### Filter interaction
1006
+
1007
+ Enrichment filters run **before** handlers. Handlers always see:
1008
+
1009
+ - Emoji already resolved (`emoji_shortcodes: true`)—`h.text` contains `"🚀"`,
1010
+ not `":rocket:"`
1011
+ - Bare URLs already autolinked (`links: { autolink: true }`)—they appear as `:link` events
1012
+ - Heading `id` already set (`headings: { ids: true }`)—`h.id` is populated
1013
+
1014
+ Post-render filters (`syntax_highlight`, allowlists, `images: { lazy: true }`,
1015
+ `links: { nofollow: true }`) run **after** handlers:
1016
+
1017
+ - `:code_block` events are still `:code_block`, not opaque HTML, even when
1018
+ `syntax_highlight: true`—setting `html=` on a code block overrides the
1019
+ highlighter
1020
+ - Handler-set `dest=` values pass through host and scheme allowlists
1021
+
1022
+ ## Benchmarks
1023
+
1024
+ Inkmark ships a benchmark harness comparing it against `kramdown`,
1025
+ `commonmarker`, `redcarpet`, `markly`, and `rdiscount` on a sweep of real
1026
+ markdown inputs.
1027
+
1028
+ Measuring apples to apples: every adapter is tuned for **feature parity** with
1029
+ Inkmark's defaults—CommonMark + core GFM (tables, strikethrough, tasklists,
1030
+ footnotes, tagfilter), no typographics, no autolink, no syntax highlighting,
1031
+ no heading-id slugging.
1032
+
1033
+ Run locally:
1034
+
1035
+ ```bash
1036
+ bundle config set with benchmark
1037
+ bundle install
1038
+ bundle exec rake benchmark
1039
+ ```
1040
+
1041
+ ### Assets
1042
+
1043
+ | Asset | Size | What it exercises |
1044
+ |---|---:|---|
1045
+ | `commonmark-spec` | 201.3 KB | CommonMark spec—code-block-heavy, edge-case-heavy |
1046
+ | `commonmarker-readme` | 17.0 KB | Real-world commonmarker README—options tables, fenced code |
1047
+ | `redcarpet-readme` | 14.0 KB | Real-world redcarpet README—prose + code samples |
1048
+ | `redcarpet-benchmark` | 8.0 KB | Classic redcarpet bench corpus—heavy emphasis / inline parsing |
1049
+ | `large-4k` | 3.7 KB | dotenv README—mixed prose, code blocks, tables |
1050
+ | `medium-1k` | 1.0 KB | Faraday README header—images, badges, inline links |
1051
+ | `small-512b` | 0.5 KB | Short README section with headings and bullet lists |
1052
+ | `tiny-256b` | 0.3 KB | 3-line CommonMark snippet—parser setup/overhead-bound |
1053
+
1054
+ See `benchmarks/NOTICE` for attribution on the vendored test inputs.
1055
+
1056
+ ### Results
1057
+
1058
+ Numbers below are from AWS EC2 `c7a.large` (AMD EPYC), Ruby 4.0.2 with YJIT on.
1059
+ Each engine uses its idiomatic "hot path"—Inkmark relies on its cached default
1060
+ options, Redcarpet reuses one pre-built `Markdown` object. Iterations per
1061
+ second, higher is better.
1062
+
1063
+ **`commonmark-spec` (201.3 KB)**
1064
+ ```
1065
+ inkmark: 1,172 i/s
1066
+ redcarpet: 908 i/s - 1.29x slower
1067
+ markly: 453 i/s - 2.59x slower
1068
+ commonmarker: 345 i/s - 3.40x slower
1069
+ rdiscount: 212 i/s - 5.53x slower
1070
+ kramdown: 26 i/s - 45.08x slower
1071
+ ```
1072
+
1073
+ **`commonmarker-readme` (16.9 KB)**
1074
+ ```
1075
+ inkmark: 16,658 i/s
1076
+ redcarpet: 12,988 i/s - 1.28x slower
1077
+ commonmarker: 4,268 i/s - 3.90x slower
1078
+ markly: 3,974 i/s - 4.19x slower
1079
+ rdiscount: 2,676 i/s - 6.22x slower
1080
+ kramdown: 113 i/s - 147.42x slower
1081
+ ```
1082
+
1083
+ **`redcarpet-readme` (14.0 KB)**
1084
+ ```
1085
+ inkmark: 17,343 i/s
1086
+ redcarpet: 13,587 i/s - 1.28x slower
1087
+ markly: 5,455 i/s - 3.18x slower
1088
+ commonmarker: 4,890 i/s - 3.55x slower
1089
+ rdiscount: 3,336 i/s - 5.20x slower
1090
+ kramdown: 208 i/s - 83.38x slower
1091
+ ```
1092
+
1093
+ **`redcarpet-benchmark` (8.0 KB)**
1094
+ ```
1095
+ inkmark: 27,634 i/s
1096
+ redcarpet: 23,777 i/s - 1.16x slower
1097
+ markly: 9,346 i/s - 2.96x slower
1098
+ commonmarker: 7,805 i/s - 3.54x slower
1099
+ rdiscount: 6,201 i/s - 4.46x slower
1100
+ kramdown: 367 i/s - 75.30x slower
1101
+ ```
1102
+
1103
+ **`large-4k` (3.7 KB)**
1104
+ ```
1105
+ inkmark: 64,051 i/s
1106
+ redcarpet: 58,420 i/s - 1.10x slower
1107
+ markly: 22,500 i/s - 2.85x slower
1108
+ commonmarker: 18,053 i/s - 3.55x slower
1109
+ rdiscount: 13,839 i/s - 4.63x slower
1110
+ kramdown: 624 i/s - 102.64x slower
1111
+ ```
1112
+
1113
+ **`medium-1k` (1.0 KB)**
1114
+ ```
1115
+ redcarpet: 216,968 i/s
1116
+ inkmark: 213,478 i/s - 1.02x slower
1117
+ markly: 70,251 i/s - 3.09x slower
1118
+ commonmarker: 46,357 i/s - 4.68x slower
1119
+ rdiscount: 45,880 i/s - 4.73x slower
1120
+ kramdown: 2,813 i/s - 77.13x slower
1121
+ ```
1122
+
1123
+ **`small-512b` (0.5 KB)**
1124
+ ```
1125
+ inkmark: 388,266 i/s
1126
+ redcarpet: 368,401 i/s - 1.05x slower
1127
+ rdiscount: 74,032 i/s - 5.24x slower
1128
+ markly: 61,175 i/s - 6.35x slower
1129
+ commonmarker: 46,658 i/s - 8.32x slower
1130
+ kramdown: 3,952 i/s - 98.25x slower
1131
+ ```
1132
+
1133
+ **`tiny-256b` (0.3 KB)**
1134
+ ```
1135
+ redcarpet: 535,972 i/s
1136
+ inkmark: 511,019 i/s - 1.05x slower
1137
+ rdiscount: 99,001 i/s - 5.41x slower
1138
+ markly: 96,159 i/s - 5.57x slower
1139
+ commonmarker: 57,704 i/s - 9.29x slower
1140
+ kramdown: 4,117 i/s - 130.18x slower
1141
+ ```
1142
+
1143
+ ## Contributing
1144
+
1145
+ Bug reports and pull requests are welcome on GitHub at
1146
+ https://github.com/yaroslav/inkmark.
1147
+
1148
+ ## Acknowledgements
1149
+
1150
+ Inkmark is built with:
1151
+
1152
+ [pulldown-cmark](https://github.com/pulldown-cmark/pulldown-cmark) by Raph Levien, Marcus Klaas de Vries, Martín Pozo, Michael Howell, Roope Salmi and Martin Geisler;
1153
+
1154
+ [Magnus](https://github.com/matsadler/magnus) by Matthew Sadler;
1155
+
1156
+ [syntect](https://github.com/trishume/syntect) by Tristan Hume, Keith Hall, Google Inc and other contributors;
1157
+
1158
+ And other Rust crates—thanks to their authors.
1159
+
1160
+ Thanks to Julik Tarkhanov for short but useful brainstorming sessions.
1161
+
1162
+ ## License
1163
+
1164
+ The gem is available as open source under the terms of the
1165
+ [MIT License](LICENSE.txt). Third-party content (benchmark assets, CommonMark
1166
+ spec) is attributed in `NOTICE` and `benchmarks/NOTICE`.