rpdfium 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/LICENSE ADDED
@@ -0,0 +1,19 @@
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ Licensed under the Apache License, Version 2.0 (the "License");
6
+ you may not use this file except in compliance with the License.
7
+ You may obtain a copy of the License at
8
+
9
+ http://www.apache.org/licenses/LICENSE-2.0
10
+
11
+ Unless required by applicable law or agreed to in writing, software
12
+ distributed under the License is distributed on an "AS IS" BASIS,
13
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ See the License for the specific language governing permissions and
15
+ limitations under the License.
16
+
17
+ Copyright 2026 The rpdfium contributors
18
+
19
+ Full license text: https://www.apache.org/licenses/LICENSE-2.0.txt
data/README.md ADDED
@@ -0,0 +1,599 @@
1
+ # rpdfium
2
+
3
+ Ruby bindings for [PDFium](https://pdfium.googlesource.com/pdfium/), the
4
+ PDF engine that powers Chrome's viewer. Provides text extraction with
5
+ character-level metadata, vector path access, image extraction, form
6
+ fields, page rendering, and pdfplumber-style table detection.
7
+
8
+ Inspired by [`pypdfium2`](https://github.com/pypdfium2-team/pypdfium2)
9
+ (bindings layout) and [`pdfplumber`](https://github.com/jsvine/pdfplumber)
10
+ (table heuristics).
11
+
12
+ ```ruby
13
+ require "rpdfium"
14
+
15
+ Rpdfium.open("invoice.pdf") do |doc|
16
+ puts doc.metadata[:title]
17
+
18
+ doc.each do |page|
19
+ puts page.text
20
+ Rpdfium::Table::Extractor.new(page).extract.each do |table|
21
+ table.each { |row| puts row.inspect }
22
+ end
23
+ end
24
+ end
25
+ ```
26
+
27
+ ## Why
28
+
29
+ The Ruby ecosystem has `pdf-reader` (text only, slow on complex docs),
30
+ `origami` (security-research focused), and `hexapdf` (great for
31
+ manipulation but text extraction is approximate). None give you
32
+ character-level bounding boxes, real vector path geometry, or table
33
+ extraction. `rpdfium` fills that gap by binding the same battle-tested
34
+ C++ engine that powers Chrome's PDF viewer.
35
+
36
+ In practice it matches the speed of Python's `pypdfium2` on text
37
+ extraction and is **15-56× faster than `pdfplumber`** while using
38
+ **5-7× less memory** on large documents. See [Performance](#performance)
39
+ for details.
40
+
41
+ ## Installing PDFium
42
+
43
+ `rpdfium` itself ships only Ruby code. The native library is loaded
44
+ from one of, in order:
45
+
46
+ - `ENV["PDFIUM_LIBRARY_PATH"]` (highest priority — point to a
47
+ `libpdfium.{so,dylib,dll}` of your choice)
48
+ - the [`rpdfium-binary`](https://github.com/retsef/rpdfium-binary)
49
+ companion gem (recommended), which ships precompiled PDFium binaries
50
+ for major platforms via [bblanchon/pdfium-binaries](https://github.com/bblanchon/pdfium-binaries)
51
+ - the system `libpdfium` (if installed via your package manager)
52
+
53
+ ### Recommended: use `rpdfium-binary`
54
+
55
+ ```bash
56
+ gem install rpdfium-binary
57
+ ```
58
+
59
+ RubyGems picks the right platform-specific gem automatically. Supported
60
+ platforms include `x86_64-linux`, `aarch64-linux`, `x86_64-linux-musl`,
61
+ `aarch64-linux-musl`, `arm64-darwin`, `x86_64-darwin`, `x64-mingw-ucrt`,
62
+ `x86-mingw32`, `aarch64-mingw-ucrt`. For unsupported platforms the
63
+ generic Ruby-platform gem is installed and the binary is downloaded on
64
+ first use into the user data directory.
65
+
66
+ Add to your `Gemfile`:
67
+
68
+ ```ruby
69
+ gem "rpdfium"
70
+ gem "rpdfium-binary"
71
+ ```
72
+
73
+ ### Alternative: manual `PDFIUM_LIBRARY_PATH`
74
+
75
+ Useful in containers, CI, or when you need a specific PDFium build:
76
+
77
+ ```bash
78
+ # macOS arm64
79
+ curl -L https://github.com/bblanchon/pdfium-binaries/releases/latest/download/pdfium-mac-arm64.tgz | tar xz
80
+ export PDFIUM_LIBRARY_PATH=$PWD/lib/libpdfium.dylib
81
+ ```
82
+
83
+ ## Architecture
84
+
85
+ Three layers, mirroring `pypdfium2`:
86
+
87
+ 1. **`Rpdfium::Raw`** — pure FFI bindings, 1:1 with the C API
88
+ (`FPDF_*`, `FPDFText_*`, `FPDFBitmap_*`, `FPDFPath_*`,
89
+ `FPDFImageObj_*`, `FPDFAnnot_*`). Use directly if you need something
90
+ the wrappers don't expose.
91
+ 2. **`Rpdfium::Document, ::Page, ::TextPage, ::Image::Embedded,
92
+ ::Annotation, ::Form::Field, ::Search, ::Outline, ::Attachment`** —
93
+ RAII-style wrappers with `ObjectSpace.define_finalizer` so handles
94
+ are released even if you forget `close`.
95
+ 3. **`Rpdfium::Table::Extractor`** — table detection on top of layer 2,
96
+ with `Rpdfium::Table::Debugger` for visual debugging.
97
+
98
+ ## What you can do
99
+
100
+ ### Text
101
+
102
+ ```ruby
103
+ page.text # plain string
104
+ page.text_in_bbox(left: 50, top: 100, right: 300, bottom: 150)
105
+ ```
106
+
107
+ ### Character-level metadata
108
+
109
+ Per-char data essential for layout-aware processing — bounding box,
110
+ font, weight, origin, rotation angle, plus PDFium's "character
111
+ provenance" flags:
112
+
113
+ ```ruby
114
+ page.chars.first
115
+ # {
116
+ # char: "T",
117
+ # codepoint: 84,
118
+ # x0: 72.0, x1: 79.2, top: 100.5, bottom: 112.3,
119
+ # origin_x: 72.0, origin_y: 110.8,
120
+ # angle: 0.0, # radians (rotated text)
121
+ # fontsize: 12.0,
122
+ # font: "Helvetica-Bold",
123
+ # weight: 700,
124
+ # render_mode: 0, # 0=fill 1=stroke 2=both 3=invisible
125
+ # generated: false, # true → inserted by PDFium (e.g. spaces)
126
+ # hyphen: false, # true → soft-hyphen for line break
127
+ # unicode_error: false # true → couldn't map glyph to unicode
128
+ # }
129
+ ```
130
+
131
+ `generated`/`hyphen`/`unicode_error` are the **artefact recognition
132
+ flags** — distinguishing real characters from PDFium-synthesized ones is
133
+ crucial when you don't want fake whitespace to widen a column.
134
+
135
+ Loose char boxes (proportional to font size, more stable for layout
136
+ algorithms):
137
+
138
+ ```ruby
139
+ page.chars(loose: true)
140
+ ```
141
+
142
+ Cluster chars into words automatically:
143
+
144
+ ```ruby
145
+ page.words(x_tolerance: 3.0, y_tolerance: 3.0)
146
+ # [{ text: "Invoice", x0: 72.0, x1: 110.5, top: 100.5, bottom: 112.3,
147
+ # fontsize: 12.0, font: "Helvetica-Bold", chars: [...] }, ...]
148
+ ```
149
+
150
+ ### Vector paths
151
+
152
+ Real path-segment iteration (not just bounding boxes), with state
153
+ machine for `closepath`. Useful for table line detection, signatures,
154
+ form layout analysis:
155
+
156
+ ```ruby
157
+ page.line_segments
158
+ # [{ x0: 72.0, y0: 100.0, x1: 540.0, y1: 100.0, stroke_width: 0.5 }, ...]
159
+
160
+ page.horizontal_lines
161
+ page.vertical_lines
162
+ ```
163
+
164
+ ### Images
165
+
166
+ ```ruby
167
+ page.images.each do |img|
168
+ meta = img.metadata
169
+ puts "#{meta[:width]}×#{meta[:height]} @ #{meta[:horizontal_dpi]} DPI, " \
170
+ "#{meta[:colorspace]}"
171
+ puts "filters: #{img.filters}" # e.g. ["DCTDecode"] for JPEG
172
+
173
+ # JPEG passthrough when filters == ["DCTDecode"]; otherwise rendered to PNG
174
+ img.save("img_#{img.bbox[:x0].to_i}.jpg")
175
+
176
+ # Or get raw/decoded bytes for custom processing
177
+ img.raw_bytes # as stored
178
+ img.decoded_bytes # post-filters (raster)
179
+ end
180
+ ```
181
+
182
+ ### Annotations & links
183
+
184
+ ```ruby
185
+ page.annotations.each do |a|
186
+ puts "#{a.subtype}: #{a[:Contents]} at #{a.bbox.inspect}"
187
+ end
188
+
189
+ page.links.each do |link|
190
+ puts link.link_uri || "→ page #{link.link_dest_page}"
191
+ end
192
+ ```
193
+
194
+ ### Forms (read-only)
195
+
196
+ ```ruby
197
+ doc = Rpdfium.open("form.pdf")
198
+ puts doc.form_type # :acroform / :xfa_full / :xfa_foreground / :none
199
+
200
+ doc.each do |page|
201
+ page.form_fields.each do |f|
202
+ pp f.to_h
203
+ # { name: "name", type: :textfield, value: "Mario Rossi",
204
+ # readonly: false, required: true, bbox: {...} }
205
+ end
206
+ end
207
+ ```
208
+
209
+ ### Outline (bookmarks) & attachments
210
+
211
+ ```ruby
212
+ Rpdfium::Outline.flatten(doc.outline) do |item, depth|
213
+ puts "#{" " * depth}- #{item.title} (page #{item.page_index})"
214
+ end
215
+
216
+ doc.attachments.each { |a| a.save("attached_#{a.name}") }
217
+ ```
218
+
219
+ ### Search
220
+
221
+ ```ruby
222
+ page.search("totale", match_case: false).each_match do |m|
223
+ puts "found '#{m[:text]}' at #{m[:rects].first.inspect}"
224
+ end
225
+ ```
226
+
227
+ ### Rendering
228
+
229
+ ```ruby
230
+ # Pure-Ruby PNG writer, zero deps:
231
+ page.render_to_png("page.png", scale: 2.0, include_annotations: true,
232
+ include_forms: true)
233
+
234
+ # Or get raw RGBA/BGRA/Gray bytes:
235
+ w, h, bytes, stride = page.render(scale: 2.0, output: :rgba)
236
+ ```
237
+
238
+ ### Tables
239
+
240
+ `pdfplumber`-style settings — every parameter you'd recognize:
241
+
242
+ ```ruby
243
+ extractor = Rpdfium::Table::Extractor.new(page,
244
+ vertical_strategy: :lines, # :lines / :lines_strict / :text / :explicit
245
+ horizontal_strategy: :lines,
246
+ snap_tolerance: 3.0,
247
+ join_tolerance: 3.0,
248
+ edge_min_length: 3.0,
249
+ edge_min_length_prefilter: 1.0,
250
+ intersection_tolerance: 3.0,
251
+ min_words_vertical: 3,
252
+ min_words_horizontal: 1,
253
+ text_x_tolerance: 3.0,
254
+ text_y_tolerance: 3.0,
255
+ explicit_vertical_lines: [], # [Float] x-coords or [Hash{x:, top:, bottom:}]
256
+ explicit_horizontal_lines: [],
257
+ auto_fallback: true # try :text if :lines finds nothing
258
+ )
259
+
260
+ extractor.tables.each do |table|
261
+ table.bbox # => [x0, top, x1, bottom]
262
+ table.rows # => Array<Array<bbox|nil>>
263
+ table.columns # => Array<Array<bbox|nil>>
264
+ table.extract # => Array<Array<String>>
265
+ end
266
+
267
+ extractor.extract # shortcut: => [[[String, ...], ...], ...] (list of tables)
268
+ extractor.edges # post-snap/join edges
269
+ extractor.intersections # Hash{[x,y] => {v:[edges], h:[edges]}}
270
+ extractor.cells # Array<bbox>
271
+ ```
272
+
273
+ The pipeline mirrors `pdfplumber.TableFinder` 1:1 and uses the same
274
+ algorithms for words-to-edges, intersections-to-cells, cells-to-tables.
275
+
276
+ Visual debugger (saves PNG with overlay: red lines, green intersections,
277
+ blue table fills):
278
+
279
+ ```ruby
280
+ Rpdfium::Table::Debugger.visualize(page, "debug.png",
281
+ vertical_strategy: :lines)
282
+ ```
283
+
284
+ ### Form-aware extraction (font filtering)
285
+
286
+ Some PDFs are "filled-out forms" — F24, tax declarations, payment
287
+ slips, government forms — where the form template and the entered
288
+ data both exist as static graphics text on the page (no AcroForm
289
+ fields, no tagged structure). On these PDFs the table pipeline picks
290
+ up the template labels as noise alongside the data.
291
+
292
+ The robust strategy is to separate chars by **role** using their
293
+ font: the template typically uses proportional fonts (Futura, Times,
294
+ Helvetica) while the data layer uses a single font (often Courier
295
+ monospace, or Helvetica at a specific size).
296
+
297
+ ```ruby
298
+ Rpdfium.open("f24.pdf") do |doc|
299
+ page = doc.page(0)
300
+
301
+ # Discover what fonts are on the page
302
+ page.font_inventory.first(5).each do |g|
303
+ puts "#{g[:font].ljust(20)} h=#{g[:height]} | #{g[:count]} chars | #{g[:sample][0,40]}"
304
+ end
305
+ # Futura-Light h=8.3 | 946 chars | "cognome, denominazione o ragione sociale"
306
+ # Courier h=10.5 | 365 chars | "01234567890Azienda S.R.L.P"
307
+ # Futura-Bold h=10.4 | 249 chars | "CODICE FISCALEDATI ANAGRAFICI..."
308
+ # ...
309
+
310
+ # Extract just the entered data, line by line
311
+ page.lines(font: "Courier").each { |l| puts l }
312
+ # => "Soggetto: Azienda S.R.L. ( 01234567890 )"
313
+ # => "1001 11 2021 499,81 0,00"
314
+ # => "1712 12 2021 32,46 0,00"
315
+ # => "1701 11 2021 0,00 295,89"
316
+ # => "532,27 295,89 236,38"
317
+ # => ...
318
+ end
319
+ ```
320
+
321
+ Three primitives:
322
+
323
+ - `Page#font_inventory` — distribution by `(font, height, weight)`,
324
+ with counts and samples for ispection
325
+ - `Page#chars_where(font:, height:, weight:, bbox:, where:)` —
326
+ filter chars by any combination of criteria
327
+ - `Page#lines(font:, ...)` — high-level helper: filter + word
328
+ extraction + line clustering, returns `Array<String>`
329
+
330
+ Works on F24 payment forms, VAT periodic communications, withholding
331
+ tax declarations, and similar government forms — anywhere the data
332
+ sits on a printed template as text.
333
+
334
+ #### Label-value pairing
335
+
336
+ `Page#label_value_pairs` associates each extracted value with the
337
+ semantic label from the template that describes it. Useful when you
338
+ want machine-readable `field_name → field_value` pairs without
339
+ hard-coding the form layout.
340
+
341
+ ```ruby
342
+ Rpdfium.open("f24.pdf") do |doc|
343
+ pairs = doc.page(0).label_value_pairs(
344
+ data_font: "Courier",
345
+ template_font: /^Futura/,
346
+ data_filter: ->(t) { t.match?(/^[\d.,]+$/) }
347
+ )
348
+ pairs.each do |p|
349
+ col = p[:labels][:col]
350
+ row = p[:labels][:row]
351
+ puts "#{p[:value].ljust(12)} → col: #{col}, row: #{row}"
352
+ end
353
+ end
354
+ # 499,81 → col: "importi a debito versati"
355
+ # 1.615,90 → col: "SALDO (M-N) +/–", row: "EURO +" ← saldo finale
356
+ ```
357
+
358
+ The algorithm clusters template words into coherent labels, then for
359
+ each value finds the `:col` label (positioned above) and the `:row`
360
+ label (positioned to the left).
361
+
362
+ #### Composable primitives for complex forms
363
+
364
+ For complex forms with repeating tables, boxed-layout cells, or
365
+ multi-word values, compose three primitives:
366
+
367
+ **`Util::WordMerger`** — join adjacent words on the same line:
368
+
369
+ ```ruby
370
+ merger = Rpdfium::Util::WordMerger.new(x_gap: 20.0, y_tol: 3.0)
371
+ merged = merger.merge_by_proximity(words)
372
+ # or, with labels mapping to preserve checkbox grids:
373
+ merged = merger.merge_by_label(words, label_per_word)
374
+ # or, only merge orphans (no label assigned):
375
+ merged = merger.merge_unlabeled(words, label_per_word)
376
+ ```
377
+
378
+ **`Util::ColumnInference`** — identify data columns by alignment:
379
+
380
+ ```ruby
381
+ inference = Rpdfium::Util::ColumnInference.new(
382
+ x_tolerance: 3.0,
383
+ min_size: 3,
384
+ cv_threshold: 0.15
385
+ )
386
+ columns = inference.infer(words)
387
+ # => [[word1, word2, ..., word12], ...]
388
+ ```
389
+
390
+ Algorithm: cluster by `x0` (left-align) AND `x1` (right-align), split
391
+ columns at large vertical gaps, filter by gap-regularity (coefficient
392
+ of variation < 0.15) to exclude false positives.
393
+
394
+ **`Util::LabelMatcher`** with column inference enables header
395
+ propagation for repeating tables (e.g. 770 Quadro ST with rows
396
+ ST2..ST13 sharing column headers printed once at the top):
397
+
398
+ ```ruby
399
+ matcher = Rpdfium::Util::LabelMatcher.new(
400
+ column_inference: Rpdfium::Util::ColumnInference.new
401
+ )
402
+ pairs = page.label_value_pairs(data_font: "Courier", matcher: matcher)
403
+ ```
404
+
405
+ For boxed-layout forms (cells separated by ~10pt with template
406
+ graphics for decimals), pass `inject_spaces: false, x_tolerance: 15.0`
407
+ to `label_value_pairs` and `row_max_dx: 400.0` to the matcher.
408
+
409
+ See `examples/adapters/` for complete working adapters that compose
410
+ these primitives for specific Italian tax forms (Modello 770,
411
+ Comunicazione IVA).
412
+
413
+ ### Struct tree (Tagged PDF)
414
+
415
+ For tagged PDFs (PDF/UA, accessibility-friendly exports from
416
+ Word/LibreOffice/InDesign), `Page#struct_tree` exposes the document's
417
+ logical structure (Document → P, H1, Table, TR, TH, TD, Figure, ...)
418
+ independently of the visual layout. This gives **zero-geometry**
419
+ extraction with semantic typing (TH vs TD, RowSpan, ColSpan, Lang).
420
+
421
+ ```ruby
422
+ page.struct_tree do |tree|
423
+ next if tree.nil? || tree.empty?
424
+
425
+ tree.tables.each do |table|
426
+ rows = table.children.select { |c| c.type == "TR" }
427
+ rows.each do |row|
428
+ cells = row.children.select { |c| %w[TH TD].include?(c.type) }
429
+ puts cells.map(&:text).map(&:strip).inspect
430
+ end
431
+ end
432
+ end
433
+ # => ["Region", "Revenue", "Growth"] (TH)
434
+ # => ["Italy", "1.250.000", "+12%"] (TD)
435
+ # => ...
436
+ ```
437
+
438
+ API summary:
439
+
440
+ ```ruby
441
+ tree = page.struct_tree # → Tree or nil (nil if not tagged)
442
+ tree.empty? # true for "tagged but placeholder" PDFs
443
+ tree.roots # → [Element, ...]
444
+ tree.walk { |el| ... } # depth-first
445
+ tree.find_all(type: "P")
446
+ tree.tables # → [Element, ...] where type == "Table"
447
+
448
+ element.type # "P", "Table", "TR", "TD", ...
449
+ element.children # → [Element, ...]
450
+ element.parent # → Element or nil
451
+ element.text # text via MCID + ActualText override
452
+ element.actual_text # /ActualText (for ligature/math resolution)
453
+ element.alt_text # /Alt (Figure / Formula)
454
+ element.lang # "it-IT", "en-US", ...
455
+ element.marked_content_ids # → [Integer]
456
+ element.attributes # → { name => value }
457
+ ```
458
+
459
+ Three possible states of `page.struct_tree`:
460
+
461
+ | PDF type | returns |
462
+ | --- | --- |
463
+ | Not tagged (most PDFs from line-of-business software, scanned PDFs) | `nil` |
464
+ | Tagged but empty (some bank statements have placeholder StructTreeRoot) | `Tree` with `empty? == true` |
465
+ | Properly tagged (Word/LibreOffice/InDesign export with accessibility tags) | Navigable `Tree` |
466
+
467
+ Lifecycle: prefer the block form for deterministic close. The implicit
468
+ form (no block) leaves cleanup to `FPDF_CloseDocument` — no leak, just
469
+ the tree stays in memory until the document is closed.
470
+
471
+ ## Performance
472
+
473
+ Measured on 4 PDFs of increasing complexity, best-of-3 runs after a
474
+ warm-up, isolated in subprocesses to capture clean peak RSS. Versions
475
+ under test: `rpdfium 0.3.13`, `pdfplumber 0.11.9`, `pypdfium2 5.6.0`.
476
+
477
+ | Test corpus | Pages | Size | What it stresses |
478
+ | --- | ---: | ---: | --- |
479
+ | `sample.pdf` | 1 | 18 KB | Plain text baseline |
480
+ | `form.pdf` | 1 | 107 KB | Char-per-text-object kerning, Form XObject, tables |
481
+ | `complex.pdf` | 85 | 60 MB | Magazine-style document, dense text + heavy graphics |
482
+ | `report.pdf` | 226 | 322 KB | Rotated pages (90°), small fonts, ~15 tables per page |
483
+
484
+ ### Speed
485
+
486
+ | Corpus | Task | rpdfium | pypdfium2 | pdfplumber | speedup vs pdfplumber |
487
+ | --- | --- | ---: | ---: | ---: | ---: |
488
+ | sample.pdf (1 pag) | text | 4 ms | 4 ms | 75 ms | **21×** |
489
+ | sample.pdf (1 pag) | tables | 4 ms | n/a | 70 ms | **16×** |
490
+ | form.pdf (1 pag) | text | 12 ms | 13 ms | 538 ms | **44×** |
491
+ | form.pdf (1 pag) | tables | 25 ms | n/a | 575 ms | **23×** |
492
+ | complex.pdf (85 pag) | text | 190 ms | 183 ms | 7.76 s | **41×** |
493
+ | complex.pdf (85 pag) | tables | 231 ms | n/a | 7.07 s | **31×** |
494
+ | report.pdf (226 pag) | text | 412 ms | 397 ms | 23.26 s | **56×** |
495
+ | report.pdf (226 pag) | tables | 1.68 s | n/a | 25.25 s | **15×** |
496
+
497
+ `pypdfium2` does not implement table extraction (it's a raw FFI binding
498
+ to PDFium, not a full pipeline). It's listed as the "pure PDFium speed
499
+ floor" for text — rpdfium matches it within ±5%, showing that the Ruby
500
+ FFI overhead is not measurable.
501
+
502
+ ### Memory (peak RSS)
503
+
504
+ | Corpus | rpdfium | pypdfium2 | pdfplumber | pdfplumber/rpdfium |
505
+ | --- | ---: | ---: | ---: | ---: |
506
+ | sample.pdf | 29 MB | 20 MB | 40 MB | 1.4× |
507
+ | form.pdf | 32 MB | 22 MB | 45 MB | 1.4× |
508
+ | complex.pdf | 106 MB | 69 MB | 535 MB | **5.0×** |
509
+ | report.pdf | 136 MB | 41 MB | 1003 MB | **7.4×** |
510
+
511
+ The memory gap widens with workload size. On a 226-page document
512
+ pdfplumber uses ~1 GB; rpdfium stays under 140 MB. For server-side
513
+ batch processing this is the difference between a 256 MB container and
514
+ a 2 GB one.
515
+
516
+ ### Headline numbers
517
+
518
+ On large PDFs (226 pages, dense layout):
519
+
520
+ - **rpdfium completes both text + tables in ~2.1 s using 136 MB**
521
+ - **pdfplumber needs ~48 s and 1 GB** for the same work
522
+
523
+ Across the four corpora the median speedup vs pdfplumber is **27× on
524
+ text**, **22× on tables**. rpdfium scales linearly with page count
525
+ (thanks to PDFium's C++ engine); pdfplumber's pure-Python pipeline
526
+ degrades super-linearly on large documents.
527
+
528
+ ### Methodology
529
+
530
+ Each measurement is the **minimum of 3 timed runs after a warm-up run**
531
+ (to neutralize OS page cache effects on the 60 MB `complex.pdf`).
532
+ Subprocess isolation per measurement ensures clean RSS reading via
533
+ `resource.getrusage` / `/proc/self/status`. The benchmark harness is
534
+ a small Ruby driver that shells out to three runners (one Ruby script
535
+ using `rpdfium`, two Python scripts using `pdfplumber` and
536
+ `pypdfium2`), parses the JSON each emits, and aggregates the results.
537
+
538
+ Output quality has been spot-checked: rpdfium matches pypdfium2 char
539
+ count within ±1 char (rounding on the trailing newline). pdfplumber
540
+ returns ~2% fewer chars on locale-formatted numbers due to a different
541
+ word-tokenization for thousand-separator punctuation (e.g. `1.250.000`
542
+ split on periods).
543
+
544
+ ## Memory safety
545
+
546
+ - `FPDF_LoadMemDocument64` does **not** copy the input bytes. The
547
+ `Document` wrapper holds an FFI buffer reference for its lifetime so
548
+ the GC can't free it early.
549
+ - Every PDFium handle (`*_Close*`) is wired to
550
+ `ObjectSpace.define_finalizer` so abandoned objects don't leak native
551
+ memory.
552
+ - `FPDF_InitLibrary` is called once per process under `Mutex`;
553
+ `FPDF_DestroyLibrary` runs via `at_exit`.
554
+ - `Document#close` releases in cascade: form-fill env → cached pages →
555
+ document handle.
556
+
557
+ ## Roadmap
558
+
559
+ | Status | Feature |
560
+ |---|---|
561
+ | ✅ | Document open (path / IO / bytes / password) |
562
+ | ✅ | Document metadata, permissions, file version |
563
+ | ✅ | Page text + bbox-bounded text |
564
+ | ✅ | Per-character bounding boxes (tight & loose) |
565
+ | ✅ | Char metadata: font, weight, origin, angle, render mode |
566
+ | ✅ | PDFium-generated char detection (artefact filtering) |
567
+ | ✅ | Word clustering (layout-aware) |
568
+ | ✅ | Vector path segments (real geometry, not bbox) |
569
+ | ✅ | Image extraction (raw + decoded + rendered) |
570
+ | ✅ | Annotations + link URI/dest |
571
+ | ✅ | AcroForm field reading |
572
+ | ✅ | Bookmarks (outline) |
573
+ | ✅ | File attachments |
574
+ | ✅ | Internal text search |
575
+ | ✅ | Page rendering to RGBA/BGRA/Gray |
576
+ | ✅ | Pure-Ruby PNG writer (zero deps) |
577
+ | ✅ | Table extraction — `:lines` strategy |
578
+ | ✅ | Table extraction — `:text` strategy |
579
+ | ✅ | Table extraction — `:explicit` strategy |
580
+ | ✅ | Visual table debugger |
581
+ | ✅ | [`rpdfium-binary`](https://github.com/retsef/rpdfium-binary) companion gem with prebuilt PDFium |
582
+ | ✅ | Structure tree traversal (PDF tagged → semantic tables / `Page#struct_tree`) |
583
+ | ✅ | Form-aware extraction via font filtering (`Page#font_inventory`, `chars_where`, `lines`) |
584
+ | ✅ | Semantic label-value pairing on filled forms (`Page#label_value_pairs`, `Util::LabelMatcher`) |
585
+ | 🚧 | XFA form support |
586
+ | 🔮 | OCR fallback for scanned PDFs (via tesseract bindings) |
587
+ | 🔮 | Write APIs (we're read-only by design for now) |
588
+
589
+ ## Why not pure-Ruby?
590
+
591
+ A correct PDF text extractor needs to interpret the content stream
592
+ (operators, font encodings including CMap-based CIDs, ToUnicode maps,
593
+ ActualText overrides, marked content). PDFium has ~15 years of
594
+ edge-case fixes baked in. Reimplementing it in Ruby would take years
595
+ and still be slower. FFI is the right call.
596
+
597
+ ## License
598
+
599
+ Apache-2.0 (same as PDFium itself).