rpdfium 0.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +1870 -0
- data/LICENSE +19 -0
- data/README.md +599 -0
- data/lib/rpdfium/annotation/annotation.rb +114 -0
- data/lib/rpdfium/document.rb +226 -0
- data/lib/rpdfium/errors.rb +55 -0
- data/lib/rpdfium/form/form.rb +121 -0
- data/lib/rpdfium/image/embedded.rb +145 -0
- data/lib/rpdfium/io/png.rb +65 -0
- data/lib/rpdfium/page.rb +1623 -0
- data/lib/rpdfium/raw.rb +982 -0
- data/lib/rpdfium/search/search.rb +101 -0
- data/lib/rpdfium/structure/attachment.rb +40 -0
- data/lib/rpdfium/structure/element.rb +330 -0
- data/lib/rpdfium/structure/outline.rb +48 -0
- data/lib/rpdfium/structure/tree.rb +202 -0
- data/lib/rpdfium/table/cells.rb +137 -0
- data/lib/rpdfium/table/debugger.rb +122 -0
- data/lib/rpdfium/table/edges.rb +225 -0
- data/lib/rpdfium/table/extractor.rb +246 -0
- data/lib/rpdfium/table/table.rb +184 -0
- data/lib/rpdfium/util/cluster.rb +143 -0
- data/lib/rpdfium/util/column_inference.rb +139 -0
- data/lib/rpdfium/util/label_matcher.rb +214 -0
- data/lib/rpdfium/util/text_extraction.rb +49 -0
- data/lib/rpdfium/util/word_extractor.rb +151 -0
- data/lib/rpdfium/util/word_merger.rb +102 -0
- data/lib/rpdfium/version.rb +5 -0
- data/lib/rpdfium.rb +92 -0
- metadata +134 -0
data/LICENSE
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
Apache License
|
|
2
|
+
Version 2.0, January 2004
|
|
3
|
+
http://www.apache.org/licenses/
|
|
4
|
+
|
|
5
|
+
Licensed under the Apache License, Version 2.0 (the "License");
|
|
6
|
+
you may not use this file except in compliance with the License.
|
|
7
|
+
You may obtain a copy of the License at
|
|
8
|
+
|
|
9
|
+
http://www.apache.org/licenses/LICENSE-2.0
|
|
10
|
+
|
|
11
|
+
Unless required by applicable law or agreed to in writing, software
|
|
12
|
+
distributed under the License is distributed on an "AS IS" BASIS,
|
|
13
|
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
14
|
+
See the License for the specific language governing permissions and
|
|
15
|
+
limitations under the License.
|
|
16
|
+
|
|
17
|
+
Copyright 2026 The rpdfium contributors
|
|
18
|
+
|
|
19
|
+
Full license text: https://www.apache.org/licenses/LICENSE-2.0.txt
|
data/README.md
ADDED
|
@@ -0,0 +1,599 @@
|
|
|
1
|
+
# rpdfium
|
|
2
|
+
|
|
3
|
+
Ruby bindings for [PDFium](https://pdfium.googlesource.com/pdfium/), the
|
|
4
|
+
PDF engine that powers Chrome's viewer. Provides text extraction with
|
|
5
|
+
character-level metadata, vector path access, image extraction, form
|
|
6
|
+
fields, page rendering, and pdfplumber-style table detection.
|
|
7
|
+
|
|
8
|
+
Inspired by [`pypdfium2`](https://github.com/pypdfium2-team/pypdfium2)
|
|
9
|
+
(bindings layout) and [`pdfplumber`](https://github.com/jsvine/pdfplumber)
|
|
10
|
+
(table heuristics).
|
|
11
|
+
|
|
12
|
+
```ruby
|
|
13
|
+
require "rpdfium"
|
|
14
|
+
|
|
15
|
+
Rpdfium.open("invoice.pdf") do |doc|
|
|
16
|
+
puts doc.metadata[:title]
|
|
17
|
+
|
|
18
|
+
doc.each do |page|
|
|
19
|
+
puts page.text
|
|
20
|
+
Rpdfium::Table::Extractor.new(page).extract.each do |table|
|
|
21
|
+
table.each { |row| puts row.inspect }
|
|
22
|
+
end
|
|
23
|
+
end
|
|
24
|
+
end
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
## Why
|
|
28
|
+
|
|
29
|
+
The Ruby ecosystem has `pdf-reader` (text only, slow on complex docs),
|
|
30
|
+
`origami` (security-research focused), and `hexapdf` (great for
|
|
31
|
+
manipulation but text extraction is approximate). None give you
|
|
32
|
+
character-level bounding boxes, real vector path geometry, or table
|
|
33
|
+
extraction. `rpdfium` fills that gap by binding the same battle-tested
|
|
34
|
+
C++ engine that powers Chrome's PDF viewer.
|
|
35
|
+
|
|
36
|
+
In practice it matches the speed of Python's `pypdfium2` on text
|
|
37
|
+
extraction and is **15-56× faster than `pdfplumber`** while using
|
|
38
|
+
**5-7× less memory** on large documents. See [Performance](#performance)
|
|
39
|
+
for details.
|
|
40
|
+
|
|
41
|
+
## Installing PDFium
|
|
42
|
+
|
|
43
|
+
`rpdfium` itself ships only Ruby code. The native library is loaded
|
|
44
|
+
from one of, in order:
|
|
45
|
+
|
|
46
|
+
- `ENV["PDFIUM_LIBRARY_PATH"]` (highest priority — point to a
|
|
47
|
+
`libpdfium.{so,dylib,dll}` of your choice)
|
|
48
|
+
- the [`rpdfium-binary`](https://github.com/retsef/rpdfium-binary)
|
|
49
|
+
companion gem (recommended), which ships precompiled PDFium binaries
|
|
50
|
+
for major platforms via [bblanchon/pdfium-binaries](https://github.com/bblanchon/pdfium-binaries)
|
|
51
|
+
- the system `libpdfium` (if installed via your package manager)
|
|
52
|
+
|
|
53
|
+
### Recommended: use `rpdfium-binary`
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
gem install rpdfium-binary
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
RubyGems picks the right platform-specific gem automatically. Supported
|
|
60
|
+
platforms include `x86_64-linux`, `aarch64-linux`, `x86_64-linux-musl`,
|
|
61
|
+
`aarch64-linux-musl`, `arm64-darwin`, `x86_64-darwin`, `x64-mingw-ucrt`,
|
|
62
|
+
`x86-mingw32`, `aarch64-mingw-ucrt`. For unsupported platforms the
|
|
63
|
+
generic Ruby-platform gem is installed and the binary is downloaded on
|
|
64
|
+
first use into the user data directory.
|
|
65
|
+
|
|
66
|
+
Add to your `Gemfile`:
|
|
67
|
+
|
|
68
|
+
```ruby
|
|
69
|
+
gem "rpdfium"
|
|
70
|
+
gem "rpdfium-binary"
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Alternative: manual `PDFIUM_LIBRARY_PATH`
|
|
74
|
+
|
|
75
|
+
Useful in containers, CI, or when you need a specific PDFium build:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
# macOS arm64
|
|
79
|
+
curl -L https://github.com/bblanchon/pdfium-binaries/releases/latest/download/pdfium-mac-arm64.tgz | tar xz
|
|
80
|
+
export PDFIUM_LIBRARY_PATH=$PWD/lib/libpdfium.dylib
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Architecture
|
|
84
|
+
|
|
85
|
+
Three layers, mirroring `pypdfium2`:
|
|
86
|
+
|
|
87
|
+
1. **`Rpdfium::Raw`** — pure FFI bindings, 1:1 with the C API
|
|
88
|
+
(`FPDF_*`, `FPDFText_*`, `FPDFBitmap_*`, `FPDFPath_*`,
|
|
89
|
+
`FPDFImageObj_*`, `FPDFAnnot_*`). Use directly if you need something
|
|
90
|
+
the wrappers don't expose.
|
|
91
|
+
2. **`Rpdfium::Document, ::Page, ::TextPage, ::Image::Embedded,
|
|
92
|
+
::Annotation, ::Form::Field, ::Search, ::Outline, ::Attachment`** —
|
|
93
|
+
RAII-style wrappers with `ObjectSpace.define_finalizer` so handles
|
|
94
|
+
are released even if you forget `close`.
|
|
95
|
+
3. **`Rpdfium::Table::Extractor`** — table detection on top of layer 2,
|
|
96
|
+
with `Rpdfium::Table::Debugger` for visual debugging.
|
|
97
|
+
|
|
98
|
+
## What you can do
|
|
99
|
+
|
|
100
|
+
### Text
|
|
101
|
+
|
|
102
|
+
```ruby
|
|
103
|
+
page.text # plain string
|
|
104
|
+
page.text_in_bbox(left: 50, top: 100, right: 300, bottom: 150)
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### Character-level metadata
|
|
108
|
+
|
|
109
|
+
Per-char data essential for layout-aware processing — bounding box,
|
|
110
|
+
font, weight, origin, rotation angle, plus PDFium's "character
|
|
111
|
+
provenance" flags:
|
|
112
|
+
|
|
113
|
+
```ruby
|
|
114
|
+
page.chars.first
|
|
115
|
+
# {
|
|
116
|
+
# char: "T",
|
|
117
|
+
# codepoint: 84,
|
|
118
|
+
# x0: 72.0, x1: 79.2, top: 100.5, bottom: 112.3,
|
|
119
|
+
# origin_x: 72.0, origin_y: 110.8,
|
|
120
|
+
# angle: 0.0, # radians (rotated text)
|
|
121
|
+
# fontsize: 12.0,
|
|
122
|
+
# font: "Helvetica-Bold",
|
|
123
|
+
# weight: 700,
|
|
124
|
+
# render_mode: 0, # 0=fill 1=stroke 2=both 3=invisible
|
|
125
|
+
# generated: false, # true → inserted by PDFium (e.g. spaces)
|
|
126
|
+
# hyphen: false, # true → soft-hyphen for line break
|
|
127
|
+
# unicode_error: false # true → couldn't map glyph to unicode
|
|
128
|
+
# }
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
`generated`/`hyphen`/`unicode_error` are the **artefact recognition
|
|
132
|
+
flags** — distinguishing real characters from PDFium-synthesized ones is
|
|
133
|
+
crucial when you don't want fake whitespace to widen a column.
|
|
134
|
+
|
|
135
|
+
Loose char boxes (proportional to font size, more stable for layout
|
|
136
|
+
algorithms):
|
|
137
|
+
|
|
138
|
+
```ruby
|
|
139
|
+
page.chars(loose: true)
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Cluster chars into words automatically:
|
|
143
|
+
|
|
144
|
+
```ruby
|
|
145
|
+
page.words(x_tolerance: 3.0, y_tolerance: 3.0)
|
|
146
|
+
# [{ text: "Invoice", x0: 72.0, x1: 110.5, top: 100.5, bottom: 112.3,
|
|
147
|
+
# fontsize: 12.0, font: "Helvetica-Bold", chars: [...] }, ...]
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
### Vector paths
|
|
151
|
+
|
|
152
|
+
Real path-segment iteration (not just bounding boxes), with state
|
|
153
|
+
machine for `closepath`. Useful for table line detection, signatures,
|
|
154
|
+
form layout analysis:
|
|
155
|
+
|
|
156
|
+
```ruby
|
|
157
|
+
page.line_segments
|
|
158
|
+
# [{ x0: 72.0, y0: 100.0, x1: 540.0, y1: 100.0, stroke_width: 0.5 }, ...]
|
|
159
|
+
|
|
160
|
+
page.horizontal_lines
|
|
161
|
+
page.vertical_lines
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
### Images
|
|
165
|
+
|
|
166
|
+
```ruby
|
|
167
|
+
page.images.each do |img|
|
|
168
|
+
meta = img.metadata
|
|
169
|
+
puts "#{meta[:width]}×#{meta[:height]} @ #{meta[:horizontal_dpi]} DPI, " \
|
|
170
|
+
"#{meta[:colorspace]}"
|
|
171
|
+
puts "filters: #{img.filters}" # e.g. ["DCTDecode"] for JPEG
|
|
172
|
+
|
|
173
|
+
# JPEG passthrough when filters == ["DCTDecode"]; otherwise rendered to PNG
|
|
174
|
+
img.save("img_#{img.bbox[:x0].to_i}.jpg")
|
|
175
|
+
|
|
176
|
+
# Or get raw/decoded bytes for custom processing
|
|
177
|
+
img.raw_bytes # as stored
|
|
178
|
+
img.decoded_bytes # post-filters (raster)
|
|
179
|
+
end
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
### Annotations & links
|
|
183
|
+
|
|
184
|
+
```ruby
|
|
185
|
+
page.annotations.each do |a|
|
|
186
|
+
puts "#{a.subtype}: #{a[:Contents]} at #{a.bbox.inspect}"
|
|
187
|
+
end
|
|
188
|
+
|
|
189
|
+
page.links.each do |link|
|
|
190
|
+
puts link.link_uri || "→ page #{link.link_dest_page}"
|
|
191
|
+
end
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
### Forms (read-only)
|
|
195
|
+
|
|
196
|
+
```ruby
|
|
197
|
+
doc = Rpdfium.open("form.pdf")
|
|
198
|
+
puts doc.form_type # :acroform / :xfa_full / :xfa_foreground / :none
|
|
199
|
+
|
|
200
|
+
doc.each do |page|
|
|
201
|
+
page.form_fields.each do |f|
|
|
202
|
+
pp f.to_h
|
|
203
|
+
# { name: "name", type: :textfield, value: "Mario Rossi",
|
|
204
|
+
# readonly: false, required: true, bbox: {...} }
|
|
205
|
+
end
|
|
206
|
+
end
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
### Outline (bookmarks) & attachments
|
|
210
|
+
|
|
211
|
+
```ruby
|
|
212
|
+
Rpdfium::Outline.flatten(doc.outline) do |item, depth|
|
|
213
|
+
puts "#{" " * depth}- #{item.title} (page #{item.page_index})"
|
|
214
|
+
end
|
|
215
|
+
|
|
216
|
+
doc.attachments.each { |a| a.save("attached_#{a.name}") }
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
### Search
|
|
220
|
+
|
|
221
|
+
```ruby
|
|
222
|
+
page.search("totale", match_case: false).each_match do |m|
|
|
223
|
+
puts "found '#{m[:text]}' at #{m[:rects].first.inspect}"
|
|
224
|
+
end
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
### Rendering
|
|
228
|
+
|
|
229
|
+
```ruby
|
|
230
|
+
# Pure-Ruby PNG writer, zero deps:
|
|
231
|
+
page.render_to_png("page.png", scale: 2.0, include_annotations: true,
|
|
232
|
+
include_forms: true)
|
|
233
|
+
|
|
234
|
+
# Or get raw RGBA/BGRA/Gray bytes:
|
|
235
|
+
w, h, bytes, stride = page.render(scale: 2.0, output: :rgba)
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
### Tables
|
|
239
|
+
|
|
240
|
+
`pdfplumber`-style settings — every parameter you'd recognize:
|
|
241
|
+
|
|
242
|
+
```ruby
|
|
243
|
+
extractor = Rpdfium::Table::Extractor.new(page,
|
|
244
|
+
vertical_strategy: :lines, # :lines / :lines_strict / :text / :explicit
|
|
245
|
+
horizontal_strategy: :lines,
|
|
246
|
+
snap_tolerance: 3.0,
|
|
247
|
+
join_tolerance: 3.0,
|
|
248
|
+
edge_min_length: 3.0,
|
|
249
|
+
edge_min_length_prefilter: 1.0,
|
|
250
|
+
intersection_tolerance: 3.0,
|
|
251
|
+
min_words_vertical: 3,
|
|
252
|
+
min_words_horizontal: 1,
|
|
253
|
+
text_x_tolerance: 3.0,
|
|
254
|
+
text_y_tolerance: 3.0,
|
|
255
|
+
explicit_vertical_lines: [], # [Float] x-coords or [Hash{x:, top:, bottom:}]
|
|
256
|
+
explicit_horizontal_lines: [],
|
|
257
|
+
auto_fallback: true # try :text if :lines finds nothing
|
|
258
|
+
)
|
|
259
|
+
|
|
260
|
+
extractor.tables.each do |table|
|
|
261
|
+
table.bbox # => [x0, top, x1, bottom]
|
|
262
|
+
table.rows # => Array<Array<bbox|nil>>
|
|
263
|
+
table.columns # => Array<Array<bbox|nil>>
|
|
264
|
+
table.extract # => Array<Array<String>>
|
|
265
|
+
end
|
|
266
|
+
|
|
267
|
+
extractor.extract # shortcut: => [[[String, ...], ...], ...] (list of tables)
|
|
268
|
+
extractor.edges # post-snap/join edges
|
|
269
|
+
extractor.intersections # Hash{[x,y] => {v:[edges], h:[edges]}}
|
|
270
|
+
extractor.cells # Array<bbox>
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
The pipeline mirrors `pdfplumber.TableFinder` 1:1 and uses the same
|
|
274
|
+
algorithms for words-to-edges, intersections-to-cells, cells-to-tables.
|
|
275
|
+
|
|
276
|
+
Visual debugger (saves PNG with overlay: red lines, green intersections,
|
|
277
|
+
blue table fills):
|
|
278
|
+
|
|
279
|
+
```ruby
|
|
280
|
+
Rpdfium::Table::Debugger.visualize(page, "debug.png",
|
|
281
|
+
vertical_strategy: :lines)
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
### Form-aware extraction (font filtering)
|
|
285
|
+
|
|
286
|
+
Some PDFs are "filled-out forms" — F24, tax declarations, payment
|
|
287
|
+
slips, government forms — where the form template and the entered
|
|
288
|
+
data both exist as static graphics text on the page (no AcroForm
|
|
289
|
+
fields, no tagged structure). On these PDFs the table pipeline picks
|
|
290
|
+
up the template labels as noise alongside the data.
|
|
291
|
+
|
|
292
|
+
The robust strategy is to separate chars by **role** using their
|
|
293
|
+
font: the template typically uses proportional fonts (Futura, Times,
|
|
294
|
+
Helvetica) while the data layer uses a single font (often Courier
|
|
295
|
+
monospace, or Helvetica at a specific size).
|
|
296
|
+
|
|
297
|
+
```ruby
|
|
298
|
+
Rpdfium.open("f24.pdf") do |doc|
|
|
299
|
+
page = doc.page(0)
|
|
300
|
+
|
|
301
|
+
# Discover what fonts are on the page
|
|
302
|
+
page.font_inventory.first(5).each do |g|
|
|
303
|
+
puts "#{g[:font].ljust(20)} h=#{g[:height]} | #{g[:count]} chars | #{g[:sample][0,40]}"
|
|
304
|
+
end
|
|
305
|
+
# Futura-Light h=8.3 | 946 chars | "cognome, denominazione o ragione sociale"
|
|
306
|
+
# Courier h=10.5 | 365 chars | "01234567890Azienda S.R.L.P"
|
|
307
|
+
# Futura-Bold h=10.4 | 249 chars | "CODICE FISCALEDATI ANAGRAFICI..."
|
|
308
|
+
# ...
|
|
309
|
+
|
|
310
|
+
# Extract just the entered data, line by line
|
|
311
|
+
page.lines(font: "Courier").each { |l| puts l }
|
|
312
|
+
# => "Soggetto: Azienda S.R.L. ( 01234567890 )"
|
|
313
|
+
# => "1001 11 2021 499,81 0,00"
|
|
314
|
+
# => "1712 12 2021 32,46 0,00"
|
|
315
|
+
# => "1701 11 2021 0,00 295,89"
|
|
316
|
+
# => "532,27 295,89 236,38"
|
|
317
|
+
# => ...
|
|
318
|
+
end
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
Three primitives:
|
|
322
|
+
|
|
323
|
+
- `Page#font_inventory` — distribution by `(font, height, weight)`,
|
|
324
|
+
with counts and samples for ispection
|
|
325
|
+
- `Page#chars_where(font:, height:, weight:, bbox:, where:)` —
|
|
326
|
+
filter chars by any combination of criteria
|
|
327
|
+
- `Page#lines(font:, ...)` — high-level helper: filter + word
|
|
328
|
+
extraction + line clustering, returns `Array<String>`
|
|
329
|
+
|
|
330
|
+
Works on F24 payment forms, VAT periodic communications, withholding
|
|
331
|
+
tax declarations, and similar government forms — anywhere the data
|
|
332
|
+
sits on a printed template as text.
|
|
333
|
+
|
|
334
|
+
#### Label-value pairing
|
|
335
|
+
|
|
336
|
+
`Page#label_value_pairs` associates each extracted value with the
|
|
337
|
+
semantic label from the template that describes it. Useful when you
|
|
338
|
+
want machine-readable `field_name → field_value` pairs without
|
|
339
|
+
hard-coding the form layout.
|
|
340
|
+
|
|
341
|
+
```ruby
|
|
342
|
+
Rpdfium.open("f24.pdf") do |doc|
|
|
343
|
+
pairs = doc.page(0).label_value_pairs(
|
|
344
|
+
data_font: "Courier",
|
|
345
|
+
template_font: /^Futura/,
|
|
346
|
+
data_filter: ->(t) { t.match?(/^[\d.,]+$/) }
|
|
347
|
+
)
|
|
348
|
+
pairs.each do |p|
|
|
349
|
+
col = p[:labels][:col]
|
|
350
|
+
row = p[:labels][:row]
|
|
351
|
+
puts "#{p[:value].ljust(12)} → col: #{col}, row: #{row}"
|
|
352
|
+
end
|
|
353
|
+
end
|
|
354
|
+
# 499,81 → col: "importi a debito versati"
|
|
355
|
+
# 1.615,90 → col: "SALDO (M-N) +/–", row: "EURO +" ← saldo finale
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
The algorithm clusters template words into coherent labels, then for
|
|
359
|
+
each value finds the `:col` label (positioned above) and the `:row`
|
|
360
|
+
label (positioned to the left).
|
|
361
|
+
|
|
362
|
+
#### Composable primitives for complex forms
|
|
363
|
+
|
|
364
|
+
For complex forms with repeating tables, boxed-layout cells, or
|
|
365
|
+
multi-word values, compose three primitives:
|
|
366
|
+
|
|
367
|
+
**`Util::WordMerger`** — join adjacent words on the same line:
|
|
368
|
+
|
|
369
|
+
```ruby
|
|
370
|
+
merger = Rpdfium::Util::WordMerger.new(x_gap: 20.0, y_tol: 3.0)
|
|
371
|
+
merged = merger.merge_by_proximity(words)
|
|
372
|
+
# or, with labels mapping to preserve checkbox grids:
|
|
373
|
+
merged = merger.merge_by_label(words, label_per_word)
|
|
374
|
+
# or, only merge orphans (no label assigned):
|
|
375
|
+
merged = merger.merge_unlabeled(words, label_per_word)
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
**`Util::ColumnInference`** — identify data columns by alignment:
|
|
379
|
+
|
|
380
|
+
```ruby
|
|
381
|
+
inference = Rpdfium::Util::ColumnInference.new(
|
|
382
|
+
x_tolerance: 3.0,
|
|
383
|
+
min_size: 3,
|
|
384
|
+
cv_threshold: 0.15
|
|
385
|
+
)
|
|
386
|
+
columns = inference.infer(words)
|
|
387
|
+
# => [[word1, word2, ..., word12], ...]
|
|
388
|
+
```
|
|
389
|
+
|
|
390
|
+
Algorithm: cluster by `x0` (left-align) AND `x1` (right-align), split
|
|
391
|
+
columns at large vertical gaps, filter by gap-regularity (coefficient
|
|
392
|
+
of variation < 0.15) to exclude false positives.
|
|
393
|
+
|
|
394
|
+
**`Util::LabelMatcher`** with column inference enables header
|
|
395
|
+
propagation for repeating tables (e.g. 770 Quadro ST with rows
|
|
396
|
+
ST2..ST13 sharing column headers printed once at the top):
|
|
397
|
+
|
|
398
|
+
```ruby
|
|
399
|
+
matcher = Rpdfium::Util::LabelMatcher.new(
|
|
400
|
+
column_inference: Rpdfium::Util::ColumnInference.new
|
|
401
|
+
)
|
|
402
|
+
pairs = page.label_value_pairs(data_font: "Courier", matcher: matcher)
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
For boxed-layout forms (cells separated by ~10pt with template
|
|
406
|
+
graphics for decimals), pass `inject_spaces: false, x_tolerance: 15.0`
|
|
407
|
+
to `label_value_pairs` and `row_max_dx: 400.0` to the matcher.
|
|
408
|
+
|
|
409
|
+
See `examples/adapters/` for complete working adapters that compose
|
|
410
|
+
these primitives for specific Italian tax forms (Modello 770,
|
|
411
|
+
Comunicazione IVA).
|
|
412
|
+
|
|
413
|
+
### Struct tree (Tagged PDF)
|
|
414
|
+
|
|
415
|
+
For tagged PDFs (PDF/UA, accessibility-friendly exports from
|
|
416
|
+
Word/LibreOffice/InDesign), `Page#struct_tree` exposes the document's
|
|
417
|
+
logical structure (Document → P, H1, Table, TR, TH, TD, Figure, ...)
|
|
418
|
+
independently of the visual layout. This gives **zero-geometry**
|
|
419
|
+
extraction with semantic typing (TH vs TD, RowSpan, ColSpan, Lang).
|
|
420
|
+
|
|
421
|
+
```ruby
|
|
422
|
+
page.struct_tree do |tree|
|
|
423
|
+
next if tree.nil? || tree.empty?
|
|
424
|
+
|
|
425
|
+
tree.tables.each do |table|
|
|
426
|
+
rows = table.children.select { |c| c.type == "TR" }
|
|
427
|
+
rows.each do |row|
|
|
428
|
+
cells = row.children.select { |c| %w[TH TD].include?(c.type) }
|
|
429
|
+
puts cells.map(&:text).map(&:strip).inspect
|
|
430
|
+
end
|
|
431
|
+
end
|
|
432
|
+
end
|
|
433
|
+
# => ["Region", "Revenue", "Growth"] (TH)
|
|
434
|
+
# => ["Italy", "1.250.000", "+12%"] (TD)
|
|
435
|
+
# => ...
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
API summary:
|
|
439
|
+
|
|
440
|
+
```ruby
|
|
441
|
+
tree = page.struct_tree # → Tree or nil (nil if not tagged)
|
|
442
|
+
tree.empty? # true for "tagged but placeholder" PDFs
|
|
443
|
+
tree.roots # → [Element, ...]
|
|
444
|
+
tree.walk { |el| ... } # depth-first
|
|
445
|
+
tree.find_all(type: "P")
|
|
446
|
+
tree.tables # → [Element, ...] where type == "Table"
|
|
447
|
+
|
|
448
|
+
element.type # "P", "Table", "TR", "TD", ...
|
|
449
|
+
element.children # → [Element, ...]
|
|
450
|
+
element.parent # → Element or nil
|
|
451
|
+
element.text # text via MCID + ActualText override
|
|
452
|
+
element.actual_text # /ActualText (for ligature/math resolution)
|
|
453
|
+
element.alt_text # /Alt (Figure / Formula)
|
|
454
|
+
element.lang # "it-IT", "en-US", ...
|
|
455
|
+
element.marked_content_ids # → [Integer]
|
|
456
|
+
element.attributes # → { name => value }
|
|
457
|
+
```
|
|
458
|
+
|
|
459
|
+
Three possible states of `page.struct_tree`:
|
|
460
|
+
|
|
461
|
+
| PDF type | returns |
|
|
462
|
+
| --- | --- |
|
|
463
|
+
| Not tagged (most PDFs from line-of-business software, scanned PDFs) | `nil` |
|
|
464
|
+
| Tagged but empty (some bank statements have placeholder StructTreeRoot) | `Tree` with `empty? == true` |
|
|
465
|
+
| Properly tagged (Word/LibreOffice/InDesign export with accessibility tags) | Navigable `Tree` |
|
|
466
|
+
|
|
467
|
+
Lifecycle: prefer the block form for deterministic close. The implicit
|
|
468
|
+
form (no block) leaves cleanup to `FPDF_CloseDocument` — no leak, just
|
|
469
|
+
the tree stays in memory until the document is closed.
|
|
470
|
+
|
|
471
|
+
## Performance
|
|
472
|
+
|
|
473
|
+
Measured on 4 PDFs of increasing complexity, best-of-3 runs after a
|
|
474
|
+
warm-up, isolated in subprocesses to capture clean peak RSS. Versions
|
|
475
|
+
under test: `rpdfium 0.3.13`, `pdfplumber 0.11.9`, `pypdfium2 5.6.0`.
|
|
476
|
+
|
|
477
|
+
| Test corpus | Pages | Size | What it stresses |
|
|
478
|
+
| --- | ---: | ---: | --- |
|
|
479
|
+
| `sample.pdf` | 1 | 18 KB | Plain text baseline |
|
|
480
|
+
| `form.pdf` | 1 | 107 KB | Char-per-text-object kerning, Form XObject, tables |
|
|
481
|
+
| `complex.pdf` | 85 | 60 MB | Magazine-style document, dense text + heavy graphics |
|
|
482
|
+
| `report.pdf` | 226 | 322 KB | Rotated pages (90°), small fonts, ~15 tables per page |
|
|
483
|
+
|
|
484
|
+
### Speed
|
|
485
|
+
|
|
486
|
+
| Corpus | Task | rpdfium | pypdfium2 | pdfplumber | speedup vs pdfplumber |
|
|
487
|
+
| --- | --- | ---: | ---: | ---: | ---: |
|
|
488
|
+
| sample.pdf (1 pag) | text | 4 ms | 4 ms | 75 ms | **21×** |
|
|
489
|
+
| sample.pdf (1 pag) | tables | 4 ms | n/a | 70 ms | **16×** |
|
|
490
|
+
| form.pdf (1 pag) | text | 12 ms | 13 ms | 538 ms | **44×** |
|
|
491
|
+
| form.pdf (1 pag) | tables | 25 ms | n/a | 575 ms | **23×** |
|
|
492
|
+
| complex.pdf (85 pag) | text | 190 ms | 183 ms | 7.76 s | **41×** |
|
|
493
|
+
| complex.pdf (85 pag) | tables | 231 ms | n/a | 7.07 s | **31×** |
|
|
494
|
+
| report.pdf (226 pag) | text | 412 ms | 397 ms | 23.26 s | **56×** |
|
|
495
|
+
| report.pdf (226 pag) | tables | 1.68 s | n/a | 25.25 s | **15×** |
|
|
496
|
+
|
|
497
|
+
`pypdfium2` does not implement table extraction (it's a raw FFI binding
|
|
498
|
+
to PDFium, not a full pipeline). It's listed as the "pure PDFium speed
|
|
499
|
+
floor" for text — rpdfium matches it within ±5%, showing that the Ruby
|
|
500
|
+
FFI overhead is not measurable.
|
|
501
|
+
|
|
502
|
+
### Memory (peak RSS)
|
|
503
|
+
|
|
504
|
+
| Corpus | rpdfium | pypdfium2 | pdfplumber | pdfplumber/rpdfium |
|
|
505
|
+
| --- | ---: | ---: | ---: | ---: |
|
|
506
|
+
| sample.pdf | 29 MB | 20 MB | 40 MB | 1.4× |
|
|
507
|
+
| form.pdf | 32 MB | 22 MB | 45 MB | 1.4× |
|
|
508
|
+
| complex.pdf | 106 MB | 69 MB | 535 MB | **5.0×** |
|
|
509
|
+
| report.pdf | 136 MB | 41 MB | 1003 MB | **7.4×** |
|
|
510
|
+
|
|
511
|
+
The memory gap widens with workload size. On a 226-page document
|
|
512
|
+
pdfplumber uses ~1 GB; rpdfium stays under 140 MB. For server-side
|
|
513
|
+
batch processing this is the difference between a 256 MB container and
|
|
514
|
+
a 2 GB one.
|
|
515
|
+
|
|
516
|
+
### Headline numbers
|
|
517
|
+
|
|
518
|
+
On large PDFs (226 pages, dense layout):
|
|
519
|
+
|
|
520
|
+
- **rpdfium completes both text + tables in ~2.1 s using 136 MB**
|
|
521
|
+
- **pdfplumber needs ~48 s and 1 GB** for the same work
|
|
522
|
+
|
|
523
|
+
Across the four corpora the median speedup vs pdfplumber is **27× on
|
|
524
|
+
text**, **22× on tables**. rpdfium scales linearly with page count
|
|
525
|
+
(thanks to PDFium's C++ engine); pdfplumber's pure-Python pipeline
|
|
526
|
+
degrades super-linearly on large documents.
|
|
527
|
+
|
|
528
|
+
### Methodology
|
|
529
|
+
|
|
530
|
+
Each measurement is the **minimum of 3 timed runs after a warm-up run**
|
|
531
|
+
(to neutralize OS page cache effects on the 60 MB `complex.pdf`).
|
|
532
|
+
Subprocess isolation per measurement ensures clean RSS reading via
|
|
533
|
+
`resource.getrusage` / `/proc/self/status`. The benchmark harness is
|
|
534
|
+
a small Ruby driver that shells out to three runners (one Ruby script
|
|
535
|
+
using `rpdfium`, two Python scripts using `pdfplumber` and
|
|
536
|
+
`pypdfium2`), parses the JSON each emits, and aggregates the results.
|
|
537
|
+
|
|
538
|
+
Output quality has been spot-checked: rpdfium matches pypdfium2 char
|
|
539
|
+
count within ±1 char (rounding on the trailing newline). pdfplumber
|
|
540
|
+
returns ~2% fewer chars on locale-formatted numbers due to a different
|
|
541
|
+
word-tokenization for thousand-separator punctuation (e.g. `1.250.000`
|
|
542
|
+
split on periods).
|
|
543
|
+
|
|
544
|
+
## Memory safety
|
|
545
|
+
|
|
546
|
+
- `FPDF_LoadMemDocument64` does **not** copy the input bytes. The
|
|
547
|
+
`Document` wrapper holds an FFI buffer reference for its lifetime so
|
|
548
|
+
the GC can't free it early.
|
|
549
|
+
- Every PDFium handle (`*_Close*`) is wired to
|
|
550
|
+
`ObjectSpace.define_finalizer` so abandoned objects don't leak native
|
|
551
|
+
memory.
|
|
552
|
+
- `FPDF_InitLibrary` is called once per process under `Mutex`;
|
|
553
|
+
`FPDF_DestroyLibrary` runs via `at_exit`.
|
|
554
|
+
- `Document#close` releases in cascade: form-fill env → cached pages →
|
|
555
|
+
document handle.
|
|
556
|
+
|
|
557
|
+
## Roadmap
|
|
558
|
+
|
|
559
|
+
| Status | Feature |
|
|
560
|
+
|---|---|
|
|
561
|
+
| ✅ | Document open (path / IO / bytes / password) |
|
|
562
|
+
| ✅ | Document metadata, permissions, file version |
|
|
563
|
+
| ✅ | Page text + bbox-bounded text |
|
|
564
|
+
| ✅ | Per-character bounding boxes (tight & loose) |
|
|
565
|
+
| ✅ | Char metadata: font, weight, origin, angle, render mode |
|
|
566
|
+
| ✅ | PDFium-generated char detection (artefact filtering) |
|
|
567
|
+
| ✅ | Word clustering (layout-aware) |
|
|
568
|
+
| ✅ | Vector path segments (real geometry, not bbox) |
|
|
569
|
+
| ✅ | Image extraction (raw + decoded + rendered) |
|
|
570
|
+
| ✅ | Annotations + link URI/dest |
|
|
571
|
+
| ✅ | AcroForm field reading |
|
|
572
|
+
| ✅ | Bookmarks (outline) |
|
|
573
|
+
| ✅ | File attachments |
|
|
574
|
+
| ✅ | Internal text search |
|
|
575
|
+
| ✅ | Page rendering to RGBA/BGRA/Gray |
|
|
576
|
+
| ✅ | Pure-Ruby PNG writer (zero deps) |
|
|
577
|
+
| ✅ | Table extraction — `:lines` strategy |
|
|
578
|
+
| ✅ | Table extraction — `:text` strategy |
|
|
579
|
+
| ✅ | Table extraction — `:explicit` strategy |
|
|
580
|
+
| ✅ | Visual table debugger |
|
|
581
|
+
| ✅ | [`rpdfium-binary`](https://github.com/retsef/rpdfium-binary) companion gem with prebuilt PDFium |
|
|
582
|
+
| ✅ | Structure tree traversal (PDF tagged → semantic tables / `Page#struct_tree`) |
|
|
583
|
+
| ✅ | Form-aware extraction via font filtering (`Page#font_inventory`, `chars_where`, `lines`) |
|
|
584
|
+
| ✅ | Semantic label-value pairing on filled forms (`Page#label_value_pairs`, `Util::LabelMatcher`) |
|
|
585
|
+
| 🚧 | XFA form support |
|
|
586
|
+
| 🔮 | OCR fallback for scanned PDFs (via tesseract bindings) |
|
|
587
|
+
| 🔮 | Write APIs (we're read-only by design for now) |
|
|
588
|
+
|
|
589
|
+
## Why not pure-Ruby?
|
|
590
|
+
|
|
591
|
+
A correct PDF text extractor needs to interpret the content stream
|
|
592
|
+
(operators, font encodings including CMap-based CIDs, ToUnicode maps,
|
|
593
|
+
ActualText overrides, marked content). PDFium has ~15 years of
|
|
594
|
+
edge-case fixes baked in. Reimplementing it in Ruby would take years
|
|
595
|
+
and still be slower. FFI is the right call.
|
|
596
|
+
|
|
597
|
+
## License
|
|
598
|
+
|
|
599
|
+
Apache-2.0 (same as PDFium itself).
|