rpdfium 0.4.1 → 0.4.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +615 -1317
- data/README.md +73 -78
- data/lib/rpdfium/annotation/annotation.rb +10 -8
- data/lib/rpdfium/document.rb +49 -22
- data/lib/rpdfium/errors.rb +2 -2
- data/lib/rpdfium/form/form.rb +9 -9
- data/lib/rpdfium/image/embedded.rb +17 -16
- data/lib/rpdfium/io/png.rb +9 -9
- data/lib/rpdfium/page.rb +561 -526
- data/lib/rpdfium/raw.rb +216 -203
- data/lib/rpdfium/search/search.rb +5 -5
- data/lib/rpdfium/structure/attachment.rb +6 -6
- data/lib/rpdfium/structure/element.rb +74 -74
- data/lib/rpdfium/structure/outline.rb +2 -2
- data/lib/rpdfium/structure/tree.rb +56 -55
- data/lib/rpdfium/table/cells.rb +36 -33
- data/lib/rpdfium/table/debugger.rb +12 -12
- data/lib/rpdfium/table/edges.rb +51 -49
- data/lib/rpdfium/table/extractor.rb +35 -34
- data/lib/rpdfium/table/table.rb +65 -62
- data/lib/rpdfium/util/cluster.rb +35 -33
- data/lib/rpdfium/util/column_inference.rb +34 -32
- data/lib/rpdfium/util/label_matcher.rb +30 -30
- data/lib/rpdfium/util/text_extraction.rb +15 -15
- data/lib/rpdfium/util/word_extractor.rb +49 -48
- data/lib/rpdfium/util/word_merger.rb +25 -24
- data/lib/rpdfium/version.rb +1 -1
- data/lib/rpdfium.rb +17 -15
- metadata +1 -1
data/README.md
CHANGED
|
@@ -27,16 +27,22 @@ end
|
|
|
27
27
|
## Why
|
|
28
28
|
|
|
29
29
|
The Ruby ecosystem has `pdf-reader` (text only, slow on complex docs),
|
|
30
|
-
`origami` (security-research focused), and `hexapdf`
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
30
|
+
`origami` (security-research focused), and `hexapdf` — a capable library that
|
|
31
|
+
extracts text with character-level positioning and exposes the vector-path
|
|
32
|
+
primitives you need to build table extraction yourself (see the ~120-line
|
|
33
|
+
[reference](benchmark/examples/hexapdf_table_extraction.rb) in the benchmark
|
|
34
|
+
suite). `hexapdf` is AGPL / commercially licensed, though.
|
|
35
|
+
|
|
36
|
+
I wanted a permissively licensed library that ships those higher-level
|
|
37
|
+
pipelines **out of the box** — pdfplumber-style table detection, page
|
|
38
|
+
rendering to raster — on top of character metadata, without hand-rolling them.
|
|
39
|
+
`rpdfium` does that under Apache-2.0, binding the same battle-tested C++ engine
|
|
40
|
+
that powers Chrome's PDF viewer.
|
|
35
41
|
|
|
36
42
|
In practice it matches the speed of Python's `pypdfium2` on text
|
|
37
|
-
extraction and is **
|
|
38
|
-
**
|
|
39
|
-
for details.
|
|
43
|
+
extraction and is **up to ~80× faster than `pdfplumber`** while using
|
|
44
|
+
**up to ~80× less memory** on dense documents (the 520-page academic tier).
|
|
45
|
+
See [Performance](#performance) for details.
|
|
40
46
|
|
|
41
47
|
## Installing PDFium
|
|
42
48
|
|
|
@@ -321,7 +327,7 @@ end
|
|
|
321
327
|
Three primitives:
|
|
322
328
|
|
|
323
329
|
- `Page#font_inventory` — distribution by `(font, height, weight)`,
|
|
324
|
-
with counts and samples for
|
|
330
|
+
with counts and samples for inspection
|
|
325
331
|
- `Page#chars_where(font:, height:, weight:, bbox:, where:)` —
|
|
326
332
|
filter chars by any combination of criteria
|
|
327
333
|
- `Page#lines(font:, ...)` — high-level helper: filter + word
|
|
@@ -470,76 +476,65 @@ the tree stays in memory until the document is closed.
|
|
|
470
476
|
|
|
471
477
|
## Performance
|
|
472
478
|
|
|
473
|
-
|
|
474
|
-
|
|
475
|
-
|
|
476
|
-
|
|
477
|
-
|
|
478
|
-
|
|
479
|
-
|
|
480
|
-
|
|
481
|
-
|
|
482
|
-
|
|
483
|
-
|
|
484
|
-
|
|
485
|
-
|
|
486
|
-
|
|
487
|
-
|
|
488
|
-
|
|
489
|
-
|
|
490
|
-
|
|
491
|
-
|
|
492
|
-
|
|
493
|
-
|
|
494
|
-
|
|
|
495
|
-
| report.pdf (226 pag) | tables | 1.68 s | n/a | 25.25 s | **15×** |
|
|
496
|
-
|
|
497
|
-
`pypdfium2` does not implement table extraction (it's a raw FFI binding
|
|
498
|
-
to PDFium, not a full pipeline). It's listed as the "pure PDFium speed
|
|
499
|
-
floor" for text — rpdfium matches it within ±5%, showing that the Ruby
|
|
500
|
-
FFI overhead is not measurable.
|
|
501
|
-
|
|
502
|
-
### Memory (peak RSS)
|
|
503
|
-
|
|
504
|
-
| Corpus | rpdfium | pypdfium2 | pdfplumber | pdfplumber/rpdfium |
|
|
479
|
+
The full, reproducible benchmark suite — sample PDFs, runners, ground-truth
|
|
480
|
+
correctness scoring, and methodology — lives in
|
|
481
|
+
[`benchmark/`](benchmark/README.md). It compares **rpdfium** against
|
|
482
|
+
**pypdfium2** (the "pure PDFium speed floor"), **pdfplumber** (the reference
|
|
483
|
+
pure-Python pipeline) and **hexapdf** (pure Ruby) across five synthetic PDFs
|
|
484
|
+
of increasing complexity (1 → 520 pages), measuring **execution time**, **peak
|
|
485
|
+
memory (RSS)** and **correctness** (fraction of known ground-truth data
|
|
486
|
+
recovered). Run it yourself:
|
|
487
|
+
|
|
488
|
+
```bash
|
|
489
|
+
export PDFIUM_LIBRARY_PATH=/path/to/libpdfium.{so,dylib,dll}
|
|
490
|
+
pip install pdfplumber pypdfium2 # optional baselines
|
|
491
|
+
gem install hexapdf # optional baseline + table-extraction reference
|
|
492
|
+
ruby benchmark/run.rb
|
|
493
|
+
```
|
|
494
|
+
|
|
495
|
+
### Synthetic suite (Apple M-series, best of 3)
|
|
496
|
+
|
|
497
|
+
Text extraction — rpdfium tracks pypdfium2 within noise (FFI overhead not
|
|
498
|
+
measurable); pdfplumber degrades super-linearly:
|
|
499
|
+
|
|
500
|
+
| PDF | rpdfium | pypdfium2 | pdfplumber | hexapdf |
|
|
505
501
|
| --- | ---: | ---: | ---: | ---: |
|
|
506
|
-
|
|
|
507
|
-
|
|
|
508
|
-
|
|
|
509
|
-
|
|
|
510
|
-
|
|
511
|
-
|
|
512
|
-
|
|
513
|
-
|
|
514
|
-
a
|
|
515
|
-
|
|
516
|
-
|
|
517
|
-
|
|
518
|
-
|
|
519
|
-
|
|
520
|
-
|
|
521
|
-
|
|
522
|
-
|
|
523
|
-
|
|
524
|
-
|
|
525
|
-
(
|
|
526
|
-
|
|
527
|
-
|
|
528
|
-
|
|
529
|
-
|
|
530
|
-
|
|
531
|
-
(
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
|
|
535
|
-
|
|
536
|
-
|
|
537
|
-
|
|
538
|
-
|
|
539
|
-
|
|
540
|
-
|
|
541
|
-
|
|
542
|
-
split on periods).
|
|
502
|
+
| 01_simple (1 pg) | 12 ms / 33 MB | 12 ms / 36 MB | 17 ms / 42 MB | 14 ms / 24 MB |
|
|
503
|
+
| 02_medium (6 pg) | 14 ms / 33 MB | 14 ms / 37 MB | 101 ms / 57 MB | 19 ms / 24 MB |
|
|
504
|
+
| 03_complex (16 pg) | 15 ms / 34 MB | 16 ms / 38 MB | 182 ms / 72 MB | 28 ms / 25 MB |
|
|
505
|
+
| 04_heavy (60 pg) | **47 ms / 35 MB** | 50 ms / 40 MB | 2.41 s / 456 MB | 145 ms / 26 MB |
|
|
506
|
+
| 05_academic (520 pg) | **706 ms / 69 MB** | 755 ms / 104 MB | **57.15 s / 5537 MB** | 2.28 s / 43 MB |
|
|
507
|
+
|
|
508
|
+
On the 520-page tier rpdfium is **~81× faster than pdfplumber and uses ~80×
|
|
509
|
+
less memory**, and it now beats raw pypdfium2 on both axes — `extract_text`
|
|
510
|
+
streams pages (one alive at a time), while the pypdfium2 runner holds them.
|
|
511
|
+
|
|
512
|
+
Table extraction (pypdfium2 has no table layer; the hexapdf column uses the
|
|
513
|
+
minimal lines-based reference in
|
|
514
|
+
[`benchmark/examples/hexapdf_table_extraction.rb`](benchmark/examples/hexapdf_table_extraction.rb)):
|
|
515
|
+
|
|
516
|
+
| PDF | rpdfium | pdfplumber | hexapdf |
|
|
517
|
+
| --- | ---: | ---: | ---: |
|
|
518
|
+
| 01_simple (1 pg) | 15 ms / 34 MB | 17 ms / 42 MB | 24 ms / 25 MB |
|
|
519
|
+
| 02_medium (6 pg) | 38 ms / 35 MB | 111 ms / 57 MB | 54 ms / 25 MB |
|
|
520
|
+
| 03_complex (16 pg) | 124 ms / 38 MB | 187 ms / 71 MB | 88 ms / 26 MB |
|
|
521
|
+
| 04_heavy (60 pg) | **496 ms / 39 MB** | 3.05 s / 442 MB | 779 ms / 29 MB |
|
|
522
|
+
| 05_academic (520 pg) | 15.46 s / 104 MB | 68.04 s / 5179 MB | **13.22 s / 37 MB** |
|
|
523
|
+
|
|
524
|
+
rpdfium stays **~4.4× faster than pdfplumber while using ~50× less memory** on
|
|
525
|
+
the academic tier. The minimal hexapdf reference edges it out on time there
|
|
526
|
+
(13.22 s vs 15.46 s): at 520 pages the full pipeline's per-page cost
|
|
527
|
+
(borderless `:text` attempts, rectangle / multi-table search, annotation
|
|
528
|
+
parsing) dominates, while the `:lines`-only reference skips all of it — a fair
|
|
529
|
+
comparison only on clean ruled grids.
|
|
530
|
+
|
|
531
|
+
Correctness is **100% for every library on every tier** — these are clean
|
|
532
|
+
generated grids, the easy case. Real-world tables (dashed rules, partial
|
|
533
|
+
borders, misaligned cells) are where rpdfium's snap/join tolerances and
|
|
534
|
+
`:text` fallback earn their cost; the 120-line hexapdf reference matches here
|
|
535
|
+
but would drop cells there. See
|
|
536
|
+
[`benchmark/README.md`](benchmark/README.md) for the full tables, task-support
|
|
537
|
+
matrix, correctness scoring and methodology.
|
|
543
538
|
|
|
544
539
|
## Memory safety
|
|
545
540
|
|
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
module Rpdfium
|
|
4
|
-
# Wrapper
|
|
5
|
-
#
|
|
6
|
-
# FPDFPage_CloseAnnot,
|
|
4
|
+
# Wrapper for FPDF_ANNOTATION. Annotations include links, highlights,
|
|
5
|
+
# comments, form widgets. PDFium requires closing each handle with
|
|
6
|
+
# FPDFPage_CloseAnnot, handled here via a finalizer.
|
|
7
7
|
class Annotation
|
|
8
8
|
SUBTYPES = {
|
|
9
9
|
Raw::FPDF_ANNOT_UNKNOWN => :unknown,
|
|
@@ -64,8 +64,8 @@ module Rpdfium
|
|
|
64
64
|
top: h - rect[:top], bottom: h - rect[:bottom] }
|
|
65
65
|
end
|
|
66
66
|
|
|
67
|
-
#
|
|
68
|
-
#
|
|
67
|
+
# Value of a key in the annotation dict (UTF-16LE).
|
|
68
|
+
# Common keys: "Contents" (annotation text), "T" (author),
|
|
69
69
|
# "M" (mod date), "NM" (uniq name).
|
|
70
70
|
def [](key)
|
|
71
71
|
Raw.read_utf16_string(:FPDFAnnot_GetStringValue, @state[:handle], key.to_s)
|
|
@@ -75,7 +75,7 @@ module Rpdfium
|
|
|
75
75
|
Raw.FPDFAnnot_HasKey(@state[:handle], key.to_s) == 1
|
|
76
76
|
end
|
|
77
77
|
|
|
78
|
-
#
|
|
78
|
+
# For :link annotations → destination URL (if external) or nil.
|
|
79
79
|
def link_uri
|
|
80
80
|
return nil unless subtype == :link
|
|
81
81
|
|
|
@@ -85,10 +85,12 @@ module Rpdfium
|
|
|
85
85
|
action = Raw.FPDFLink_GetAction(link_handle)
|
|
86
86
|
return nil if action.null?
|
|
87
87
|
|
|
88
|
-
|
|
88
|
+
# Unlike most PDFium getters, FPDFAction_GetURIPath returns 7-bit
|
|
89
|
+
# ASCII bytes, not UTF-16LE.
|
|
90
|
+
Raw.read_ascii_string(:FPDFAction_GetURIPath, @page.document.handle, action)
|
|
89
91
|
end
|
|
90
92
|
|
|
91
|
-
#
|
|
93
|
+
# For internal links → destination page index, or nil.
|
|
92
94
|
def link_dest_page
|
|
93
95
|
return nil unless subtype == :link
|
|
94
96
|
|
data/lib/rpdfium/document.rb
CHANGED
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
module Rpdfium
|
|
4
|
-
#
|
|
5
|
-
# -
|
|
6
|
-
# - metadata (Title, Author,
|
|
4
|
+
# Document-level wrapper. Exposes:
|
|
5
|
+
# - opening from path / IO / bytes / page by index
|
|
6
|
+
# - metadata (Title, Author, etc.)
|
|
7
7
|
# - permissions
|
|
8
8
|
# - outline (bookmarks)
|
|
9
9
|
# - attachments
|
|
@@ -39,10 +39,11 @@ module Rpdfium
|
|
|
39
39
|
|
|
40
40
|
raise LoadError, "Failed to load PDF: #{msg}"
|
|
41
41
|
end
|
|
42
|
-
#
|
|
43
|
-
#
|
|
44
|
-
#
|
|
45
|
-
#
|
|
42
|
+
# State shared between the instance and the finalizer. Wrapped in a
|
|
43
|
+
# mutable Hash because the finalizer closure and the explicit
|
|
44
|
+
# close() must see the same :closed flag — otherwise whichever
|
|
45
|
+
# arrives second calls FPDF_CloseDocument on an already-freed
|
|
46
|
+
# handle and PDFium segfaults.
|
|
46
47
|
@state = {
|
|
47
48
|
handle: handle,
|
|
48
49
|
retain_buffer: retain_buffer,
|
|
@@ -50,11 +51,11 @@ module Rpdfium
|
|
|
50
51
|
}
|
|
51
52
|
@form_env = nil
|
|
52
53
|
@page_cache = {}
|
|
53
|
-
#
|
|
54
|
-
# self
|
|
55
|
-
# finalizer
|
|
56
|
-
#
|
|
57
|
-
#
|
|
54
|
+
# IMPORTANT: the finalizer captures @state (Hash), NOT self.
|
|
55
|
+
# Capturing self would prevent the GC from collecting the Document.
|
|
56
|
+
# Moreover the finalizer does NOT touch @page_cache: Pages have
|
|
57
|
+
# their own individual finalizer, and the execution order among
|
|
58
|
+
# finalizers is non-deterministic in Ruby.
|
|
58
59
|
ObjectSpace.define_finalizer(self, self.class.finalizer(@state))
|
|
59
60
|
end
|
|
60
61
|
|
|
@@ -86,8 +87,9 @@ module Rpdfium
|
|
|
86
87
|
ensure_open!
|
|
87
88
|
raise PageError, "Page index #{index} out of range" unless (0...page_count).cover?(index)
|
|
88
89
|
|
|
89
|
-
#
|
|
90
|
-
#
|
|
90
|
+
# Pages are cacheable: reloading them is expensive and the objects
|
|
91
|
+
# are immutable from the application's point of view (in read-only
|
|
92
|
+
# mode).
|
|
91
93
|
@page_cache[index] ||= Page.new(self, index)
|
|
92
94
|
end
|
|
93
95
|
alias [] page
|
|
@@ -98,6 +100,31 @@ module Rpdfium
|
|
|
98
100
|
page_count.times { |i| yield page(i) }
|
|
99
101
|
end
|
|
100
102
|
|
|
103
|
+
# Iterates the pages WITHOUT retaining them in the page cache: each page
|
|
104
|
+
# is closed (native FPDF_PAGE / text page handles and the per-page char
|
|
105
|
+
# and line-segment caches) as soon as the block returns.
|
|
106
|
+
#
|
|
107
|
+
# `#each` caches every visited page for the document's whole lifetime —
|
|
108
|
+
# ideal for interactive, random-access use, but for a single linear pass
|
|
109
|
+
# over a large document it makes peak memory grow with the page count
|
|
110
|
+
# (each page keeps thousands of char hashes alive). The batch helpers
|
|
111
|
+
# (`Rpdfium.extract_text`, `.extract_tables`, `.render_to_pngs`) visit
|
|
112
|
+
# each page exactly once, so they stream instead: only one page is alive
|
|
113
|
+
# at a time and peak RSS stays flat in the number of pages.
|
|
114
|
+
def each_page_streaming
|
|
115
|
+
return enum_for(:each_page_streaming) unless block_given?
|
|
116
|
+
|
|
117
|
+
ensure_open!
|
|
118
|
+
page_count.times do |i|
|
|
119
|
+
pg = Page.new(self, i)
|
|
120
|
+
begin
|
|
121
|
+
yield pg
|
|
122
|
+
ensure
|
|
123
|
+
pg.close
|
|
124
|
+
end
|
|
125
|
+
end
|
|
126
|
+
end
|
|
127
|
+
|
|
101
128
|
def page_label(index)
|
|
102
129
|
Raw.read_utf16_string(:FPDF_GetPageLabel, @state[:handle], index)
|
|
103
130
|
end
|
|
@@ -116,11 +143,11 @@ module Rpdfium
|
|
|
116
143
|
return nil if Raw.FPDF_GetFileVersion(@state[:handle], buf) == 0
|
|
117
144
|
|
|
118
145
|
v = buf.read_int
|
|
119
|
-
# PDFium
|
|
146
|
+
# PDFium returns 14 → 1.4, 17 → 1.7
|
|
120
147
|
"#{v / 10}.#{v % 10}"
|
|
121
148
|
end
|
|
122
149
|
|
|
123
|
-
# Permission bits
|
|
150
|
+
# Permission bits according to the PDF spec (Table 22 §7.6.3.2)
|
|
124
151
|
PERMISSIONS = {
|
|
125
152
|
print: 1 << 2,
|
|
126
153
|
modify: 1 << 3,
|
|
@@ -154,9 +181,9 @@ module Rpdfium
|
|
|
154
181
|
form_type != :none
|
|
155
182
|
end
|
|
156
183
|
|
|
157
|
-
# Lazy form environment.
|
|
158
|
-
# -
|
|
159
|
-
# -
|
|
184
|
+
# Lazy form environment. Required to:
|
|
185
|
+
# - read FormFieldType/Value/Name on widget annotations
|
|
186
|
+
# - render the form fields over the page (FFLDraw)
|
|
160
187
|
def form_env
|
|
161
188
|
@form_env ||= Form::Environment.new(self) if has_forms?
|
|
162
189
|
end
|
|
@@ -179,7 +206,7 @@ module Rpdfium
|
|
|
179
206
|
def close
|
|
180
207
|
return if @state[:closed]
|
|
181
208
|
|
|
182
|
-
#
|
|
209
|
+
# Order: close form env and cached pages first, then the document.
|
|
183
210
|
@form_env&.close
|
|
184
211
|
@page_cache.each_value(&:close)
|
|
185
212
|
@page_cache.clear
|
|
@@ -216,8 +243,8 @@ module Rpdfium
|
|
|
216
243
|
end
|
|
217
244
|
|
|
218
245
|
def load_from_bytes(bytes, password)
|
|
219
|
-
#
|
|
220
|
-
#
|
|
246
|
+
# CRITICAL: PDFium does NOT copy the bytes — it references them. We
|
|
247
|
+
# must keep the buffer alive for the entire life of the document.
|
|
221
248
|
buf = FFI::MemoryPointer.new(:uchar, bytes.bytesize)
|
|
222
249
|
buf.put_bytes(0, bytes)
|
|
223
250
|
[Raw.FPDF_LoadMemDocument64(buf, bytes.bytesize, password), buf]
|
data/lib/rpdfium/errors.rb
CHANGED
|
@@ -34,8 +34,8 @@ module Rpdfium
|
|
|
34
34
|
|
|
35
35
|
Raw.FPDF_InitLibrary
|
|
36
36
|
@initialized = true
|
|
37
|
-
#
|
|
38
|
-
#
|
|
37
|
+
# Automatic cleanup at process exit. Order is guaranteed: all Ruby
|
|
38
|
+
# finalizers run before the at_exit of our own blocks.
|
|
39
39
|
at_exit { Raw.FPDF_DestroyLibrary if @initialized }
|
|
40
40
|
end
|
|
41
41
|
end
|
data/lib/rpdfium/form/form.rb
CHANGED
|
@@ -2,10 +2,10 @@
|
|
|
2
2
|
|
|
3
3
|
module Rpdfium
|
|
4
4
|
module Form
|
|
5
|
-
# FPDF_FORMHANDLE
|
|
6
|
-
# In
|
|
7
|
-
# (version=2, callbacks NULL). PDFium
|
|
8
|
-
#
|
|
5
|
+
# FPDF_FORMHANDLE is required to read widget annotations.
|
|
6
|
+
# In read-only mode it is enough to initialize it with a minimal FORMFILLINFO
|
|
7
|
+
# (version=2, callbacks NULL). PDFium invokes the callbacks only during
|
|
8
|
+
# user interaction or JavaScript, which we do not use.
|
|
9
9
|
class Environment
|
|
10
10
|
attr_reader :document
|
|
11
11
|
|
|
@@ -13,7 +13,7 @@ module Rpdfium
|
|
|
13
13
|
@document = document
|
|
14
14
|
@info = Raw::FPDF_FORMFILLINFO.new
|
|
15
15
|
@info[:version] = 2
|
|
16
|
-
#
|
|
16
|
+
# All pointers remain NULL (the FFI::Struct default).
|
|
17
17
|
handle = Raw.FPDFDOC_InitFormFillEnvironment(document.handle, @info)
|
|
18
18
|
if handle.null?
|
|
19
19
|
raise FormError,
|
|
@@ -48,8 +48,8 @@ module Rpdfium
|
|
|
48
48
|
end
|
|
49
49
|
end
|
|
50
50
|
|
|
51
|
-
# Wrapper
|
|
52
|
-
#
|
|
51
|
+
# Wrapper for a form widget. It is built from
|
|
52
|
+
# an annotation of type :widget and the document env.
|
|
53
53
|
class Field
|
|
54
54
|
TYPES = {
|
|
55
55
|
Raw::FPDF_FORMFIELD_UNKNOWN => :unknown,
|
|
@@ -89,14 +89,14 @@ module Rpdfium
|
|
|
89
89
|
def readonly?; (flags & (1 << 0)).positive?; end
|
|
90
90
|
def required?; (flags & (1 << 1)).positive?; end
|
|
91
91
|
|
|
92
|
-
#
|
|
92
|
+
# For checkbox and radio
|
|
93
93
|
def checked?
|
|
94
94
|
return false unless %i[checkbox radiobutton].include?(type)
|
|
95
95
|
|
|
96
96
|
Raw.FPDFAnnot_IsChecked(@env.handle, @annotation.handle) == 1
|
|
97
97
|
end
|
|
98
98
|
|
|
99
|
-
#
|
|
99
|
+
# For combobox/listbox
|
|
100
100
|
def options
|
|
101
101
|
n = Raw.FPDFAnnot_GetOptionCount(@env.handle, @annotation.handle)
|
|
102
102
|
return [] if n <= 0
|
|
@@ -2,11 +2,11 @@
|
|
|
2
2
|
|
|
3
3
|
module Rpdfium
|
|
4
4
|
module Image
|
|
5
|
-
# Wrapper
|
|
6
|
-
# -
|
|
7
|
-
# -
|
|
8
|
-
# -
|
|
9
|
-
# -
|
|
5
|
+
# Wrapper for an image object placed in a page. Allows you to:
|
|
6
|
+
# - read metadata (pixel size, DPI, colorspace, BPP)
|
|
7
|
+
# - obtain raw bytes (as stored: typically JPEG)
|
|
8
|
+
# - obtain decoded bytes (raster after filters)
|
|
9
|
+
# - obtain a rendered bitmap (with masks and matrix applied)
|
|
10
10
|
class Embedded
|
|
11
11
|
COLORSPACES = {
|
|
12
12
|
0 => :unknown, 1 => :devicegray, 2 => :devicergb, 3 => :devicecmyk,
|
|
@@ -55,8 +55,8 @@ module Rpdfium
|
|
|
55
55
|
top: h - t.read_float, bottom: h - b.read_float }
|
|
56
56
|
end
|
|
57
57
|
|
|
58
|
-
#
|
|
59
|
-
# ["FlateDecode"] → zlib, ["DCTDecode","DCTDecode"] →
|
|
58
|
+
# Filters applied in PDF order: e.g. ["DCTDecode"] → JPEG,
|
|
59
|
+
# ["FlateDecode"] → zlib, ["DCTDecode","DCTDecode"] → re-encodings.
|
|
60
60
|
def filters
|
|
61
61
|
n = Raw.FPDFImageObj_GetImageFilterCount(@handle)
|
|
62
62
|
Array.new(n) do |i|
|
|
@@ -72,8 +72,9 @@ module Rpdfium
|
|
|
72
72
|
end
|
|
73
73
|
end
|
|
74
74
|
|
|
75
|
-
#
|
|
76
|
-
#
|
|
75
|
+
# "Raw" bytes: as they are stored in the PDF. If filters ==
|
|
76
|
+
# ["DCTDecode"] these bytes are a complete JPEG that you can save
|
|
77
|
+
# with a .jpg extension.
|
|
77
78
|
def raw_bytes
|
|
78
79
|
len = Raw.FPDFImageObj_GetImageDataRaw(@handle, FFI::Pointer::NULL, 0)
|
|
79
80
|
return "" if len.zero?
|
|
@@ -83,8 +84,8 @@ module Rpdfium
|
|
|
83
84
|
buf.read_bytes(len)
|
|
84
85
|
end
|
|
85
86
|
|
|
86
|
-
#
|
|
87
|
-
# Layout
|
|
87
|
+
# Decoded bytes: raster pixels after the filters are applied.
|
|
88
|
+
# Layout depends on the colorspace.
|
|
88
89
|
def decoded_bytes
|
|
89
90
|
len = Raw.FPDFImageObj_GetImageDataDecoded(@handle, FFI::Pointer::NULL, 0)
|
|
90
91
|
return "" if len.zero?
|
|
@@ -94,7 +95,7 @@ module Rpdfium
|
|
|
94
95
|
buf.read_bytes(len)
|
|
95
96
|
end
|
|
96
97
|
|
|
97
|
-
# Bitmap
|
|
98
|
+
# Bitmap rendered applying matrix and masks. Returns [w, h, bytes(BGRA)].
|
|
98
99
|
def render_bitmap
|
|
99
100
|
bitmap = Raw.FPDFImageObj_GetRenderedBitmap(
|
|
100
101
|
@page.document.handle, @page.handle, @handle
|
|
@@ -112,14 +113,14 @@ module Rpdfium
|
|
|
112
113
|
end
|
|
113
114
|
end
|
|
114
115
|
|
|
115
|
-
#
|
|
116
|
-
#
|
|
116
|
+
# Saves the file. If the filters are DCTDecode → writes a direct
|
|
117
|
+
# .jpg. Otherwise renders the bitmap to PNG.
|
|
117
118
|
def save(path)
|
|
118
119
|
if filters == ["DCTDecode"]
|
|
119
120
|
File.binwrite(path, raw_bytes)
|
|
120
121
|
else
|
|
121
122
|
w, h, bytes, stride = render_bitmap
|
|
122
|
-
#
|
|
123
|
+
# The rendered bitmaps are BGRA: we convert to RGBA for the PNG writer
|
|
123
124
|
rgba = swap_bgra_to_rgba(bytes, w, h, stride)
|
|
124
125
|
Rpdfium::IO::PNG.write(path, w, h, rgba, stride: w * 4)
|
|
125
126
|
end
|
|
@@ -132,7 +133,7 @@ module Rpdfium
|
|
|
132
133
|
out = String.new(capacity: w * h * 4, encoding: Encoding::ASCII_8BIT)
|
|
133
134
|
h.times do |y|
|
|
134
135
|
row = bgra.byteslice(y * stride, w * 4)
|
|
135
|
-
#
|
|
136
|
+
# Swap B<->R for each pixel
|
|
136
137
|
(0...row.bytesize).step(4) do |i|
|
|
137
138
|
out << row.getbyte(i + 2) << row.getbyte(i + 1) <<
|
|
138
139
|
row.getbyte(i) << row.getbyte(i + 3)
|
data/lib/rpdfium/io/png.rb
CHANGED
|
@@ -4,12 +4,12 @@ require "zlib"
|
|
|
4
4
|
|
|
5
5
|
module Rpdfium
|
|
6
6
|
module IO
|
|
7
|
-
# PNG writer
|
|
8
|
-
#
|
|
9
|
-
#
|
|
7
|
+
# Minimal PNG writer, pure Ruby, zero external dependencies.
|
|
8
|
+
# Supports only RGBA 8bpc (color type 6) — the format PDFium produces
|
|
9
|
+
# when rendering with FPDF_REVERSE_BYTE_ORDER.
|
|
10
10
|
#
|
|
11
|
-
#
|
|
12
|
-
#
|
|
11
|
+
# Reference: PNG spec (RFC 2083). No compromise on validity:
|
|
12
|
+
# generates correct CRC32 values and uses deflate via the zlib stdlib.
|
|
13
13
|
module PNG
|
|
14
14
|
SIGNATURE = "\x89PNG\r\n\x1a\n".b
|
|
15
15
|
COLOR_RGBA = 6
|
|
@@ -34,10 +34,10 @@ module Rpdfium
|
|
|
34
34
|
end
|
|
35
35
|
|
|
36
36
|
def write_idat(io, width, height, rgba, stride)
|
|
37
|
-
# PNG
|
|
38
|
-
# 0 = None (
|
|
39
|
-
#
|
|
40
|
-
#
|
|
37
|
+
# PNG requires a "filter type" byte at the start of each row.
|
|
38
|
+
# 0 = None (no filter). It works but compresses worse.
|
|
39
|
+
# For simplicity we use None — output 1.5-2x larger than the optimal
|
|
40
|
+
# minimum, but it is an explicit complexity/zero-dep tradeoff choice.
|
|
41
41
|
row_bytes = width * 4
|
|
42
42
|
scanlines = String.new(capacity: (row_bytes + 1) * height,
|
|
43
43
|
encoding: Encoding::ASCII_8BIT)
|