rpdfium 0.4.1 → 0.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -27,16 +27,22 @@ end
27
27
  ## Why
28
28
 
29
29
  The Ruby ecosystem has `pdf-reader` (text only, slow on complex docs),
30
- `origami` (security-research focused), and `hexapdf` (great for
31
- manipulation but text extraction is approximate). None give you
32
- character-level bounding boxes, real vector path geometry, or table
33
- extraction. `rpdfium` fills that gap by binding the same battle-tested
34
- C++ engine that powers Chrome's PDF viewer.
30
+ `origami` (security-research focused), and `hexapdf` a capable library that
31
+ extracts text with character-level positioning and exposes the vector-path
32
+ primitives you need to build table extraction yourself (see the ~120-line
33
+ [reference](benchmark/examples/hexapdf_table_extraction.rb) in the benchmark
34
+ suite). `hexapdf` is AGPL / commercially licensed, though.
35
+
36
+ I wanted a permissively licensed library that ships those higher-level
37
+ pipelines **out of the box** — pdfplumber-style table detection, page
38
+ rendering to raster — on top of character metadata, without hand-rolling them.
39
+ `rpdfium` does that under Apache-2.0, binding the same battle-tested C++ engine
40
+ that powers Chrome's PDF viewer.
35
41
 
36
42
  In practice it matches the speed of Python's `pypdfium2` on text
37
- extraction and is **15-56× faster than `pdfplumber`** while using
38
- **5-7× less memory** on large documents. See [Performance](#performance)
39
- for details.
43
+ extraction and is **up to ~80× faster than `pdfplumber`** while using
44
+ **up to ~80× less memory** on dense documents (the 520-page academic tier).
45
+ See [Performance](#performance) for details.
40
46
 
41
47
  ## Installing PDFium
42
48
 
@@ -321,7 +327,7 @@ end
321
327
  Three primitives:
322
328
 
323
329
  - `Page#font_inventory` — distribution by `(font, height, weight)`,
324
- with counts and samples for ispection
330
+ with counts and samples for inspection
325
331
  - `Page#chars_where(font:, height:, weight:, bbox:, where:)` —
326
332
  filter chars by any combination of criteria
327
333
  - `Page#lines(font:, ...)` — high-level helper: filter + word
@@ -470,76 +476,65 @@ the tree stays in memory until the document is closed.
470
476
 
471
477
  ## Performance
472
478
 
473
- Measured on 4 PDFs of increasing complexity, best-of-3 runs after a
474
- warm-up, isolated in subprocesses to capture clean peak RSS. Versions
475
- under test: `rpdfium 0.3.13`, `pdfplumber 0.11.9`, `pypdfium2 5.6.0`.
476
-
477
- | Test corpus | Pages | Size | What it stresses |
478
- | --- | ---: | ---: | --- |
479
- | `sample.pdf` | 1 | 18 KB | Plain text baseline |
480
- | `form.pdf` | 1 | 107 KB | Char-per-text-object kerning, Form XObject, tables |
481
- | `complex.pdf` | 85 | 60 MB | Magazine-style document, dense text + heavy graphics |
482
- | `report.pdf` | 226 | 322 KB | Rotated pages (90°), small fonts, ~15 tables per page |
483
-
484
- ### Speed
485
-
486
- | Corpus | Task | rpdfium | pypdfium2 | pdfplumber | speedup vs pdfplumber |
487
- | --- | --- | ---: | ---: | ---: | ---: |
488
- | sample.pdf (1 pag) | text | 4 ms | 4 ms | 75 ms | **21×** |
489
- | sample.pdf (1 pag) | tables | 4 ms | n/a | 70 ms | **16×** |
490
- | form.pdf (1 pag) | text | 12 ms | 13 ms | 538 ms | **44×** |
491
- | form.pdf (1 pag) | tables | 25 ms | n/a | 575 ms | **23×** |
492
- | complex.pdf (85 pag) | text | 190 ms | 183 ms | 7.76 s | **41×** |
493
- | complex.pdf (85 pag) | tables | 231 ms | n/a | 7.07 s | **31×** |
494
- | report.pdf (226 pag) | text | 412 ms | 397 ms | 23.26 s | **56×** |
495
- | report.pdf (226 pag) | tables | 1.68 s | n/a | 25.25 s | **15×** |
496
-
497
- `pypdfium2` does not implement table extraction (it's a raw FFI binding
498
- to PDFium, not a full pipeline). It's listed as the "pure PDFium speed
499
- floor" for text — rpdfium matches it within ±5%, showing that the Ruby
500
- FFI overhead is not measurable.
501
-
502
- ### Memory (peak RSS)
503
-
504
- | Corpus | rpdfium | pypdfium2 | pdfplumber | pdfplumber/rpdfium |
479
+ The full, reproducible benchmark suite sample PDFs, runners, ground-truth
480
+ correctness scoring, and methodology lives in
481
+ [`benchmark/`](benchmark/README.md). It compares **rpdfium** against
482
+ **pypdfium2** (the "pure PDFium speed floor"), **pdfplumber** (the reference
483
+ pure-Python pipeline) and **hexapdf** (pure Ruby) across five synthetic PDFs
484
+ of increasing complexity (1 520 pages), measuring **execution time**, **peak
485
+ memory (RSS)** and **correctness** (fraction of known ground-truth data
486
+ recovered). Run it yourself:
487
+
488
+ ```bash
489
+ export PDFIUM_LIBRARY_PATH=/path/to/libpdfium.{so,dylib,dll}
490
+ pip install pdfplumber pypdfium2 # optional baselines
491
+ gem install hexapdf # optional baseline + table-extraction reference
492
+ ruby benchmark/run.rb
493
+ ```
494
+
495
+ ### Synthetic suite (Apple M-series, best of 3)
496
+
497
+ Text extraction rpdfium tracks pypdfium2 within noise (FFI overhead not
498
+ measurable); pdfplumber degrades super-linearly:
499
+
500
+ | PDF | rpdfium | pypdfium2 | pdfplumber | hexapdf |
505
501
  | --- | ---: | ---: | ---: | ---: |
506
- | sample.pdf | 29 MB | 20 MB | 40 MB | 1.4× |
507
- | form.pdf | 32 MB | 22 MB | 45 MB | 1.4× |
508
- | complex.pdf | 106 MB | 69 MB | 535 MB | **5.0×** |
509
- | report.pdf | 136 MB | 41 MB | 1003 MB | **7.4×** |
510
-
511
- The memory gap widens with workload size. On a 226-page document
512
- pdfplumber uses ~1 GB; rpdfium stays under 140 MB. For server-side
513
- batch processing this is the difference between a 256 MB container and
514
- a 2 GB one.
515
-
516
- ### Headline numbers
517
-
518
- On large PDFs (226 pages, dense layout):
519
-
520
- - **rpdfium completes both text + tables in ~2.1 s using 136 MB**
521
- - **pdfplumber needs ~48 s and 1 GB** for the same work
522
-
523
- Across the four corpora the median speedup vs pdfplumber is **27× on
524
- text**, **22× on tables**. rpdfium scales linearly with page count
525
- (thanks to PDFium's C++ engine); pdfplumber's pure-Python pipeline
526
- degrades super-linearly on large documents.
527
-
528
- ### Methodology
529
-
530
- Each measurement is the **minimum of 3 timed runs after a warm-up run**
531
- (to neutralize OS page cache effects on the 60 MB `complex.pdf`).
532
- Subprocess isolation per measurement ensures clean RSS reading via
533
- `resource.getrusage` / `/proc/self/status`. The benchmark harness is
534
- a small Ruby driver that shells out to three runners (one Ruby script
535
- using `rpdfium`, two Python scripts using `pdfplumber` and
536
- `pypdfium2`), parses the JSON each emits, and aggregates the results.
537
-
538
- Output quality has been spot-checked: rpdfium matches pypdfium2 char
539
- count within ±1 char (rounding on the trailing newline). pdfplumber
540
- returns ~2% fewer chars on locale-formatted numbers due to a different
541
- word-tokenization for thousand-separator punctuation (e.g. `1.250.000`
542
- split on periods).
502
+ | 01_simple (1 pg) | 12 ms / 33 MB | 12 ms / 36 MB | 17 ms / 42 MB | 14 ms / 24 MB |
503
+ | 02_medium (6 pg) | 14 ms / 33 MB | 14 ms / 37 MB | 101 ms / 57 MB | 19 ms / 24 MB |
504
+ | 03_complex (16 pg) | 15 ms / 34 MB | 16 ms / 38 MB | 182 ms / 72 MB | 28 ms / 25 MB |
505
+ | 04_heavy (60 pg) | **47 ms / 35 MB** | 50 ms / 40 MB | 2.41 s / 456 MB | 145 ms / 26 MB |
506
+ | 05_academic (520 pg) | **706 ms / 69 MB** | 755 ms / 104 MB | **57.15 s / 5537 MB** | 2.28 s / 43 MB |
507
+
508
+ On the 520-page tier rpdfium is **~81× faster than pdfplumber and uses ~80×
509
+ less memory**, and it now beats raw pypdfium2 on both axes — `extract_text`
510
+ streams pages (one alive at a time), while the pypdfium2 runner holds them.
511
+
512
+ Table extraction (pypdfium2 has no table layer; the hexapdf column uses the
513
+ minimal lines-based reference in
514
+ [`benchmark/examples/hexapdf_table_extraction.rb`](benchmark/examples/hexapdf_table_extraction.rb)):
515
+
516
+ | PDF | rpdfium | pdfplumber | hexapdf |
517
+ | --- | ---: | ---: | ---: |
518
+ | 01_simple (1 pg) | 15 ms / 34 MB | 17 ms / 42 MB | 24 ms / 25 MB |
519
+ | 02_medium (6 pg) | 38 ms / 35 MB | 111 ms / 57 MB | 54 ms / 25 MB |
520
+ | 03_complex (16 pg) | 124 ms / 38 MB | 187 ms / 71 MB | 88 ms / 26 MB |
521
+ | 04_heavy (60 pg) | **496 ms / 39 MB** | 3.05 s / 442 MB | 779 ms / 29 MB |
522
+ | 05_academic (520 pg) | 15.46 s / 104 MB | 68.04 s / 5179 MB | **13.22 s / 37 MB** |
523
+
524
+ rpdfium stays **~4.4× faster than pdfplumber while using ~50× less memory** on
525
+ the academic tier. The minimal hexapdf reference edges it out on time there
526
+ (13.22 s vs 15.46 s): at 520 pages the full pipeline's per-page cost
527
+ (borderless `:text` attempts, rectangle / multi-table search, annotation
528
+ parsing) dominates, while the `:lines`-only reference skips all of it — a fair
529
+ comparison only on clean ruled grids.
530
+
531
+ Correctness is **100% for every library on every tier** — these are clean
532
+ generated grids, the easy case. Real-world tables (dashed rules, partial
533
+ borders, misaligned cells) are where rpdfium's snap/join tolerances and
534
+ `:text` fallback earn their cost; the 120-line hexapdf reference matches here
535
+ but would drop cells there. See
536
+ [`benchmark/README.md`](benchmark/README.md) for the full tables, task-support
537
+ matrix, correctness scoring and methodology.
543
538
 
544
539
  ## Memory safety
545
540
 
@@ -1,9 +1,9 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Rpdfium
4
- # Wrapper per FPDF_ANNOTATION. Le annotazioni includono link, highlight,
5
- # commenti, widget di form. PDFium richiede di chiudere ogni handle con
6
- # FPDFPage_CloseAnnot, gestito qui via finalizer.
4
+ # Wrapper for FPDF_ANNOTATION. Annotations include links, highlights,
5
+ # comments, form widgets. PDFium requires closing each handle with
6
+ # FPDFPage_CloseAnnot, handled here via a finalizer.
7
7
  class Annotation
8
8
  SUBTYPES = {
9
9
  Raw::FPDF_ANNOT_UNKNOWN => :unknown,
@@ -64,8 +64,8 @@ module Rpdfium
64
64
  top: h - rect[:top], bottom: h - rect[:bottom] }
65
65
  end
66
66
 
67
- # Valore di una chiave del dict di annotazione (UTF-16LE).
68
- # Chiavi comuni: "Contents" (testo annotazione), "T" (autore),
67
+ # Value of a key in the annotation dict (UTF-16LE).
68
+ # Common keys: "Contents" (annotation text), "T" (author),
69
69
  # "M" (mod date), "NM" (uniq name).
70
70
  def [](key)
71
71
  Raw.read_utf16_string(:FPDFAnnot_GetStringValue, @state[:handle], key.to_s)
@@ -75,7 +75,7 @@ module Rpdfium
75
75
  Raw.FPDFAnnot_HasKey(@state[:handle], key.to_s) == 1
76
76
  end
77
77
 
78
- # Per annotazioni :link → URL di destinazione (se esterno) o nil.
78
+ # For :link annotations destination URL (if external) or nil.
79
79
  def link_uri
80
80
  return nil unless subtype == :link
81
81
 
@@ -85,10 +85,12 @@ module Rpdfium
85
85
  action = Raw.FPDFLink_GetAction(link_handle)
86
86
  return nil if action.null?
87
87
 
88
- Raw.read_utf16_string(:FPDFAction_GetURIPath, @page.document.handle, action)
88
+ # Unlike most PDFium getters, FPDFAction_GetURIPath returns 7-bit
89
+ # ASCII bytes, not UTF-16LE.
90
+ Raw.read_ascii_string(:FPDFAction_GetURIPath, @page.document.handle, action)
89
91
  end
90
92
 
91
- # Per link interniindice pagina di destinazione, o nil.
93
+ # For internal linksdestination page index, or nil.
92
94
  def link_dest_page
93
95
  return nil unless subtype == :link
94
96
 
@@ -1,9 +1,9 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Rpdfium
4
- # Wrapper di livello documento. Espone:
5
- # - apertura da path / IO / bytes / pagina by index
6
- # - metadata (Title, Author, ecc.)
4
+ # Document-level wrapper. Exposes:
5
+ # - opening from path / IO / bytes / page by index
6
+ # - metadata (Title, Author, etc.)
7
7
  # - permissions
8
8
  # - outline (bookmarks)
9
9
  # - attachments
@@ -39,10 +39,11 @@ module Rpdfium
39
39
 
40
40
  raise LoadError, "Failed to load PDF: #{msg}"
41
41
  end
42
- # Stato condiviso tra istanza e finalizer. Wrappato in Hash mutabile
43
- # perché la closure del finalizer e il close() esplicito devono vedere
44
- # lo stesso :closed flag — altrimenti chi arriva secondo richiama
45
- # FPDF_CloseDocument su un handle già liberato e PDFium segfaulta.
42
+ # State shared between the instance and the finalizer. Wrapped in a
43
+ # mutable Hash because the finalizer closure and the explicit
44
+ # close() must see the same :closed flag — otherwise whichever
45
+ # arrives second calls FPDF_CloseDocument on an already-freed
46
+ # handle and PDFium segfaults.
46
47
  @state = {
47
48
  handle: handle,
48
49
  retain_buffer: retain_buffer,
@@ -50,11 +51,11 @@ module Rpdfium
50
51
  }
51
52
  @form_env = nil
52
53
  @page_cache = {}
53
- # IMPORTANTE: il finalizer cattura @state (Hash), NON self. Catturare
54
- # self impedirebbe al GC di raccogliere il Document. Inoltre il
55
- # finalizer NON tocca @page_cache: le Page hanno il loro finalizer
56
- # individuale, e l'ordine di esecuzione tra finalizer è non
57
- # deterministico in Ruby.
54
+ # IMPORTANT: the finalizer captures @state (Hash), NOT self.
55
+ # Capturing self would prevent the GC from collecting the Document.
56
+ # Moreover the finalizer does NOT touch @page_cache: Pages have
57
+ # their own individual finalizer, and the execution order among
58
+ # finalizers is non-deterministic in Ruby.
58
59
  ObjectSpace.define_finalizer(self, self.class.finalizer(@state))
59
60
  end
60
61
 
@@ -86,8 +87,9 @@ module Rpdfium
86
87
  ensure_open!
87
88
  raise PageError, "Page index #{index} out of range" unless (0...page_count).cover?(index)
88
89
 
89
- # Le pagine sono cacheable: ricaricarle è costoso e gli oggetti sono
90
- # immutabili dal punto di vista applicativo (in modalità read-only).
90
+ # Pages are cacheable: reloading them is expensive and the objects
91
+ # are immutable from the application's point of view (in read-only
92
+ # mode).
91
93
  @page_cache[index] ||= Page.new(self, index)
92
94
  end
93
95
  alias [] page
@@ -98,6 +100,31 @@ module Rpdfium
98
100
  page_count.times { |i| yield page(i) }
99
101
  end
100
102
 
103
+ # Iterates the pages WITHOUT retaining them in the page cache: each page
104
+ # is closed (native FPDF_PAGE / text page handles and the per-page char
105
+ # and line-segment caches) as soon as the block returns.
106
+ #
107
+ # `#each` caches every visited page for the document's whole lifetime —
108
+ # ideal for interactive, random-access use, but for a single linear pass
109
+ # over a large document it makes peak memory grow with the page count
110
+ # (each page keeps thousands of char hashes alive). The batch helpers
111
+ # (`Rpdfium.extract_text`, `.extract_tables`, `.render_to_pngs`) visit
112
+ # each page exactly once, so they stream instead: only one page is alive
113
+ # at a time and peak RSS stays flat in the number of pages.
114
+ def each_page_streaming
115
+ return enum_for(:each_page_streaming) unless block_given?
116
+
117
+ ensure_open!
118
+ page_count.times do |i|
119
+ pg = Page.new(self, i)
120
+ begin
121
+ yield pg
122
+ ensure
123
+ pg.close
124
+ end
125
+ end
126
+ end
127
+
101
128
  def page_label(index)
102
129
  Raw.read_utf16_string(:FPDF_GetPageLabel, @state[:handle], index)
103
130
  end
@@ -116,11 +143,11 @@ module Rpdfium
116
143
  return nil if Raw.FPDF_GetFileVersion(@state[:handle], buf) == 0
117
144
 
118
145
  v = buf.read_int
119
- # PDFium ritorna 14 → 1.4, 17 → 1.7
146
+ # PDFium returns 14 → 1.4, 17 → 1.7
120
147
  "#{v / 10}.#{v % 10}"
121
148
  end
122
149
 
123
- # Permission bits secondo PDF spec (Table 22 §7.6.3.2)
150
+ # Permission bits according to the PDF spec (Table 22 §7.6.3.2)
124
151
  PERMISSIONS = {
125
152
  print: 1 << 2,
126
153
  modify: 1 << 3,
@@ -154,9 +181,9 @@ module Rpdfium
154
181
  form_type != :none
155
182
  end
156
183
 
157
- # Lazy form environment. Necessario per:
158
- # - leggere FormFieldType/Value/Name su widget annotations
159
- # - renderizzare i form fields sopra la pagina (FFLDraw)
184
+ # Lazy form environment. Required to:
185
+ # - read FormFieldType/Value/Name on widget annotations
186
+ # - render the form fields over the page (FFLDraw)
160
187
  def form_env
161
188
  @form_env ||= Form::Environment.new(self) if has_forms?
162
189
  end
@@ -179,7 +206,7 @@ module Rpdfium
179
206
  def close
180
207
  return if @state[:closed]
181
208
 
182
- # Ordine: chiudi prima form env e pagine cached, poi documento.
209
+ # Order: close form env and cached pages first, then the document.
183
210
  @form_env&.close
184
211
  @page_cache.each_value(&:close)
185
212
  @page_cache.clear
@@ -216,8 +243,8 @@ module Rpdfium
216
243
  end
217
244
 
218
245
  def load_from_bytes(bytes, password)
219
- # CRITICO: PDFium NON copia i bytes — li referenzia. Dobbiamo tenere
220
- # vivo il buffer per tutta la vita del documento.
246
+ # CRITICAL: PDFium does NOT copy the bytes — it references them. We
247
+ # must keep the buffer alive for the entire life of the document.
221
248
  buf = FFI::MemoryPointer.new(:uchar, bytes.bytesize)
222
249
  buf.put_bytes(0, bytes)
223
250
  [Raw.FPDF_LoadMemDocument64(buf, bytes.bytesize, password), buf]
@@ -34,8 +34,8 @@ module Rpdfium
34
34
 
35
35
  Raw.FPDF_InitLibrary
36
36
  @initialized = true
37
- # Cleanup automatico a process exit. Ordine garantito: tutti i
38
- # finalizer Ruby vengono eseguiti prima di at_exit dei nostri blocchi.
37
+ # Automatic cleanup at process exit. Order is guaranteed: all Ruby
38
+ # finalizers run before the at_exit of our own blocks.
39
39
  at_exit { Raw.FPDF_DestroyLibrary if @initialized }
40
40
  end
41
41
  end
@@ -2,10 +2,10 @@
2
2
 
3
3
  module Rpdfium
4
4
  module Form
5
- # FPDF_FORMHANDLE è necessario per leggere widget annotations.
6
- # In modalità read-only basta inizializzarlo con una FORMFILLINFO minimale
7
- # (version=2, callbacks NULL). PDFium chiama i callback solo durante
8
- # interazione utente o JavaScript, che noi non usiamo.
5
+ # FPDF_FORMHANDLE is required to read widget annotations.
6
+ # In read-only mode it is enough to initialize it with a minimal FORMFILLINFO
7
+ # (version=2, callbacks NULL). PDFium invokes the callbacks only during
8
+ # user interaction or JavaScript, which we do not use.
9
9
  class Environment
10
10
  attr_reader :document
11
11
 
@@ -13,7 +13,7 @@ module Rpdfium
13
13
  @document = document
14
14
  @info = Raw::FPDF_FORMFILLINFO.new
15
15
  @info[:version] = 2
16
- # Tutti i puntatori restano NULL (default di FFI::Struct).
16
+ # All pointers remain NULL (the FFI::Struct default).
17
17
  handle = Raw.FPDFDOC_InitFormFillEnvironment(document.handle, @info)
18
18
  if handle.null?
19
19
  raise FormError,
@@ -48,8 +48,8 @@ module Rpdfium
48
48
  end
49
49
  end
50
50
 
51
- # Wrapper per un widget di form. Si costruisce a partire da
52
- # un'annotazione di tipo :widget e l'env del documento.
51
+ # Wrapper for a form widget. It is built from
52
+ # an annotation of type :widget and the document env.
53
53
  class Field
54
54
  TYPES = {
55
55
  Raw::FPDF_FORMFIELD_UNKNOWN => :unknown,
@@ -89,14 +89,14 @@ module Rpdfium
89
89
  def readonly?; (flags & (1 << 0)).positive?; end
90
90
  def required?; (flags & (1 << 1)).positive?; end
91
91
 
92
- # Per checkbox e radio
92
+ # For checkbox and radio
93
93
  def checked?
94
94
  return false unless %i[checkbox radiobutton].include?(type)
95
95
 
96
96
  Raw.FPDFAnnot_IsChecked(@env.handle, @annotation.handle) == 1
97
97
  end
98
98
 
99
- # Per combobox/listbox
99
+ # For combobox/listbox
100
100
  def options
101
101
  n = Raw.FPDFAnnot_GetOptionCount(@env.handle, @annotation.handle)
102
102
  return [] if n <= 0
@@ -2,11 +2,11 @@
2
2
 
3
3
  module Rpdfium
4
4
  module Image
5
- # Wrapper per un image object inserito in una pagina. Permette di:
6
- # - leggere metadata (dimensione pixel, DPI, colorspace, BPP)
7
- # - ottenere bytes raw (così come stoccati: tipicamente JPEG)
8
- # - ottenere bytes decoded (raster post-filtri)
9
- # - ottenere bitmap renderizzato (con maschere e matrice applicate)
5
+ # Wrapper for an image object placed in a page. Allows you to:
6
+ # - read metadata (pixel size, DPI, colorspace, BPP)
7
+ # - obtain raw bytes (as stored: typically JPEG)
8
+ # - obtain decoded bytes (raster after filters)
9
+ # - obtain a rendered bitmap (with masks and matrix applied)
10
10
  class Embedded
11
11
  COLORSPACES = {
12
12
  0 => :unknown, 1 => :devicegray, 2 => :devicergb, 3 => :devicecmyk,
@@ -55,8 +55,8 @@ module Rpdfium
55
55
  top: h - t.read_float, bottom: h - b.read_float }
56
56
  end
57
57
 
58
- # Filtri applicati nell'ordine PDF: es. ["DCTDecode"] → JPEG,
59
- # ["FlateDecode"] → zlib, ["DCTDecode","DCTDecode"] → ricodifiche.
58
+ # Filters applied in PDF order: e.g. ["DCTDecode"] → JPEG,
59
+ # ["FlateDecode"] → zlib, ["DCTDecode","DCTDecode"] → re-encodings.
60
60
  def filters
61
61
  n = Raw.FPDFImageObj_GetImageFilterCount(@handle)
62
62
  Array.new(n) do |i|
@@ -72,8 +72,9 @@ module Rpdfium
72
72
  end
73
73
  end
74
74
 
75
- # Bytes "raw": come sono stoccati nel PDF. Se filters == ["DCTDecode"]
76
- # questi bytes sono un JPEG completo che puoi salvare con estensione .jpg.
75
+ # "Raw" bytes: as they are stored in the PDF. If filters ==
76
+ # ["DCTDecode"] these bytes are a complete JPEG that you can save
77
+ # with a .jpg extension.
77
78
  def raw_bytes
78
79
  len = Raw.FPDFImageObj_GetImageDataRaw(@handle, FFI::Pointer::NULL, 0)
79
80
  return "" if len.zero?
@@ -83,8 +84,8 @@ module Rpdfium
83
84
  buf.read_bytes(len)
84
85
  end
85
86
 
86
- # Bytes decoded: pixel raster dopo l'applicazione dei filtri.
87
- # Layout dipende dal colorspace.
87
+ # Decoded bytes: raster pixels after the filters are applied.
88
+ # Layout depends on the colorspace.
88
89
  def decoded_bytes
89
90
  len = Raw.FPDFImageObj_GetImageDataDecoded(@handle, FFI::Pointer::NULL, 0)
90
91
  return "" if len.zero?
@@ -94,7 +95,7 @@ module Rpdfium
94
95
  buf.read_bytes(len)
95
96
  end
96
97
 
97
- # Bitmap renderizzato applicando matrice e maschere. Ritorna [w, h, bytes(BGRA)].
98
+ # Bitmap rendered applying matrix and masks. Returns [w, h, bytes(BGRA)].
98
99
  def render_bitmap
99
100
  bitmap = Raw.FPDFImageObj_GetRenderedBitmap(
100
101
  @page.document.handle, @page.handle, @handle
@@ -112,14 +113,14 @@ module Rpdfium
112
113
  end
113
114
  end
114
115
 
115
- # Salva il file. Se i filtri sono DCTDecode → scrive .jpg diretto.
116
- # Altrimenti renderizza il bitmap a PNG.
116
+ # Saves the file. If the filters are DCTDecode → writes a direct
117
+ # .jpg. Otherwise renders the bitmap to PNG.
117
118
  def save(path)
118
119
  if filters == ["DCTDecode"]
119
120
  File.binwrite(path, raw_bytes)
120
121
  else
121
122
  w, h, bytes, stride = render_bitmap
122
- # I bitmap resi sono BGRA: convertiamo a RGBA per il PNG writer
123
+ # The rendered bitmaps are BGRA: we convert to RGBA for the PNG writer
123
124
  rgba = swap_bgra_to_rgba(bytes, w, h, stride)
124
125
  Rpdfium::IO::PNG.write(path, w, h, rgba, stride: w * 4)
125
126
  end
@@ -132,7 +133,7 @@ module Rpdfium
132
133
  out = String.new(capacity: w * h * 4, encoding: Encoding::ASCII_8BIT)
133
134
  h.times do |y|
134
135
  row = bgra.byteslice(y * stride, w * 4)
135
- # Scambia B<->R per ogni pixel
136
+ # Swap B<->R for each pixel
136
137
  (0...row.bytesize).step(4) do |i|
137
138
  out << row.getbyte(i + 2) << row.getbyte(i + 1) <<
138
139
  row.getbyte(i) << row.getbyte(i + 3)
@@ -4,12 +4,12 @@ require "zlib"
4
4
 
5
5
  module Rpdfium
6
6
  module IO
7
- # PNG writer minimale, puro Ruby, zero dipendenze esterne.
8
- # Supporta solo RGBA 8bpc (color type 6) — il formato che PDFium produce
9
- # quando rendi con FPDF_REVERSE_BYTE_ORDER.
7
+ # Minimal PNG writer, pure Ruby, zero external dependencies.
8
+ # Supports only RGBA 8bpc (color type 6) — the format PDFium produces
9
+ # when rendering with FPDF_REVERSE_BYTE_ORDER.
10
10
  #
11
- # Riferimento: PNG spec (RFC 2083). Nessun compromesso sulla validità:
12
- # genera CRC32 corretti e usa deflate via zlib stdlib.
11
+ # Reference: PNG spec (RFC 2083). No compromise on validity:
12
+ # generates correct CRC32 values and uses deflate via the zlib stdlib.
13
13
  module PNG
14
14
  SIGNATURE = "\x89PNG\r\n\x1a\n".b
15
15
  COLOR_RGBA = 6
@@ -34,10 +34,10 @@ module Rpdfium
34
34
  end
35
35
 
36
36
  def write_idat(io, width, height, rgba, stride)
37
- # PNG richiede un byte di "filter type" all'inizio di ogni riga.
38
- # 0 = None (nessun filtro). Funziona ma comprime peggio.
39
- # Per semplicità usiamo None — output 1.5-2x più grande del minimo
40
- # ottimo, ma è una scelta esplicita di tradeoff complessità/zero-dep.
37
+ # PNG requires a "filter type" byte at the start of each row.
38
+ # 0 = None (no filter). It works but compresses worse.
39
+ # For simplicity we use None — output 1.5-2x larger than the optimal
40
+ # minimum, but it is an explicit complexity/zero-dep tradeoff choice.
41
41
  row_bytes = width * 4
42
42
  scanlines = String.new(capacity: (row_bytes + 1) * height,
43
43
  encoding: Encoding::ASCII_8BIT)