RubyGems - rpdfium - Versions diffs - 0.4.1 → 0.4.2 - Mend

rpdfium 0.4.1 → 0.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +601 -1317
data/README.md +73 -78
data/lib/rpdfium/annotation/annotation.rb +10 -8
data/lib/rpdfium/document.rb +49 -22
data/lib/rpdfium/errors.rb +2 -2
data/lib/rpdfium/form/form.rb +9 -9
data/lib/rpdfium/image/embedded.rb +17 -16
data/lib/rpdfium/io/png.rb +9 -9
data/lib/rpdfium/page.rb +562 -527
data/lib/rpdfium/raw.rb +216 -203
data/lib/rpdfium/search/search.rb +5 -5
data/lib/rpdfium/structure/attachment.rb +6 -6
data/lib/rpdfium/structure/element.rb +74 -74
data/lib/rpdfium/structure/outline.rb +2 -2
data/lib/rpdfium/structure/tree.rb +56 -55
data/lib/rpdfium/table/cells.rb +36 -33
data/lib/rpdfium/table/debugger.rb +12 -12
data/lib/rpdfium/table/edges.rb +51 -49
data/lib/rpdfium/table/extractor.rb +35 -34
data/lib/rpdfium/table/table.rb +65 -62
data/lib/rpdfium/util/cluster.rb +35 -33
data/lib/rpdfium/util/column_inference.rb +34 -32
data/lib/rpdfium/util/label_matcher.rb +30 -30
data/lib/rpdfium/util/text_extraction.rb +15 -15
data/lib/rpdfium/util/word_extractor.rb +49 -48
data/lib/rpdfium/util/word_merger.rb +25 -24
data/lib/rpdfium/version.rb +1 -1
data/lib/rpdfium.rb +17 -15
metadata +1 -1

data/README.md CHANGED Viewed

@@ -27,16 +27,22 @@ end
 ## Why
 The Ruby ecosystem has `pdf-reader` (text only, slow on complex docs),
-`origami` (security-research focused), and `hexapdf` (great for
-manipulation but text extraction is approximate). None give you
-character-level bounding boxes, real vector path geometry, or table
-extraction. `rpdfium` fills that gap by binding the same battle-tested
-C++ engine that powers Chrome's PDF viewer.
+`origami` (security-research focused), and `hexapdf` — a capable library that
+extracts text with character-level positioning and exposes the vector-path
+primitives you need to build table extraction yourself (see the ~120-line
+[reference](benchmark/examples/hexapdf_table_extraction.rb) in the benchmark
+suite). `hexapdf` is AGPL / commercially licensed, though.
+I wanted a permissively licensed library that ships those higher-level
+pipelines **out of the box** — pdfplumber-style table detection, page
+rendering to raster — on top of character metadata, without hand-rolling them.
+`rpdfium` does that under Apache-2.0, binding the same battle-tested C++ engine
+that powers Chrome's PDF viewer.
 In practice it matches the speed of Python's `pypdfium2` on text
-extraction and is **15-56× faster than `pdfplumber`** while using
-**5-7× less memory** on large documents. See [Performance](#performance)
-for details.
+extraction and is **up to ~80× faster than `pdfplumber`** while using
+**up to ~80× less memory** on dense documents (the 520-page academic tier).
+See [Performance](#performance) for details.
 ## Installing PDFium
@@ -321,7 +327,7 @@ end
 Three primitives:
 - `Page#font_inventory` — distribution by `(font, height, weight)`,
-  with counts and samples for ispection
+  with counts and samples for inspection
 - `Page#chars_where(font:, height:, weight:, bbox:, where:)` —
   filter chars by any combination of criteria
 - `Page#lines(font:, ...)` — high-level helper: filter + word
@@ -470,76 +476,65 @@ the tree stays in memory until the document is closed.
 ## Performance
-Measured on 4 PDFs of increasing complexity, best-of-3 runs after a
-warm-up, isolated in subprocesses to capture clean peak RSS. Versions
-under test: `rpdfium 0.3.13`, `pdfplumber 0.11.9`, `pypdfium2 5.6.0`.
-| Test corpus | Pages | Size | What it stresses |
-| --- | ---: | ---: | --- |
-| `sample.pdf` | 1 | 18 KB | Plain text baseline |
-| `form.pdf` | 1 | 107 KB | Char-per-text-object kerning, Form XObject, tables |
-| `complex.pdf` | 85 | 60 MB | Magazine-style document, dense text + heavy graphics |
-| `report.pdf` | 226 | 322 KB | Rotated pages (90°), small fonts, ~15 tables per page |
-### Speed
-| Corpus | Task | rpdfium | pypdfium2 | pdfplumber | speedup vs pdfplumber |
-| --- | --- | ---: | ---: | ---: | ---: |
-| sample.pdf (1 pag) | text | 4 ms | 4 ms | 75 ms | **21×** |
-| sample.pdf (1 pag) | tables | 4 ms | n/a | 70 ms | **16×** |
-| form.pdf (1 pag) | text | 12 ms | 13 ms | 538 ms | **44×** |
-| form.pdf (1 pag) | tables | 25 ms | n/a | 575 ms | **23×** |
-| complex.pdf (85 pag) | text | 190 ms | 183 ms | 7.76 s | **41×** |
-| complex.pdf (85 pag) | tables | 231 ms | n/a | 7.07 s | **31×** |
-| report.pdf (226 pag) | text | 412 ms | 397 ms | 23.26 s | **56×** |
-| report.pdf (226 pag) | tables | 1.68 s | n/a | 25.25 s | **15×** |
-`pypdfium2` does not implement table extraction (it's a raw FFI binding
-to PDFium, not a full pipeline). It's listed as the "pure PDFium speed
-floor" for text — rpdfium matches it within ±5%, showing that the Ruby
-FFI overhead is not measurable.
-### Memory (peak RSS)
-| Corpus | rpdfium | pypdfium2 | pdfplumber | pdfplumber/rpdfium |
+The full, reproducible benchmark suite — sample PDFs, runners, ground-truth
+correctness scoring, and methodology — lives in
+[`benchmark/`](benchmark/README.md). It compares **rpdfium** against
+**pypdfium2** (the "pure PDFium speed floor"), **pdfplumber** (the reference
+pure-Python pipeline) and **hexapdf** (pure Ruby) across five synthetic PDFs
+of increasing complexity (1 → 520 pages), measuring **execution time**, **peak
+memory (RSS)** and **correctness** (fraction of known ground-truth data
+recovered). Run it yourself:
+```bash
+export PDFIUM_LIBRARY_PATH=/path/to/libpdfium.{so,dylib,dll}
+pip install pdfplumber pypdfium2    # optional baselines
+gem install hexapdf                 # optional baseline + table-extraction reference
+ruby benchmark/run.rb
+```
+### Synthetic suite (Apple M-series, best of 3)
+Text extraction — rpdfium tracks pypdfium2 within noise (FFI overhead not
+measurable); pdfplumber degrades super-linearly:
+| PDF | rpdfium | pypdfium2 | pdfplumber | hexapdf |
 | --- | ---: | ---: | ---: | ---: |
-| sample.pdf | 29 MB | 20 MB | 40 MB | 1.4× |
-| form.pdf | 32 MB | 22 MB | 45 MB | 1.4× |
-| complex.pdf | 106 MB | 69 MB | 535 MB | **5.0×** |
-| report.pdf | 136 MB | 41 MB | 1003 MB | **7.4×** |
-The memory gap widens with workload size. On a 226-page document
-pdfplumber uses ~1 GB; rpdfium stays under 140 MB. For server-side
-batch processing this is the difference between a 256 MB container and
-a 2 GB one.
-### Headline numbers
-On large PDFs (226 pages, dense layout):
-- **rpdfium completes both text + tables in ~2.1 s using 136 MB**
-- **pdfplumber needs ~48 s and 1 GB** for the same work
-Across the four corpora the median speedup vs pdfplumber is **27× on
-text**, **22× on tables**. rpdfium scales linearly with page count
-(thanks to PDFium's C++ engine); pdfplumber's pure-Python pipeline
-degrades super-linearly on large documents.
-### Methodology
-Each measurement is the **minimum of 3 timed runs after a warm-up run**
-(to neutralize OS page cache effects on the 60 MB `complex.pdf`).
-Subprocess isolation per measurement ensures clean RSS reading via
-`resource.getrusage` / `/proc/self/status`. The benchmark harness is
-a small Ruby driver that shells out to three runners (one Ruby script
-using `rpdfium`, two Python scripts using `pdfplumber` and
-`pypdfium2`), parses the JSON each emits, and aggregates the results.
-Output quality has been spot-checked: rpdfium matches pypdfium2 char
-count within ±1 char (rounding on the trailing newline). pdfplumber
-returns ~2% fewer chars on locale-formatted numbers due to a different
-word-tokenization for thousand-separator punctuation (e.g. `1.250.000`
-split on periods).
+| 01_simple (1 pg) | 12 ms / 33 MB | 12 ms / 36 MB | 17 ms / 42 MB | 14 ms / 24 MB |
+| 02_medium (6 pg) | 14 ms / 33 MB | 14 ms / 37 MB | 101 ms / 57 MB | 19 ms / 24 MB |
+| 03_complex (16 pg) | 15 ms / 34 MB | 16 ms / 38 MB | 182 ms / 72 MB | 28 ms / 25 MB |
+| 04_heavy (60 pg) | **47 ms / 35 MB** | 50 ms / 40 MB | 2.41 s / 456 MB | 145 ms / 26 MB |
+| 05_academic (520 pg) | **706 ms / 69 MB** | 755 ms / 104 MB | **57.15 s / 5537 MB** | 2.28 s / 43 MB |
+On the 520-page tier rpdfium is **~81× faster than pdfplumber and uses ~80×
+less memory**, and it now beats raw pypdfium2 on both axes — `extract_text`
+streams pages (one alive at a time), while the pypdfium2 runner holds them.
+Table extraction (pypdfium2 has no table layer; the hexapdf column uses the
+minimal lines-based reference in
+[`benchmark/examples/hexapdf_table_extraction.rb`](benchmark/examples/hexapdf_table_extraction.rb)):
+| PDF | rpdfium | pdfplumber | hexapdf |
+| --- | ---: | ---: | ---: |
+| 01_simple (1 pg) | 15 ms / 34 MB | 17 ms / 42 MB | 24 ms / 25 MB |
+| 02_medium (6 pg) | 38 ms / 35 MB | 111 ms / 57 MB | 54 ms / 25 MB |
+| 03_complex (16 pg) | 124 ms / 38 MB | 187 ms / 71 MB | 88 ms / 26 MB |
+| 04_heavy (60 pg) | **496 ms / 39 MB** | 3.05 s / 442 MB | 779 ms / 29 MB |
+| 05_academic (520 pg) | 15.46 s / 104 MB | 68.04 s / 5179 MB | **13.22 s / 37 MB** |
+rpdfium stays **~4.4× faster than pdfplumber while using ~50× less memory** on
+the academic tier. The minimal hexapdf reference edges it out on time there
+(13.22 s vs 15.46 s): at 520 pages the full pipeline's per-page cost
+(borderless `:text` attempts, rectangle / multi-table search, annotation
+parsing) dominates, while the `:lines`-only reference skips all of it — a fair
+comparison only on clean ruled grids.
+Correctness is **100% for every library on every tier** — these are clean
+generated grids, the easy case. Real-world tables (dashed rules, partial
+borders, misaligned cells) are where rpdfium's snap/join tolerances and
+`:text` fallback earn their cost; the 120-line hexapdf reference matches here
+but would drop cells there. See
+[`benchmark/README.md`](benchmark/README.md) for the full tables, task-support
+matrix, correctness scoring and methodology.
 ## Memory safety

data/lib/rpdfium/annotation/annotation.rb CHANGED Viewed

@@ -1,9 +1,9 @@
 # frozen_string_literal: true
 module Rpdfium
-  # Wrapper per FPDF_ANNOTATION. Le annotazioni includono link, highlight,
-  # commenti, widget di form. PDFium richiede di chiudere ogni handle con
-  # FPDFPage_CloseAnnot, gestito qui via finalizer.
+  # Wrapper for FPDF_ANNOTATION. Annotations include links, highlights,
+  # comments, form widgets. PDFium requires closing each handle with
+  # FPDFPage_CloseAnnot, handled here via a finalizer.
   class Annotation
     SUBTYPES = {
       Raw::FPDF_ANNOT_UNKNOWN => :unknown,
@@ -64,8 +64,8 @@ module Rpdfium
         top: h - rect[:top], bottom: h - rect[:bottom] }
     end
-    # Valore di una chiave del dict di annotazione (UTF-16LE).
-    # Chiavi comuni: "Contents" (testo annotazione), "T" (autore),
+    # Value of a key in the annotation dict (UTF-16LE).
+    # Common keys: "Contents" (annotation text), "T" (author),
     # "M" (mod date), "NM" (uniq name).
     def [](key)
       Raw.read_utf16_string(:FPDFAnnot_GetStringValue, @state[:handle], key.to_s)
@@ -75,7 +75,7 @@ module Rpdfium
       Raw.FPDFAnnot_HasKey(@state[:handle], key.to_s) == 1
     end
-    # Per annotazioni :link → URL di destinazione (se esterno) o nil.
+    # For :link annotations → destination URL (if external) or nil.
     def link_uri
       return nil unless subtype == :link
@@ -85,10 +85,12 @@ module Rpdfium
       action = Raw.FPDFLink_GetAction(link_handle)
       return nil if action.null?
-      Raw.read_utf16_string(:FPDFAction_GetURIPath, @page.document.handle, action)
+      # Unlike most PDFium getters, FPDFAction_GetURIPath returns 7-bit
+      # ASCII bytes, not UTF-16LE.
+      Raw.read_ascii_string(:FPDFAction_GetURIPath, @page.document.handle, action)
     end
-    # Per link interni → indice pagina di destinazione, o nil.
+    # For internal links → destination page index, or nil.
     def link_dest_page
       return nil unless subtype == :link

data/lib/rpdfium/document.rb CHANGED Viewed

@@ -1,9 +1,9 @@
 # frozen_string_literal: true
 module Rpdfium
-  # Wrapper di livello documento. Espone:
-  # - apertura da path / IO / bytes / pagina by index
-  # - metadata (Title, Author, ecc.)
+  # Document-level wrapper. Exposes:
+  # - opening from path / IO / bytes / page by index
+  # - metadata (Title, Author, etc.)
   # - permissions
   # - outline (bookmarks)
   # - attachments
@@ -39,10 +39,11 @@ module Rpdfium
         raise LoadError, "Failed to load PDF: #{msg}"
       end
-      # Stato condiviso tra istanza e finalizer. Wrappato in Hash mutabile
-      # perché la closure del finalizer e il close() esplicito devono vedere
-      # lo stesso :closed flag — altrimenti chi arriva secondo richiama
-      # FPDF_CloseDocument su un handle già liberato e PDFium segfaulta.
+      # State shared between the instance and the finalizer. Wrapped in a
+      # mutable Hash because the finalizer closure and the explicit
+      # close() must see the same :closed flag — otherwise whichever
+      # arrives second calls FPDF_CloseDocument on an already-freed
+      # handle and PDFium segfaults.
       @state = {
         handle: handle,
         retain_buffer: retain_buffer,
@@ -50,11 +51,11 @@ module Rpdfium
       }
       @form_env = nil
       @page_cache = {}
-      # IMPORTANTE: il finalizer cattura @state (Hash), NON self. Catturare
-      # self impedirebbe al GC di raccogliere il Document. Inoltre il
-      # finalizer NON tocca @page_cache: le Page hanno il loro finalizer
-      # individuale, e l'ordine di esecuzione tra finalizer è non
-      # deterministico in Ruby.
+      # IMPORTANT: the finalizer captures @state (Hash), NOT self.
+      # Capturing self would prevent the GC from collecting the Document.
+      # Moreover the finalizer does NOT touch @page_cache: Pages have
+      # their own individual finalizer, and the execution order among
+      # finalizers is non-deterministic in Ruby.
       ObjectSpace.define_finalizer(self, self.class.finalizer(@state))
     end
@@ -86,8 +87,9 @@ module Rpdfium
       ensure_open!
       raise PageError, "Page index #{index} out of range" unless (0...page_count).cover?(index)
-      # Le pagine sono cacheable: ricaricarle è costoso e gli oggetti sono
-      # immutabili dal punto di vista applicativo (in modalità read-only).
+      # Pages are cacheable: reloading them is expensive and the objects
+      # are immutable from the application's point of view (in read-only
+      # mode).
       @page_cache[index] ||= Page.new(self, index)
     end
     alias [] page
@@ -98,6 +100,31 @@ module Rpdfium
       page_count.times { |i| yield page(i) }
     end
+    # Iterates the pages WITHOUT retaining them in the page cache: each page
+    # is closed (native FPDF_PAGE / text page handles and the per-page char
+    # and line-segment caches) as soon as the block returns.
+    #
+    # `#each` caches every visited page for the document's whole lifetime —
+    # ideal for interactive, random-access use, but for a single linear pass
+    # over a large document it makes peak memory grow with the page count
+    # (each page keeps thousands of char hashes alive). The batch helpers
+    # (`Rpdfium.extract_text`, `.extract_tables`, `.render_to_pngs`) visit
+    # each page exactly once, so they stream instead: only one page is alive
+    # at a time and peak RSS stays flat in the number of pages.
+    def each_page_streaming
+      return enum_for(:each_page_streaming) unless block_given?
+      ensure_open!
+      page_count.times do |i|
+        pg = Page.new(self, i)
+        begin
+          yield pg
+        ensure
+          pg.close
+        end
+      end
+    end
     def page_label(index)
       Raw.read_utf16_string(:FPDF_GetPageLabel, @state[:handle], index)
     end
@@ -116,11 +143,11 @@ module Rpdfium
       return nil if Raw.FPDF_GetFileVersion(@state[:handle], buf) == 0
       v = buf.read_int
-      # PDFium ritorna 14 → 1.4, 17 → 1.7
+      # PDFium returns 14 → 1.4, 17 → 1.7
       "#{v / 10}.#{v % 10}"
     end
-    # Permission bits secondo PDF spec (Table 22 §7.6.3.2)
+    # Permission bits according to the PDF spec (Table 22 §7.6.3.2)
     PERMISSIONS = {
       print:       1 << 2,
       modify:      1 << 3,
@@ -154,9 +181,9 @@ module Rpdfium
       form_type != :none
     end
-    # Lazy form environment. Necessario per:
-    # - leggere FormFieldType/Value/Name su widget annotations
-    # - renderizzare i form fields sopra la pagina (FFLDraw)
+    # Lazy form environment. Required to:
+    # - read FormFieldType/Value/Name on widget annotations
+    # - render the form fields over the page (FFLDraw)
     def form_env
       @form_env ||= Form::Environment.new(self) if has_forms?
     end
@@ -179,7 +206,7 @@ module Rpdfium
     def close
       return if @state[:closed]
-      # Ordine: chiudi prima form env e pagine cached, poi documento.
+      # Order: close form env and cached pages first, then the document.
       @form_env&.close
       @page_cache.each_value(&:close)
       @page_cache.clear
@@ -216,8 +243,8 @@ module Rpdfium
     end
     def load_from_bytes(bytes, password)
-      # CRITICO: PDFium NON copia i bytes — li referenzia. Dobbiamo tenere
-      # vivo il buffer per tutta la vita del documento.
+      # CRITICAL: PDFium does NOT copy the bytes — it references them. We
+      # must keep the buffer alive for the entire life of the document.
       buf = FFI::MemoryPointer.new(:uchar, bytes.bytesize)
       buf.put_bytes(0, bytes)
       [Raw.FPDF_LoadMemDocument64(buf, bytes.bytesize, password), buf]

data/lib/rpdfium/errors.rb CHANGED Viewed

@@ -34,8 +34,8 @@ module Rpdfium
         Raw.FPDF_InitLibrary
         @initialized = true
-        # Cleanup automatico a process exit. Ordine garantito: tutti i
-        # finalizer Ruby vengono eseguiti prima di at_exit dei nostri blocchi.
+        # Automatic cleanup at process exit. Order is guaranteed: all Ruby
+        # finalizers run before the at_exit of our own blocks.
         at_exit { Raw.FPDF_DestroyLibrary if @initialized }
       end
     end

data/lib/rpdfium/form/form.rb CHANGED Viewed

@@ -2,10 +2,10 @@
 module Rpdfium
   module Form
-    # FPDF_FORMHANDLE è necessario per leggere widget annotations.
-    # In modalità read-only basta inizializzarlo con una FORMFILLINFO minimale
-    # (version=2, callbacks NULL). PDFium chiama i callback solo durante
-    # interazione utente o JavaScript, che noi non usiamo.
+    # FPDF_FORMHANDLE is required to read widget annotations.
+    # In read-only mode it is enough to initialize it with a minimal FORMFILLINFO
+    # (version=2, callbacks NULL). PDFium invokes the callbacks only during
+    # user interaction or JavaScript, which we do not use.
     class Environment
       attr_reader :document
@@ -13,7 +13,7 @@ module Rpdfium
         @document = document
         @info = Raw::FPDF_FORMFILLINFO.new
         @info[:version] = 2
-        # Tutti i puntatori restano NULL (default di FFI::Struct).
+        # All pointers remain NULL (the FFI::Struct default).
         handle = Raw.FPDFDOC_InitFormFillEnvironment(document.handle, @info)
         if handle.null?
           raise FormError,
@@ -48,8 +48,8 @@ module Rpdfium
       end
     end
-    # Wrapper per un widget di form. Si costruisce a partire da
-    # un'annotazione di tipo :widget e l'env del documento.
+    # Wrapper for a form widget. It is built from
+    # an annotation of type :widget and the document env.
     class Field
       TYPES = {
         Raw::FPDF_FORMFIELD_UNKNOWN     => :unknown,
@@ -89,14 +89,14 @@ module Rpdfium
       def readonly?; (flags & (1 << 0)).positive?; end
       def required?; (flags & (1 << 1)).positive?; end
-      # Per checkbox e radio
+      # For checkbox and radio
       def checked?
         return false unless %i[checkbox radiobutton].include?(type)
         Raw.FPDFAnnot_IsChecked(@env.handle, @annotation.handle) == 1
       end
-      # Per combobox/listbox
+      # For combobox/listbox
       def options
         n = Raw.FPDFAnnot_GetOptionCount(@env.handle, @annotation.handle)
         return [] if n <= 0

data/lib/rpdfium/image/embedded.rb CHANGED Viewed

@@ -2,11 +2,11 @@
 module Rpdfium
   module Image
-    # Wrapper per un image object inserito in una pagina. Permette di:
-    # - leggere metadata (dimensione pixel, DPI, colorspace, BPP)
-    # - ottenere bytes raw (così come stoccati: tipicamente JPEG)
-    # - ottenere bytes decoded (raster post-filtri)
-    # - ottenere bitmap renderizzato (con maschere e matrice applicate)
+    # Wrapper for an image object placed in a page. Allows you to:
+    # - read metadata (pixel size, DPI, colorspace, BPP)
+    # - obtain raw bytes (as stored: typically JPEG)
+    # - obtain decoded bytes (raster after filters)
+    # - obtain a rendered bitmap (with masks and matrix applied)
     class Embedded
       COLORSPACES = {
         0 => :unknown, 1 => :devicegray, 2 => :devicergb, 3 => :devicecmyk,
@@ -55,8 +55,8 @@ module Rpdfium
           top: h - t.read_float, bottom: h - b.read_float }
       end
-      # Filtri applicati nell'ordine PDF: es. ["DCTDecode"] → JPEG,
-      # ["FlateDecode"] → zlib, ["DCTDecode","DCTDecode"] → ricodifiche.
+      # Filters applied in PDF order: e.g. ["DCTDecode"] → JPEG,
+      # ["FlateDecode"] → zlib, ["DCTDecode","DCTDecode"] → re-encodings.
       def filters
         n = Raw.FPDFImageObj_GetImageFilterCount(@handle)
         Array.new(n) do |i|
@@ -72,8 +72,9 @@ module Rpdfium
         end
       end
-      # Bytes "raw": come sono stoccati nel PDF. Se filters == ["DCTDecode"]
-      # questi bytes sono un JPEG completo che puoi salvare con estensione .jpg.
+      # "Raw" bytes: as they are stored in the PDF. If filters ==
+      # ["DCTDecode"] these bytes are a complete JPEG that you can save
+      # with a .jpg extension.
       def raw_bytes
         len = Raw.FPDFImageObj_GetImageDataRaw(@handle, FFI::Pointer::NULL, 0)
         return "" if len.zero?
@@ -83,8 +84,8 @@ module Rpdfium
         buf.read_bytes(len)
       end
-      # Bytes decoded: pixel raster dopo l'applicazione dei filtri.
-      # Layout dipende dal colorspace.
+      # Decoded bytes: raster pixels after the filters are applied.
+      # Layout depends on the colorspace.
       def decoded_bytes
         len = Raw.FPDFImageObj_GetImageDataDecoded(@handle, FFI::Pointer::NULL, 0)
         return "" if len.zero?
@@ -94,7 +95,7 @@ module Rpdfium
         buf.read_bytes(len)
       end
-      # Bitmap renderizzato applicando matrice e maschere. Ritorna [w, h, bytes(BGRA)].
+      # Bitmap rendered applying matrix and masks. Returns [w, h, bytes(BGRA)].
       def render_bitmap
         bitmap = Raw.FPDFImageObj_GetRenderedBitmap(
           @page.document.handle, @page.handle, @handle
@@ -112,14 +113,14 @@ module Rpdfium
         end
       end
-      # Salva il file. Se i filtri sono DCTDecode → scrive .jpg diretto.
-      # Altrimenti renderizza il bitmap a PNG.
+      # Saves the file. If the filters are DCTDecode → writes a direct
+      # .jpg. Otherwise renders the bitmap to PNG.
       def save(path)
         if filters == ["DCTDecode"]
           File.binwrite(path, raw_bytes)
         else
           w, h, bytes, stride = render_bitmap
-          # I bitmap resi sono BGRA: convertiamo a RGBA per il PNG writer
+          # The rendered bitmaps are BGRA: we convert to RGBA for the PNG writer
           rgba = swap_bgra_to_rgba(bytes, w, h, stride)
           Rpdfium::IO::PNG.write(path, w, h, rgba, stride: w * 4)
         end
@@ -132,7 +133,7 @@ module Rpdfium
         out = String.new(capacity: w * h * 4, encoding: Encoding::ASCII_8BIT)
         h.times do |y|
           row = bgra.byteslice(y * stride, w * 4)
-          # Scambia B<->R per ogni pixel
+          # Swap B<->R for each pixel
           (0...row.bytesize).step(4) do |i|
             out << row.getbyte(i + 2) << row.getbyte(i + 1) <<
                    row.getbyte(i)     << row.getbyte(i + 3)

data/lib/rpdfium/io/png.rb CHANGED Viewed

@@ -4,12 +4,12 @@ require "zlib"
 module Rpdfium
   module IO
-    # PNG writer minimale, puro Ruby, zero dipendenze esterne.
-    # Supporta solo RGBA 8bpc (color type 6) — il formato che PDFium produce
-    # quando rendi con FPDF_REVERSE_BYTE_ORDER.
+    # Minimal PNG writer, pure Ruby, zero external dependencies.
+    # Supports only RGBA 8bpc (color type 6) — the format PDFium produces
+    # when rendering with FPDF_REVERSE_BYTE_ORDER.
     #
-    # Riferimento: PNG spec (RFC 2083). Nessun compromesso sulla validità:
-    # genera CRC32 corretti e usa deflate via zlib stdlib.
+    # Reference: PNG spec (RFC 2083). No compromise on validity:
+    # generates correct CRC32 values and uses deflate via the zlib stdlib.
     module PNG
       SIGNATURE = "\x89PNG\r\n\x1a\n".b
       COLOR_RGBA = 6
@@ -34,10 +34,10 @@ module Rpdfium
       end
       def write_idat(io, width, height, rgba, stride)
-        # PNG richiede un byte di "filter type" all'inizio di ogni riga.
-        # 0 = None (nessun filtro). Funziona ma comprime peggio.
-        # Per semplicità usiamo None — output 1.5-2x più grande del minimo
-        # ottimo, ma è una scelta esplicita di tradeoff complessità/zero-dep.
+        # PNG requires a "filter type" byte at the start of each row.
+        # 0 = None (no filter). It works but compresses worse.
+        # For simplicity we use None — output 1.5-2x larger than the optimal
+        # minimum, but it is an explicit complexity/zero-dep tradeoff choice.
         row_bytes = width * 4
         scanlines = String.new(capacity: (row_bytes + 1) * height,
                                 encoding: Encoding::ASCII_8BIT)