RubyGems - rpdfium - Versions diffs - 0.4.1 → 0.4.2 - Mend

rpdfium 0.4.1 → 0.4.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +601 -1317
data/README.md +73 -78
data/lib/rpdfium/annotation/annotation.rb +10 -8
data/lib/rpdfium/document.rb +49 -22
data/lib/rpdfium/errors.rb +2 -2
data/lib/rpdfium/form/form.rb +9 -9
data/lib/rpdfium/image/embedded.rb +17 -16
data/lib/rpdfium/io/png.rb +9 -9
data/lib/rpdfium/page.rb +562 -527
data/lib/rpdfium/raw.rb +216 -203
data/lib/rpdfium/search/search.rb +5 -5
data/lib/rpdfium/structure/attachment.rb +6 -6
data/lib/rpdfium/structure/element.rb +74 -74
data/lib/rpdfium/structure/outline.rb +2 -2
data/lib/rpdfium/structure/tree.rb +56 -55
data/lib/rpdfium/table/cells.rb +36 -33
data/lib/rpdfium/table/debugger.rb +12 -12
data/lib/rpdfium/table/edges.rb +51 -49
data/lib/rpdfium/table/extractor.rb +35 -34
data/lib/rpdfium/table/table.rb +65 -62
data/lib/rpdfium/util/cluster.rb +35 -33
data/lib/rpdfium/util/column_inference.rb +34 -32
data/lib/rpdfium/util/label_matcher.rb +30 -30
data/lib/rpdfium/util/text_extraction.rb +15 -15
data/lib/rpdfium/util/word_extractor.rb +49 -48
data/lib/rpdfium/util/word_merger.rb +25 -24
data/lib/rpdfium/version.rb +1 -1
data/lib/rpdfium.rb +17 -15
metadata +1 -1

data/lib/rpdfium/table/table.rb CHANGED Viewed

@@ -2,12 +2,12 @@
 module Rpdfium
   module Table
-    # Rappresenta una tabella trovata su una pagina. Espone celle, righe,
-    # colonne, bbox, e il metodo `extract` che ritorna i dati testuali.
+    # Represents a table found on a page. Exposes cells, rows,
+    # columns, bbox, and the `extract` method that returns the textual data.
     #
-    # Ogni cella è una bbox `[x0, top, x1, bottom]` (top-down).
-    # Una "row" è il gruppo di celle che condividono la stessa `top`.
-    # Una "column" è il gruppo che condivide la stessa `x0`.
+    # Each cell is a bbox `[x0, top, x1, bottom]` (top-down).
+    # A "row" is the group of cells sharing the same `top`.
+    # A "column" is the group sharing the same `x0`.
     class Table
       attr_reader :page, :cells
@@ -27,9 +27,9 @@ module Rpdfium
         end
       end
-      # Restituisce le righe come Array<Array<bbox|nil>>. Le celle "mancanti"
-      # in una riga (es. perché la tabella ha una topologia irregolare) sono
-      # rappresentate come nil — coerente con pdfplumber.
+      # Returns the rows as Array<Array<bbox|nil>>. The "missing" cells
+      # in a row (e.g. because the table has an irregular topology) are
+      # represented as nil — consistent with pdfplumber.
       def rows
         rows_or_columns(:row)
       end
@@ -38,57 +38,60 @@ module Rpdfium
         rows_or_columns(:col)
       end
-      # Estrai dati: Array<Array<String>>. Per ogni riga, per ogni cella,
-      # filtra i char della pagina il cui MIDPOINT è nella bbox della cella,
-      # poi ricostruisce il testo via Util::TextExtraction (che a sua volta
-      # passa da WordExtractor).
+      # Extract data: Array<Array<String>>. For each row, for each cell,
+      # filter the page chars whose MIDPOINT lies within the cell's bbox,
+      # then reconstruct the text via Util::TextExtraction (which in turn
+      # goes through WordExtractor).
       #
-      # Questo è il path di pdfplumber.Table.extract — per ogni riga prima
-      # filtra i char della riga (ottimizzazione: quasi tutti i char delle
-      # altre righe vengono scartati subito), poi per ogni cella filtra
-      # ancora dentro la sub-bbox.
+      # This is the pdfplumber.Table.extract path — for each row it first
+      # filters the row's chars (optimization: nearly all chars from the
+      # other rows are discarded immediately), then for each cell filters
+      # again within the sub-bbox.
       #
-      # Ottimizzazione rispetto al path naïve: i char vengono ordinati per
-      # midpoint verticale una sola volta; per ogni riga si usa bsearch per
-      # trovare in O(log n) i char candidati invece di scansionare tutto
-      # l'array O(n) per ogni riga.
+      # Optimization over the naïve path: the chars are sorted by their
+      # vertical midpoint only once; for each row bsearch is used to find
+      # the candidate chars in O(log n) instead of scanning the whole
+      # array O(n) for every row.
       #
-      # NOTA su strategia :text: `words_to_edges_h` emette per design DUE
-      # edges per riga (top e bottom della bbox del cluster). Significa che
-      # una tabella detectata da text-strategy avrà righe "vere" intervallate
-      # da righe "vuote" tra il bottom-edge della riga N e il top-edge della
-      # riga N+1. Questo è identico al comportamento di pdfplumber. Il
-      # caller può filtrare via `result.reject { |row| row.all?(&:empty?) }`
-      # se vuole eliminarle.
-      # `cell_padding`: estende il bbox di ogni cella verso sinistra e verso
-      # l'alto di N punti. Default 0 (= comportamento pdfplumber identico).
-      # Utile per PDF dove i char sporgono leggermente dal bordo della cella
-      # (es. la "I" maiuscola della cella "Intermediario" in CR Banca d'Italia
-      # ha x0=24.0 ma il bordo della cella è a x=25.6 — viene scartata dal
-      # filtro midpoint, output "ntermediario:"). Con `cell_padding: 2.0` la
-      # cella diventa [23.6, ..., 100, ...] e la "I" viene catturata.
+      # NOTE on the :text strategy: `words_to_edges_h` emits by design TWO
+      # edges per row (top and bottom of the cluster bbox). This means that
+      # a table detected by the text-strategy will have "real" rows
+      # interleaved with "empty" rows between the bottom-edge of row N and
+      # the top-edge of row N+1. This is identical to pdfplumber's behavior.
+      # The caller may filter via `result.reject { |row| row.all?(&:empty?) }`
+      # if it wants to drop them.
+      # `cell_padding`: extends each cell's bbox toward the left and toward
+      # the top by N points. Default 0 (= identical pdfplumber behavior).
+      # Useful for PDFs where chars protrude slightly past the cell border
+      # (e.g. the uppercase "I" of the "Intermediario" cell in a CR Banca
+      # d'Italia form has x0=24.0 but the cell border is at x=25.6 — it gets
+      # discarded by the midpoint filter, output "ntermediario:"). With
+      # `cell_padding: 2.0` the cell becomes [23.6, ..., 100, ...] and the
+      # "I" is captured.
       #
-      # Padding solo sui bordi "interno-sinistro" e "interno-alto" per
-      # evitare di duplicare char condivisi tra celle adiacenti (un char tra
-      # cella A e cella B finirebbe in entrambe se entrambe paddassero su
-      # tutti i lati).
+      # Padding only on the "inner-left" and "inner-top" borders to avoid
+      # duplicating chars shared between adjacent cells (a char between
+      # cell A and cell B would end up in both if both padded on all
+      # sides).
       def extract(x_tolerance: Util::WordExtractor::DEFAULT_X_TOLERANCE,
                   y_tolerance: Util::WordExtractor::DEFAULT_Y_TOLERANCE,
                   keep_blank_chars: false,
                   cell_padding: 0.0)
-        # `lean: true`: salta 5 chiamate FFI per char (font name, weight,
-        # angle, hyphen flag, unicode error) che non servono al pipeline
-        # di estrazione tabelle. Su tabelle con migliaia di char riduce
-        # il tempo di compute_chars del ~30%.
-        chars = @page.chars(lean: true)
-        # Ordina per midpoint verticale una volta sola; costruisce un array
-        # parallelo di vmid per bsearch. Costo: O(n log n) una tantum.
+        # `geometry: true`: the strongest lean mode — on top of skipping
+        # font/weight/angle/hyphen/unicode-error it also drops the per-char
+        # origin read and emits a minimal hash. It keeps only the fields the
+        # table/word pipeline reads, cutting both FFI roundtrips and hash
+        # allocation. On tables with thousands of chars this is the dominant
+        # cost of extract_tables. See Page#chars.
+        chars = @page.chars(lean: true, geometry: true)
+        # Sort by vertical midpoint once; build a parallel array of vmid
+        # for bsearch. Cost: O(n log n) one-time.
         sorted_chars = chars.sort_by { |c| (c[:top] + c[:bottom]) / 2.0 }
         vmids = sorted_chars.map { |c| (c[:top] + c[:bottom]) / 2.0 }
-        # Istanzia WordExtractor UNA volta sola e riusalo per tutte le celle
-        # (può esserci una tabella con decine di celle, evitiamo allocazioni).
+        # Instantiate WordExtractor ONCE and reuse it for all cells
+        # (a table may have dozens of cells; avoid allocations).
         word_extractor = Util::WordExtractor.new(
           x_tolerance: x_tolerance,
           y_tolerance: y_tolerance,
@@ -118,8 +121,8 @@ module Rpdfium
       private
-      # Versione "inlined" di Util::TextExtraction.extract_text che riusa
-      # un WordExtractor preesistente invece di crearlo ogni volta.
+      # "Inlined" version of Util::TextExtraction.extract_text that reuses
+      # a pre-existing WordExtractor instead of creating one every time.
       def extract_text_with(chars, word_extractor, y_tolerance)
         words = word_extractor.extract_words(chars)
         return "" if words.empty?
@@ -132,15 +135,15 @@ module Rpdfium
       def pad_cell_bbox(bbox, padding)
         x0, top, x1, bottom = bbox
-        # Estendi solo i bordi "interno-sinistro" e "interno-alto" per evitare
-        # di catturare char della cella adiacente destra/sotto.
+        # Extend only the "inner-left" and "inner-top" borders to avoid
+        # capturing chars from the adjacent cell to the right/below.
         [x0 - padding, top - padding, x1, bottom]
       end
-      # Test "char midpoint dentro bbox" — esattamente come pdfplumber.
-      # Il midpoint del char (non gli estremi della bbox) è il criterio:
-      # un char a cavallo del bordo viene assegnato alla cella in cui ha
-      # più "peso visivo".
+      # Test "char midpoint inside bbox" — exactly like pdfplumber.
+      # The char's midpoint (not the bbox extremes) is the criterion:
+      # a char straddling the border is assigned to the cell in which it
+      # has more "visual weight".
       def char_in_bbox?(char, bbox)
         x0, top, x1, bottom = bbox
         h_mid = (char[:x0] + char[:x1]) / 2.0
@@ -159,15 +162,15 @@ module Rpdfium
         end
       end
-      # Ricostruisce righe o colonne. axis 0 = x (per row clustering antiaxis=top),
-      # axis 1 = top (per column clustering antiaxis=x0). Usa il key invariante
-      # come "anchor" e il key variabile come ordering interno.
+      # Reconstructs rows or columns. axis 0 = x (for row clustering antiaxis=top),
+      # axis 1 = top (for column clustering antiaxis=x0). Uses the invariant key
+      # as "anchor" and the variable key as the internal ordering.
       def rows_or_columns(kind)
-        # Per row: sortBy = top, antiaxis = x0
-        # Per col: sortBy = x0, antiaxis = top
+        # For row: sortBy = top, antiaxis = x0
+        # For col: sortBy = x0, antiaxis = top
         sort_idx, group_idx = kind == :row ? [1, 0] : [0, 1]
-        # Tutti gli x0 (per row) o top (per col) distinti, sortati
+        # All distinct x0 (for row) or top (for col), sorted
         all_keys = @cells.map { |c| c[group_idx] }.uniq.sort
         # Group by sort_idx

data/lib/rpdfium/util/cluster.rb CHANGED Viewed

@@ -2,30 +2,31 @@
 module Rpdfium
   module Util
-    # Primitive di clustering 1D usate da tutto il pipeline tabellare.
-    # Mappa diretta su `pdfplumber.utils.clustering` (cluster_list,
+    # 1D clustering primitives used throughout the table pipeline.
+    # Direct mapping onto `pdfplumber.utils.clustering` (cluster_list,
     # cluster_objects, make_cluster_dict).
     #
-    # PROPRIETÀ CHIAVE: questi cluster sono "1D agglomerative single-linkage":
-    # due valori finiscono nello stesso cluster se sono entro `tolerance` da
-    # un valore qualsiasi del cluster. NON solo dal centro/media. Ne consegue
-    # che catene di valori ravvicinati possono estendere il cluster ben oltre
-    # `tolerance` (questo è esattamente il comportamento di pdfplumber, e su
-    # cui si appoggiano le sue euristiche edge/intersection).
+    # KEY PROPERTY: these clusters are "1D agglomerative single-linkage":
+    # two values end up in the same cluster if they are within
+    # `tolerance` of any value in the cluster. NOT only of the
+    # center/mean. As a result, chains of close values can extend the
+    # cluster well beyond `tolerance` (this is exactly pdfplumber's
+    # behavior, on which its edge/intersection heuristics rely).
     module Cluster
       module_function
-      # Raggruppa valori scalari in cluster. I valori dentro lo stesso cluster
-      # sono entro `tolerance` da almeno un altro valore del cluster.
+      # Groups scalar values into clusters. The values within the same
+      # cluster are within `tolerance` of at least one other value of
+      # the cluster.
       #
-      # Esempio:
+      # Example:
       #   cluster_list([1.0, 1.5, 2.0, 5.0], tolerance: 1.0)
       #   #=> [[1.0, 1.5, 2.0], [5.0]]
       #
-      # NOTA: Catene "stepping stone": [1, 2, 3, 4] con tol=1 fanno UN cluster
-      # solo, anche se 1 e 4 distano 3. Questo è il comportamento di
-      # pdfplumber, è documentato nei suoi issue come potenzialmente
-      # sorprendente ma intenzionale. Lo manteniamo identico.
+      # NOTE: "Stepping stone" chains: [1, 2, 3, 4] with tol=1 form a
+      # SINGLE cluster, even though 1 and 4 are 3 apart. This is
+      # pdfplumber's behavior, documented in its issues as potentially
+      # surprising but intentional. We keep it identical.
       def cluster_list(values, tolerance: 0)
         return [] if values.empty?
@@ -41,22 +42,23 @@ module Rpdfium
         clusters
       end
-      # Raggruppa oggetti (Hash) in cluster basandosi su una funzione di
-      # estrazione `key_fn` (oppure simbolo Hash key) e tolleranza.
+      # Groups objects (Hash) into clusters based on an extraction
+      # function `key_fn` (or a Hash key symbol) and a tolerance.
       #
-      # Esempio:
+      # Example:
       #   cluster_objects(words, ->(w) { w[:top] }, tolerance: 1)
       #   cluster_objects(words, :top, tolerance: 1)   # syntactic sugar
       def cluster_objects(objects, key_fn, tolerance: 0, presorted: false)
         return [] if objects.empty?
-        # Fast path per il caso Symbol più comune (:top, :x0, :bottom):
-        # accesso diretto Hash[symbol] è ~2× più veloce della lambda call.
+        # Fast path for the most common Symbol case (:top, :x0, :bottom):
+        # direct Hash[symbol] access is ~2x faster than the lambda call.
         if key_fn.is_a?(Symbol)
-          # Se il chiamante garantisce che l'input è già sortato per key_fn
-          # (es. perché viene da un sort lessicografico [key_fn, ...]) si
-          # può saltare il sort interno. Risparmio significativo quando
-          # cluster_objects è chiamato in loop su molte righe piccole.
+          # If the caller guarantees that the input is already sorted by
+          # key_fn (e.g. because it comes from a lexicographic sort
+          # [key_fn, ...]) the internal sort can be skipped. A significant
+          # saving when cluster_objects is called in a loop over many
+          # small rows.
           sorted = presorted ? objects : objects.sort_by { |o| o[key_fn] }
           first = sorted.first
           last_key = first[key_fn]
@@ -78,7 +80,7 @@ module Rpdfium
           return clusters
         end
-        # Path generico con accessor callable
+        # Generic path with a callable accessor
         accessor = key_fn
         sorted = presorted ? objects : objects.sort_by { |o| accessor.call(o) }
         last_key = accessor.call(sorted.first)
@@ -96,8 +98,8 @@ module Rpdfium
         clusters
       end
-      # bbox = [x0, top, x1, bottom] (top-down). Ritorna la bbox che racchiude
-      # tutti gli oggetti passati. Usa min/max di x0/top/x1/bottom.
+      # bbox = [x0, top, x1, bottom] (top-down). Returns the bbox that
+      # encloses all the passed objects. Uses min/max of x0/top/x1/bottom.
       def objects_to_bbox(objects)
         objects.each_with_object(
           [Float::INFINITY, Float::INFINITY, -Float::INFINITY, -Float::INFINITY]
@@ -109,16 +111,16 @@ module Rpdfium
         end
       end
-      # Variante che ritorna un Hash invece di tuple — comoda nel contesto
-      # edge dove ci serve mescolare bbox+orientation.
+      # Variant that returns a Hash instead of a tuple — handy in the
+      # edge context where we need to mix bbox+orientation.
       def objects_to_rect(objects)
         x0, top, x1, bottom = objects_to_bbox(objects)
         { x0: x0, top: top, x1: x1, bottom: bottom,
           width: x1 - x0, height: bottom - top }
       end
-      # bbox sovrapposti. None overlap => nil. Match pdfplumber's
-      # get_bbox_overlap: ritorna la bbox di intersezione, oppure nil.
+      # Overlapping bbox. No overlap => nil. Matches pdfplumber's
+      # get_bbox_overlap: returns the intersection bbox, or nil.
       def bbox_overlap(a, b)
         ax0, atop, ax1, abot = a
         bx0, btop, bx1, bbot = b
@@ -133,8 +135,8 @@ module Rpdfium
         [x0, top, x1, bot]
       end
-      # True se due bbox si sovrappongono (anche solo a un punto è no, deve
-      # esserci area positiva).
+      # True if two bbox overlap (even just at a point is no; there must
+      # be positive area).
       def bbox_overlaps?(a, b)
         !bbox_overlap(a, b).nil?
       end

data/lib/rpdfium/util/column_inference.rb CHANGED Viewed

@@ -2,26 +2,28 @@
 module Rpdfium
   module Util
-    # Inferenza di colonne dati su PDF non-tabellari.
+    # Inference of data columns on non-tabular PDFs.
     #
-    # Identifica gruppi di word che appartengono alla stessa "colonna"
-    # verticale di un layout (es. una colonna di importi in un modulo
-    # prestampato) anche quando non ci sono linee disegnate.
+    # Identifies groups of words that belong to the same vertical
+    # "column" of a layout (e.g. a column of amounts in a prestamped
+    # form) even when no lines are drawn.
     #
-    # L'algoritmo opera in tre passaggi:
+    # The algorithm operates in three passes:
     #
-    # 1. **Cluster per coordinata X** — raggruppa le word con la stessa x0
-    #    (left-aligned) o x1 (right-aligned, tipico dei numeri) entro la
-    #    tolleranza configurabile.
+    # 1. **Cluster by X coordinate** — groups words with the same x0
+    #    (left-aligned) or x1 (right-aligned, typical of numbers) within
+    #    the configurable tolerance.
     #
-    # 2. **Spezza per gap verticali** — se due word consecutive in un
-    #    gruppo hanno un gap verticale "anomalo" (> 3× la mediana, o
-    #    > 40pt), le separa in colonne distinte. Risolve casi tipo "codice
-    #    fiscale in alto + tabella sotto" che condividono la stessa X.
+    # 2. **Split by vertical gaps** — if two consecutive words in a
+    #    group have an "anomalous" vertical gap (> 3x the median, or
+    #    > 40pt), they are separated into distinct columns. Resolves
+    #    cases such as "fiscal code at the top + table below" that share
+    #    the same X.
     #
-    # 3. **Filtra per densità** — una colonna "vera" ha valori regolarmente
-    #    equispaziati (coefficiente di variazione dei gap < soglia). Esclude
-    #    falsi positivi come valori isolati che si trovano per caso allineati.
+    # 3. **Filter by density** — a "true" column has regularly
+    #    equispaced values (coefficient of variation of the gaps <
+    #    threshold). Excludes false positives such as isolated values
+    #    that happen to be aligned by chance.
     #
     # @example
     #   inference = Rpdfium::Util::ColumnInference.new(
@@ -31,8 +33,8 @@ module Rpdfium
     #   )
     #   columns = inference.infer(words)
     #   # => [
-    #   #   [word1, word2, ..., word12],   # 12 importi nella colonna 1
-    #   #   [word1, word2, ..., word12]    # 12 codici nella colonna 2
+    #   #   [word1, word2, ..., word12],   # 12 amounts in column 1
+    #   #   [word1, word2, ..., word12]    # 12 codes in column 2
     #   # ]
     class ColumnInference
       DEFAULT_X_TOLERANCE = 3.0
@@ -53,27 +55,27 @@ module Rpdfium
         @gap_absolute = gap_absolute
       end
-      # Inferisce le colonne dai word forniti. Usa sia x0 (left-align) che
-      # x1 (right-align) come criteri di allineamento, ritorna l'unione
-      # delle colonne identificate.
+      # Infers the columns from the supplied words. Uses both x0
+      # (left-align) and x1 (right-align) as alignment criteria, returns
+      # the union of the identified columns.
       #
-      # @param words [Array<Hash>] word con :x0, :x1, :top
-      # @return [Array<Array<Hash>>] array di colonne, ognuna è un array
-      #   di word ordinati per :top crescente
+      # @param words [Array<Hash>] words with :x0, :x1, :top
+      # @return [Array<Array<Hash>>] array of columns, each one an array
+      #   of words ordered by ascending :top
       def infer(words)
         return [] if words.empty?
         by_x0 = cluster_by(words, :x0)
         by_x1 = cluster_by(words, :x1)
-        # Unione: una word può apparire in più colonne. È compito del
-        # chiamante decidere come gestire (es. preferire la prima
-        # colonna, o quella più grande). Qui ritorniamo tutte.
+        # Union: a word may appear in more than one column. It is the
+        # caller's responsibility to decide how to handle this (e.g.
+        # prefer the first column, or the largest one). Here we return all.
         (by_x0 + by_x1)
       end
-      # Cluster di word per una specifica coordinata.
-      # @param coord [Symbol] :x0 o :x1
+      # Clusters words by a specific coordinate.
+      # @param coord [Symbol] :x0 or :x1
       def cluster_by(words, coord)
         sorted = words.sort_by { |v| v[coord] }
         x_groups = []
@@ -116,10 +118,10 @@ module Rpdfium
         columns
       end
-      # Una colonna è "abbastanza densa" se ha almeno min_size valori e
-      # il coefficiente di variazione (std_dev/mean) dei gap verticali è
-      # sotto la soglia. CV bassa = spacing regolare = colonna ripetitiva
-      # vera (vs. valori sparsi accidentalmente allineati).
+      # A column is "dense enough" if it has at least min_size values
+      # and the coefficient of variation (std_dev/mean) of the vertical
+      # gaps is below the threshold. Low CV = regular spacing = a true
+      # repetitive column (vs. scattered values accidentally aligned).
       def dense_enough?(col_values)
         return false if col_values.size < @min_size

data/lib/rpdfium/util/label_matcher.rb CHANGED Viewed

@@ -2,29 +2,29 @@
 module Rpdfium
   module Util
-    # Associa label semantiche a valori inseriti su PDF di moduli compilati
-    # (F24, comunicazioni IVA, modelli 770) dove template e dati coesistono
-    # come testo grafico in font diversi.
+    # Associates semantic labels with values placed on PDFs of filled-in
+    # forms (F24, VAT communications, Modello 770) where template and data
+    # coexist as graphic text in different fonts.
     #
-    # Strategia base:
+    # Base strategy:
     #
-    # 1. **Cluster** le parole del template in "label coerenti": word
-    #    geometricamente vicine formano un'unica label.
+    # 1. **Cluster** the template words into "coherent labels": words that
+    #    are geometrically close form a single label.
     #
-    # 2. **Per ogni valore** cerca:
-    #    - `:col` — label SOPRA in stessa colonna
-    #    - `:row` — label A SINISTRA in stessa riga
+    # 2. **For each value** search for:
+    #    - `:col` — the label ABOVE in the same column
+    #    - `:row` — the label TO THE LEFT in the same row
     #
-    # 3. (Opzionale) **Riassegnazione per colonne**: usa `ColumnInference`
-    #    per identificare colonne ripetitive (es. ST2..ST13 del 770 Quadro
-    #    ST) e propaga l'header canonico a tutti i valori della colonna,
-    #    superando il limite `col_max_dy`.
+    # 3. (Optional) **Column reassignment**: uses `ColumnInference` to
+    #    identify repetitive columns (e.g. ST2..ST13 of the 770 Quadro
+    #    ST) and propagates the canonical header to all the values in the
+    #    column, overriding the `col_max_dy` limit.
     #
-    # @example uso base
+    # @example basic usage
     #   matcher = Rpdfium::Util::LabelMatcher.new
     #   matcher.match(value_words, anchor_words)
     #
-    # @example con tabelle ripetitive (header in cima alla colonna)
+    # @example with repetitive tables (header at the top of the column)
     #   matcher = Rpdfium::Util::LabelMatcher.new(
     #     column_inference: Rpdfium::Util::ColumnInference.new
     #   )
@@ -60,11 +60,11 @@ module Rpdfium
         @column_inference = column_inference
       end
-      # Calcola le associazioni label → valore.
+      # Computes the label → value associations.
       #
-      # @param values [Array<Hash>] word del layer "dati"
-      # @param anchors [Array<Hash>] word del layer "template"
-      # @return [Array<Hash>] uno per valore: { value:, labels: { col:, row: }, geometry: }
+      # @param values [Array<Hash>] words of the "data" layer
+      # @param anchors [Array<Hash>] words of the "template" layer
+      # @return [Array<Hash>] one per value: { value:, labels: { col:, row: }, geometry: }
       def match(values, anchors)
         labels = cluster_anchors(anchors)
@@ -74,7 +74,7 @@ module Rpdfium
           { value: v, col: col, row: row }
         end
-        # Riassegnazione opzionale per colonne ripetitive
+        # Optional reassignment for repetitive columns
         prelim = reassign_by_columns(prelim, labels, values) if @column_inference
         prelim.map do |entry|
@@ -92,8 +92,8 @@ module Rpdfium
         end
       end
-      # Ricostruisce le label dal cluster delle word del template.
-      # Esposto pubblicamente per ispezione/debug.
+      # Reconstructs the labels from the cluster of template words.
+      # Exposed publicly for inspection/debug.
       def cluster_anchors(anchor_words)
         remaining = anchor_words.dup
         groups = []
@@ -145,10 +145,10 @@ module Rpdfium
       end
       def find_col_label(value, labels)
-        # Per word "wide" (più larghe della maggior parte delle label,
-        # tipicamente perché frutto di merge di una stringa che attraversa
-        # più colonne template) usa il left edge: la label corretta è
-        # quella sotto cui INIZIA il valore.
+        # For "wide" words (wider than most labels, typically because
+        # they result from the merge of a string that spans several
+        # template columns) use the left edge: the correct label is the
+        # one below which the value STARTS.
         value_width = value[:x1] - value[:x0]
         anchor_point =
           if value_width > WIDE_VALUE_THRESHOLD
@@ -175,14 +175,14 @@ module Rpdfium
         end.max_by { |l| l[:x1] }
       end
-      # Identifica colonne dati e propaga l'header canonico stampato in
-      # cima alla colonna a TUTTI i valori della colonna.
-      # Usa @column_inference fornito al constructor.
+      # Identifies data columns and propagates the canonical header
+      # printed at the top of the column to ALL the values of the column.
+      # Uses the @column_inference provided to the constructor.
       def reassign_by_columns(prelim, labels, values)
         columns = @column_inference.infer(values)
         return prelim if columns.empty?
-        # Ordina colonne più grandi prima (più evidenza statistica)
+        # Sort larger columns first (more statistical evidence)
         sorted_columns = columns.sort_by { |c| -c.size }
         column_headers = {}

data/lib/rpdfium/util/text_extraction.rb CHANGED Viewed

@@ -2,19 +2,19 @@
 module Rpdfium
   module Util
-    # Estrazione testo "lineare" da una collezione di char, layout=False.
-    # Equivalente di pdfplumber.utils.text.chars_to_textmap nella variante
-    # senza preservazione del layout grafico.
+    # "Linear" text extraction from a collection of chars, layout=False.
+    # Equivalent of pdfplumber.utils.text.chars_to_textmap in the variant
+    # without preservation of the graphic layout.
     #
-    # Algoritmo:
-    #   1. Estrai words con WordExtractor (gli stessi tolerance).
-    #   2. Cluster di words per `top` con y_tolerance → righe logiche.
-    #   3. Per ogni riga, ordina per x0 e joina con singolo spazio.
-    #   4. Joina le righe con "\n".
+    # Algorithm:
+    #   1. Extract words with WordExtractor (same tolerances).
+    #   2. Cluster words by `top` with y_tolerance → logical lines.
+    #   3. For each line, sort by x0 and join with a single space.
+    #   4. Join the lines with "\n".
     #
-    # NOTA su una sottigliezza: pdfplumber permette di usare x_tolerance
-    # diverso da y_tolerance sia per word-extraction che per line-clustering.
-    # Replichiamo questa flessibilità.
+    # NOTE on a subtlety: pdfplumber allows using an x_tolerance different
+    # from y_tolerance both for word-extraction and for line-clustering.
+    # We replicate this flexibility.
     module TextExtraction
       module_function
@@ -34,12 +34,12 @@ module Rpdfium
         ).extract_words(chars)
         return "" if words.empty?
-        # Cluster delle WORDS per top: righe di output finali.
-        # Usa y_tolerance "di linea" — pdfplumber qui usa la stessa y_tolerance
-        # passata, ed è coerente con come si comporta extract_text.
+        # Cluster the WORDS by top: final output lines.
+        # Uses the "line" y_tolerance — pdfplumber here uses the same
+        # y_tolerance passed in, consistent with how extract_text behaves.
         line_clusters = Cluster.cluster_objects(words, :top, tolerance: y_tolerance)
-        # Per ogni riga di output joina con spazio singolo.
+        # For each output line, join with a single space.
         line_clusters.map do |line_words|
           line_words.sort_by { |w| w[:x0] }.map { |w| w[:text] }.join(" ")
         end.join("\n")