acroforge 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 700f9e41b86a58ef21a6f5d045907c1df97ae55c9d2728a465cd7b0f3ecf828e
4
- data.tar.gz: 9859e634dbf6236e17ee1392ccf1239c8edf494be6072272218f6e2c58092c33
3
+ metadata.gz: c97cab951d5efe0d85539bcc55b368277b95bd1d6595141459be383a1a818c2e
4
+ data.tar.gz: 7156afa0720c474b3ee0b287cb994ccb3438c1f58cd10ff297673d19d6b34a30
5
5
  SHA512:
6
- metadata.gz: 29c17e2e6958e5772cab51aa25a11187cc8d1f08cc9ccf7b2add174abacee05682b44f3f38f54dd50e6755084d5ccd5949407373da0d5d9a0e81de4257ed2348
7
- data.tar.gz: eebeb0e39a636cba72f879cb6b4d31af3a670a01ae3692a86e47d5d248eb8a20ba656ded433f7c17f1fdf4b8d29041c4751bb2b78f2fff141c660798bbd1f805
6
+ metadata.gz: 35a79c84d8dd14affaa2e99a05c1b40fa7df8942e3e217adcaf05457de1e3303b49f7c9029e4768fa547ea83f41f96be7ece12cd889bc3efac37424a2d97ebbb
7
+ data.tar.gz: e0c282b00ddc2f38321a1e2979028886c614fbea59651081f2a9f2d581b2c07b5abb41374ba1a078df2fd7f382b81a7ccf774bd4b5e16aff089487410852020b
data/CHANGELOG.md CHANGED
@@ -1,5 +1,34 @@
1
1
  # CHANGELOG
2
2
 
3
+ ## [0.2.0] - 2026-05-28
4
+
5
+ ### Changed (breaking)
6
+
7
+ - **`_btn` suffix dropped from canonical keys.** Button/choice fields now use bare canonical keys (`gender`, `title`, `marital_status`) instead of `gender_btn` etc. Type detection for validation reads the PDF field type and the resolved options map, not a string token in the key. Existing payloads/schemas/mappings using `_btn` need their keys updated.
8
+ - **`fill!` is now standalone.** It no longer requires `compile!` to have run, and it no longer falls back to the normalized PDF — it always fills the original `@template_path`. Stale normalized files can no longer silently take over a fill. To fill a renamed template, instantiate a new `Engine` over the normalized output.
9
+ - **Bad radio values now raise instead of warn-and-skip.** When `fill!` cannot match a value to a radio button's allowed options, it raises `AcroForge::Error` instead of logging a warning and continuing.
10
+ - **Engine introspection renamed:** `raw_fields` → `fields`, `raw_field_names` → `field_names`, `any_raw_fields?` → `any_fields?`.
11
+ - **Schemas emitted by `Schema.infer` / `Schema.merge` no longer carry empty `variations: []`.** When a field has no spatial label worth recording (typically anything preserved verbatim), the key is omitted entirely. Existing schemas with `variations:` still load fine.
12
+ - **Parenthetical content stripped** from both canonical keys and humanized variations. `Date of Birth (YYYY-MM-DD)` now produces `:date_of_birth` (not `:date_of_birth_yyyy_mm_dd`) and the variation `"Date of Birth"`.
13
+
14
+ ### Added
15
+
16
+ - **`preserve:` constructor kwarg on `Engine`** — explicit allowlist of AcroForm field names `compile!` must never rename. Plumbed through `Schema.infer` and `Relabeler.propose`.
17
+ - **Auto-preserve cascade in `compile!`** — fields are kept verbatim when they are (A) in `preserve:`, (B) a schema canonical key, or (C) look like a clean snake_case identifier (`looks_like_clean_identifier?`). Tier C is what makes `schema infer` on a clean PDF stop generating `mr` / `single` / section-header bleeds.
18
+ - **Case-insensitive radio matching** in `fill!` — `"Mr"` resolves to `:mr` against `allowed_values` without forcing the caller to know the export-value casing.
19
+ - **Automatic image stamping in `fill!`** — a payload value that targets a push-button image-upload field and points to a JPG/PNG file path is now drawn into the widget rectangle (scaled to fit) instead of being treated as text. `compile!` infers canonical keys for these slots from widget geometry: square-ish → `:passport_photo`, wide-and-thin → `:signature`, otherwise `:photo`.
20
+ - **Image validation with new error classes** — stamped images must be JPG or PNG within `MAX_IMAGE_BYTES` (5 MB) and `MAX_IMAGE_DIMENSION` (4000 px per side); violations raise the new `AcroForge::ImageTooLargeError` or `AcroForge::UnsupportedImageFormatError` (both subclass `AcroForge::Error`). When ImageMagick (`convert`) is on `PATH`, oversized images are downsampled toward `TARGET_PPI` (200) and transparent PNG borders (e.g. around a signature) are trimmed before stamping.
21
+ - **`simplecov`** in the dev/test group; coverage report under `coverage/`.
22
+
23
+ ### Fixed
24
+
25
+ - Radio button widget appearance state (`/AS`) wasn't syncing when `fill!` operated standalone (no compile-time `:TU` options map). Radios now coerce values to a Symbol matched against `allowed_values`, so hexapdf syncs widget state correctly.
26
+ - `Schema.infer_type` was missing matches on snake_case names (`account_no`, `date_signed`) because `_` is a word character — regex `\b` never fired. Underscores are now normalized to spaces before type detection.
27
+ - `Schema.infer_type` now recognises `\bdob\b` as a date.
28
+ - `Schema.merge` no longer reinstates the empty `variations: []` keys that `aggregate_proposals` strips.
29
+ - `compile!` was reading `field[:Opt]` for choice fields via `is_a?(Array)`, but hexapdf 1.x returns `HexaPDF::PDFArray` (not `Array`). The check is now duck-typed via `respond_to?(:each)`, so choice-field options maps are populated correctly.
30
+ - Spatial `:standard` scorer no longer claims a label sitting two rows above a field — tightened `dy_top` window and gave inline (label-left) matches a stronger preference over grid-locked (label-above) candidates. This recovers `first_name` on multi-column form layouts like Abokobi's.
31
+
3
32
  ## [0.1.0] - 2026-05-26
4
33
 
5
34
  ## Added
data/README.md CHANGED
@@ -80,19 +80,29 @@ Exit codes: `0` success, `1` user error (bad args, missing file), `2` validation
80
80
  ```ruby
81
81
  require "acroforge"
82
82
 
83
- # Compile a PDF and inspect what the heuristic found.
83
+ # Inspect the fields a PDF declares (no compilation needed).
84
+ engine = AcroForge::Engine.new("form.pdf")
85
+ engine.fields # => [{ name: "full_name", type: :text, alternate_name: nil }, ...]
86
+ engine.field_names # => ["full_name", ...]
87
+ engine.any_fields? # => true
88
+
89
+ # Fill an already-clean PDF — fill! is standalone, compile! is NOT required.
90
+ engine.fill!({ full_name: "Alice", email: "alice@example.com" }, "filled.pdf")
91
+
92
+ # Compile a messy PDF to clean its field names via the spatial heuristic.
84
93
  engine = AcroForge::Engine.new(
85
94
  "form.pdf",
86
- schema: AcroForge::Schema.load("schema.yml"), # or pass a Hash directly
87
- overrides: {}, # optional per-PDF overrides
88
- sections: ["Personal Details", "Loan Details"] # optional section headers for scoping
95
+ schema: AcroForge::Schema.load("schema.yml"), # or pass a Hash directly
96
+ overrides: {}, # per-PDF target-key overrides
97
+ preserve: %w[some_field_to_keep_verbatim], # allowlist of names to never rename
98
+ sections: ["Personal Details", "Loan Details"] # optional section headers for scoping
89
99
  )
90
100
  result = engine.compile!
91
101
  # => { mapped: {...}, unmapped: [...], select_options: {...}, new_fields_detected: [...] }
102
+ # Writes a `<form>_normalized.pdf` next to the source.
92
103
 
93
- # Fill a form with a payload.
94
- engine.validate_payload!(full_name: "Alice", email: "alice@example.com")
95
- engine.fill!({ full_name: "Alice", email: "alice@example.com" }, "filled.pdf")
104
+ # To fill a compiled (normalized) template, point a new Engine at it:
105
+ AcroForge::Engine.new("form_normalized.pdf").fill!(payload, "filled.pdf")
96
106
 
97
107
  # Generate a starter schema from a PDF.
98
108
  schema = AcroForge::Schema.infer("form.pdf")
@@ -107,9 +117,20 @@ AcroForge::Validator.valid?("alice@example.com", :email) # => true
107
117
  AcroForge::Validator.valid?("not a date", :date) # => false
108
118
  ```
109
119
 
120
+ ### Preserve cascade
121
+
122
+ `compile!` keeps an existing field name verbatim — skipping the spatial heuristic for that field — when any of these match:
123
+
124
+ 1. **Explicit:** the name is in the `preserve:` kwarg.
125
+ 2. **Schema:** the name is already a canonical key in the supplied `schema:`.
126
+ 3. **Heuristic:** the name looks like a clean snake_case identifier (no `pageN_fieldM` / `TextN` / `ImageN` markers).
127
+
128
+ The cascade is what stops `Schema.infer` on a clean PDF from synthesising garbage keys derived from adjacent radio-option labels or section headers.
129
+
110
130
  ### Errors
111
131
 
112
132
  - `AcroForge::ValidationError`: raised by `Engine#validate_payload!` on type mismatch.
133
+ - `AcroForge::Error`: raised by `Engine#fill!` when the PDF rejects a value (e.g. a radio value that doesn't match any allowed option).
113
134
  - `AcroForge::RelabelError`: raised by `Relabeler.apply!` on malformed mapping YAML, invalid key names, or missing AcroForm.
114
135
 
115
136
  ## Schema format
@@ -157,7 +178,7 @@ AcroForge also accepts a legacy "shorthand" form where the value is just an arra
157
178
  _meta:
158
179
  source_pdf: broken_form.pdf
159
180
  generated_at: 2026-05-26T14:32:11Z
160
- acroforge_version: 0.1.0
181
+ acroforge_version: 0.2.0
161
182
  total_fields: 98
162
183
 
163
184
  page0_field6:
@@ -11,16 +11,27 @@ require_relative "constants"
11
11
  require_relative "labels"
12
12
 
13
13
  module AcroForge
14
+ ImageTooLargeError = Class.new(Error)
15
+ UnsupportedImageFormatError = Class.new(Error)
16
+
14
17
  class Engine
18
+ # Caps a phone-camera passport photo from bloating the output PDF.
19
+ MAX_IMAGE_BYTES = 5 * 1024 * 1024
20
+ MAX_IMAGE_DIMENSION = 4000
21
+ # Auto-downsample images whose pixel resolution far exceeds this PPI
22
+ # at the widget's rendered size. Requires ImageMagick on PATH.
23
+ TARGET_PPI = 200
24
+
15
25
  attr_reader :template_path, :schema, :overrides, :sections, :normalized_path,
16
26
  :mapped_fields, :unmapped_fields, :filled_fields, :missing_fields,
17
27
  :select_field_options, :new_fields_detected
18
28
 
19
- def initialize(template_path, schema: {}, overrides: {}, sections: [], normalized_dir: nil)
29
+ def initialize(template_path, schema: {}, overrides: {}, sections: [], preserve: [], normalized_dir: nil)
20
30
  @template_path = template_path
21
31
  @schema = schema
22
32
  @overrides = overrides
23
33
  @sections = sections
34
+ @preserve = Array(preserve).map(&:to_s)
24
35
 
25
36
  dir = normalized_dir || File.dirname(template_path)
26
37
  base = File.basename(template_path, ".*")
@@ -47,7 +58,7 @@ module AcroForge
47
58
  @source_form ||= source_doc.acro_form(create: false)
48
59
  end
49
60
 
50
- def raw_fields
61
+ def fields
51
62
  return [] unless source_form
52
63
  extracted = []
53
64
  source_form.each_field do |field|
@@ -62,12 +73,12 @@ module AcroForge
62
73
  extracted
63
74
  end
64
75
 
65
- def raw_field_names
66
- raw_fields.map { |f| f[:name] }
76
+ def field_names
77
+ fields.map { |f| f[:name] }
67
78
  end
68
79
 
69
- def any_raw_fields?
70
- raw_fields.any?
80
+ def any_fields?
81
+ fields.any?
71
82
  end
72
83
 
73
84
  def fully_mapped?
@@ -108,9 +119,10 @@ module AcroForge
108
119
  index
109
120
  end
110
121
 
111
- # ------------------------------------------
112
- # PHASE 1: THE HIERARCHICAL COMPILER
113
- # ------------------------------------------
122
+ # Phase one of the discover→fill pipeline: run the spatial heuristic over
123
+ # every field, rename each to its proposed semantic key in place, and
124
+ # persist the result to @normalized_path. The returned hash is what the
125
+ # Relabeler and Schema.infer consume.
114
126
  def compile!
115
127
  puts ">> Compiling template: #{@template_path}"
116
128
  form = source_doc.acro_form(create: true)
@@ -161,10 +173,70 @@ module AcroForge
161
173
  is_btn = field.is_a?(HexaPDF::Type::AcroForm::ButtonField) || field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
162
174
  is_radio_group = is_btn && field.each_widget.count > 1
163
175
 
176
+ preserved_key = nil
177
+ preserved_source = nil
178
+ if @preserve.include?(original_field_name) || @preserve.include?(base_field_name)
179
+ preserved_key = base_field_name
180
+ preserved_source = :explicit
181
+ elsif @schema && !@schema.empty?
182
+ # Raw name first — sanitize_key over-corrects "social" → "sofficial".
183
+ if @schema.key?(base_field_name.to_sym)
184
+ preserved_key = base_field_name
185
+ preserved_source = :schema
186
+ elsif (canonical = sanitize_key(base_field_name)) && @schema.key?(canonical.to_sym)
187
+ preserved_key = canonical.to_s
188
+ preserved_source = :schema
189
+ end
190
+ end
191
+
192
+ # Heuristic fallback for Schema.infer, which compiles with no schema yet.
193
+ if preserved_key.nil? && looks_like_clean_identifier?(base_field_name)
194
+ preserved_key = base_field_name
195
+ preserved_source = :heuristic
196
+ end
197
+
198
+ if preserved_key
199
+ target_key = preserved_key.to_sym
200
+ @mapped_fields[original_field_name] = target_key
201
+
202
+ y_center = (widget[:Rect][1] + widget[:Rect][3]) / 2.0
203
+ active_section = get_active_section(section_map, page_index, y_center)
204
+
205
+ preserved_opts = preserved_options_map(field)
206
+ # Mirror what the renamed-field path does so validate_payload! and fill!'s :TU
207
+ # branch see the same option map for preserved buttons/choices.
208
+ if preserved_opts && !preserved_opts.empty?
209
+ @select_field_options[target_key.to_s] = preserved_opts
210
+ field[:TU] = preserved_opts.to_json
211
+ end
212
+
213
+ @field_proposals << {
214
+ pdf_field_name: original_field_name,
215
+ pdf_field_type: case field
216
+ when HexaPDF::Type::AcroForm::TextField then :text
217
+ when HexaPDF::Type::AcroForm::ButtonField then :button
218
+ when HexaPDF::Type::AcroForm::ChoiceField then :choice
219
+ else :other
220
+ end,
221
+ canonical_key: target_key,
222
+ raw_label: preserved_key,
223
+ confidence: :preserved,
224
+ section: active_section,
225
+ page: page_index,
226
+ y: y_center,
227
+ x: (widget[:Rect][0] + widget[:Rect][2]) / 2.0,
228
+ options: preserved_opts
229
+ }
230
+
231
+ puts " [Preserved] '#{original_field_name}' (#{preserved_source}) -> :#{target_key}"
232
+ next
233
+ end
234
+
164
235
  options_map = nil
165
236
 
166
237
  if is_radio_group
167
- # THE FIX: Sort by Highest Y, then Leftmost X to guarantee finding the top-left box of multi-line groups
238
+ # A group's label sits by its top-left widget, so order by highest Y
239
+ # then leftmost X to find it — widget enumeration order is arbitrary.
168
240
  first_widget = field.each_widget.min_by { |w| [-w[:Rect][1], w[:Rect][0]] }
169
241
 
170
242
  raw_label = find_nearest_text(page_text_map[page_index], first_widget[:Rect], mode: :group_label)
@@ -220,10 +292,19 @@ module AcroForge
220
292
 
221
293
  ["no", "false", "off", "0", "unchecked"].each { |k| options_map[k] = "Off" }
222
294
 
295
+ # Image-upload (push-button) fields don't sit beside an inline
296
+ # label — their slot IS the label. Infer the canonical key from
297
+ # the widget geometry: square → passport_photo, wide-thin → signature.
298
+ if field.respond_to?(:push_button?) && field.push_button?
299
+ inferred = infer_image_field_key(widget[:Rect])
300
+ raw_label = inferred.to_s.tr("_", " ") if inferred
301
+ end
302
+
223
303
  elsif field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
224
304
  # Choice fields can expose values via /Opt entries.
225
305
  options_map = {}
226
- if field[:Opt].is_a?(Array)
306
+ # `field[:Opt]` is a HexaPDF::PDFArray on hexapdf 1.x — not a plain Array.
307
+ if field[:Opt].respond_to?(:each)
227
308
  field[:Opt].each do |opt|
228
309
  if opt.is_a?(Array)
229
310
  export_val = opt[0].to_s
@@ -264,8 +345,7 @@ module AcroForge
264
345
  override_entry = @overrides[original_field_name.to_s] || @overrides[original_field_name.to_sym]
265
346
  if override_entry
266
347
  semantic_name = override_entry[:key] || override_key_used
267
- mapped_semantic = semantic_name.to_sym
268
- target_key = (is_btn && !mapped_semantic.to_s.end_with?("_btn")) ? :"#{mapped_semantic}_btn" : mapped_semantic
348
+ target_key = semantic_name.to_sym
269
349
 
270
350
  # Ensure uniqueness when multiple fields map to the same semantic key
271
351
  original_target = target_key
@@ -319,7 +399,6 @@ module AcroForge
319
399
 
320
400
  target_key = active_section ? :"#{active_section}_#{base_key}" : base_key
321
401
  target_key = @overrides[raw_label].to_sym if @overrides[raw_label]
322
- target_key = :"#{target_key}_btn" if is_btn && !target_key.to_s.end_with?("_btn")
323
402
 
324
403
  original_target = target_key
325
404
  counter = 1
@@ -397,19 +476,16 @@ module AcroForge
397
476
  }
398
477
  end
399
478
 
400
- # ------------------------------------------
401
- # PHASE 2: THE CRASH-PROOF INJECTOR
402
- # ------------------------------------------
479
+ # Phase two: inject a payload into a name-addressable form and write
480
+ # output_path. Standalone it does not depend on compile! having run.
403
481
  def fill!(payload, output_path, image_overlays = {})
404
- puts ">> Injecting data into: #{@normalized_path}"
405
-
406
- unless File.exist?(@normalized_path)
407
- raise "Normalized template missing. Please run compile! first."
408
- end
482
+ # Always the original template — a stale normalized PDF would silently fill the wrong document.
483
+ # To fill a normalized PDF, instantiate a new Engine pointing at it.
484
+ puts ">> Injecting data into: #{@template_path}"
409
485
 
410
486
  validate_payload!(payload)
411
487
 
412
- normalized_doc = HexaPDF::Document.open(@normalized_path)
488
+ normalized_doc = HexaPDF::Document.open(@template_path)
413
489
  form = normalized_doc.acro_form
414
490
 
415
491
  @filled_fields = {}
@@ -429,6 +505,13 @@ module AcroForge
429
505
 
430
506
  if doc_field
431
507
  begin
508
+ if image_upload?(doc_field) && image_path?(value)
509
+ stamp_image_on_widget(normalized_doc, doc_field, value)
510
+ @filled_fields[key] = value
511
+ puts " [Stamped] :#{key} <- #{value}"
512
+ next
513
+ end
514
+
432
515
  if doc_field.is_a?(HexaPDF::Type::AcroForm::ButtonField) ||
433
516
  doc_field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
434
517
  resolved_from_map = false
@@ -464,7 +547,15 @@ module AcroForge
464
547
  normalized_val = value.to_s.downcase.strip
465
548
  on_state_sym = button_on_states(doc_field).first || :Yes
466
549
 
467
- if ["true", "yes", "on", "1"].include?(normalized_val)
550
+ # Radios first — their "0"/"1" export values would otherwise
551
+ # be swallowed by the checkbox truthy/falsy branches below.
552
+ if doc_field.respond_to?(:radio_button?) && doc_field.radio_button?
553
+ # Case-insensitive match against allowed_values so "Mr" finds :mr.
554
+ # Symbol assignment triggers hexapdf to sync each widget's :AS.
555
+ allowed = doc_field.allowed_values || []
556
+ target = allowed.find { |v| v.to_s.casecmp(value.to_s).zero? } || value.to_sym
557
+ doc_field.field_value = target
558
+ elsif ["true", "yes", "on", "1"].include?(normalized_val)
468
559
  doc_field.field_value = on_state_sym.to_s
469
560
  doc_field.each_widget { |w| w[:AS] = on_state_sym }
470
561
  elsif ["false", "no", "off", "0"].include?(normalized_val)
@@ -487,7 +578,7 @@ module AcroForge
487
578
  @filled_fields[key] = value
488
579
  puts " [Filled] :#{key} = #{value}"
489
580
  rescue HexaPDF::Error => e
490
- puts " [Warning] Rejected :#{key} - PDF formatting conflict (#{e.message.split(" (HexaPDF").first})"
581
+ raise AcroForge::Error, "Field :#{key} rejected by PDF: #{e.message.split(" (HexaPDF").first}"
491
582
  end
492
583
  else
493
584
  @missing_fields << key
@@ -526,8 +617,218 @@ module AcroForge
526
617
  :low
527
618
  end
528
619
 
620
+ # Lets Schema.infer classify radios/choices as :select instead of :boolean.
621
+ # Returns an empty hash (never nil) so callers can `.any?` / `.empty?` uniformly.
622
+ def preserved_options_map(field)
623
+ if field.is_a?(HexaPDF::Type::AcroForm::ButtonField) && field.radio_button?
624
+ vals = field.allowed_values
625
+ return {} unless vals && !vals.empty?
626
+ vals.each_with_object({}) { |v, h| h[v.to_s] = v.to_s }
627
+ elsif field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
628
+ items = field.option_items
629
+ return {} unless items && !items.empty?
630
+ items.each_with_object({}) { |v, h| h[v.to_s] = v.to_s }
631
+ else
632
+ {}
633
+ end
634
+ end
635
+
636
+ def image_upload?(field)
637
+ field.is_a?(HexaPDF::Type::AcroForm::ButtonField) &&
638
+ field.respond_to?(:push_button?) && field.push_button?
639
+ end
640
+
641
+ def image_path?(value)
642
+ value.is_a?(String) && File.file?(value)
643
+ end
644
+
645
+ def stamp_image_on_widget(doc, field, path)
646
+ format, image_width, image_height = validate_image!(path)
647
+ widget = field.each_widget.first
648
+ return unless widget && widget[:Rect]
649
+
650
+ page = doc.pages.find { |candidate_page| candidate_page[:Annots]&.include?(widget) }
651
+ return unless page
652
+
653
+ # Widget Rect is absolute page coords; canvas API is MediaBox-relative.
654
+ media_box_x, media_box_y = page.box.value[0], page.box.value[1]
655
+ rect_x_min, rect_y_min, rect_x_max, rect_y_max = widget[:Rect]
656
+ slot_width = rect_x_max - rect_x_min
657
+ slot_height = rect_y_max - rect_y_min
658
+ slot_canvas_x = rect_x_min - media_box_x
659
+ slot_canvas_y = rect_y_min - media_box_y
660
+
661
+ stamp_path = prepare_image_for_slot(path, format, image_width, image_height,
662
+ slot_width, slot_height) || path
663
+ if stamp_path != path
664
+ _, image_width, image_height = image_dimensions(stamp_path)
665
+ end
666
+
667
+ draw_width, draw_height = fit_inside(image_width, image_height, slot_width, slot_height)
668
+ draw_x = slot_canvas_x + (slot_width - draw_width) / 2.0
669
+ draw_y = slot_canvas_y + (slot_height - draw_height) / 2.0
670
+
671
+ canvas = page.canvas(type: :overlay)
672
+ canvas.fill_color(255, 255, 255)
673
+ canvas.rectangle(slot_canvas_x, slot_canvas_y, slot_width, slot_height).fill
674
+ canvas.image(stamp_path, at: [draw_x, draw_y], width: draw_width, height: draw_height)
675
+
676
+ # Bake into the page so the widget's empty appearance doesn't repaint over the image.
677
+ page[:Annots].delete(widget)
678
+ end
679
+
680
+ def fit_inside(image_width, image_height, slot_width, slot_height)
681
+ scale = [slot_width.to_f / image_width, slot_height.to_f / image_height].min
682
+ [image_width * scale, image_height * scale]
683
+ end
684
+
685
+ # Trim removes the transparent border around a signature; downsample
686
+ # caps source resolution at TARGET_PPI for the widget's longer side.
687
+ def prepare_image_for_slot(path, format, image_width, image_height,
688
+ slot_width_pt, slot_height_pt)
689
+ return nil unless imagemagick_available?
690
+ slot_max_pt = [slot_width_pt, slot_height_pt].max
691
+ target_max_px = (slot_max_pt * TARGET_PPI / 72.0).ceil
692
+ needs_resize = image_width > target_max_px * 2 || image_height > target_max_px * 2
693
+ needs_trim = format == :png && png_with_alpha?(path)
694
+ return nil unless needs_resize || needs_trim
695
+
696
+ ext = (format == :png) ? ".png" : ".jpg"
697
+ require "securerandom"
698
+ require "tmpdir"
699
+ output_path = File.join(Dir.tmpdir,
700
+ "acroforge_stamp_#{Process.pid}_#{SecureRandom.hex(4)}#{ext}")
701
+ # `format:path` locks the coder, closing the CVE-2016-3714 (ImageTragick) class of attack.
702
+ args = ["convert", "#{format}:#{path}"]
703
+ args.push("-trim", "+repage") if needs_trim
704
+ args.push("-resize", "#{target_max_px}x#{target_max_px}>") if needs_resize
705
+ args.push(output_path)
706
+ success = system(*args, out: File::NULL, err: File::NULL)
707
+ (success && File.exist?(output_path)) ? output_path : nil
708
+ end
709
+
710
+ # PNG color type 4 = greyscale+alpha, 6 = RGBA — only these are trim-worthy.
711
+ def png_with_alpha?(path)
712
+ File.open(path, "rb") do |io|
713
+ return false unless io.read(8) == "\x89PNG\r\n\x1A\n".b
714
+ return false if io.read(8).nil?
715
+ return false if io.read(8).nil?
716
+ return false if io.read(1).nil?
717
+ color_type_byte = io.read(1)
718
+ return false if color_type_byte.nil?
719
+ color_type = color_type_byte.unpack1("C")
720
+ color_type == 4 || color_type == 6
721
+ end
722
+ end
723
+
724
+ def imagemagick_available?
725
+ return @imagemagick_available if defined?(@imagemagick_available)
726
+ @imagemagick_available = system("which", "convert", out: File::NULL, err: File::NULL)
727
+ end
728
+
729
+ # Trust boundary in front of ImageMagick: any malformed-input path
730
+ # raises a single error class so worker retry policies can key on it.
731
+ def validate_image!(path)
732
+ size = File.size(path)
733
+ if size > MAX_IMAGE_BYTES
734
+ raise ImageTooLargeError, "#{path}: #{size} bytes exceeds #{MAX_IMAGE_BYTES} byte cap"
735
+ end
736
+ format, width, height = image_dimensions(path)
737
+ if width > MAX_IMAGE_DIMENSION || height > MAX_IMAGE_DIMENSION
738
+ raise ImageTooLargeError,
739
+ "#{path}: #{width}x#{height}px exceeds #{MAX_IMAGE_DIMENSION}px per side"
740
+ end
741
+ [format, width, height]
742
+ end
743
+
744
+ def image_dimensions(path)
745
+ File.open(path, "rb") do |io|
746
+ head = read_exact(io, 8, path)
747
+ io.rewind
748
+ if head.start_with?("\x89PNG\r\n\x1A\n".b)
749
+ width, height = read_png_dimensions(io, path)
750
+ [:png, width, height]
751
+ elsif head[0, 2] == "\xFF\xD8".b
752
+ width, height = read_jpeg_dimensions(io, path)
753
+ [:jpg, width, height]
754
+ else
755
+ raise_unsupported(path)
756
+ end
757
+ end
758
+ end
759
+
760
+ def read_png_dimensions(io, path)
761
+ read_exact(io, 16, path) # 8-byte signature + 4 length + "IHDR"
762
+ width = read_exact(io, 4, path).unpack1("N")
763
+ height = read_exact(io, 4, path).unpack1("N")
764
+ [width, height]
765
+ end
766
+
767
+ def read_jpeg_dimensions(io, path)
768
+ read_exact(io, 2, path) # SOI
769
+ loop do
770
+ marker_byte = read_exact(io, 1, path).getbyte(0)
771
+ raise_unsupported(path) unless marker_byte == 0xFF
772
+ # Runs of 0xFF are valid JPEG fill bytes between markers.
773
+ marker_code = read_exact(io, 1, path).getbyte(0)
774
+ marker_code = read_exact(io, 1, path).getbyte(0) while marker_code == 0xFF
775
+ raise_unsupported(path, "no SOF marker found") if marker_code == 0xD9 || marker_code == 0x00
776
+ # 0xD0..0xD7 and 0x01 are standalone markers — no length follows.
777
+ next if (0xD0..0xD7).cover?(marker_code) || marker_code == 0x01
778
+ segment_length = read_exact(io, 2, path).unpack1("n")
779
+ raise_unsupported(path, "negative segment length") if segment_length < 2
780
+ is_sof_marker = (0xC0..0xCF).cover?(marker_code) && ![0xC4, 0xC8, 0xCC].include?(marker_code)
781
+ if is_sof_marker
782
+ read_exact(io, 1, path) # precision
783
+ height = read_exact(io, 2, path).unpack1("n")
784
+ width = read_exact(io, 2, path).unpack1("n")
785
+ return [width, height]
786
+ else
787
+ read_exact(io, segment_length - 2, path)
788
+ end
789
+ end
790
+ end
791
+
792
+ def read_exact(io, byte_count, path)
793
+ buf = io.read(byte_count)
794
+ raise_unsupported(path, "truncated header") if buf.nil? || buf.bytesize < byte_count
795
+ buf
796
+ end
797
+
798
+ def raise_unsupported(path, reason = "only JPG and PNG are supported")
799
+ raise UnsupportedImageFormatError, "#{path}: #{reason}"
800
+ end
801
+
802
+ # Heuristic key for push-button image fields based on widget aspect ratio.
803
+ # Vendor forms reserve square-ish boxes for headshots and wide-thin strips
804
+ # for signatures; the slot's geometry is a more reliable signal than any
805
+ # nearby text because the labels are usually far from the box.
806
+ def infer_image_field_key(rect)
807
+ width = rect[2] - rect[0]
808
+ height = rect[3] - rect[1]
809
+ return nil if width <= 0 || height <= 0
810
+ aspect_ratio = width.to_f / height
811
+ if aspect_ratio > 3.0
812
+ :signature
813
+ elsif aspect_ratio.between?(0.5, 2.0) && [width, height].min >= 30
814
+ :passport_photo
815
+ else
816
+ :photo
817
+ end
818
+ end
819
+
820
+ def looks_like_clean_identifier?(name)
821
+ return false unless name.is_a?(String) && !name.empty?
822
+ return false unless name.match?(/\A[a-z][a-z0-9_]*\z/)
823
+ return false if name.match?(/\A(?:page|text|field|image)\d/)
824
+ return false if name.match?(/_(?:field|text|image)\d/)
825
+ true
826
+ end
827
+
529
828
  def sanitize_key(string)
530
- key = string.to_s.downcase
829
+ cleaned = AcroForge::Labels.strip_parenthetical(string)
830
+
831
+ key = cleaned.downcase
531
832
  .gsub(/['’*]/, "")
532
833
  .gsub(/[^a-z0-9]+/, "_").squeeze("_")
533
834
  .sub(/_$/, "")
@@ -735,24 +1036,25 @@ module AcroForge
735
1036
  payload.each do |key, value|
736
1037
  next if value.nil? || value.to_s.empty?
737
1038
 
738
- # Strip suffixes like _1 or _btn to find the base canonical key for schema lookup
1039
+ # Strip the _N collision suffix that compile! appends when multiple
1040
+ # fields share a name. The bare key drives schema and override lookup.
739
1041
  key_str = key.to_s
740
- base_key = key_str.sub(/_btn(?:_\d+)?\z/, "").sub(/_\d+\z/, "").to_sym
1042
+ base_key = key_str.sub(/_\d+\z/, "").to_sym
741
1043
 
742
- # Try to resolve override info. @overrides may be keyed by
743
- # original PDF field names (strings like "page0_field6") so allow lookup
744
- # by semantic base_key (matching value[:key]) or by string key.
745
1044
  override_info = @overrides[base_key] || @overrides[base_key.to_s] || @overrides.values.find { |v| v.is_a?(Hash) && v[:key].to_sym == base_key }
746
1045
 
747
1046
  type_info = @schema[base_key]
748
1047
 
749
- # If it's a button field, it's a select type by nature
750
- type = if key_str.include?("_btn")
751
- :select
752
- elsif override_info
1048
+ # Treat fields with a known options set as :select regardless of how
1049
+ # the schema labels them — that's how button/choice fields are
1050
+ # validated against their allowed values.
1051
+ pdf_options_for_type = @select_field_options[key.to_s]&.keys || []
1052
+ type = if override_info
753
1053
  override_info[:type]
754
1054
  elsif type_info
755
1055
  type_info.is_a?(Hash) ? type_info[:type] : :string
1056
+ elsif !pdf_options_for_type.empty?
1057
+ :select
756
1058
  else
757
1059
  infer_type(key)
758
1060
  end
@@ -789,9 +1091,11 @@ module AcroForge
789
1091
  end
790
1092
  end
791
1093
 
792
- # ------------------------------------------
793
- # THE UNIVERSAL DYNAMIC HEURISTIC (WEIGHTS FIXED)
794
- # ------------------------------------------
1094
+ # Scores every text chunk against the field's rectangle and returns the
1095
+ # best-matching label. `mode` selects which spatial relationship to reward
1096
+ # (inline label, grid header, radio option). The magic offsets below are
1097
+ # hand-tuned weights, not physical distances — they bias one relationship
1098
+ # over another when several candidates sit close to the field.
795
1099
  def find_nearest_text(text_chunks, field_rect, mode: :standard)
796
1100
  f_x_min, f_y_min, f_x_max, f_y_max = field_rect
797
1101
  f_y_center = (f_y_min + f_y_max) / 2.0
@@ -836,14 +1140,17 @@ module AcroForge
836
1140
  end
837
1141
 
838
1142
  when :standard
839
- is_grid_locked = dy_top > -5 && dy_top < 30 && t_x_center >= (f_x_min - 20) && t_x_center <= (f_x_max + 20)
1143
+ # Tighter dy_top so labels two rows up don't claim an unrelated field.
1144
+ is_grid_locked = dy_top > -5 && dy_top < 12 && t_x_center >= (f_x_min - 20) && t_x_center <= (f_x_max + 20)
840
1145
  is_inline = dy_center < 10 && dx_left > -10 && dx_left < 200
841
1146
 
842
- if is_grid_locked
843
- score = dy_top.abs - 2000
1147
+ # Inline beats grid-locked when both apply: a label sitting on the same
1148
+ # row as its field is a stronger signal than one sitting above it.
1149
+ if is_inline
1150
+ score = dx_left.abs - 3000
844
1151
  score -= 200 if has_colon_or_q
845
- elsif is_inline
846
- score = dx_left.abs - 1000
1152
+ elsif is_grid_locked
1153
+ score = dy_top.abs - 2000
847
1154
  score -= 200 if has_colon_or_q
848
1155
  elsif dy_center < 15 && dx_left > -10 && dx_left < 150
849
1156
  score = dx_left.abs
@@ -22,10 +22,20 @@ module AcroForge
22
22
  of at by in on to up from with as vs
23
23
  ].to_set
24
24
 
25
+ # Parenthetical content in form labels is UI hints, not field identity.
26
+ def strip_parenthetical(text)
27
+ text.to_s.gsub(/\s*\(.*?\)\s*/, " ").gsub(/\s+/, " ").strip
28
+ end
29
+
25
30
  def humanize(label)
26
31
  return label unless label.is_a?(String) && !label.empty?
27
32
 
28
- result = fix_typos(label)
33
+ result = strip_parenthetical(label)
34
+ # Treat `_` as a word separator so snake_case names from preserved
35
+ # fields title-case correctly. Real PDF labels almost never contain
36
+ # underscores; snake_case keys passed through here always do.
37
+ result = result.tr("_", " ")
38
+ result = fix_typos(result)
29
39
  result = title_case(result)
30
40
  result.gsub(/\s+/, " ").strip
31
41
  end
@@ -94,14 +94,14 @@ module AcroForge
94
94
  # we use its proposals directly (no second compile). This lets callers
95
95
  # like the CLI's `bootstrap` subcommand share one compile pass with
96
96
  # Schema.infer instead of running the engine twice.
97
- def propose(pdf_path, out:, schema: {}, mode: :merge, engine: nil)
97
+ def propose(pdf_path, out:, schema: {}, preserve: [], mode: :merge, engine: nil)
98
98
  existing = (mode == :merge && File.exist?(out)) ? YAML.load_file(out) : nil
99
99
 
100
100
  proposals = if engine
101
101
  engine.field_proposals
102
102
  else
103
103
  Dir.mktmpdir do |tmp|
104
- e = AcroForge::Engine.new(pdf_path, schema: schema, normalized_dir: tmp)
104
+ e = AcroForge::Engine.new(pdf_path, schema: schema, preserve: preserve, normalized_dir: tmp)
105
105
  e.compile!
106
106
  e.field_proposals
107
107
  end
@@ -147,16 +147,21 @@ module AcroForge
147
147
  end
148
148
 
149
149
  def infer_type(proposal)
150
+ # Mirrors AcroForge::Schema#infer_type. Kept separate so Relabeler
151
+ # doesn't have to load the whole Schema module path during mapping
152
+ # propose. Keep these two in sync when you tweak either.
150
153
  case proposal[:pdf_field_type]
151
154
  when :button
152
155
  ((proposal[:options]&.size || 0) > 1) ? :select : :boolean
153
156
  when :choice
154
157
  :select
155
158
  else
156
- label = proposal[:raw_label].to_s.downcase
159
+ # Normalize underscores to spaces so `\b`-anchored regexes match
160
+ # against snake_case names like `account_no`, `date_signed`.
161
+ label = proposal[:raw_label].to_s.downcase.tr("_", " ")
157
162
  case label
158
163
  when /amount|salary|income|balance|fee|tier3/ then :money
159
- when /\bdate\b|birth|expiry|employed/ then :date
164
+ when /\bdate\b|\bdob\b|birth|expiry|employed/ then :date
160
165
  when /email/ then :email
161
166
  when /years|tenor|number of|\bno\.?\b/ then :number
162
167
  else :string
@@ -12,7 +12,9 @@ module AcroForge
12
12
  def load(path)
13
13
  raw = case File.extname(path).downcase
14
14
  when ".yml", ".yaml"
15
- YAML.safe_load_file(path, permitted_classes: [Symbol], aliases: true)
15
+ # safe_load_file was added in Psych 4 (Ruby 3.1+); safe_load(File.read)
16
+ # keeps the gem usable on Ruby 2.7 as the gemspec advertises.
17
+ YAML.safe_load(File.read(path), permitted_classes: [Symbol], aliases: true) # standard:disable Style/YAMLFileRead
16
18
  when ".json"
17
19
  JSON.parse(File.read(path), symbolize_names: false)
18
20
  else
@@ -98,24 +100,26 @@ module AcroForge
98
100
  # we use its proposals directly. This lets callers (notably the CLI's
99
101
  # `bootstrap` subcommand) avoid a redundant second compile when they
100
102
  # also want to call Relabeler.propose on the same PDF.
101
- def infer(pdf_path, sections: [], engine: nil)
103
+ def infer(pdf_path, sections: [], preserve: [], engine: nil)
102
104
  return aggregate_proposals(engine.field_proposals) if engine
103
105
 
104
106
  require "tmpdir"
105
107
  Dir.mktmpdir do |tmp|
106
- e = AcroForge::Engine.new(pdf_path, sections: sections, normalized_dir: tmp)
108
+ e = AcroForge::Engine.new(pdf_path, sections: sections, preserve: preserve, normalized_dir: tmp)
107
109
  e.compile!
108
110
  aggregate_proposals(e.field_proposals)
109
111
  end
110
112
  end
111
113
 
112
114
  def aggregate_proposals(proposals)
113
- proposals.each_with_object({}) do |p, schema|
115
+ result = proposals.each_with_object({}) do |p, schema|
114
116
  next if p[:canonical_key].nil?
115
117
 
116
118
  key = p[:canonical_key].to_sym
117
119
  schema[key] ||= {type: infer_type(p), variations: []}
118
- if p[:raw_label]
120
+
121
+ # Preserved fields' raw_label echoes the key — adding it as a variation is noise.
122
+ if p[:raw_label] && p[:confidence] != :preserved
119
123
  cleaned = humanize_label(p[:raw_label])
120
124
  schema[key][:variations] << cleaned unless schema[key][:variations].include?(cleaned)
121
125
  end
@@ -124,6 +128,9 @@ module AcroForge
124
128
  schema[key][:options] = p[:options].keys.map(&:to_sym).uniq
125
129
  end
126
130
  end
131
+
132
+ result.each_value { |entry| entry.delete(:variations) if entry[:variations].empty? }
133
+ result
127
134
  end
128
135
 
129
136
  # Thin delegator. The real implementation lives in AcroForge::Labels so
@@ -141,10 +148,11 @@ module AcroForge
141
148
  when :choice
142
149
  :select
143
150
  else
144
- label = proposal[:raw_label].to_s.downcase
151
+ # `_` is a word char, so `\b` regexes need underscores converted to spaces.
152
+ label = proposal[:raw_label].to_s.downcase.tr("_", " ")
145
153
  case label
146
154
  when /amount|salary|income|balance|fee|tier3/ then :money
147
- when /\bdate\b|birth|expiry|employed/ then :date
155
+ when /\bdate\b|\bdob\b|birth|expiry|employed/ then :date
148
156
  when /email/ then :email
149
157
  when /years|tenor|number of|\bno\.?\b/ then :number
150
158
  else :string
@@ -187,6 +195,12 @@ module AcroForge
187
195
  end
188
196
  end
189
197
 
198
+ # Mirrors aggregate_proposals — keeps merged schemas free of hollow `variations: []`.
199
+ result.each_value do |entry|
200
+ next unless entry.is_a?(Hash) && entry[:variations].is_a?(Array) && entry[:variations].empty?
201
+ entry.delete(:variations)
202
+ end
203
+
190
204
  result
191
205
  end
192
206
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module AcroForge
4
- VERSION = "0.1.0"
4
+ VERSION = "0.2.0"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: acroforge
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Maxwell Nana Forson
@@ -75,7 +75,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
75
75
  - !ruby/object:Gem::Version
76
76
  version: '0'
77
77
  requirements: []
78
- rubygems_version: 4.0.6
78
+ rubygems_version: 4.0.12
79
79
  specification_version: 4
80
80
  summary: PDF AcroForm engine with heuristic-assisted field relabeling.
81
81
  test_files: []