acroforge 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +29 -0
- data/README.md +29 -8
- data/lib/acroforge/engine.rb +350 -43
- data/lib/acroforge/labels.rb +11 -1
- data/lib/acroforge/relabeler.rb +9 -4
- data/lib/acroforge/schema.rb +21 -7
- data/lib/acroforge/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c97cab951d5efe0d85539bcc55b368277b95bd1d6595141459be383a1a818c2e
|
|
4
|
+
data.tar.gz: 7156afa0720c474b3ee0b287cb994ccb3438c1f58cd10ff297673d19d6b34a30
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 35a79c84d8dd14affaa2e99a05c1b40fa7df8942e3e217adcaf05457de1e3303b49f7c9029e4768fa547ea83f41f96be7ece12cd889bc3efac37424a2d97ebbb
|
|
7
|
+
data.tar.gz: e0c282b00ddc2f38321a1e2979028886c614fbea59651081f2a9f2d581b2c07b5abb41374ba1a078df2fd7f382b81a7ccf774bd4b5e16aff089487410852020b
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,34 @@
|
|
|
1
1
|
# CHANGELOG
|
|
2
2
|
|
|
3
|
+
## [0.2.0] - 2026-05-28
|
|
4
|
+
|
|
5
|
+
### Changed (breaking)
|
|
6
|
+
|
|
7
|
+
- **`_btn` suffix dropped from canonical keys.** Button/choice fields now use bare canonical keys (`gender`, `title`, `marital_status`) instead of `gender_btn` etc. Type detection for validation reads the PDF field type and the resolved options map, not a string token in the key. Existing payloads/schemas/mappings using `_btn` need their keys updated.
|
|
8
|
+
- **`fill!` is now standalone.** It no longer requires `compile!` to have run, and it no longer falls back to the normalized PDF — it always fills the original `@template_path`. Stale normalized files can no longer silently take over a fill. To fill a renamed template, instantiate a new `Engine` over the normalized output.
|
|
9
|
+
- **Bad radio values now raise instead of warn-and-skip.** When `fill!` cannot match a value to a radio button's allowed options, it raises `AcroForge::Error` instead of logging a warning and continuing.
|
|
10
|
+
- **Engine introspection renamed:** `raw_fields` → `fields`, `raw_field_names` → `field_names`, `any_raw_fields?` → `any_fields?`.
|
|
11
|
+
- **Schemas emitted by `Schema.infer` / `Schema.merge` no longer carry empty `variations: []`.** When a field has no spatial label worth recording (typically anything preserved verbatim), the key is omitted entirely. Existing schemas with `variations:` still load fine.
|
|
12
|
+
- **Parenthetical content stripped** from both canonical keys and humanized variations. `Date of Birth (YYYY-MM-DD)` now produces `:date_of_birth` (not `:date_of_birth_yyyy_mm_dd`) and the variation `"Date of Birth"`.
|
|
13
|
+
|
|
14
|
+
### Added
|
|
15
|
+
|
|
16
|
+
- **`preserve:` constructor kwarg on `Engine`** — explicit allowlist of AcroForm field names `compile!` must never rename. Plumbed through `Schema.infer` and `Relabeler.propose`.
|
|
17
|
+
- **Auto-preserve cascade in `compile!`** — fields are kept verbatim when they are (A) in `preserve:`, (B) a schema canonical key, or (C) look like a clean snake_case identifier (`looks_like_clean_identifier?`). Tier C is what makes `schema infer` on a clean PDF stop generating `mr` / `single` / section-header bleeds.
|
|
18
|
+
- **Case-insensitive radio matching** in `fill!` — `"Mr"` resolves to `:mr` against `allowed_values` without forcing the caller to know the export-value casing.
|
|
19
|
+
- **Automatic image stamping in `fill!`** — a payload value that targets a push-button image-upload field and points to a JPG/PNG file path is now drawn into the widget rectangle (scaled to fit) instead of being treated as text. `compile!` infers canonical keys for these slots from widget geometry: square-ish → `:passport_photo`, wide-and-thin → `:signature`, otherwise `:photo`.
|
|
20
|
+
- **Image validation with new error classes** — stamped images must be JPG or PNG within `MAX_IMAGE_BYTES` (5 MB) and `MAX_IMAGE_DIMENSION` (4000 px per side); violations raise the new `AcroForge::ImageTooLargeError` or `AcroForge::UnsupportedImageFormatError` (both subclass `AcroForge::Error`). When ImageMagick (`convert`) is on `PATH`, oversized images are downsampled toward `TARGET_PPI` (200) and transparent PNG borders (e.g. around a signature) are trimmed before stamping.
|
|
21
|
+
- **`simplecov`** in the dev/test group; coverage report under `coverage/`.
|
|
22
|
+
|
|
23
|
+
### Fixed
|
|
24
|
+
|
|
25
|
+
- Radio button widget appearance state (`/AS`) wasn't syncing when `fill!` operated standalone (no compile-time `:TU` options map). Radios now coerce values to a Symbol matched against `allowed_values`, so hexapdf syncs widget state correctly.
|
|
26
|
+
- `Schema.infer_type` was missing matches on snake_case names (`account_no`, `date_signed`) because `_` is a word character — regex `\b` never fired. Underscores are now normalized to spaces before type detection.
|
|
27
|
+
- `Schema.infer_type` now recognises `\bdob\b` as a date.
|
|
28
|
+
- `Schema.merge` no longer reinstates the empty `variations: []` keys that `aggregate_proposals` strips.
|
|
29
|
+
- `compile!` was reading `field[:Opt]` for choice fields via `is_a?(Array)`, but hexapdf 1.x returns `HexaPDF::PDFArray` (not `Array`). The check is now duck-typed via `respond_to?(:each)`, so choice-field options maps are populated correctly.
|
|
30
|
+
- Spatial `:standard` scorer no longer claims a label sitting two rows above a field — tightened `dy_top` window and gave inline (label-left) matches a stronger preference over grid-locked (label-above) candidates. This recovers `first_name` on multi-column form layouts like Abokobi's.
|
|
31
|
+
|
|
3
32
|
## [0.1.0] - 2026-05-26
|
|
4
33
|
|
|
5
34
|
## Added
|
data/README.md
CHANGED
|
@@ -80,19 +80,29 @@ Exit codes: `0` success, `1` user error (bad args, missing file), `2` validation
|
|
|
80
80
|
```ruby
|
|
81
81
|
require "acroforge"
|
|
82
82
|
|
|
83
|
-
#
|
|
83
|
+
# Inspect the fields a PDF declares (no compilation needed).
|
|
84
|
+
engine = AcroForge::Engine.new("form.pdf")
|
|
85
|
+
engine.fields # => [{ name: "full_name", type: :text, alternate_name: nil }, ...]
|
|
86
|
+
engine.field_names # => ["full_name", ...]
|
|
87
|
+
engine.any_fields? # => true
|
|
88
|
+
|
|
89
|
+
# Fill an already-clean PDF — fill! is standalone, compile! is NOT required.
|
|
90
|
+
engine.fill!({ full_name: "Alice", email: "alice@example.com" }, "filled.pdf")
|
|
91
|
+
|
|
92
|
+
# Compile a messy PDF to clean its field names via the spatial heuristic.
|
|
84
93
|
engine = AcroForge::Engine.new(
|
|
85
94
|
"form.pdf",
|
|
86
|
-
schema:
|
|
87
|
-
overrides: {},
|
|
88
|
-
|
|
95
|
+
schema: AcroForge::Schema.load("schema.yml"), # or pass a Hash directly
|
|
96
|
+
overrides: {}, # per-PDF target-key overrides
|
|
97
|
+
preserve: %w[some_field_to_keep_verbatim], # allowlist of names to never rename
|
|
98
|
+
sections: ["Personal Details", "Loan Details"] # optional section headers for scoping
|
|
89
99
|
)
|
|
90
100
|
result = engine.compile!
|
|
91
101
|
# => { mapped: {...}, unmapped: [...], select_options: {...}, new_fields_detected: [...] }
|
|
102
|
+
# Writes a `<form>_normalized.pdf` next to the source.
|
|
92
103
|
|
|
93
|
-
#
|
|
94
|
-
|
|
95
|
-
engine.fill!({ full_name: "Alice", email: "alice@example.com" }, "filled.pdf")
|
|
104
|
+
# To fill a compiled (normalized) template, point a new Engine at it:
|
|
105
|
+
AcroForge::Engine.new("form_normalized.pdf").fill!(payload, "filled.pdf")
|
|
96
106
|
|
|
97
107
|
# Generate a starter schema from a PDF.
|
|
98
108
|
schema = AcroForge::Schema.infer("form.pdf")
|
|
@@ -107,9 +117,20 @@ AcroForge::Validator.valid?("alice@example.com", :email) # => true
|
|
|
107
117
|
AcroForge::Validator.valid?("not a date", :date) # => false
|
|
108
118
|
```
|
|
109
119
|
|
|
120
|
+
### Preserve cascade
|
|
121
|
+
|
|
122
|
+
`compile!` keeps an existing field name verbatim — skipping the spatial heuristic for that field — when any of these match:
|
|
123
|
+
|
|
124
|
+
1. **Explicit:** the name is in the `preserve:` kwarg.
|
|
125
|
+
2. **Schema:** the name is already a canonical key in the supplied `schema:`.
|
|
126
|
+
3. **Heuristic:** the name looks like a clean snake_case identifier (no `pageN_fieldM` / `TextN` / `ImageN` markers).
|
|
127
|
+
|
|
128
|
+
The cascade is what stops `Schema.infer` on a clean PDF from synthesising garbage keys derived from adjacent radio-option labels or section headers.
|
|
129
|
+
|
|
110
130
|
### Errors
|
|
111
131
|
|
|
112
132
|
- `AcroForge::ValidationError`: raised by `Engine#validate_payload!` on type mismatch.
|
|
133
|
+
- `AcroForge::Error`: raised by `Engine#fill!` when the PDF rejects a value (e.g. a radio value that doesn't match any allowed option).
|
|
113
134
|
- `AcroForge::RelabelError`: raised by `Relabeler.apply!` on malformed mapping YAML, invalid key names, or missing AcroForm.
|
|
114
135
|
|
|
115
136
|
## Schema format
|
|
@@ -157,7 +178,7 @@ AcroForge also accepts a legacy "shorthand" form where the value is just an arra
|
|
|
157
178
|
_meta:
|
|
158
179
|
source_pdf: broken_form.pdf
|
|
159
180
|
generated_at: 2026-05-26T14:32:11Z
|
|
160
|
-
acroforge_version: 0.
|
|
181
|
+
acroforge_version: 0.2.0
|
|
161
182
|
total_fields: 98
|
|
162
183
|
|
|
163
184
|
page0_field6:
|
data/lib/acroforge/engine.rb
CHANGED
|
@@ -11,16 +11,27 @@ require_relative "constants"
|
|
|
11
11
|
require_relative "labels"
|
|
12
12
|
|
|
13
13
|
module AcroForge
|
|
14
|
+
ImageTooLargeError = Class.new(Error)
|
|
15
|
+
UnsupportedImageFormatError = Class.new(Error)
|
|
16
|
+
|
|
14
17
|
class Engine
|
|
18
|
+
# Caps a phone-camera passport photo from bloating the output PDF.
|
|
19
|
+
MAX_IMAGE_BYTES = 5 * 1024 * 1024
|
|
20
|
+
MAX_IMAGE_DIMENSION = 4000
|
|
21
|
+
# Auto-downsample images whose pixel resolution far exceeds this PPI
|
|
22
|
+
# at the widget's rendered size. Requires ImageMagick on PATH.
|
|
23
|
+
TARGET_PPI = 200
|
|
24
|
+
|
|
15
25
|
attr_reader :template_path, :schema, :overrides, :sections, :normalized_path,
|
|
16
26
|
:mapped_fields, :unmapped_fields, :filled_fields, :missing_fields,
|
|
17
27
|
:select_field_options, :new_fields_detected
|
|
18
28
|
|
|
19
|
-
def initialize(template_path, schema: {}, overrides: {}, sections: [], normalized_dir: nil)
|
|
29
|
+
def initialize(template_path, schema: {}, overrides: {}, sections: [], preserve: [], normalized_dir: nil)
|
|
20
30
|
@template_path = template_path
|
|
21
31
|
@schema = schema
|
|
22
32
|
@overrides = overrides
|
|
23
33
|
@sections = sections
|
|
34
|
+
@preserve = Array(preserve).map(&:to_s)
|
|
24
35
|
|
|
25
36
|
dir = normalized_dir || File.dirname(template_path)
|
|
26
37
|
base = File.basename(template_path, ".*")
|
|
@@ -47,7 +58,7 @@ module AcroForge
|
|
|
47
58
|
@source_form ||= source_doc.acro_form(create: false)
|
|
48
59
|
end
|
|
49
60
|
|
|
50
|
-
def
|
|
61
|
+
def fields
|
|
51
62
|
return [] unless source_form
|
|
52
63
|
extracted = []
|
|
53
64
|
source_form.each_field do |field|
|
|
@@ -62,12 +73,12 @@ module AcroForge
|
|
|
62
73
|
extracted
|
|
63
74
|
end
|
|
64
75
|
|
|
65
|
-
def
|
|
66
|
-
|
|
76
|
+
def field_names
|
|
77
|
+
fields.map { |f| f[:name] }
|
|
67
78
|
end
|
|
68
79
|
|
|
69
|
-
def
|
|
70
|
-
|
|
80
|
+
def any_fields?
|
|
81
|
+
fields.any?
|
|
71
82
|
end
|
|
72
83
|
|
|
73
84
|
def fully_mapped?
|
|
@@ -108,9 +119,10 @@ module AcroForge
|
|
|
108
119
|
index
|
|
109
120
|
end
|
|
110
121
|
|
|
111
|
-
#
|
|
112
|
-
#
|
|
113
|
-
#
|
|
122
|
+
# Phase one of the discover→fill pipeline: run the spatial heuristic over
|
|
123
|
+
# every field, rename each to its proposed semantic key in place, and
|
|
124
|
+
# persist the result to @normalized_path. The returned hash is what the
|
|
125
|
+
# Relabeler and Schema.infer consume.
|
|
114
126
|
def compile!
|
|
115
127
|
puts ">> Compiling template: #{@template_path}"
|
|
116
128
|
form = source_doc.acro_form(create: true)
|
|
@@ -161,10 +173,70 @@ module AcroForge
|
|
|
161
173
|
is_btn = field.is_a?(HexaPDF::Type::AcroForm::ButtonField) || field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
|
|
162
174
|
is_radio_group = is_btn && field.each_widget.count > 1
|
|
163
175
|
|
|
176
|
+
preserved_key = nil
|
|
177
|
+
preserved_source = nil
|
|
178
|
+
if @preserve.include?(original_field_name) || @preserve.include?(base_field_name)
|
|
179
|
+
preserved_key = base_field_name
|
|
180
|
+
preserved_source = :explicit
|
|
181
|
+
elsif @schema && !@schema.empty?
|
|
182
|
+
# Raw name first — sanitize_key over-corrects "social" → "sofficial".
|
|
183
|
+
if @schema.key?(base_field_name.to_sym)
|
|
184
|
+
preserved_key = base_field_name
|
|
185
|
+
preserved_source = :schema
|
|
186
|
+
elsif (canonical = sanitize_key(base_field_name)) && @schema.key?(canonical.to_sym)
|
|
187
|
+
preserved_key = canonical.to_s
|
|
188
|
+
preserved_source = :schema
|
|
189
|
+
end
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
# Heuristic fallback for Schema.infer, which compiles with no schema yet.
|
|
193
|
+
if preserved_key.nil? && looks_like_clean_identifier?(base_field_name)
|
|
194
|
+
preserved_key = base_field_name
|
|
195
|
+
preserved_source = :heuristic
|
|
196
|
+
end
|
|
197
|
+
|
|
198
|
+
if preserved_key
|
|
199
|
+
target_key = preserved_key.to_sym
|
|
200
|
+
@mapped_fields[original_field_name] = target_key
|
|
201
|
+
|
|
202
|
+
y_center = (widget[:Rect][1] + widget[:Rect][3]) / 2.0
|
|
203
|
+
active_section = get_active_section(section_map, page_index, y_center)
|
|
204
|
+
|
|
205
|
+
preserved_opts = preserved_options_map(field)
|
|
206
|
+
# Mirror what the renamed-field path does so validate_payload! and fill!'s :TU
|
|
207
|
+
# branch see the same option map for preserved buttons/choices.
|
|
208
|
+
if preserved_opts && !preserved_opts.empty?
|
|
209
|
+
@select_field_options[target_key.to_s] = preserved_opts
|
|
210
|
+
field[:TU] = preserved_opts.to_json
|
|
211
|
+
end
|
|
212
|
+
|
|
213
|
+
@field_proposals << {
|
|
214
|
+
pdf_field_name: original_field_name,
|
|
215
|
+
pdf_field_type: case field
|
|
216
|
+
when HexaPDF::Type::AcroForm::TextField then :text
|
|
217
|
+
when HexaPDF::Type::AcroForm::ButtonField then :button
|
|
218
|
+
when HexaPDF::Type::AcroForm::ChoiceField then :choice
|
|
219
|
+
else :other
|
|
220
|
+
end,
|
|
221
|
+
canonical_key: target_key,
|
|
222
|
+
raw_label: preserved_key,
|
|
223
|
+
confidence: :preserved,
|
|
224
|
+
section: active_section,
|
|
225
|
+
page: page_index,
|
|
226
|
+
y: y_center,
|
|
227
|
+
x: (widget[:Rect][0] + widget[:Rect][2]) / 2.0,
|
|
228
|
+
options: preserved_opts
|
|
229
|
+
}
|
|
230
|
+
|
|
231
|
+
puts " [Preserved] '#{original_field_name}' (#{preserved_source}) -> :#{target_key}"
|
|
232
|
+
next
|
|
233
|
+
end
|
|
234
|
+
|
|
164
235
|
options_map = nil
|
|
165
236
|
|
|
166
237
|
if is_radio_group
|
|
167
|
-
#
|
|
238
|
+
# A group's label sits by its top-left widget, so order by highest Y
|
|
239
|
+
# then leftmost X to find it — widget enumeration order is arbitrary.
|
|
168
240
|
first_widget = field.each_widget.min_by { |w| [-w[:Rect][1], w[:Rect][0]] }
|
|
169
241
|
|
|
170
242
|
raw_label = find_nearest_text(page_text_map[page_index], first_widget[:Rect], mode: :group_label)
|
|
@@ -220,10 +292,19 @@ module AcroForge
|
|
|
220
292
|
|
|
221
293
|
["no", "false", "off", "0", "unchecked"].each { |k| options_map[k] = "Off" }
|
|
222
294
|
|
|
295
|
+
# Image-upload (push-button) fields don't sit beside an inline
|
|
296
|
+
# label — their slot IS the label. Infer the canonical key from
|
|
297
|
+
# the widget geometry: square → passport_photo, wide-thin → signature.
|
|
298
|
+
if field.respond_to?(:push_button?) && field.push_button?
|
|
299
|
+
inferred = infer_image_field_key(widget[:Rect])
|
|
300
|
+
raw_label = inferred.to_s.tr("_", " ") if inferred
|
|
301
|
+
end
|
|
302
|
+
|
|
223
303
|
elsif field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
|
|
224
304
|
# Choice fields can expose values via /Opt entries.
|
|
225
305
|
options_map = {}
|
|
226
|
-
|
|
306
|
+
# `field[:Opt]` is a HexaPDF::PDFArray on hexapdf 1.x — not a plain Array.
|
|
307
|
+
if field[:Opt].respond_to?(:each)
|
|
227
308
|
field[:Opt].each do |opt|
|
|
228
309
|
if opt.is_a?(Array)
|
|
229
310
|
export_val = opt[0].to_s
|
|
@@ -264,8 +345,7 @@ module AcroForge
|
|
|
264
345
|
override_entry = @overrides[original_field_name.to_s] || @overrides[original_field_name.to_sym]
|
|
265
346
|
if override_entry
|
|
266
347
|
semantic_name = override_entry[:key] || override_key_used
|
|
267
|
-
|
|
268
|
-
target_key = (is_btn && !mapped_semantic.to_s.end_with?("_btn")) ? :"#{mapped_semantic}_btn" : mapped_semantic
|
|
348
|
+
target_key = semantic_name.to_sym
|
|
269
349
|
|
|
270
350
|
# Ensure uniqueness when multiple fields map to the same semantic key
|
|
271
351
|
original_target = target_key
|
|
@@ -319,7 +399,6 @@ module AcroForge
|
|
|
319
399
|
|
|
320
400
|
target_key = active_section ? :"#{active_section}_#{base_key}" : base_key
|
|
321
401
|
target_key = @overrides[raw_label].to_sym if @overrides[raw_label]
|
|
322
|
-
target_key = :"#{target_key}_btn" if is_btn && !target_key.to_s.end_with?("_btn")
|
|
323
402
|
|
|
324
403
|
original_target = target_key
|
|
325
404
|
counter = 1
|
|
@@ -397,19 +476,16 @@ module AcroForge
|
|
|
397
476
|
}
|
|
398
477
|
end
|
|
399
478
|
|
|
400
|
-
#
|
|
401
|
-
#
|
|
402
|
-
# ------------------------------------------
|
|
479
|
+
# Phase two: inject a payload into a name-addressable form and write
|
|
480
|
+
# output_path. Standalone — it does not depend on compile! having run.
|
|
403
481
|
def fill!(payload, output_path, image_overlays = {})
|
|
404
|
-
|
|
405
|
-
|
|
406
|
-
|
|
407
|
-
raise "Normalized template missing. Please run compile! first."
|
|
408
|
-
end
|
|
482
|
+
# Always the original template — a stale normalized PDF would silently fill the wrong document.
|
|
483
|
+
# To fill a normalized PDF, instantiate a new Engine pointing at it.
|
|
484
|
+
puts ">> Injecting data into: #{@template_path}"
|
|
409
485
|
|
|
410
486
|
validate_payload!(payload)
|
|
411
487
|
|
|
412
|
-
normalized_doc = HexaPDF::Document.open(@
|
|
488
|
+
normalized_doc = HexaPDF::Document.open(@template_path)
|
|
413
489
|
form = normalized_doc.acro_form
|
|
414
490
|
|
|
415
491
|
@filled_fields = {}
|
|
@@ -429,6 +505,13 @@ module AcroForge
|
|
|
429
505
|
|
|
430
506
|
if doc_field
|
|
431
507
|
begin
|
|
508
|
+
if image_upload?(doc_field) && image_path?(value)
|
|
509
|
+
stamp_image_on_widget(normalized_doc, doc_field, value)
|
|
510
|
+
@filled_fields[key] = value
|
|
511
|
+
puts " [Stamped] :#{key} <- #{value}"
|
|
512
|
+
next
|
|
513
|
+
end
|
|
514
|
+
|
|
432
515
|
if doc_field.is_a?(HexaPDF::Type::AcroForm::ButtonField) ||
|
|
433
516
|
doc_field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
|
|
434
517
|
resolved_from_map = false
|
|
@@ -464,7 +547,15 @@ module AcroForge
|
|
|
464
547
|
normalized_val = value.to_s.downcase.strip
|
|
465
548
|
on_state_sym = button_on_states(doc_field).first || :Yes
|
|
466
549
|
|
|
467
|
-
|
|
550
|
+
# Radios first — their "0"/"1" export values would otherwise
|
|
551
|
+
# be swallowed by the checkbox truthy/falsy branches below.
|
|
552
|
+
if doc_field.respond_to?(:radio_button?) && doc_field.radio_button?
|
|
553
|
+
# Case-insensitive match against allowed_values so "Mr" finds :mr.
|
|
554
|
+
# Symbol assignment triggers hexapdf to sync each widget's :AS.
|
|
555
|
+
allowed = doc_field.allowed_values || []
|
|
556
|
+
target = allowed.find { |v| v.to_s.casecmp(value.to_s).zero? } || value.to_sym
|
|
557
|
+
doc_field.field_value = target
|
|
558
|
+
elsif ["true", "yes", "on", "1"].include?(normalized_val)
|
|
468
559
|
doc_field.field_value = on_state_sym.to_s
|
|
469
560
|
doc_field.each_widget { |w| w[:AS] = on_state_sym }
|
|
470
561
|
elsif ["false", "no", "off", "0"].include?(normalized_val)
|
|
@@ -487,7 +578,7 @@ module AcroForge
|
|
|
487
578
|
@filled_fields[key] = value
|
|
488
579
|
puts " [Filled] :#{key} = #{value}"
|
|
489
580
|
rescue HexaPDF::Error => e
|
|
490
|
-
|
|
581
|
+
raise AcroForge::Error, "Field :#{key} rejected by PDF: #{e.message.split(" (HexaPDF").first}"
|
|
491
582
|
end
|
|
492
583
|
else
|
|
493
584
|
@missing_fields << key
|
|
@@ -526,8 +617,218 @@ module AcroForge
|
|
|
526
617
|
:low
|
|
527
618
|
end
|
|
528
619
|
|
|
620
|
+
# Lets Schema.infer classify radios/choices as :select instead of :boolean.
|
|
621
|
+
# Returns an empty hash (never nil) so callers can `.any?` / `.empty?` uniformly.
|
|
622
|
+
def preserved_options_map(field)
|
|
623
|
+
if field.is_a?(HexaPDF::Type::AcroForm::ButtonField) && field.radio_button?
|
|
624
|
+
vals = field.allowed_values
|
|
625
|
+
return {} unless vals && !vals.empty?
|
|
626
|
+
vals.each_with_object({}) { |v, h| h[v.to_s] = v.to_s }
|
|
627
|
+
elsif field.is_a?(HexaPDF::Type::AcroForm::ChoiceField)
|
|
628
|
+
items = field.option_items
|
|
629
|
+
return {} unless items && !items.empty?
|
|
630
|
+
items.each_with_object({}) { |v, h| h[v.to_s] = v.to_s }
|
|
631
|
+
else
|
|
632
|
+
{}
|
|
633
|
+
end
|
|
634
|
+
end
|
|
635
|
+
|
|
636
|
+
def image_upload?(field)
|
|
637
|
+
field.is_a?(HexaPDF::Type::AcroForm::ButtonField) &&
|
|
638
|
+
field.respond_to?(:push_button?) && field.push_button?
|
|
639
|
+
end
|
|
640
|
+
|
|
641
|
+
def image_path?(value)
|
|
642
|
+
value.is_a?(String) && File.file?(value)
|
|
643
|
+
end
|
|
644
|
+
|
|
645
|
+
def stamp_image_on_widget(doc, field, path)
|
|
646
|
+
format, image_width, image_height = validate_image!(path)
|
|
647
|
+
widget = field.each_widget.first
|
|
648
|
+
return unless widget && widget[:Rect]
|
|
649
|
+
|
|
650
|
+
page = doc.pages.find { |candidate_page| candidate_page[:Annots]&.include?(widget) }
|
|
651
|
+
return unless page
|
|
652
|
+
|
|
653
|
+
# Widget Rect is absolute page coords; canvas API is MediaBox-relative.
|
|
654
|
+
media_box_x, media_box_y = page.box.value[0], page.box.value[1]
|
|
655
|
+
rect_x_min, rect_y_min, rect_x_max, rect_y_max = widget[:Rect]
|
|
656
|
+
slot_width = rect_x_max - rect_x_min
|
|
657
|
+
slot_height = rect_y_max - rect_y_min
|
|
658
|
+
slot_canvas_x = rect_x_min - media_box_x
|
|
659
|
+
slot_canvas_y = rect_y_min - media_box_y
|
|
660
|
+
|
|
661
|
+
stamp_path = prepare_image_for_slot(path, format, image_width, image_height,
|
|
662
|
+
slot_width, slot_height) || path
|
|
663
|
+
if stamp_path != path
|
|
664
|
+
_, image_width, image_height = image_dimensions(stamp_path)
|
|
665
|
+
end
|
|
666
|
+
|
|
667
|
+
draw_width, draw_height = fit_inside(image_width, image_height, slot_width, slot_height)
|
|
668
|
+
draw_x = slot_canvas_x + (slot_width - draw_width) / 2.0
|
|
669
|
+
draw_y = slot_canvas_y + (slot_height - draw_height) / 2.0
|
|
670
|
+
|
|
671
|
+
canvas = page.canvas(type: :overlay)
|
|
672
|
+
canvas.fill_color(255, 255, 255)
|
|
673
|
+
canvas.rectangle(slot_canvas_x, slot_canvas_y, slot_width, slot_height).fill
|
|
674
|
+
canvas.image(stamp_path, at: [draw_x, draw_y], width: draw_width, height: draw_height)
|
|
675
|
+
|
|
676
|
+
# Bake into the page so the widget's empty appearance doesn't repaint over the image.
|
|
677
|
+
page[:Annots].delete(widget)
|
|
678
|
+
end
|
|
679
|
+
|
|
680
|
+
def fit_inside(image_width, image_height, slot_width, slot_height)
|
|
681
|
+
scale = [slot_width.to_f / image_width, slot_height.to_f / image_height].min
|
|
682
|
+
[image_width * scale, image_height * scale]
|
|
683
|
+
end
|
|
684
|
+
|
|
685
|
+
# Trim removes the transparent border around a signature; downsample
|
|
686
|
+
# caps source resolution at TARGET_PPI for the widget's longer side.
|
|
687
|
+
def prepare_image_for_slot(path, format, image_width, image_height,
|
|
688
|
+
slot_width_pt, slot_height_pt)
|
|
689
|
+
return nil unless imagemagick_available?
|
|
690
|
+
slot_max_pt = [slot_width_pt, slot_height_pt].max
|
|
691
|
+
target_max_px = (slot_max_pt * TARGET_PPI / 72.0).ceil
|
|
692
|
+
needs_resize = image_width > target_max_px * 2 || image_height > target_max_px * 2
|
|
693
|
+
needs_trim = format == :png && png_with_alpha?(path)
|
|
694
|
+
return nil unless needs_resize || needs_trim
|
|
695
|
+
|
|
696
|
+
ext = (format == :png) ? ".png" : ".jpg"
|
|
697
|
+
require "securerandom"
|
|
698
|
+
require "tmpdir"
|
|
699
|
+
output_path = File.join(Dir.tmpdir,
|
|
700
|
+
"acroforge_stamp_#{Process.pid}_#{SecureRandom.hex(4)}#{ext}")
|
|
701
|
+
# `format:path` locks the coder, closing the CVE-2016-3714 (ImageTragick) class of attack.
|
|
702
|
+
args = ["convert", "#{format}:#{path}"]
|
|
703
|
+
args.push("-trim", "+repage") if needs_trim
|
|
704
|
+
args.push("-resize", "#{target_max_px}x#{target_max_px}>") if needs_resize
|
|
705
|
+
args.push(output_path)
|
|
706
|
+
success = system(*args, out: File::NULL, err: File::NULL)
|
|
707
|
+
(success && File.exist?(output_path)) ? output_path : nil
|
|
708
|
+
end
|
|
709
|
+
|
|
710
|
+
# PNG color type 4 = greyscale+alpha, 6 = RGBA — only these are trim-worthy.
|
|
711
|
+
def png_with_alpha?(path)
|
|
712
|
+
File.open(path, "rb") do |io|
|
|
713
|
+
return false unless io.read(8) == "\x89PNG\r\n\x1A\n".b
|
|
714
|
+
return false if io.read(8).nil?
|
|
715
|
+
return false if io.read(8).nil?
|
|
716
|
+
return false if io.read(1).nil?
|
|
717
|
+
color_type_byte = io.read(1)
|
|
718
|
+
return false if color_type_byte.nil?
|
|
719
|
+
color_type = color_type_byte.unpack1("C")
|
|
720
|
+
color_type == 4 || color_type == 6
|
|
721
|
+
end
|
|
722
|
+
end
|
|
723
|
+
|
|
724
|
+
def imagemagick_available?
|
|
725
|
+
return @imagemagick_available if defined?(@imagemagick_available)
|
|
726
|
+
@imagemagick_available = system("which", "convert", out: File::NULL, err: File::NULL)
|
|
727
|
+
end
|
|
728
|
+
|
|
729
|
+
# Trust boundary in front of ImageMagick: any malformed-input path
|
|
730
|
+
# raises a single error class so worker retry policies can key on it.
|
|
731
|
+
def validate_image!(path)
|
|
732
|
+
size = File.size(path)
|
|
733
|
+
if size > MAX_IMAGE_BYTES
|
|
734
|
+
raise ImageTooLargeError, "#{path}: #{size} bytes exceeds #{MAX_IMAGE_BYTES} byte cap"
|
|
735
|
+
end
|
|
736
|
+
format, width, height = image_dimensions(path)
|
|
737
|
+
if width > MAX_IMAGE_DIMENSION || height > MAX_IMAGE_DIMENSION
|
|
738
|
+
raise ImageTooLargeError,
|
|
739
|
+
"#{path}: #{width}x#{height}px exceeds #{MAX_IMAGE_DIMENSION}px per side"
|
|
740
|
+
end
|
|
741
|
+
[format, width, height]
|
|
742
|
+
end
|
|
743
|
+
|
|
744
|
+
def image_dimensions(path)
|
|
745
|
+
File.open(path, "rb") do |io|
|
|
746
|
+
head = read_exact(io, 8, path)
|
|
747
|
+
io.rewind
|
|
748
|
+
if head.start_with?("\x89PNG\r\n\x1A\n".b)
|
|
749
|
+
width, height = read_png_dimensions(io, path)
|
|
750
|
+
[:png, width, height]
|
|
751
|
+
elsif head[0, 2] == "\xFF\xD8".b
|
|
752
|
+
width, height = read_jpeg_dimensions(io, path)
|
|
753
|
+
[:jpg, width, height]
|
|
754
|
+
else
|
|
755
|
+
raise_unsupported(path)
|
|
756
|
+
end
|
|
757
|
+
end
|
|
758
|
+
end
|
|
759
|
+
|
|
760
|
+
def read_png_dimensions(io, path)
|
|
761
|
+
read_exact(io, 16, path) # 8-byte signature + 4 length + "IHDR"
|
|
762
|
+
width = read_exact(io, 4, path).unpack1("N")
|
|
763
|
+
height = read_exact(io, 4, path).unpack1("N")
|
|
764
|
+
[width, height]
|
|
765
|
+
end
|
|
766
|
+
|
|
767
|
+
def read_jpeg_dimensions(io, path)
|
|
768
|
+
read_exact(io, 2, path) # SOI
|
|
769
|
+
loop do
|
|
770
|
+
marker_byte = read_exact(io, 1, path).getbyte(0)
|
|
771
|
+
raise_unsupported(path) unless marker_byte == 0xFF
|
|
772
|
+
# Runs of 0xFF are valid JPEG fill bytes between markers.
|
|
773
|
+
marker_code = read_exact(io, 1, path).getbyte(0)
|
|
774
|
+
marker_code = read_exact(io, 1, path).getbyte(0) while marker_code == 0xFF
|
|
775
|
+
raise_unsupported(path, "no SOF marker found") if marker_code == 0xD9 || marker_code == 0x00
|
|
776
|
+
# 0xD0..0xD7 and 0x01 are standalone markers — no length follows.
|
|
777
|
+
next if (0xD0..0xD7).cover?(marker_code) || marker_code == 0x01
|
|
778
|
+
segment_length = read_exact(io, 2, path).unpack1("n")
|
|
779
|
+
raise_unsupported(path, "negative segment length") if segment_length < 2
|
|
780
|
+
is_sof_marker = (0xC0..0xCF).cover?(marker_code) && ![0xC4, 0xC8, 0xCC].include?(marker_code)
|
|
781
|
+
if is_sof_marker
|
|
782
|
+
read_exact(io, 1, path) # precision
|
|
783
|
+
height = read_exact(io, 2, path).unpack1("n")
|
|
784
|
+
width = read_exact(io, 2, path).unpack1("n")
|
|
785
|
+
return [width, height]
|
|
786
|
+
else
|
|
787
|
+
read_exact(io, segment_length - 2, path)
|
|
788
|
+
end
|
|
789
|
+
end
|
|
790
|
+
end
|
|
791
|
+
|
|
792
|
+
def read_exact(io, byte_count, path)
|
|
793
|
+
buf = io.read(byte_count)
|
|
794
|
+
raise_unsupported(path, "truncated header") if buf.nil? || buf.bytesize < byte_count
|
|
795
|
+
buf
|
|
796
|
+
end
|
|
797
|
+
|
|
798
|
+
def raise_unsupported(path, reason = "only JPG and PNG are supported")
|
|
799
|
+
raise UnsupportedImageFormatError, "#{path}: #{reason}"
|
|
800
|
+
end
|
|
801
|
+
|
|
802
|
+
# Heuristic key for push-button image fields based on widget aspect ratio.
|
|
803
|
+
# Vendor forms reserve square-ish boxes for headshots and wide-thin strips
|
|
804
|
+
# for signatures; the slot's geometry is a more reliable signal than any
|
|
805
|
+
# nearby text because the labels are usually far from the box.
|
|
806
|
+
def infer_image_field_key(rect)
|
|
807
|
+
width = rect[2] - rect[0]
|
|
808
|
+
height = rect[3] - rect[1]
|
|
809
|
+
return nil if width <= 0 || height <= 0
|
|
810
|
+
aspect_ratio = width.to_f / height
|
|
811
|
+
if aspect_ratio > 3.0
|
|
812
|
+
:signature
|
|
813
|
+
elsif aspect_ratio.between?(0.5, 2.0) && [width, height].min >= 30
|
|
814
|
+
:passport_photo
|
|
815
|
+
else
|
|
816
|
+
:photo
|
|
817
|
+
end
|
|
818
|
+
end
|
|
819
|
+
|
|
820
|
+
def looks_like_clean_identifier?(name)
|
|
821
|
+
return false unless name.is_a?(String) && !name.empty?
|
|
822
|
+
return false unless name.match?(/\A[a-z][a-z0-9_]*\z/)
|
|
823
|
+
return false if name.match?(/\A(?:page|text|field|image)\d/)
|
|
824
|
+
return false if name.match?(/_(?:field|text|image)\d/)
|
|
825
|
+
true
|
|
826
|
+
end
|
|
827
|
+
|
|
529
828
|
def sanitize_key(string)
|
|
530
|
-
|
|
829
|
+
cleaned = AcroForge::Labels.strip_parenthetical(string)
|
|
830
|
+
|
|
831
|
+
key = cleaned.downcase
|
|
531
832
|
.gsub(/['’*]/, "")
|
|
532
833
|
.gsub(/[^a-z0-9]+/, "_").squeeze("_")
|
|
533
834
|
.sub(/_$/, "")
|
|
@@ -735,24 +1036,25 @@ module AcroForge
|
|
|
735
1036
|
payload.each do |key, value|
|
|
736
1037
|
next if value.nil? || value.to_s.empty?
|
|
737
1038
|
|
|
738
|
-
# Strip
|
|
1039
|
+
# Strip the _N collision suffix that compile! appends when multiple
|
|
1040
|
+
# fields share a name. The bare key drives schema and override lookup.
|
|
739
1041
|
key_str = key.to_s
|
|
740
|
-
base_key = key_str.sub(/
|
|
1042
|
+
base_key = key_str.sub(/_\d+\z/, "").to_sym
|
|
741
1043
|
|
|
742
|
-
# Try to resolve override info. @overrides may be keyed by
|
|
743
|
-
# original PDF field names (strings like "page0_field6") so allow lookup
|
|
744
|
-
# by semantic base_key (matching value[:key]) or by string key.
|
|
745
1044
|
override_info = @overrides[base_key] || @overrides[base_key.to_s] || @overrides.values.find { |v| v.is_a?(Hash) && v[:key].to_sym == base_key }
|
|
746
1045
|
|
|
747
1046
|
type_info = @schema[base_key]
|
|
748
1047
|
|
|
749
|
-
#
|
|
750
|
-
|
|
751
|
-
|
|
752
|
-
|
|
1048
|
+
# Treat fields with a known options set as :select regardless of how
|
|
1049
|
+
# the schema labels them — that's how button/choice fields are
|
|
1050
|
+
# validated against their allowed values.
|
|
1051
|
+
pdf_options_for_type = @select_field_options[key.to_s]&.keys || []
|
|
1052
|
+
type = if override_info
|
|
753
1053
|
override_info[:type]
|
|
754
1054
|
elsif type_info
|
|
755
1055
|
type_info.is_a?(Hash) ? type_info[:type] : :string
|
|
1056
|
+
elsif !pdf_options_for_type.empty?
|
|
1057
|
+
:select
|
|
756
1058
|
else
|
|
757
1059
|
infer_type(key)
|
|
758
1060
|
end
|
|
@@ -789,9 +1091,11 @@ module AcroForge
|
|
|
789
1091
|
end
|
|
790
1092
|
end
|
|
791
1093
|
|
|
792
|
-
#
|
|
793
|
-
#
|
|
794
|
-
#
|
|
1094
|
+
# Scores every text chunk against the field's rectangle and returns the
|
|
1095
|
+
# best-matching label. `mode` selects which spatial relationship to reward
|
|
1096
|
+
# (inline label, grid header, radio option). The magic offsets below are
|
|
1097
|
+
# hand-tuned weights, not physical distances — they bias one relationship
|
|
1098
|
+
# over another when several candidates sit close to the field.
|
|
795
1099
|
def find_nearest_text(text_chunks, field_rect, mode: :standard)
|
|
796
1100
|
f_x_min, f_y_min, f_x_max, f_y_max = field_rect
|
|
797
1101
|
f_y_center = (f_y_min + f_y_max) / 2.0
|
|
@@ -836,14 +1140,17 @@ module AcroForge
|
|
|
836
1140
|
end
|
|
837
1141
|
|
|
838
1142
|
when :standard
|
|
839
|
-
|
|
1143
|
+
# Tighter dy_top so labels two rows up don't claim an unrelated field.
|
|
1144
|
+
is_grid_locked = dy_top > -5 && dy_top < 12 && t_x_center >= (f_x_min - 20) && t_x_center <= (f_x_max + 20)
|
|
840
1145
|
is_inline = dy_center < 10 && dx_left > -10 && dx_left < 200
|
|
841
1146
|
|
|
842
|
-
|
|
843
|
-
|
|
1147
|
+
# Inline beats grid-locked when both apply: a label sitting on the same
|
|
1148
|
+
# row as its field is a stronger signal than one sitting above it.
|
|
1149
|
+
if is_inline
|
|
1150
|
+
score = dx_left.abs - 3000
|
|
844
1151
|
score -= 200 if has_colon_or_q
|
|
845
|
-
elsif
|
|
846
|
-
score =
|
|
1152
|
+
elsif is_grid_locked
|
|
1153
|
+
score = dy_top.abs - 2000
|
|
847
1154
|
score -= 200 if has_colon_or_q
|
|
848
1155
|
elsif dy_center < 15 && dx_left > -10 && dx_left < 150
|
|
849
1156
|
score = dx_left.abs
|
data/lib/acroforge/labels.rb
CHANGED
|
@@ -22,10 +22,20 @@ module AcroForge
|
|
|
22
22
|
of at by in on to up from with as vs
|
|
23
23
|
].to_set
|
|
24
24
|
|
|
25
|
+
# Parenthetical content in form labels is UI hints, not field identity.
|
|
26
|
+
def strip_parenthetical(text)
|
|
27
|
+
text.to_s.gsub(/\s*\(.*?\)\s*/, " ").gsub(/\s+/, " ").strip
|
|
28
|
+
end
|
|
29
|
+
|
|
25
30
|
def humanize(label)
|
|
26
31
|
return label unless label.is_a?(String) && !label.empty?
|
|
27
32
|
|
|
28
|
-
result =
|
|
33
|
+
result = strip_parenthetical(label)
|
|
34
|
+
# Treat `_` as a word separator so snake_case names from preserved
|
|
35
|
+
# fields title-case correctly. Real PDF labels almost never contain
|
|
36
|
+
# underscores; snake_case keys passed through here always do.
|
|
37
|
+
result = result.tr("_", " ")
|
|
38
|
+
result = fix_typos(result)
|
|
29
39
|
result = title_case(result)
|
|
30
40
|
result.gsub(/\s+/, " ").strip
|
|
31
41
|
end
|
data/lib/acroforge/relabeler.rb
CHANGED
|
@@ -94,14 +94,14 @@ module AcroForge
|
|
|
94
94
|
# we use its proposals directly (no second compile). This lets callers
|
|
95
95
|
# like the CLI's `bootstrap` subcommand share one compile pass with
|
|
96
96
|
# Schema.infer instead of running the engine twice.
|
|
97
|
-
def propose(pdf_path, out:, schema: {}, mode: :merge, engine: nil)
|
|
97
|
+
def propose(pdf_path, out:, schema: {}, preserve: [], mode: :merge, engine: nil)
|
|
98
98
|
existing = (mode == :merge && File.exist?(out)) ? YAML.load_file(out) : nil
|
|
99
99
|
|
|
100
100
|
proposals = if engine
|
|
101
101
|
engine.field_proposals
|
|
102
102
|
else
|
|
103
103
|
Dir.mktmpdir do |tmp|
|
|
104
|
-
e = AcroForge::Engine.new(pdf_path, schema: schema, normalized_dir: tmp)
|
|
104
|
+
e = AcroForge::Engine.new(pdf_path, schema: schema, preserve: preserve, normalized_dir: tmp)
|
|
105
105
|
e.compile!
|
|
106
106
|
e.field_proposals
|
|
107
107
|
end
|
|
@@ -147,16 +147,21 @@ module AcroForge
|
|
|
147
147
|
end
|
|
148
148
|
|
|
149
149
|
def infer_type(proposal)
|
|
150
|
+
# Mirrors AcroForge::Schema#infer_type. Kept separate so Relabeler
|
|
151
|
+
# doesn't have to load the whole Schema module path during mapping
|
|
152
|
+
# propose. Keep these two in sync when you tweak either.
|
|
150
153
|
case proposal[:pdf_field_type]
|
|
151
154
|
when :button
|
|
152
155
|
((proposal[:options]&.size || 0) > 1) ? :select : :boolean
|
|
153
156
|
when :choice
|
|
154
157
|
:select
|
|
155
158
|
else
|
|
156
|
-
|
|
159
|
+
# Normalize underscores to spaces so `\b`-anchored regexes match
|
|
160
|
+
# against snake_case names like `account_no`, `date_signed`.
|
|
161
|
+
label = proposal[:raw_label].to_s.downcase.tr("_", " ")
|
|
157
162
|
case label
|
|
158
163
|
when /amount|salary|income|balance|fee|tier3/ then :money
|
|
159
|
-
when /\bdate\b|birth|expiry|employed/ then :date
|
|
164
|
+
when /\bdate\b|\bdob\b|birth|expiry|employed/ then :date
|
|
160
165
|
when /email/ then :email
|
|
161
166
|
when /years|tenor|number of|\bno\.?\b/ then :number
|
|
162
167
|
else :string
|
data/lib/acroforge/schema.rb
CHANGED
|
@@ -12,7 +12,9 @@ module AcroForge
|
|
|
12
12
|
def load(path)
|
|
13
13
|
raw = case File.extname(path).downcase
|
|
14
14
|
when ".yml", ".yaml"
|
|
15
|
-
|
|
15
|
+
# safe_load_file was added in Psych 4 (Ruby 3.1+); safe_load(File.read)
|
|
16
|
+
# keeps the gem usable on Ruby 2.7 as the gemspec advertises.
|
|
17
|
+
YAML.safe_load(File.read(path), permitted_classes: [Symbol], aliases: true) # standard:disable Style/YAMLFileRead
|
|
16
18
|
when ".json"
|
|
17
19
|
JSON.parse(File.read(path), symbolize_names: false)
|
|
18
20
|
else
|
|
@@ -98,24 +100,26 @@ module AcroForge
|
|
|
98
100
|
# we use its proposals directly. This lets callers (notably the CLI's
|
|
99
101
|
# `bootstrap` subcommand) avoid a redundant second compile when they
|
|
100
102
|
# also want to call Relabeler.propose on the same PDF.
|
|
101
|
-
def infer(pdf_path, sections: [], engine: nil)
|
|
103
|
+
def infer(pdf_path, sections: [], preserve: [], engine: nil)
|
|
102
104
|
return aggregate_proposals(engine.field_proposals) if engine
|
|
103
105
|
|
|
104
106
|
require "tmpdir"
|
|
105
107
|
Dir.mktmpdir do |tmp|
|
|
106
|
-
e = AcroForge::Engine.new(pdf_path, sections: sections, normalized_dir: tmp)
|
|
108
|
+
e = AcroForge::Engine.new(pdf_path, sections: sections, preserve: preserve, normalized_dir: tmp)
|
|
107
109
|
e.compile!
|
|
108
110
|
aggregate_proposals(e.field_proposals)
|
|
109
111
|
end
|
|
110
112
|
end
|
|
111
113
|
|
|
112
114
|
def aggregate_proposals(proposals)
|
|
113
|
-
proposals.each_with_object({}) do |p, schema|
|
|
115
|
+
result = proposals.each_with_object({}) do |p, schema|
|
|
114
116
|
next if p[:canonical_key].nil?
|
|
115
117
|
|
|
116
118
|
key = p[:canonical_key].to_sym
|
|
117
119
|
schema[key] ||= {type: infer_type(p), variations: []}
|
|
118
|
-
|
|
120
|
+
|
|
121
|
+
# Preserved fields' raw_label echoes the key — adding it as a variation is noise.
|
|
122
|
+
if p[:raw_label] && p[:confidence] != :preserved
|
|
119
123
|
cleaned = humanize_label(p[:raw_label])
|
|
120
124
|
schema[key][:variations] << cleaned unless schema[key][:variations].include?(cleaned)
|
|
121
125
|
end
|
|
@@ -124,6 +128,9 @@ module AcroForge
|
|
|
124
128
|
schema[key][:options] = p[:options].keys.map(&:to_sym).uniq
|
|
125
129
|
end
|
|
126
130
|
end
|
|
131
|
+
|
|
132
|
+
result.each_value { |entry| entry.delete(:variations) if entry[:variations].empty? }
|
|
133
|
+
result
|
|
127
134
|
end
|
|
128
135
|
|
|
129
136
|
# Thin delegator. The real implementation lives in AcroForge::Labels so
|
|
@@ -141,10 +148,11 @@ module AcroForge
|
|
|
141
148
|
when :choice
|
|
142
149
|
:select
|
|
143
150
|
else
|
|
144
|
-
|
|
151
|
+
# `_` is a word char, so `\b` regexes need underscores converted to spaces.
|
|
152
|
+
label = proposal[:raw_label].to_s.downcase.tr("_", " ")
|
|
145
153
|
case label
|
|
146
154
|
when /amount|salary|income|balance|fee|tier3/ then :money
|
|
147
|
-
when /\bdate\b|birth|expiry|employed/ then :date
|
|
155
|
+
when /\bdate\b|\bdob\b|birth|expiry|employed/ then :date
|
|
148
156
|
when /email/ then :email
|
|
149
157
|
when /years|tenor|number of|\bno\.?\b/ then :number
|
|
150
158
|
else :string
|
|
@@ -187,6 +195,12 @@ module AcroForge
|
|
|
187
195
|
end
|
|
188
196
|
end
|
|
189
197
|
|
|
198
|
+
# Mirrors aggregate_proposals — keeps merged schemas free of hollow `variations: []`.
|
|
199
|
+
result.each_value do |entry|
|
|
200
|
+
next unless entry.is_a?(Hash) && entry[:variations].is_a?(Array) && entry[:variations].empty?
|
|
201
|
+
entry.delete(:variations)
|
|
202
|
+
end
|
|
203
|
+
|
|
190
204
|
result
|
|
191
205
|
end
|
|
192
206
|
|
data/lib/acroforge/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: acroforge
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.2.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Maxwell Nana Forson
|
|
@@ -75,7 +75,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
|
75
75
|
- !ruby/object:Gem::Version
|
|
76
76
|
version: '0'
|
|
77
77
|
requirements: []
|
|
78
|
-
rubygems_version: 4.0.
|
|
78
|
+
rubygems_version: 4.0.12
|
|
79
79
|
specification_version: 4
|
|
80
80
|
summary: PDF AcroForm engine with heuristic-assisted field relabeling.
|
|
81
81
|
test_files: []
|