acro_that 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,251 @@
1
+ # PDF File Structure
2
+
3
+ ## Overview
4
+
5
+ PDF (Portable Document Format) files have a reputation for being complex binary formats, but at their core, they are **text-based files with a structured syntax**. Understanding this fundamental fact is key to understanding how PDF works.
6
+
7
+ While PDFs can contain binary data (like compressed streams, images, and fonts), the **structure** of a PDF—its objects, dictionaries, arrays, and references—is defined using plain text syntax.
8
+
9
+ ## PDF File Anatomy
10
+
11
+ A PDF file consists of several main parts:
12
+
13
+ 1. **Header**: `%PDF-1.4` (or similar version)
14
+ 2. **Body**: A collection of PDF objects (the actual content)
15
+ 3. **Cross-Reference Table (xref)**: Points to byte offsets of objects
16
+ 4. **Trailer**: Contains the root object reference and metadata
17
+ 5. **EOF Marker**: `%%EOF`
18
+
19
+ ### PDF Objects
20
+
21
+ The body contains PDF objects. Each object has:
22
+ - An object number and generation number (e.g., `5 0 obj`)
23
+ - Content (dictionary, array, stream, etc.)
24
+ - An `endobj` marker
25
+
26
+ Example:
27
+ ```
28
+ 5 0 obj
29
+ << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] >>
30
+ endobj
31
+ ```
32
+
33
+ ## PDF Dictionaries
34
+
35
+ **PDF dictionaries are the heart of PDF structure.** They're defined using angle brackets:
36
+
37
+ ```
38
+ << /Key1 value1 /Key2 value2 /Key3 value3 >>
39
+ ```
40
+
41
+ Think of them like JSON objects or Ruby hashes, but with PDF-specific syntax:
42
+ - Keys are PDF names (always start with `/`)
43
+ - Values can be: strings, numbers, booleans, arrays, dictionaries, or object references
44
+ - Whitespace is generally ignored (but required between tokens)
45
+
46
+ ### Dictionary Examples
47
+
48
+ **Simple dictionary:**
49
+ ```
50
+ << /Type /Page /Width 612 /Height 792 >>
51
+ ```
52
+
53
+ **Nested dictionary:**
54
+ ```
55
+ <<
56
+ /Type /Annot
57
+ /Subtype /Widget
58
+ /Rect [100 500 200 520]
59
+ /AP <<
60
+ /N <<
61
+ /Yes 10 0 R
62
+ /Off 11 0 R
63
+ >>
64
+ >>
65
+ >>
66
+ ```
67
+
68
+ **Dictionary with array:**
69
+ ```
70
+ << /Kids [5 0 R 6 0 R 7 0 R] >>
71
+ ```
72
+
73
+ **Dictionary with string values:**
74
+ ```
75
+ << /Title (My Document) /Author (John Doe) >>
76
+ ```
77
+
78
+ The parentheses `()` denote literal strings in PDF syntax. Hex strings use angle brackets: `<>`.
79
+
80
+ ## PDF Text-Based Syntax
81
+
82
+ Despite being "binary" files, PDFs use text-based syntax for their structure. This means:
83
+
84
+ 1. **Dictionaries are text**: `<< ... >>` are just character sequences
85
+ 2. **Arrays are text**: `[ ... ]` are just character sequences
86
+ 3. **References are text**: `5 0 R` means "object 5, generation 0"
87
+ 4. **Strings can be text or hex**: `(Hello)` or `<48656C6C6F>`
88
+
89
+ ### Why This Matters
90
+
91
+ Because PDF dictionaries are just text with delimiters (`<<`, `>>`), we can parse them using **simple text traversal algorithms**—no complex parser generator, no AST construction, just:
92
+
93
+ 1. Find opening `<<`
94
+ 2. Track nesting depth by counting `<<` and `>>`
95
+ 3. When depth reaches zero, we've found a complete dictionary
96
+ 4. Repeat
97
+
98
+ ## PDF Object References
99
+
100
+ PDFs use references to link objects together:
101
+ ```
102
+ 5 0 R
103
+ ```
104
+
105
+ This means:
106
+ - Object number: `5`
107
+ - Generation number: `0` (usually 0 for non-incremental PDFs)
108
+ - `R` means "reference"
109
+
110
+ When you see `/Parent 5 0 R`, it means the `Parent` key references object 5.
111
+
112
+ ## PDF Arrays
113
+
114
+ Arrays are space-separated lists in square brackets:
115
+ ```
116
+ [0 0 612 792]
117
+ ```
118
+
119
+ Can contain any PDF value type:
120
+ ```
121
+ [5 0 R 6 0 R]
122
+ [/Yes /Off]
123
+ [(Hello) (World)]
124
+ ```
125
+
126
+ ## PDF Strings
127
+
128
+ PDF strings come in two flavors:
129
+
130
+ ### Literal Strings (parentheses)
131
+ ```
132
+ (Hello World)
133
+ (Line 1\nLine 2)
134
+ ```
135
+
136
+ Can contain escape sequences: `\n`, `\r`, `\t`, `\\(`, `\\)`, octal `\123`.
137
+
138
+ ### Hex Strings (angle brackets)
139
+ ```
140
+ <48656C6C6F>
141
+ <FEFF00480065006C006C006F>
142
+ ```
143
+
144
+ The hex string `<FEFF...>` with BOM indicates UTF-16BE encoding.
145
+
146
+ ## PDF Names
147
+
148
+ PDF names start with `/`:
149
+ ```
150
+ /Type
151
+ /Subtype
152
+ /Widget
153
+ ```
154
+
155
+ Names can contain most characters except special delimiters.
156
+
157
+ ## Stream Objects
158
+
159
+ Some PDF objects contain **streams** (binary or text data):
160
+ ```
161
+ 10 0 obj
162
+ << /Length 100 /Filter /FlateDecode >>
163
+ stream
164
+ [compressed binary data here]
165
+ endstream
166
+ endobj
167
+ ```
168
+
169
+ For parsing structure (dictionaries), we typically strip or ignore stream bodies because they can contain arbitrary binary data that would confuse text-based parsing.
170
+
171
+ ## Why AcroThat Works
172
+
173
+ `AcroThat` works because **PDF dictionaries are just text patterns**. Despite looking complicated, the algorithms are straightforward:
174
+
175
+ ### Finding Dictionaries
176
+
177
+ The `each_dictionary` method:
178
+ 1. Searches for `<<` (start of dictionary)
179
+ 2. Tracks nesting depth: `<<` increments, `>>` decrements
180
+ 3. When depth returns to 0, we've found a complete dictionary
181
+ 4. Yield it and continue searching
182
+
183
+ This is **pure text traversal**—no PDF-specific knowledge beyond "dictionaries use `<<` and `>>`".
184
+
185
+ ### Extracting Values
186
+
187
+ The `value_token_after` method:
188
+ 1. Finds a key (like `/V`)
189
+ 2. Skips whitespace
190
+ 3. Based on the next character, extracts the value:
191
+ - `(` → Extract literal string (handle escaping)
192
+ - `<` → Extract hex string or dictionary
193
+ - `[` → Extract array (match brackets)
194
+ - `/` → Extract name
195
+ - Otherwise → Extract atom (number, reference, etc.)
196
+
197
+ Again, this is just **text pattern matching** with some bracket/depth tracking.
198
+
199
+ ### Why It Seems Complicated
200
+
201
+ The complexity comes from:
202
+ 1. **Handling edge cases**: Escaped characters, nested structures, various value types
203
+ 2. **Preserving exact formatting**: When replacing values, we must maintain valid PDF syntax
204
+ 3. **Encoding/decoding**: PDF strings have special encoding rules (UTF-16BE BOM, escapes)
205
+ 4. **Safety checks**: Verifying dictionaries are still valid after modification
206
+
207
+ But the **core concept** is simple: PDF dictionaries are text, so we can parse them with text traversal.
208
+
209
+ ## Example: Walking Through a PDF Dictionary
210
+
211
+ Given this PDF dictionary text:
212
+ ```
213
+ << /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>
214
+ ```
215
+
216
+ How `AcroThat` would parse it:
217
+
218
+ 1. **`each_dictionary` finds it:**
219
+ - Finds `<<` at position 0
220
+ - Depth: 0 → 1 (after `<<`)
221
+ - Scans forward...
222
+ - Finds `>>` at position 64
223
+ - Depth: 1 → 0
224
+ - Yields: `"<< /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>"`
225
+
226
+ 2. **`value_token_after("/V", dict)` extracts value:**
227
+ - Finds `/V` (followed by space)
228
+ - Skips whitespace
229
+ - Next char is `(`, so extract literal string
230
+ - Scan forward, handle escaping, match closing `)`
231
+ - Returns: `"(Hello World)"`
232
+
233
+ 3. **`decode_pdf_string("(Hello World)")` decodes:**
234
+ - Starts with `(`, ends with `)`
235
+ - Extract inner: `"Hello World"`
236
+ - Unescape (no escapes here)
237
+ - Check for UTF-16BE BOM (none)
238
+ - Return: `"Hello World"`
239
+
240
+ ## Conclusion
241
+
242
+ PDF files are **structured text files** with binary data embedded in streams. The structure itself—dictionaries, arrays, strings, references—is all text-based syntax. This is why `AcroThat` can use simple text traversal to parse and modify PDF dictionaries without needing a full PDF parser.
243
+
244
+ The apparent complexity in `AcroThat` comes from:
245
+ - Handling PDF's various value types
246
+ - Proper encoding/decoding of strings
247
+ - Careful preservation of structure during edits
248
+ - Edge case handling (escaping, nesting, etc.)
249
+
250
+ But the **fundamental approach** is elegantly simple: treat PDF dictionaries as text patterns and parse them with character-by-character traversal.
251
+
@@ -0,0 +1,278 @@
1
+ # frozen_string_literal: true
2
+
3
+ module AcroThat
4
+ module Actions
5
+ # Action to add a new field to a PDF document
6
+ class AddField
7
+ include Base
8
+
9
+ attr_reader :field_obj_num, :field_type, :field_value
10
+
11
+ def initialize(document, name, options = {})
12
+ @document = document
13
+ @name = name
14
+ @options = options
15
+ end
16
+
17
+ def call
18
+ x = @options[:x] || 100
19
+ y = @options[:y] || 500
20
+ width = @options[:width] || 100
21
+ height = @options[:height] || 20
22
+ page_num = @options[:page] || 1
23
+
24
+ # Normalize field type: accept symbols or strings, convert to PDF format
25
+ type_input = @options[:type] || "/Tx"
26
+ @field_type = case type_input
27
+ when :text, "text", "/Tx", "/tx"
28
+ "/Tx"
29
+ when :button, "button", "/Btn", "/btn"
30
+ "/Btn"
31
+ when :choice, "choice", "/Ch", "/ch"
32
+ "/Ch"
33
+ when :signature, "signature", "/Sig", "/sig"
34
+ "/Sig"
35
+ else
36
+ type_input.to_s # Use as-is if it's already in PDF format
37
+ end
38
+ @field_value = @options[:value] || ""
39
+
40
+ # Create a proper field dictionary + a widget annotation that references it via /Parent
41
+ @field_obj_num = next_fresh_object_number
42
+ widget_obj_num = @field_obj_num + 1
43
+
44
+ field_body = create_field_dictionary(@field_value, @field_type)
45
+
46
+ # Find the page ref for /P on widget (must happen before we create patches)
47
+ page_ref = find_page_ref(page_num)
48
+
49
+ # Create widget with page reference
50
+ widget_body = create_widget_annotation_with_parent(widget_obj_num, [@field_obj_num, 0], page_ref, x, y, width,
51
+ height, @field_type, @field_value)
52
+
53
+ # Queue objects
54
+ @document.instance_variable_get(:@patches) << { ref: [@field_obj_num, 0], body: field_body }
55
+ @document.instance_variable_get(:@patches) << { ref: [widget_obj_num, 0], body: widget_body }
56
+
57
+ # Add field reference (not widget) to AcroForm /Fields AND ensure defaults in ONE patch
58
+ add_field_to_acroform_with_defaults(@field_obj_num)
59
+
60
+ # Add widget to the target page's /Annots
61
+ add_widget_to_page(widget_obj_num, page_num)
62
+
63
+ true
64
+ end
65
+
66
+ private
67
+
68
+ def create_field_dictionary(value, type)
69
+ dict = "<<\n"
70
+ dict += " /FT #{type}\n"
71
+ dict += " /T #{DictScan.encode_pdf_string(@name)}\n"
72
+ dict += " /Ff 0\n"
73
+ dict += " /DA (/Helv 0 Tf 0 g)\n"
74
+ dict += " /V #{DictScan.encode_pdf_string(value)}\n" if value && !value.empty?
75
+ dict += ">>"
76
+ dict
77
+ end
78
+
79
+ def create_widget_annotation_with_parent(_widget_obj_num, parent_ref, page_ref, x, y, width, height, type, value)
80
+ rect_array = "[#{x} #{y} #{x + width} #{y + height}]"
81
+ widget = "<<\n"
82
+ widget += " /Type /Annot\n"
83
+ widget += " /Subtype /Widget\n"
84
+ widget += " /Parent #{parent_ref[0]} #{parent_ref[1]} R\n"
85
+ widget += " /P #{page_ref[0]} #{page_ref[1]} R\n" if page_ref
86
+ widget += " /FT #{type}\n"
87
+ widget += " /Rect #{rect_array}\n"
88
+ widget += " /F 4\n"
89
+ widget += " /DA (/Helv 0 Tf 0 g)\n"
90
+ widget += " /V #{DictScan.encode_pdf_string(value)}\n" if value && !value.empty?
91
+ widget += ">>"
92
+ widget
93
+ end
94
+
95
+ def add_field_to_acroform_with_defaults(field_obj_num)
96
+ af_ref = acroform_ref
97
+ return false unless af_ref
98
+
99
+ af_body = get_object_body_with_patch(af_ref)
100
+
101
+ patched = af_body.dup
102
+
103
+ # Step 1: Add field to /Fields array
104
+ fields_array_ref = DictScan.value_token_after("/Fields", patched)
105
+
106
+ if fields_array_ref && fields_array_ref =~ /\A(\d+)\s+(\d+)\s+R/
107
+ # Reference case: /Fields points to a separate array object
108
+ arr_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
109
+ arr_body = get_object_body_with_patch(arr_ref)
110
+ new_body = DictScan.add_ref_to_array(arr_body, [field_obj_num, 0])
111
+ apply_patch(arr_ref, new_body, arr_body)
112
+ elsif patched.include?("/Fields")
113
+ # Inline array case: use DictScan utility
114
+ patched = DictScan.add_ref_to_inline_array(patched, "/Fields", [field_obj_num, 0])
115
+ else
116
+ # No /Fields exists - add it with the field reference
117
+ patched = DictScan.upsert_key_value(patched, "/Fields", "[#{field_obj_num} 0 R]")
118
+ end
119
+
120
+ # Step 2: Ensure /NeedAppearances true
121
+ unless patched.include?("/NeedAppearances")
122
+ patched = DictScan.upsert_key_value(patched, "/NeedAppearances", "true")
123
+ end
124
+
125
+ # Step 3: Ensure /DR /Font has /Helv mapping
126
+ unless patched.include?("/DR") && patched.include?("/Helv")
127
+ font_obj_num = next_fresh_object_number
128
+ font_body = "<<\n /Type /Font\n /Subtype /Type1\n /BaseFont /Helvetica\n>>"
129
+ patches << { ref: [font_obj_num, 0], body: font_body }
130
+
131
+ if patched.include?("/DR")
132
+ # /DR exists - try to add /Font if it doesn't exist
133
+ dr_tok = DictScan.value_token_after("/DR", patched)
134
+ if dr_tok && dr_tok.start_with?("<<")
135
+ # Check if /Font already exists in /DR
136
+ unless dr_tok.include?("/Font")
137
+ # Add /Font to existing /DR dictionary
138
+ new_dr_tok = dr_tok.chomp(">>") + " /Font << /Helv #{font_obj_num} 0 R >>\n>>"
139
+ patched = patched.sub(dr_tok) { |_| new_dr_tok }
140
+ end
141
+ else
142
+ # /DR exists but isn't a dictionary - replace it
143
+ patched = DictScan.replace_key_value(patched, "/DR", "<< /Font << /Helv #{font_obj_num} 0 R >> >>")
144
+ end
145
+ else
146
+ # No /DR exists - add it
147
+ patched = DictScan.upsert_key_value(patched, "/DR", "<< /Font << /Helv #{font_obj_num} 0 R >> >>")
148
+ end
149
+ end
150
+
151
+ apply_patch(af_ref, patched, af_body)
152
+ true
153
+ end
154
+
155
+ def find_page_ref(page_num)
156
+ page_objects = []
157
+ resolver.each_object do |ref, body|
158
+ next unless body
159
+
160
+ # Check for /Type /Page with or without space, or /Type/Page
161
+ is_page = body.include?("/Type /Page") ||
162
+ body.include?("/Type/Page") ||
163
+ (body.include?("/Type") && body.include?("/Page") && body =~ %r{/Type\s*/Page})
164
+ next unless is_page
165
+
166
+ page_objects << ref
167
+ end
168
+
169
+ # If still no pages found, try to find them via the page tree
170
+ if page_objects.empty?
171
+ # Find the document catalog's /Pages entry
172
+ root_ref = resolver.root_ref
173
+ if root_ref
174
+ catalog_body = resolver.object_body(root_ref)
175
+ if catalog_body && catalog_body =~ %r{/Pages\s+(\d+)\s+(\d+)\s+R}
176
+ pages_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
177
+ pages_body = resolver.object_body(pages_ref)
178
+
179
+ # Extract /Kids array from Pages object
180
+ if pages_body && pages_body =~ %r{/Kids\s*\[(.*?)\]}m
181
+ kids_array = ::Regexp.last_match(1)
182
+ # Extract all object references from Kids array
183
+ kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
184
+ kid_ref = [num_str.to_i, gen_str.to_i]
185
+ kid_body = resolver.object_body(kid_ref)
186
+ # Check if this kid is a page or another Pages node
187
+ if kid_body && (kid_body.include?("/Type /Page") || kid_body.include?("/Type/Page") || (kid_body.include?("/Type") && kid_body.include?("/Page")))
188
+ page_objects << kid_ref
189
+ elsif kid_body && kid_body.include?("/Type /Pages")
190
+ # Recursively find pages in this Pages node
191
+ if kid_body =~ %r{/Kids\s*\[(.*?)\]}m
192
+ kid_body[::Regexp.last_match(0)..].scan(/(\d+)\s+(\d+)\s+R/) do |n, g|
193
+ grandkid_ref = [n.to_i, g.to_i]
194
+ grandkid_body = resolver.object_body(grandkid_ref)
195
+ if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body.include?("/Type/Page"))
196
+ page_objects << grandkid_ref
197
+ end
198
+ end
199
+ end
200
+ end
201
+ end
202
+ end
203
+ end
204
+ end
205
+ end
206
+
207
+ return page_objects[0] if page_objects.empty?
208
+ return page_objects[page_num - 1] if page_num.positive? && page_num <= page_objects.length
209
+
210
+ page_objects[0]
211
+ end
212
+
213
+ def add_widget_to_page(widget_obj_num, page_num)
214
+ # Find the specific page using the same logic as find_page_ref
215
+ target_page_ref = find_page_ref(page_num)
216
+ return false unless target_page_ref
217
+
218
+ page_body = get_object_body_with_patch(target_page_ref)
219
+
220
+ # Use DictScan utility to safely add reference to /Annots array
221
+ new_body = if page_body =~ %r{/Annots\s*\[(.*?)\]}m
222
+ # Inline array - add to it
223
+ result = DictScan.add_ref_to_inline_array(page_body, "/Annots", [widget_obj_num, 0])
224
+ if result && result != page_body
225
+ result
226
+ else
227
+ # Fallback: use string manipulation
228
+ annots_array = ::Regexp.last_match(1)
229
+ ref_token = "#{widget_obj_num} 0 R"
230
+ new_annots = if annots_array.strip.empty?
231
+ "[#{ref_token}]"
232
+ else
233
+ "[#{annots_array} #{ref_token}]"
234
+ end
235
+ page_body.sub(%r{/Annots\s*\[.*?\]}, "/Annots #{new_annots}")
236
+ end
237
+ elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
238
+ # Indirect array reference - need to read and modify the array object
239
+ annots_array_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
240
+ annots_array_body = get_object_body_with_patch(annots_array_ref)
241
+
242
+ ref_token = "#{widget_obj_num} 0 R"
243
+ if annots_array_body
244
+ new_annots_body = if annots_array_body.strip == "[]"
245
+ "[#{ref_token}]"
246
+ elsif annots_array_body.strip.start_with?("[") && annots_array_body.strip.end_with?("]")
247
+ without_brackets = annots_array_body.strip[1..-2].strip
248
+ "[#{without_brackets} #{ref_token}]"
249
+ else
250
+ "[#{annots_array_body} #{ref_token}]"
251
+ end
252
+
253
+ apply_patch(annots_array_ref, new_annots_body, annots_array_body)
254
+
255
+ # Page body doesn't need to change (still references the same array object)
256
+ page_body
257
+ else
258
+ # Array object not found - fallback to creating inline array
259
+ page_body.sub(%r{/Annots\s+\d+\s+\d+\s+R}, "/Annots [#{ref_token}]")
260
+ end
261
+ else
262
+ # No /Annots exists - add it with the widget reference
263
+ # Insert /Annots before the closing >> of the dictionary
264
+ ref_token = "#{widget_obj_num} 0 R"
265
+ if page_body.include?(">>")
266
+ # Find the last >> (closing the outermost dictionary) and insert /Annots before it
267
+ page_body.reverse.sub(">>".reverse, "/Annots [#{ref_token}]>>".reverse).reverse
268
+ else
269
+ page_body + " /Annots [#{ref_token}]"
270
+ end
271
+ end
272
+
273
+ apply_patch(target_page_ref, new_body, page_body) if new_body && new_body != page_body
274
+ true
275
+ end
276
+ end
277
+ end
278
+ end