corp_pdf 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,259 @@
1
+ # Refactoring Opportunities
2
+
3
+ This document identifies code duplication and unused methods that could be refactored to improve maintainability.
4
+
5
+ ## 1. Duplicated Page-Finding Logic ✅ **COMPLETED**
6
+
7
+ ### Status
8
+ **RESOLVED** - This refactoring has been completed:
9
+ - ✅ `DictScan.is_page?(body)` exists (line 320 in dict_scan.rb)
10
+ - ✅ `Document#find_all_pages` exists (line 693 in document.rb)
11
+ - ✅ `Document#find_page_by_number(page_num)` exists (line 725 in document.rb)
12
+ - ✅ `Base#find_page_by_number` delegates to Document
13
+ - ✅ `AddField#find_page_ref` now uses `find_page_by_number` (line 288)
14
+
15
+ ### Original Issue
16
+ Multiple methods had similar logic for finding page objects in a PDF document.
17
+
18
+ ### Resolution
19
+ All page-finding logic has been unified into `DictScan.is_page?` and `Document#find_all_pages` / `find_page_by_number`.
20
+
21
+ ---
22
+
23
+ ## 2. Duplicated Widget-Matching Logic
24
+
25
+ ### Issue
26
+ Multiple methods have similar logic for finding widgets that belong to a field. Widgets can be matched by:
27
+ 1. `/Parent` reference pointing to the field
28
+ 2. `/T` (field name) matching the field name
29
+
30
+ ### Locations
31
+ - `Document#list_fields` (lines 222-327) - Finds widgets and matches them to fields
32
+ - `Document#clear` (lines 472-495) - Finds widgets for removed fields
33
+ - `UpdateField#update_widget_annotations_for_field` (lines 220-247) - Finds widgets by /Parent
34
+ - `UpdateField#update_widget_names_for_field` (lines 249-280) - Finds widgets by /Parent and /T
35
+ - `RemoveField#remove_widget_annotations_from_pages` (lines 55-103) - Finds widgets by /Parent and /T
36
+ - `AddSignatureAppearance#find_widget_annotation` (lines 164-206) - Finds widgets by /Parent
37
+
38
+ ### Pattern
39
+ The pattern of checking `/Parent` reference and matching by `/T` field name is repeated throughout.
40
+
41
+ ### Suggested Refactor
42
+ Create utility methods in `Base` or a new `WidgetMatcher` module:
43
+ - `find_widgets_by_parent(field_ref)` - Find widgets with /Parent pointing to field_ref
44
+ - `find_widgets_by_name(field_name)` - Find widgets with /T matching field_name
45
+ - `find_widgets_for_field(field_ref, field_name)` - Find all widgets for a field (by parent or name)
46
+
47
+ ### Benefits
48
+ - Centralized widget matching logic
49
+ - Consistent widget finding behavior
50
+ - Easier to extend matching criteria
51
+
52
+ ---
53
+
54
+ ## 3. Duplicated /Annots Array Manipulation
55
+
56
+ ### Issue
57
+ Multiple methods handle adding or removing widget references from page `/Annots` arrays. The logic needs to handle:
58
+ 1. Inline `/Annots` arrays: `/Annots [...]`
59
+ 2. Indirect `/Annots` arrays: `/Annots X Y R` (reference to separate array object)
60
+
61
+ ### Locations
62
+ - `AddField#add_widget_to_page` (lines 213-275) - Adds widget to /Annots
63
+ - `RemoveField#remove_widget_from_page_annots` (lines 125-155) - Removes widget from /Annots
64
+ - `Document#clear` (lines 555-633) - Removes widgets from /Annots during cleanup
65
+
66
+ ### Pattern
67
+ All three methods have similar conditional logic:
68
+ ```ruby
69
+ if page_body =~ %r{/Annots\s*\[(.*?)\]}m
70
+ # Handle inline array
71
+ elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
72
+ # Handle indirect array
73
+ else
74
+ # Create new /Annots array
75
+ end
76
+ ```
77
+
78
+ ### Suggested Refactor
79
+ Extend `DictScan` with methods:
80
+ - `DictScan.add_to_annots_array(page_body, widget_ref)` - Unified method to add widget to /Annots
81
+ - `DictScan.remove_from_annots_array(page_body, widget_ref)` - Unified method to remove widget from /Annots
82
+ - `DictScan.get_annots_array(page_body)` - Extract /Annots array (handles both inline and indirect)
83
+
84
+ ### Benefits
85
+ - Single implementation of /Annots manipulation logic
86
+ - Consistent handling of edge cases
87
+ - Easier to test /Annots operations
88
+
89
+ ---
90
+
91
+ ## 4. Duplicated Box Parsing Logic ✅ **COMPLETED**
92
+
93
+ ### Status
94
+ **RESOLVED** - This refactoring has been completed:
95
+ - ✅ `DictScan.parse_box(body, box_type)` exists (line 340 in dict_scan.rb)
96
+ - ✅ `Document#list_pages` now uses `parse_box` for all box types (lines 89-99 in document.rb)
97
+
98
+ ### Original Issue
99
+ `Document#list_pages` had repeated code blocks for parsing different box types (MediaBox, CropBox, ArtBox, BleedBox, TrimBox).
100
+
101
+ ### Resolution
102
+ Extracted the common box parsing logic into `DictScan.parse_box` helper method. All box type parsing in `Document#list_pages` now uses this shared method, reducing code duplication from ~45 lines to ~10 lines while maintaining existing functionality.
103
+
104
+ ---
105
+
106
+ ## 5. Duplicated next_fresh_object_number Implementation
107
+
108
+ ### Issue
109
+ The `next_fresh_object_number` method is implemented identically in two places.
110
+
111
+ ### Locations
112
+ - `Document#next_fresh_object_number` (lines 745-754)
113
+ - `Base#next_fresh_object_number` (lines 28-37)
114
+
115
+ ### Pattern
116
+ Both methods have identical implementation. However, `Document` doesn't include `Base`, so both need to exist independently.
117
+
118
+ ### Suggested Refactor
119
+ - Consider whether `Document` should use `Base`'s implementation via delegation
120
+ - Or: Keep both implementations if Document needs independent access
121
+
122
+ ### Benefits
123
+ - Single implementation
124
+ - Consistent object numbering logic
125
+
126
+ ### Note
127
+ This may be intentional since `Document` doesn't include `Base` - both classes need this functionality independently.
128
+
129
+ ---
130
+
131
+ ## 6. Unused Methods
132
+
133
+ ### Issue
134
+ Some methods are defined but never called.
135
+
136
+ ### Locations
137
+ - `AddSignatureAppearance#get_widget_rect_dimensions` (lines 218-223)
138
+ - Defined but never used
139
+ - `extract_rect` is used instead, which provides the same information
140
+
141
+ ### Suggested Refactor
142
+ - Remove `get_widget_rect_dimensions` if it's truly unused
143
+ - Or: Verify if it was intended for future use and document it
144
+
145
+ ### Benefits
146
+ - Cleaner codebase
147
+ - Less confusion about which method to use
148
+
149
+ ---
150
+
151
+ ## 7. Duplicated Base64 Decoding Logic
152
+
153
+ ### Issue
154
+ `AddSignatureAppearance` has two similar methods for decoding base64 data.
155
+
156
+ ### Locations
157
+ - `AddSignatureAppearance#decode_base64_data_uri` (lines 101-106)
158
+ - `AddSignatureAppearance#decode_base64_if_needed` (lines 108-119)
159
+
160
+ ### Pattern
161
+ Both methods handle base64 decoding, with slightly different logic. Could potentially be unified.
162
+
163
+ ### Suggested Refactor
164
+ - Consider merging into a single method that handles both cases
165
+ - Or: Document the distinction if both are needed
166
+
167
+ ### Benefits
168
+ - Simpler API
169
+ - Less code duplication
170
+
171
+ ---
172
+
173
+ ## 8. Duplicated Regex Pattern for Object Reference
174
+
175
+ ### Issue
176
+ The pattern for extracting object references `(\d+)\s+(\d+)\s+R` appears in many places.
177
+
178
+ ### Locations
179
+ Throughout the codebase, used in:
180
+ - Extracting `/Parent` references
181
+ - Extracting `/P` (page) references
182
+ - Extracting `/Pages` references
183
+ - Extracting `/Fields` array references
184
+ - And many more...
185
+
186
+ ### Suggested Refactor
187
+ Create a utility method:
188
+ ```ruby
189
+ def DictScan.extract_object_ref(str)
190
+ # Extract object reference from string
191
+ # Returns [obj_num, gen_num] or nil
192
+ end
193
+ ```
194
+
195
+ ### Benefits
196
+ - Consistent reference extraction
197
+ - Easier to update if PDF reference format changes
198
+ - More readable code
199
+
200
+ ---
201
+
202
+ ## Priority Recommendations
203
+
204
+ ### High Priority
205
+ 1. **Widget Matching Logic (#2)** - Most duplicated, used in many critical operations
206
+ 2. **/Annots Array Manipulation (#3)** - Complex logic that's error-prone when duplicated
207
+
208
+
209
+ ### Low Priority
210
+ 6. **next_fresh_object_number (#5)** - Simple duplication (may be intentional)
211
+ 7. **Object Reference Extraction (#8)** - Could improve consistency
212
+ 8. **Unused Methods (#6)** - Cleanup task (`get_widget_rect_dimensions`)
213
+ 9. **Base64 Decoding (#7)** - Minor duplication
214
+
215
+ ### Completed ✅
216
+ - **Page-Finding Logic (#1)** - Successfully refactored into `DictScan.is_page?` and unified page-finding methods
217
+ - **Checkbox Appearance Creation (#9)** - Extracted common Form XObject building logic into `build_form_xobject` helper method
218
+ - **Box Parsing Logic (#4)** - Extracted common box parsing logic into `DictScan.parse_box` helper method
219
+ - **PDF Metadata Formatting (#10)** - Moved `format_pdf_key` and `format_pdf_value` to `DictScan` module as shared utilities
220
+
221
+ ---
222
+
223
+ ## 9. Duplicated Checkbox Appearance Creation Logic ✅ **COMPLETED**
224
+
225
+ ### Status
226
+ **RESOLVED** - This refactoring has been completed:
227
+ - ✅ `AddField#build_form_xobject` exists (line 472 in add_field.rb)
228
+ - ✅ `AddField#create_checkbox_yes_appearance` now uses `build_form_xobject` (line 458)
229
+ - ✅ `AddField#create_checkbox_off_appearance` now uses `build_form_xobject` (line 469)
230
+
231
+ ### Original Issue
232
+ The `create_checkbox_yes_appearance` and `create_checkbox_off_appearance` methods had duplicated Form XObject dictionary building logic.
233
+
234
+ ### Resolution
235
+ Extracted the common Form XObject dictionary building logic into `build_form_xobject` helper method. Both checkbox appearance methods now use this shared method, reducing duplication while maintaining existing functionality.
236
+
237
+ ---
238
+
239
+ ## 10. PDF Metadata Formatting Methods Could Be Shared ✅ **COMPLETED**
240
+
241
+ ### Status
242
+ **RESOLVED** - This refactoring has been completed:
243
+ - ✅ `DictScan.format_pdf_key(key)` exists (line 134 in dict_scan.rb)
244
+ - ✅ `DictScan.format_pdf_value(value)` exists (line 140 in dict_scan.rb)
245
+ - ✅ `AddField` now uses `DictScan.format_pdf_key` and `DictScan.format_pdf_value` (lines 145-146, 195-196)
246
+
247
+ ### Original Issue
248
+ The `format_pdf_key` and `format_pdf_value` methods in `AddField` were useful utility functions that could be shared across the codebase.
249
+
250
+ ### Resolution
251
+ Moved `format_pdf_key` and `format_pdf_value` from `AddField` to the `DictScan` module as module functions. This makes them reusable throughout the codebase and provides a single source of truth for PDF formatting rules. `AddField` now uses these shared utilities, maintaining existing functionality while improving code reusability.
252
+
253
+ ---
254
+
255
+ ## Notes
256
+ - All refactoring should be accompanied by tests to ensure behavior doesn't change
257
+ - Consider backward compatibility if any methods are moved between modules
258
+ - Some duplication may be intentional for performance reasons (avoid method call overhead) - evaluate before refactoring
259
+
@@ -0,0 +1,73 @@
1
+ # frozen_string_literal: true
2
+
3
+ module CorpPdf
4
+ module Actions
5
+ # Action to add a new field to a PDF document
6
+ # Delegates to field-specific classes for actual field creation
7
+ class AddField
8
+ include Base
9
+
10
+ attr_reader :field_obj_num, :field_type, :field_value
11
+
12
+ def initialize(document, name, options = {})
13
+ @document = document
14
+ @name = name
15
+ @options = normalize_hash_keys(options)
16
+ @metadata = normalize_hash_keys(@options[:metadata] || {})
17
+ end
18
+
19
+ def call
20
+ type_input = @options[:type] || "/Tx"
21
+ @options[:group_id]
22
+
23
+ # Auto-set radio button flags if type is :radio and flags not explicitly set
24
+ # MUST set this BEFORE creating the field handler so it gets passed correctly
25
+ if [:radio, "radio"].include?(type_input) && !@metadata[:Ff]
26
+ @metadata[:Ff] = 49_152
27
+ end
28
+
29
+ # Determine field type and create appropriate field handler
30
+ field_handler = create_field_handler(type_input)
31
+
32
+ # Call the field handler
33
+ field_handler.call
34
+
35
+ # Store field_obj_num from handler for compatibility
36
+ @field_obj_num = field_handler.field_obj_num
37
+ @field_type = field_handler.field_type
38
+ @field_value = field_handler.field_value
39
+
40
+ true
41
+ end
42
+
43
+ private
44
+
45
+ def normalize_hash_keys(hash)
46
+ return hash unless hash.is_a?(Hash)
47
+
48
+ hash.each_with_object({}) do |(key, value), normalized|
49
+ sym_key = key.is_a?(Symbol) ? key : key.to_sym
50
+ normalized[sym_key] = value.is_a?(Hash) ? normalize_hash_keys(value) : value
51
+ end
52
+ end
53
+
54
+ def create_field_handler(type_input)
55
+ is_radio = [:radio, "radio"].include?(type_input)
56
+ group_id = @options[:group_id]
57
+ is_button = [:button, "button", "/Btn", "/btn"].include?(type_input)
58
+
59
+ if is_radio && group_id
60
+ CorpPdf::Fields::Radio.new(@document, @name, @options.merge(metadata: @metadata))
61
+ elsif [:signature, "signature", "/Sig"].include?(type_input)
62
+ CorpPdf::Fields::Signature.new(@document, @name, @options.merge(metadata: @metadata))
63
+ elsif [:checkbox, "checkbox"].include?(type_input) || is_button
64
+ # :button type maps to /Btn which are checkboxes by default (unless radio flag is set)
65
+ CorpPdf::Fields::Checkbox.new(@document, @name, @options.merge(metadata: @metadata))
66
+ else
67
+ # Default to text field
68
+ CorpPdf::Fields::Text.new(@document, @name, @options.merge(metadata: @metadata))
69
+ end
70
+ end
71
+ end
72
+ end
73
+ end
@@ -0,0 +1,48 @@
1
+ # frozen_string_literal: true
2
+
3
+ module CorpPdf
4
+ module Actions
5
+ module Base
6
+ def resolver
7
+ @document.instance_variable_get(:@resolver)
8
+ end
9
+
10
+ def patches
11
+ @document.instance_variable_get(:@patches)
12
+ end
13
+
14
+ def get_object_body_with_patch(ref)
15
+ body = resolver.object_body(ref)
16
+ existing_patch = patches.find { |p| p[:ref] == ref }
17
+ existing_patch ? existing_patch[:body] : body
18
+ end
19
+
20
+ def apply_patch(ref, body, original_body = nil)
21
+ original_body ||= resolver.object_body(ref)
22
+ return if body == original_body
23
+
24
+ patches.reject! { |p| p[:ref] == ref }
25
+ patches << { ref: ref, body: body }
26
+ end
27
+
28
+ def next_fresh_object_number
29
+ max_obj_num = 0
30
+ resolver.each_object do |ref, _|
31
+ max_obj_num = [max_obj_num, ref[0]].max
32
+ end
33
+ patches.each do |p|
34
+ max_obj_num = [max_obj_num, p[:ref][0]].max
35
+ end
36
+ max_obj_num + 1
37
+ end
38
+
39
+ def acroform_ref
40
+ @document.send(:acroform_ref)
41
+ end
42
+
43
+ def find_page_by_number(page_num)
44
+ @document.send(:find_page_by_number, page_num)
45
+ end
46
+ end
47
+ end
48
+ end
@@ -0,0 +1,154 @@
1
+ # frozen_string_literal: true
2
+
3
+ module CorpPdf
4
+ module Actions
5
+ # Action to remove a field from a PDF document
6
+ class RemoveField
7
+ include Base
8
+
9
+ def initialize(document, field)
10
+ @document = document
11
+ @field = field
12
+ end
13
+
14
+ def call
15
+ af_ref = acroform_ref
16
+ return false unless af_ref
17
+
18
+ # Step 1: Remove widget annotations from pages' /Annots arrays
19
+ remove_widget_annotations_from_pages
20
+
21
+ # Step 2: Remove from /Fields array
22
+ remove_from_fields_array(af_ref)
23
+
24
+ # Step 3: Mark the field object as deleted by setting /T to empty
25
+ mark_field_deleted
26
+
27
+ true
28
+ end
29
+
30
+ private
31
+
32
+ def remove_from_fields_array(af_ref)
33
+ af_body = get_object_body_with_patch(af_ref)
34
+ fields_array_ref = DictScan.value_token_after("/Fields", af_body)
35
+
36
+ if fields_array_ref && fields_array_ref =~ /\A(\d+)\s+(\d+)\s+R/
37
+ arr_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
38
+ arr_body = get_object_body_with_patch(arr_ref)
39
+ filtered = DictScan.remove_ref_from_array(arr_body, @field.ref)
40
+ apply_patch(arr_ref, filtered, arr_body)
41
+ else
42
+ filtered_af = DictScan.remove_ref_from_inline_array(af_body, "/Fields", @field.ref)
43
+ apply_patch(af_ref, filtered_af, af_body) if filtered_af
44
+ end
45
+ end
46
+
47
+ def mark_field_deleted
48
+ fld_body = get_object_body_with_patch(@field.ref)
49
+ return unless fld_body
50
+
51
+ deleted_body = DictScan.replace_key_value(fld_body, "/T", "()")
52
+ apply_patch(@field.ref, deleted_body, fld_body)
53
+ end
54
+
55
+ def remove_widget_annotations_from_pages
56
+ widget_refs_to_remove = []
57
+
58
+ field_body = get_object_body_with_patch(@field.ref)
59
+ if field_body && DictScan.is_widget?(field_body)
60
+ widget_refs_to_remove << @field.ref
61
+ end
62
+
63
+ resolver.each_object do |widget_ref, body|
64
+ next unless body
65
+ next if widget_ref == @field.ref
66
+ next unless DictScan.is_widget?(body)
67
+
68
+ body = get_object_body_with_patch(widget_ref)
69
+
70
+ # Match by /Parent reference
71
+ if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
72
+ widget_parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
73
+ if widget_parent_ref == @field.ref
74
+ widget_refs_to_remove << widget_ref
75
+ next
76
+ end
77
+ end
78
+
79
+ # Also match by field name (/T) - some widgets might not have /Parent
80
+ next unless body.include?("/T") && @field.name
81
+
82
+ t_tok = DictScan.value_token_after("/T", body)
83
+ next unless t_tok
84
+
85
+ widget_name = DictScan.decode_pdf_string(t_tok)
86
+ if widget_name && widget_name == @field.name
87
+ widget_refs_to_remove << widget_ref
88
+ end
89
+ end
90
+
91
+ return if widget_refs_to_remove.empty?
92
+
93
+ widget_refs_to_remove.each do |widget_ref|
94
+ widget_body = get_object_body_with_patch(widget_ref)
95
+
96
+ if widget_body && widget_body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
97
+ page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
98
+ remove_widget_from_page_annots(page_ref, widget_ref)
99
+ else
100
+ find_and_remove_widget_from_all_pages(widget_ref)
101
+ end
102
+ end
103
+ end
104
+
105
+ def find_and_remove_widget_from_all_pages(widget_ref)
106
+ # Find all page objects and check their /Annots arrays
107
+ page_objects = []
108
+ resolver.each_object do |ref, body|
109
+ next unless body
110
+ next unless DictScan.is_page?(body)
111
+
112
+ page_objects << ref
113
+ end
114
+
115
+ # Check each page's /Annots array
116
+ page_objects.each do |page_ref|
117
+ remove_widget_from_page_annots(page_ref, widget_ref)
118
+ end
119
+ end
120
+
121
+ def remove_widget_from_page_annots(page_ref, widget_ref)
122
+ page_body = get_object_body_with_patch(page_ref)
123
+ return unless page_body
124
+
125
+ # Handle inline /Annots array
126
+ if page_body =~ %r{/Annots\s*\[(.*?)\]}m
127
+ annots_array_str = ::Regexp.last_match(1)
128
+ # Remove the widget reference from the array
129
+ filtered_array = annots_array_str.gsub(/\b#{widget_ref[0]}\s+#{widget_ref[1]}\s+R\b/, "").strip
130
+ # Clean up extra spaces
131
+ filtered_array.gsub!(/\s+/, " ")
132
+
133
+ new_annots = if filtered_array.empty?
134
+ "[]"
135
+ else
136
+ "[#{filtered_array}]"
137
+ end
138
+
139
+ new_page_body = page_body.sub(%r{/Annots\s*\[.*?\]}, "/Annots #{new_annots}")
140
+ apply_patch(page_ref, new_page_body, page_body)
141
+ # Handle indirect /Annots array reference
142
+ elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
143
+ annots_array_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
144
+ annots_array_body = get_object_body_with_patch(annots_array_ref)
145
+
146
+ if annots_array_body
147
+ filtered_body = DictScan.remove_ref_from_array(annots_array_body, widget_ref)
148
+ apply_patch(annots_array_ref, filtered_body, annots_array_body)
149
+ end
150
+ end
151
+ end
152
+ end
153
+ end
154
+ end