acro_that 0.1.1 → 0.1.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7cccc792d205578fc981258a0b171e4906b3f9c945ac2abd664447054251f940
4
- data.tar.gz: 4544010bb0642b5c88cd9b5cc95a0c97d018109ecbe19648807419590d3f5e8b
3
+ metadata.gz: 56f6be44d023bdedaf254cc097d18791f284793d0c79c07847fd462c0643cc91
4
+ data.tar.gz: b649d290433fac6a6113a74ca47bfde60e31687f6e93b85e30c32053ee416b3a
5
5
  SHA512:
6
- metadata.gz: 5a43d6c18e6babead9223d93f661032ff324ee45fae76646d2d8c37528ea4ddf294b5c5240ddc67d894c53aaff05d2245baff6c9c6f7813d9dad83c0584e5cea
7
- data.tar.gz: 17052dece8b5a7700c4d9f822b2ba7e2f2b8d0ded537a173d38b2e7190ffc0b6d2d9f86d51848b8d7dd99cd35015cda99299dfab0b7f87a5df1d0404e087cbe8
6
+ metadata.gz: b27c5ec88a2cfdec49750d9d3316979c90c0815461eb6281b4cf602dc6533c8761c8fba7fb49407256b9ade87f2f3731e8f9d121c4ef089d08be2901107cc295
7
+ data.tar.gz: f1b41cef40049564af8f081547460d72e787a721f676db4c80cd74eb21844b48364ac3b871e6ad2839505908d9e5be3b8d643c16812bd91ae6d0e858bcfb51eb
data/CHANGELOG.md CHANGED
@@ -5,6 +5,22 @@ All notable changes to this project will be documented in this file.
5
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
  and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
7
 
8
+ ## [0.1.3] - 2025-01-XX
9
+
10
+ ### Fixed
11
+ - Fixed bug where fields added to multi-page PDFs were all placed on the same page. Fields now correctly appear on their specified pages when using the `page` option in `add_field`.
12
+
13
+ ### Changed
14
+ - Refactored page-finding logic to eliminate code duplication across `Document` and `AddField` classes
15
+ - Unified page discovery through `Document#find_all_pages` and `Document#find_page_by_number` methods
16
+ - Updated all page detection patterns to use centralized `DictScan.is_page?` utility method
17
+
18
+ ### Added
19
+ - `DictScan.is_page?` utility method for consistent page object detection across the codebase
20
+ - `Document#find_all_pages` private method for unified page discovery in document order
21
+ - `Document#find_page_by_number` private method for finding pages by page number
22
+ - Exposed `find_page_by_number` through `Base` module for use in action classes
23
+
8
24
  ## [0.1.1] - 2025-10-31
9
25
 
10
26
  ### Added
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- acro_that (0.1.0)
4
+ acro_that (0.1.2)
5
5
  chunky_png (~> 1.4)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -189,6 +189,31 @@ flattened_doc = AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
189
189
  flattened_bytes = AcroThat::Document.flatten_pdf("input.pdf")
190
190
  ```
191
191
 
192
+ #### Clearing Fields
193
+
194
+ The `clear` and `clear!` methods allow you to completely remove unwanted fields by rewriting the entire PDF:
195
+
196
+ ```ruby
197
+ doc = AcroThat::Document.new("form.pdf")
198
+
199
+ # Remove all fields matching a pattern
200
+ doc.clear!(remove_pattern: /^text-/)
201
+
202
+ # Keep only specific fields
203
+ doc.clear!(keep_fields: ["Name", "Email"])
204
+
205
+ # Remove specific fields
206
+ doc.clear!(remove_fields: ["OldField1", "OldField2"])
207
+
208
+ # Use a block to determine which fields to keep
209
+ doc.clear! { |name| !name.start_with?("temp_") }
210
+
211
+ # Write the cleared PDF
212
+ doc.write("cleared.pdf", flatten: true)
213
+ ```
214
+
215
+ **Note:** Unlike `remove_field`, which uses incremental updates, `clear` completely rewrites the PDF to exclude unwanted fields. This is more efficient when removing many fields and ensures complete removal. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
216
+
192
217
  ### API Reference
193
218
 
194
219
  #### `AcroThat::Document.new(path_or_io)`
@@ -282,6 +307,30 @@ AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
282
307
  flattened_doc = AcroThat::Document.flatten_pdf("input.pdf")
283
308
  ```
284
309
 
310
+ #### `#clear(options = {})` and `#clear!(options = {})`
311
+ Removes unwanted fields by rewriting the entire PDF. `clear` returns cleared PDF bytes without modifying the document, while `clear!` modifies the document in-place. Options include:
312
+
313
+ - `keep_fields`: Array of field names to keep (all others removed)
314
+ - `remove_fields`: Array of field names to remove
315
+ - `remove_pattern`: Regex pattern - fields matching this are removed
316
+ - Block: Given field name, return `true` to keep, `false` to remove
317
+
318
+ ```ruby
319
+ # Remove all fields
320
+ cleared = doc.clear(remove_pattern: /.*/)
321
+
322
+ # Remove fields matching pattern (in-place)
323
+ doc.clear!(remove_pattern: /^text-/)
324
+
325
+ # Keep only specific fields
326
+ doc.clear!(keep_fields: ["Name", "Email"])
327
+
328
+ # Use block to filter fields
329
+ doc.clear! { |name| !name.match?(/^[a-f0-9-]{30,}/) }
330
+ ```
331
+
332
+ **Note:** This completely rewrites the PDF (like `flatten`), so it's more efficient than using `remove_field` multiple times. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
333
+
285
334
  ### Field Object
286
335
 
287
336
  Each field returned by `#list_fields` is a `Field` object with the following attributes and methods:
data/docs/README.md CHANGED
@@ -38,6 +38,17 @@ Explains how PDF object streams work and how `AcroThat` parses them:
38
38
 
39
39
  **Key insight:** Object streams compress multiple objects together, but parsing them is still **text traversal**—once decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
40
40
 
41
+ ### [Clearing Fields](./cleaning_fields.md)
42
+
43
+ Documentation for the `clear` and `clear!` methods:
44
+ - How to remove unwanted fields completely
45
+ - Difference between `clear` and `remove_field`
46
+ - Pattern matching and field selection
47
+ - Removing orphaned widget references
48
+ - Best practices for clearing PDFs
49
+
50
+ **Key insight:** `clear` rewrites the entire PDF to exclude unwanted fields, ensuring complete removal rather than just marking fields as deleted.
51
+
41
52
  ## Common Themes
42
53
 
43
54
  Throughout all documentation, you'll see these recurring themes:
@@ -54,6 +65,7 @@ Throughout all documentation, you'll see these recurring themes:
54
65
  1. Start with [PDF Structure](./pdf_structure.md) to understand PDFs at a high level
55
66
  2. Read [DictScan Explained](./dict_scan_explained.md) to see how text traversal works
56
67
  3. Read [Object Streams](./object_streams.md) to understand compression features
68
+ 4. Read [Clearing Fields](./cleaning_fields.md) to learn how to remove unwanted fields
57
69
 
58
70
  **If you're debugging:**
59
71
  - [DictScan Explained](./dict_scan_explained.md) has function-by-function walkthroughs
@@ -0,0 +1,202 @@
1
+ # Clearing Fields with `clear` and `clear!`
2
+
3
+ The `clear` method allows you to completely remove unwanted form fields from a PDF by rewriting the entire document, rather than using incremental updates. This is useful when you want to:
4
+
5
+ - Remove multiple layers of added fields
6
+ - Clear a PDF that has accumulated many unwanted fields
7
+ - Get back to a base file without certain fields
8
+ - Remove orphaned or invalid field references
9
+
10
+ Unlike `remove_field`, which uses incremental updates, `clear` rewrites the entire PDF (similar to `flatten`) but excludes the unwanted fields entirely. This ensures that:
11
+
12
+ - Field objects are completely removed (not just marked as deleted)
13
+ - Widget annotations are removed from page `/Annots` arrays
14
+ - Orphaned widget references are cleaned up
15
+ - The AcroForm `/Fields` array is updated
16
+ - All references to removed fields are eliminated
17
+
18
+ ## Methods
19
+
20
+ ### `clear(options = {})`
21
+
22
+ Returns a new PDF with unwanted fields removed. Does not modify the current document.
23
+
24
+ **Options:**
25
+ - `keep_fields`: Array of field names to keep (all others removed)
26
+ - `remove_fields`: Array of field names to remove
27
+ - `remove_pattern`: Regex pattern - fields matching this are removed
28
+ - Block: Given field name, return `true` to keep, `false` to remove
29
+
30
+ ### `clear!(options = {})`
31
+
32
+ Same as `clear`, but modifies the current document in-place. Mutates the document instance.
33
+
34
+ ## Usage Examples
35
+
36
+ ### Remove All Fields
37
+
38
+ ```ruby
39
+ doc = AcroThat::Document.new("form.pdf")
40
+
41
+ # Remove all fields
42
+ cleared_pdf = doc.clear(remove_pattern: /.*/)
43
+
44
+ # Or in-place
45
+ doc.clear!(remove_pattern: /.*/)
46
+ ```
47
+
48
+ ### Remove Fields Matching a Pattern
49
+
50
+ ```ruby
51
+ # Remove all fields starting with "text-"
52
+ doc.clear!(remove_pattern: /^text-/)
53
+
54
+ # Remove UUID-like generated fields
55
+ doc.clear! { |name| !(name =~ /text-/ || name =~ /^[a-f0-9]{20,}/) }
56
+ ```
57
+
58
+ ### Keep Only Specific Fields
59
+
60
+ ```ruby
61
+ # Keep only these fields, remove all others
62
+ doc.clear!(keep_fields: ["Name", "Email", "Phone"])
63
+
64
+ # Write the cleared PDF
65
+ doc.write("cleared.pdf", flatten: true)
66
+ ```
67
+
68
+ ### Remove Specific Fields
69
+
70
+ ```ruby
71
+ # Remove specific unwanted fields
72
+ doc.clear!(remove_fields: ["OldField1", "OldField2", "GeneratedField3"])
73
+ ```
74
+
75
+ ### Complex Selection with Block
76
+
77
+ ```ruby
78
+ # Remove all fields except those matching certain criteria
79
+ doc.clear! do |field_name|
80
+ # Keep fields that don't look generated
81
+ !field_name.start_with?("text-") &&
82
+ !field_name.match?(/^[a-f0-9]{20,}/)
83
+ end
84
+ ```
85
+
86
+ ## How It Works
87
+
88
+ The `clear` method:
89
+
90
+ 1. **Identifies fields to remove** based on the provided criteria (pattern, list, or block)
91
+
92
+ 2. **Finds related widgets** for each field to be removed:
93
+ - Widgets that reference the field via `/Parent`
94
+ - Widgets that have the same name via `/T`
95
+
96
+ 3. **Collects objects to write**, excluding:
97
+ - Field objects that should be removed
98
+ - Widget annotation objects that should be removed
99
+
100
+ 4. **Updates AcroForm structure**:
101
+ - Removes field references from the `/Fields` array
102
+ - Handles both inline and indirect array references
103
+
104
+ 5. **Clears page annotations**:
105
+ - Removes widget references from page `/Annots` arrays
106
+ - Removes orphaned widget references (widgets pointing to non-existent fields)
107
+ - Removes references to widgets that don't exist in the cleared PDF
108
+
109
+ 6. **Rewrites the entire PDF** from scratch (like `flatten`) with only the selected objects
110
+
111
+ ## Key Differences from `remove_field`
112
+
113
+ | Feature | `remove_field` | `clear` |
114
+ |---------|---------------|---------|
115
+ | Update Type | Incremental update | Complete rewrite |
116
+ | Object Removal | Marks as deleted | Completely excluded |
117
+ | PDF Structure | Preserves all objects | Only includes selected objects |
118
+ | Use Case | Remove one/a few fields | Remove many fields or clean up |
119
+ | Performance | Fast (append only) | Slower (full rewrite) |
120
+
121
+ ## Best Practices
122
+
123
+ 1. **Use `clear` when removing many fields**: If you need to remove a large number of fields, `clear` is more efficient and produces cleaner output.
124
+
125
+ 2. **Always flatten after clearing**: Since `clear` rewrites the PDF, consider using `write(..., flatten: true)` to ensure compatibility with all PDF viewers:
126
+
127
+ ```ruby
128
+ doc.clear!(remove_pattern: /^text-/)
129
+ doc.write("output.pdf", flatten: true)
130
+ ```
131
+
132
+ 3. **Combine with field addition**: After clearing, you can add new fields:
133
+
134
+ ```ruby
135
+ doc.clear!(remove_pattern: /.*/)
136
+ doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1)
137
+ doc.write("output.pdf", flatten: true)
138
+ ```
139
+
140
+ 4. **Use patterns for generated fields**: If you have fields with predictable naming patterns (e.g., UUID-based names), use regex patterns:
141
+
142
+ ```ruby
143
+ # Remove all UUID-like fields
144
+ doc.clear!(remove_pattern: /^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/)
145
+
146
+ # Remove all fields containing "temp" or "test"
147
+ doc.clear!(remove_pattern: /temp|test/i)
148
+ ```
149
+
150
+ ## Technical Details
151
+
152
+ ### Orphaned Widget Removal
153
+
154
+ The `clear` method automatically identifies and removes orphaned widget references:
155
+
156
+ - **Non-existent widgets**: Widget references in `/Annots` arrays that point to objects that don't exist
157
+ - **Orphaned widgets**: Widgets that reference parent fields that don't exist in the cleaned PDF
158
+
159
+ This ensures that page annotation arrays don't contain invalid references that could confuse PDF viewers.
160
+
161
+ ### Page Detection
162
+
163
+ The method correctly identifies actual page objects (`/Type /Page`) and avoids matching page container objects (`/Type /Pages`), ensuring widgets are properly associated with the correct page.
164
+
165
+ ### AcroForm Structure
166
+
167
+ The method properly handles both:
168
+ - **Inline `/Fields` arrays**: Arrays directly in the AcroForm dictionary
169
+ - **Indirect `/Fields` arrays**: Arrays referenced as separate objects
170
+
171
+ Both are updated to remove references to deleted fields.
172
+
173
+ ## Example: Complete Clearing Workflow
174
+
175
+ ```ruby
176
+ require 'acro_that'
177
+
178
+ # Load PDF with many unwanted fields
179
+ doc = AcroThat::Document.new("messy_form.pdf")
180
+
181
+ # Remove all generated/UUID-like fields
182
+ doc.clear! { |name|
183
+ # Keep only fields that look intentional
184
+ !name.match?(/^[a-f0-9-]{30,}/) && # Not UUID-like
185
+ !name.start_with?("temp_") && # Not temporary
186
+ !name.empty? # Not empty
187
+ }
188
+
189
+ # Add new fields
190
+ doc.add_field("Name", value: "", x: 100, y: 700, width: 200, height: 20, page: 1, type: :text)
191
+ doc.add_field("Email", value: "", x: 100, y: 670, width: 200, height: 20, page: 1, type: :text)
192
+
193
+ # Write cleared and updated PDF
194
+ doc.write("cleared_form.pdf", flatten: true)
195
+ ```
196
+
197
+ ## See Also
198
+
199
+ - [`flatten` and `flatten!`](./README.md#flattening-pdfs) - Similar rewrite approach for removing incremental updates
200
+ - [`remove_field`](../README.md#remove_field) - Incremental removal of single fields
201
+ - [Main README](../README.md) - General usage and API reference
202
+
data/issues/README.md ADDED
@@ -0,0 +1,38 @@
1
+ # Code Review Issues
2
+
3
+ This folder contains documentation of code cleanup and refactoring opportunities found in the codebase.
4
+
5
+ ## Files
6
+
7
+ - **[refactoring-opportunities.md](./refactoring-opportunities.md)** - Detailed list of code duplication and refactoring opportunities
8
+
9
+ ## Summary
10
+
11
+ ### High Priority Issues
12
+ 1. **Widget Matching Logic** - Duplicated across 6+ locations
13
+ 2. **/Annots Array Manipulation** - Complex logic duplicated in 3 locations
14
+
15
+ ### Medium Priority Issues
16
+ 3. **Page-Finding Logic** - Similar logic in 4+ methods
17
+ 4. **Box Parsing Logic** - Repeated code blocks for 5 box types
18
+
19
+ ### Low Priority Issues
20
+ 5. Duplicated `next_fresh_object_number` implementation
21
+ 6. Object reference extraction pattern duplication
22
+ 7. Unused method: `get_widget_rect_dimensions`
23
+ 8. Base64 decoding logic duplication
24
+
25
+ ## Quick Stats
26
+
27
+ - **8 refactoring opportunities** identified
28
+ - **6+ locations** with widget matching duplication
29
+ - **3 locations** with /Annots array manipulation duplication
30
+ - **1 unused method** found
31
+
32
+ ## Next Steps
33
+
34
+ 1. Review [refactoring-opportunities.md](./refactoring-opportunities.md) for detailed information
35
+ 2. Prioritize refactoring based on maintenance needs
36
+ 3. Create test coverage before refactoring
37
+ 4. Refactor incrementally, starting with high-priority items
38
+
@@ -0,0 +1,269 @@
1
+ # Refactoring Opportunities
2
+
3
+ This document identifies code duplication and unused methods that could be refactored to improve maintainability.
4
+
5
+ ## 1. Duplicated Page-Finding Logic
6
+
7
+ ### Issue
8
+ Multiple methods have similar logic for finding page objects in a PDF document.
9
+
10
+ ### Locations
11
+ - `Document#list_pages` (lines 75-104)
12
+ - `Document#collect_pages_from_tree` (lines 691-712)
13
+ - `Document#find_page_number_for_ref` (lines 714-728)
14
+ - `AddField#find_page_ref` (lines 155-211)
15
+
16
+ ### Pattern
17
+ The pattern `body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}` appears in multiple places with slight variations.
18
+
19
+ ### Suggested Refactor
20
+ Create a shared module or utility methods in `DictScan`:
21
+ - `DictScan.is_page?(body)` - Check if a body represents a page object
22
+ - `Document#find_all_pages` - Unified method to find all page objects
23
+ - `Document#find_page_by_number(page_num)` - Find a specific page by number
24
+
25
+ ### Benefits
26
+ - Single source of truth for page detection logic
27
+ - Easier to maintain and update page-finding behavior
28
+ - Consistent page ordering across methods
29
+
30
+ ---
31
+
32
+ ## 2. Duplicated Widget-Matching Logic
33
+
34
+ ### Issue
35
+ Multiple methods have similar logic for finding widgets that belong to a field. Widgets can be matched by:
36
+ 1. `/Parent` reference pointing to the field
37
+ 2. `/T` (field name) matching the field name
38
+
39
+ ### Locations
40
+ - `Document#list_fields` (lines 222-327) - Finds widgets and matches them to fields
41
+ - `Document#clear` (lines 472-495) - Finds widgets for removed fields
42
+ - `UpdateField#update_widget_annotations_for_field` (lines 220-247) - Finds widgets by /Parent
43
+ - `UpdateField#update_widget_names_for_field` (lines 249-280) - Finds widgets by /Parent and /T
44
+ - `RemoveField#remove_widget_annotations_from_pages` (lines 55-103) - Finds widgets by /Parent and /T
45
+ - `AddSignatureAppearance#find_widget_annotation` (lines 164-206) - Finds widgets by /Parent
46
+
47
+ ### Pattern
48
+ The pattern of checking `/Parent` reference and matching by `/T` field name is repeated throughout.
49
+
50
+ ### Suggested Refactor
51
+ Create utility methods in `Base` or a new `WidgetMatcher` module:
52
+ - `find_widgets_by_parent(field_ref)` - Find widgets with /Parent pointing to field_ref
53
+ - `find_widgets_by_name(field_name)` - Find widgets with /T matching field_name
54
+ - `find_widgets_for_field(field_ref, field_name)` - Find all widgets for a field (by parent or name)
55
+
56
+ ### Benefits
57
+ - Centralized widget matching logic
58
+ - Consistent widget finding behavior
59
+ - Easier to extend matching criteria
60
+
61
+ ---
62
+
63
+ ## 3. Duplicated /Annots Array Manipulation
64
+
65
+ ### Issue
66
+ Multiple methods handle adding or removing widget references from page `/Annots` arrays. The logic needs to handle:
67
+ 1. Inline `/Annots` arrays: `/Annots [...]`
68
+ 2. Indirect `/Annots` arrays: `/Annots X Y R` (reference to separate array object)
69
+
70
+ ### Locations
71
+ - `AddField#add_widget_to_page` (lines 213-275) - Adds widget to /Annots
72
+ - `RemoveField#remove_widget_from_page_annots` (lines 125-155) - Removes widget from /Annots
73
+ - `Document#clear` (lines 555-633) - Removes widgets from /Annots during cleanup
74
+
75
+ ### Pattern
76
+ All three methods have similar conditional logic:
77
+ ```ruby
78
+ if page_body =~ %r{/Annots\s*\[(.*?)\]}m
79
+ # Handle inline array
80
+ elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
81
+ # Handle indirect array
82
+ else
83
+ # Create new /Annots array
84
+ end
85
+ ```
86
+
87
+ ### Suggested Refactor
88
+ Extend `DictScan` with methods:
89
+ - `DictScan.add_to_annots_array(page_body, widget_ref)` - Unified method to add widget to /Annots
90
+ - `DictScan.remove_from_annots_array(page_body, widget_ref)` - Unified method to remove widget from /Annots
91
+ - `DictScan.get_annots_array(page_body)` - Extract /Annots array (handles both inline and indirect)
92
+
93
+ ### Benefits
94
+ - Single implementation of /Annots manipulation logic
95
+ - Consistent handling of edge cases
96
+ - Easier to test /Annots operations
97
+
98
+ ---
99
+
100
+ ## 4. Duplicated Box Parsing Logic
101
+
102
+ ### Issue
103
+ `Document#list_pages` has repeated code blocks for parsing different box types (MediaBox, CropBox, ArtBox, BleedBox, TrimBox).
104
+
105
+ ### Locations
106
+ - `Document#list_pages` (lines 120-165)
107
+
108
+ ### Pattern
109
+ Each box type uses identical logic:
110
+ ```ruby
111
+ if body =~ %r{/MediaBox\s*\[(.*?)\]}
112
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
113
+ if box_values.length == 4
114
+ llx, lly, urx, ury = box_values
115
+ media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
116
+ end
117
+ end
118
+ ```
119
+
120
+ ### Suggested Refactor
121
+ Create a helper method:
122
+ ```ruby
123
+ def parse_box(body, box_type)
124
+ pattern = %r{/#{box_type}\s*\[(.*?)\]}
125
+ return nil unless body =~ pattern
126
+
127
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
128
+ return nil unless box_values.length == 4
129
+
130
+ llx, lly, urx, ury = box_values
131
+ { llx: llx, lly: lly, urx: urx, ury: ury }
132
+ end
133
+ ```
134
+
135
+ ### Benefits
136
+ - Reduces code duplication from ~45 lines to ~10 lines per box type
137
+ - Easier to add new box types
138
+ - Consistent parsing logic
139
+
140
+ ---
141
+
142
+ ## 5. Duplicated next_fresh_object_number Implementation
143
+
144
+ ### Issue
145
+ The `next_fresh_object_number` method is implemented identically in two places.
146
+
147
+ ### Locations
148
+ - `Document#next_fresh_object_number` (lines 730-739)
149
+ - `Base#next_fresh_object_number` (lines 28-37)
150
+
151
+ ### Pattern
152
+ Both methods have identical implementation:
153
+ ```ruby
154
+ def next_fresh_object_number
155
+ max_obj_num = 0
156
+ resolver.each_object do |ref, _|
157
+ max_obj_num = [max_obj_num, ref[0]].max
158
+ end
159
+ patches.each do |p|
160
+ max_obj_num = [max_obj_num, p[:ref][0]].max
161
+ end
162
+ max_obj_num + 1
163
+ end
164
+ ```
165
+
166
+ ### Suggested Refactor
167
+ - Remove `Document#next_fresh_object_number` - it's only called within `Document` but could use `Base`'s implementation
168
+ - Or: Document already has access to resolver and patches, so remove duplication by making Document use Base's method
169
+
170
+ ### Benefits
171
+ - Single implementation
172
+ - Consistent object numbering logic
173
+
174
+ ---
175
+
176
+ ## 6. Unused Methods
177
+
178
+ ### Issue
179
+ Some methods are defined but never called.
180
+
181
+ ### Locations
182
+ - `AddSignatureAppearance#get_widget_rect_dimensions` (lines 218-223)
183
+ - Defined but never used
184
+ - `extract_rect` is used instead, which provides the same information
185
+
186
+ ### Suggested Refactor
187
+ - Remove `get_widget_rect_dimensions` if it's truly unused
188
+ - Or: Verify if it was intended for future use and document it
189
+
190
+ ### Benefits
191
+ - Cleaner codebase
192
+ - Less confusion about which method to use
193
+
194
+ ---
195
+
196
+ ## 7. Duplicated Base64 Decoding Logic
197
+
198
+ ### Issue
199
+ `AddSignatureAppearance` has two similar methods for decoding base64 data.
200
+
201
+ ### Locations
202
+ - `AddSignatureAppearance#decode_base64_data_uri` (lines 101-106)
203
+ - `AddSignatureAppearance#decode_base64_if_needed` (lines 108-119)
204
+
205
+ ### Pattern
206
+ Both methods handle base64 decoding, with slightly different logic. Could potentially be unified.
207
+
208
+ ### Suggested Refactor
209
+ - Consider merging into a single method that handles both cases
210
+ - Or: Document the distinction if both are needed
211
+
212
+ ### Benefits
213
+ - Simpler API
214
+ - Less code duplication
215
+
216
+ ---
217
+
218
+ ## 8. Duplicated Regex Pattern for Object Reference
219
+
220
+ ### Issue
221
+ The pattern for extracting object references `(\d+)\s+(\d+)\s+R` appears in many places.
222
+
223
+ ### Locations
224
+ Throughout the codebase, used in:
225
+ - Extracting `/Parent` references
226
+ - Extracting `/P` (page) references
227
+ - Extracting `/Pages` references
228
+ - Extracting `/Fields` array references
229
+ - And many more...
230
+
231
+ ### Suggested Refactor
232
+ Create a utility method:
233
+ ```ruby
234
+ def DictScan.extract_object_ref(str)
235
+ # Extract object reference from string
236
+ # Returns [obj_num, gen_num] or nil
237
+ end
238
+ ```
239
+
240
+ ### Benefits
241
+ - Consistent reference extraction
242
+ - Easier to update if PDF reference format changes
243
+ - More readable code
244
+
245
+ ---
246
+
247
+ ## Priority Recommendations
248
+
249
+ ### High Priority
250
+ 1. **Widget Matching Logic (#2)** - Most duplicated, used in many critical operations
251
+ 2. **/Annots Array Manipulation (#3)** - Complex logic that's error-prone when duplicated
252
+
253
+ ### Medium Priority
254
+ 3. **Page-Finding Logic (#1)** - Used in multiple places, but less frequently
255
+ 4. **Box Parsing Logic (#4)** - Simple duplication, easy to refactor
256
+
257
+ ### Low Priority
258
+ 5. **next_fresh_object_number (#5)** - Simple duplication
259
+ 6. **Object Reference Extraction (#8)** - Could improve consistency
260
+ 7. **Unused Methods (#6)** - Cleanup task
261
+ 8. **Base64 Decoding (#7)** - Minor duplication
262
+
263
+ ---
264
+
265
+ ## Notes
266
+ - All refactoring should be accompanied by tests to ensure behavior doesn't change
267
+ - Consider backward compatibility if any methods are moved between modules
268
+ - Some duplication may be intentional for performance reasons (avoid method call overhead) - evaluate before refactoring
269
+