acro_that 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7cccc792d205578fc981258a0b171e4906b3f9c945ac2abd664447054251f940
4
- data.tar.gz: 4544010bb0642b5c88cd9b5cc95a0c97d018109ecbe19648807419590d3f5e8b
3
+ metadata.gz: aa3b37611a5ca19d2cb0fa1c86d3821fb139b6f65ce9d3cd314d913b6e9e5db5
4
+ data.tar.gz: a4d07aaa7f268eb151b324b03e9575f563880b2cf97fd0172edc0e96c35e6eef
5
5
  SHA512:
6
- metadata.gz: 5a43d6c18e6babead9223d93f661032ff324ee45fae76646d2d8c37528ea4ddf294b5c5240ddc67d894c53aaff05d2245baff6c9c6f7813d9dad83c0584e5cea
7
- data.tar.gz: 17052dece8b5a7700c4d9f822b2ba7e2f2b8d0ded537a173d38b2e7190ffc0b6d2d9f86d51848b8d7dd99cd35015cda99299dfab0b7f87a5df1d0404e087cbe8
6
+ metadata.gz: a545eea311d77e7b46459e710925f8ec7b61baebd846ec7f602f6a77ab9532bab87d37be4bba53b00b43a09281e79faba79fcb71e5371c36953e877d556dcdd8
7
+ data.tar.gz: b9074d19a0cc330af44f4fd5609f49f8892e20deb95d522fa7ca5fb3099ca8c71f01cbc98ef7d3e6ed95620af090afc96035fbeaaf4a551ca6f9e4003c49f4a8
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- acro_that (0.1.0)
4
+ acro_that (0.1.2)
5
5
  chunky_png (~> 1.4)
6
6
 
7
7
  GEM
data/README.md CHANGED
@@ -189,6 +189,31 @@ flattened_doc = AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
189
189
  flattened_bytes = AcroThat::Document.flatten_pdf("input.pdf")
190
190
  ```
191
191
 
192
+ #### Clearing Fields
193
+
194
+ The `clear` and `clear!` methods allow you to completely remove unwanted fields by rewriting the entire PDF:
195
+
196
+ ```ruby
197
+ doc = AcroThat::Document.new("form.pdf")
198
+
199
+ # Remove all fields matching a pattern
200
+ doc.clear!(remove_pattern: /^text-/)
201
+
202
+ # Keep only specific fields
203
+ doc.clear!(keep_fields: ["Name", "Email"])
204
+
205
+ # Remove specific fields
206
+ doc.clear!(remove_fields: ["OldField1", "OldField2"])
207
+
208
+ # Use a block to determine which fields to keep
209
+ doc.clear! { |name| !name.start_with?("temp_") }
210
+
211
+ # Write the cleared PDF
212
+ doc.write("cleared.pdf", flatten: true)
213
+ ```
214
+
215
+ **Note:** Unlike `remove_field`, which uses incremental updates, `clear` completely rewrites the PDF to exclude unwanted fields. This is more efficient when removing many fields and ensures complete removal. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
216
+
192
217
  ### API Reference
193
218
 
194
219
  #### `AcroThat::Document.new(path_or_io)`
@@ -282,6 +307,30 @@ AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
282
307
  flattened_doc = AcroThat::Document.flatten_pdf("input.pdf")
283
308
  ```
284
309
 
310
+ #### `#clear(options = {})` and `#clear!(options = {})`
311
+ Removes unwanted fields by rewriting the entire PDF. `clear` returns cleared PDF bytes without modifying the document, while `clear!` modifies the document in-place. Options include:
312
+
313
+ - `keep_fields`: Array of field names to keep (all others removed)
314
+ - `remove_fields`: Array of field names to remove
315
+ - `remove_pattern`: Regex pattern - fields matching this are removed
316
+ - Block: Given field name, return `true` to keep, `false` to remove
317
+
318
+ ```ruby
319
+ # Remove all fields
320
+ cleared = doc.clear(remove_pattern: /.*/)
321
+
322
+ # Remove fields matching pattern (in-place)
323
+ doc.clear!(remove_pattern: /^text-/)
324
+
325
+ # Keep only specific fields
326
+ doc.clear!(keep_fields: ["Name", "Email"])
327
+
328
+ # Use block to filter fields
329
+ doc.clear! { |name| !name.match?(/^[a-f0-9-]{30,}/) }
330
+ ```
331
+
332
+ **Note:** This completely rewrites the PDF (like `flatten`), so it's more efficient than using `remove_field` multiple times. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
333
+
285
334
  ### Field Object
286
335
 
287
336
  Each field returned by `#list_fields` is a `Field` object with the following attributes and methods:
data/docs/README.md CHANGED
@@ -38,6 +38,17 @@ Explains how PDF object streams work and how `AcroThat` parses them:
38
38
 
39
39
  **Key insight:** Object streams compress multiple objects together, but parsing them is still **text traversal**—once decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
40
40
 
41
+ ### [Clearing Fields](./cleaning_fields.md)
42
+
43
+ Documentation for the `clear` and `clear!` methods:
44
+ - How to remove unwanted fields completely
45
+ - Difference between `clear` and `remove_field`
46
+ - Pattern matching and field selection
47
+ - Removing orphaned widget references
48
+ - Best practices for clearing PDFs
49
+
50
+ **Key insight:** `clear` rewrites the entire PDF to exclude unwanted fields, ensuring complete removal rather than just marking fields as deleted.
51
+
41
52
  ## Common Themes
42
53
 
43
54
  Throughout all documentation, you'll see these recurring themes:
@@ -54,6 +65,7 @@ Throughout all documentation, you'll see these recurring themes:
54
65
  1. Start with [PDF Structure](./pdf_structure.md) to understand PDFs at a high level
55
66
  2. Read [DictScan Explained](./dict_scan_explained.md) to see how text traversal works
56
67
  3. Read [Object Streams](./object_streams.md) to understand compression features
68
+ 4. Read [Clearing Fields](./cleaning_fields.md) to learn how to remove unwanted fields
57
69
 
58
70
  **If you're debugging:**
59
71
  - [DictScan Explained](./dict_scan_explained.md) has function-by-function walkthroughs
@@ -0,0 +1,202 @@
1
+ # Clearing Fields with `clear` and `clear!`
2
+
3
+ The `clear` method allows you to completely remove unwanted form fields from a PDF by rewriting the entire document, rather than using incremental updates. This is useful when you want to:
4
+
5
+ - Remove multiple layers of added fields
6
+ - Clear a PDF that has accumulated many unwanted fields
7
+ - Get back to a base file without certain fields
8
+ - Remove orphaned or invalid field references
9
+
10
+ Unlike `remove_field`, which uses incremental updates, `clear` rewrites the entire PDF (similar to `flatten`) but excludes the unwanted fields entirely. This ensures that:
11
+
12
+ - Field objects are completely removed (not just marked as deleted)
13
+ - Widget annotations are removed from page `/Annots` arrays
14
+ - Orphaned widget references are cleaned up
15
+ - The AcroForm `/Fields` array is updated
16
+ - All references to removed fields are eliminated
17
+
18
+ ## Methods
19
+
20
+ ### `clear(options = {})`
21
+
22
+ Returns a new PDF with unwanted fields removed. Does not modify the current document.
23
+
24
+ **Options:**
25
+ - `keep_fields`: Array of field names to keep (all others removed)
26
+ - `remove_fields`: Array of field names to remove
27
+ - `remove_pattern`: Regex pattern - fields matching this are removed
28
+ - Block: Given field name, return `true` to keep, `false` to remove
29
+
30
+ ### `clear!(options = {})`
31
+
32
+ Same as `clear`, but modifies the current document in-place. Mutates the document instance.
33
+
34
+ ## Usage Examples
35
+
36
+ ### Remove All Fields
37
+
38
+ ```ruby
39
+ doc = AcroThat::Document.new("form.pdf")
40
+
41
+ # Remove all fields
42
+ cleared_pdf = doc.clear(remove_pattern: /.*/)
43
+
44
+ # Or in-place
45
+ doc.clear!(remove_pattern: /.*/)
46
+ ```
47
+
48
+ ### Remove Fields Matching a Pattern
49
+
50
+ ```ruby
51
+ # Remove all fields starting with "text-"
52
+ doc.clear!(remove_pattern: /^text-/)
53
+
54
+ # Remove UUID-like generated fields
55
+ doc.clear! { |name| !(name =~ /text-/ || name =~ /^[a-f0-9]{20,}/) }
56
+ ```
57
+
58
+ ### Keep Only Specific Fields
59
+
60
+ ```ruby
61
+ # Keep only these fields, remove all others
62
+ doc.clear!(keep_fields: ["Name", "Email", "Phone"])
63
+
64
+ # Write the cleared PDF
65
+ doc.write("cleared.pdf", flatten: true)
66
+ ```
67
+
68
+ ### Remove Specific Fields
69
+
70
+ ```ruby
71
+ # Remove specific unwanted fields
72
+ doc.clear!(remove_fields: ["OldField1", "OldField2", "GeneratedField3"])
73
+ ```
74
+
75
+ ### Complex Selection with Block
76
+
77
+ ```ruby
78
+ # Remove all fields except those matching certain criteria
79
+ doc.clear! do |field_name|
80
+ # Keep fields that don't look generated
81
+ !field_name.start_with?("text-") &&
82
+ !field_name.match?(/^[a-f0-9]{20,}/)
83
+ end
84
+ ```
85
+
86
+ ## How It Works
87
+
88
+ The `clear` method:
89
+
90
+ 1. **Identifies fields to remove** based on the provided criteria (pattern, list, or block)
91
+
92
+ 2. **Finds related widgets** for each field to be removed:
93
+ - Widgets that reference the field via `/Parent`
94
+ - Widgets that have the same name via `/T`
95
+
96
+ 3. **Collects objects to write**, excluding:
97
+ - Field objects that should be removed
98
+ - Widget annotation objects that should be removed
99
+
100
+ 4. **Updates AcroForm structure**:
101
+ - Removes field references from the `/Fields` array
102
+ - Handles both inline and indirect array references
103
+
104
+ 5. **Clears page annotations**:
105
+ - Removes widget references from page `/Annots` arrays
106
+ - Removes orphaned widget references (widgets pointing to non-existent fields)
107
+ - Removes references to widgets that don't exist in the cleared PDF
108
+
109
+ 6. **Rewrites the entire PDF** from scratch (like `flatten`) with only the selected objects
110
+
111
+ ## Key Differences from `remove_field`
112
+
113
+ | Feature | `remove_field` | `clear` |
114
+ |---------|---------------|---------|
115
+ | Update Type | Incremental update | Complete rewrite |
116
+ | Object Removal | Marks as deleted | Completely excluded |
117
+ | PDF Structure | Preserves all objects | Only includes selected objects |
118
+ | Use Case | Remove one/a few fields | Remove many fields or clean up |
119
+ | Performance | Fast (append only) | Slower (full rewrite) |
120
+
121
+ ## Best Practices
122
+
123
+ 1. **Use `clear` when removing many fields**: If you need to remove a large number of fields, `clear` is more efficient and produces cleaner output.
124
+
125
+ 2. **Always flatten after clearing**: Since `clear` rewrites the PDF, consider using `write(..., flatten: true)` to ensure compatibility with all PDF viewers:
126
+
127
+ ```ruby
128
+ doc.clear!(remove_pattern: /^text-/)
129
+ doc.write("output.pdf", flatten: true)
130
+ ```
131
+
132
+ 3. **Combine with field addition**: After clearing, you can add new fields:
133
+
134
+ ```ruby
135
+ doc.clear!(remove_pattern: /.*/)
136
+ doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1)
137
+ doc.write("output.pdf", flatten: true)
138
+ ```
139
+
140
+ 4. **Use patterns for generated fields**: If you have fields with predictable naming patterns (e.g., UUID-based names), use regex patterns:
141
+
142
+ ```ruby
143
+ # Remove all UUID-like fields
144
+ doc.clear!(remove_pattern: /^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/)
145
+
146
+ # Remove all fields containing "temp" or "test"
147
+ doc.clear!(remove_pattern: /temp|test/i)
148
+ ```
149
+
150
+ ## Technical Details
151
+
152
+ ### Orphaned Widget Removal
153
+
154
+ The `clear` method automatically identifies and removes orphaned widget references:
155
+
156
+ - **Non-existent widgets**: Widget references in `/Annots` arrays that point to objects that don't exist
157
+ - **Orphaned widgets**: Widgets that reference parent fields that don't exist in the cleaned PDF
158
+
159
+ This ensures that page annotation arrays don't contain invalid references that could confuse PDF viewers.
160
+
161
+ ### Page Detection
162
+
163
+ The method correctly identifies actual page objects (`/Type /Page`) and avoids matching page container objects (`/Type /Pages`), ensuring widgets are properly associated with the correct page.
164
+
165
+ ### AcroForm Structure
166
+
167
+ The method properly handles both:
168
+ - **Inline `/Fields` arrays**: Arrays directly in the AcroForm dictionary
169
+ - **Indirect `/Fields` arrays**: Arrays referenced as separate objects
170
+
171
+ Both are updated to remove references to deleted fields.
172
+
173
+ ## Example: Complete Clearing Workflow
174
+
175
+ ```ruby
176
+ require 'acro_that'
177
+
178
+ # Load PDF with many unwanted fields
179
+ doc = AcroThat::Document.new("messy_form.pdf")
180
+
181
+ # Remove all generated/UUID-like fields
182
+ doc.clear! { |name|
183
+ # Keep only fields that look intentional
184
+ !name.match?(/^[a-f0-9-]{30,}/) && # Not UUID-like
185
+ !name.start_with?("temp_") && # Not temporary
186
+ !name.empty? # Not empty
187
+ }
188
+
189
+ # Add new fields
190
+ doc.add_field("Name", value: "", x: 100, y: 700, width: 200, height: 20, page: 1, type: :text)
191
+ doc.add_field("Email", value: "", x: 100, y: 670, width: 200, height: 20, page: 1, type: :text)
192
+
193
+ # Write cleared and updated PDF
194
+ doc.write("cleared_form.pdf", flatten: true)
195
+ ```
196
+
197
+ ## See Also
198
+
199
+ - [`flatten` and `flatten!`](./README.md#flattening-pdfs) - Similar rewrite approach for removing incremental updates
200
+ - [`remove_field`](../README.md#remove_field) - Incremental removal of single fields
201
+ - [Main README](../README.md) - General usage and API reference
202
+
data/issues/README.md ADDED
@@ -0,0 +1,38 @@
1
+ # Code Review Issues
2
+
3
+ This folder contains documentation of code cleanup and refactoring opportunities found in the codebase.
4
+
5
+ ## Files
6
+
7
+ - **[refactoring-opportunities.md](./refactoring-opportunities.md)** - Detailed list of code duplication and refactoring opportunities
8
+
9
+ ## Summary
10
+
11
+ ### High Priority Issues
12
+ 1. **Widget Matching Logic** - Duplicated across 6+ locations
13
+ 2. **/Annots Array Manipulation** - Complex logic duplicated in 3 locations
14
+
15
+ ### Medium Priority Issues
16
+ 3. **Page-Finding Logic** - Similar logic in 4+ methods
17
+ 4. **Box Parsing Logic** - Repeated code blocks for 5 box types
18
+
19
+ ### Low Priority Issues
20
+ 5. Duplicated `next_fresh_object_number` implementation
21
+ 6. Object reference extraction pattern duplication
22
+ 7. Unused method: `get_widget_rect_dimensions`
23
+ 8. Base64 decoding logic duplication
24
+
25
+ ## Quick Stats
26
+
27
+ - **8 refactoring opportunities** identified
28
+ - **6+ locations** with widget matching duplication
29
+ - **3 locations** with /Annots array manipulation duplication
30
+ - **1 unused method** found
31
+
32
+ ## Next Steps
33
+
34
+ 1. Review [refactoring-opportunities.md](./refactoring-opportunities.md) for detailed information
35
+ 2. Prioritize refactoring based on maintenance needs
36
+ 3. Create test coverage before refactoring
37
+ 4. Refactor incrementally, starting with high-priority items
38
+
@@ -0,0 +1,269 @@
1
+ # Refactoring Opportunities
2
+
3
+ This document identifies code duplication and unused methods that could be refactored to improve maintainability.
4
+
5
+ ## 1. Duplicated Page-Finding Logic
6
+
7
+ ### Issue
8
+ Multiple methods have similar logic for finding page objects in a PDF document.
9
+
10
+ ### Locations
11
+ - `Document#list_pages` (lines 75-104)
12
+ - `Document#collect_pages_from_tree` (lines 691-712)
13
+ - `Document#find_page_number_for_ref` (lines 714-728)
14
+ - `AddField#find_page_ref` (lines 155-211)
15
+
16
+ ### Pattern
17
+ The pattern `body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}` appears in multiple places with slight variations.
18
+
19
+ ### Suggested Refactor
20
+ Create a shared module or utility methods in `DictScan`:
21
+ - `DictScan.is_page?(body)` - Check if a body represents a page object
22
+ - `Document#find_all_pages` - Unified method to find all page objects
23
+ - `Document#find_page_by_number(page_num)` - Find a specific page by number
24
+
25
+ ### Benefits
26
+ - Single source of truth for page detection logic
27
+ - Easier to maintain and update page-finding behavior
28
+ - Consistent page ordering across methods
29
+
30
+ ---
31
+
32
+ ## 2. Duplicated Widget-Matching Logic
33
+
34
+ ### Issue
35
+ Multiple methods have similar logic for finding widgets that belong to a field. Widgets can be matched by:
36
+ 1. `/Parent` reference pointing to the field
37
+ 2. `/T` (field name) matching the field name
38
+
39
+ ### Locations
40
+ - `Document#list_fields` (lines 222-327) - Finds widgets and matches them to fields
41
+ - `Document#clear` (lines 472-495) - Finds widgets for removed fields
42
+ - `UpdateField#update_widget_annotations_for_field` (lines 220-247) - Finds widgets by /Parent
43
+ - `UpdateField#update_widget_names_for_field` (lines 249-280) - Finds widgets by /Parent and /T
44
+ - `RemoveField#remove_widget_annotations_from_pages` (lines 55-103) - Finds widgets by /Parent and /T
45
+ - `AddSignatureAppearance#find_widget_annotation` (lines 164-206) - Finds widgets by /Parent
46
+
47
+ ### Pattern
48
+ The pattern of checking `/Parent` reference and matching by `/T` field name is repeated throughout.
49
+
50
+ ### Suggested Refactor
51
+ Create utility methods in `Base` or a new `WidgetMatcher` module:
52
+ - `find_widgets_by_parent(field_ref)` - Find widgets with /Parent pointing to field_ref
53
+ - `find_widgets_by_name(field_name)` - Find widgets with /T matching field_name
54
+ - `find_widgets_for_field(field_ref, field_name)` - Find all widgets for a field (by parent or name)
55
+
56
+ ### Benefits
57
+ - Centralized widget matching logic
58
+ - Consistent widget finding behavior
59
+ - Easier to extend matching criteria
60
+
61
+ ---
62
+
63
+ ## 3. Duplicated /Annots Array Manipulation
64
+
65
+ ### Issue
66
+ Multiple methods handle adding or removing widget references from page `/Annots` arrays. The logic needs to handle:
67
+ 1. Inline `/Annots` arrays: `/Annots [...]`
68
+ 2. Indirect `/Annots` arrays: `/Annots X Y R` (reference to separate array object)
69
+
70
+ ### Locations
71
+ - `AddField#add_widget_to_page` (lines 213-275) - Adds widget to /Annots
72
+ - `RemoveField#remove_widget_from_page_annots` (lines 125-155) - Removes widget from /Annots
73
+ - `Document#clear` (lines 555-633) - Removes widgets from /Annots during cleanup
74
+
75
+ ### Pattern
76
+ All three methods have similar conditional logic:
77
+ ```ruby
78
+ if page_body =~ %r{/Annots\s*\[(.*?)\]}m
79
+ # Handle inline array
80
+ elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
81
+ # Handle indirect array
82
+ else
83
+ # Create new /Annots array
84
+ end
85
+ ```
86
+
87
+ ### Suggested Refactor
88
+ Extend `DictScan` with methods:
89
+ - `DictScan.add_to_annots_array(page_body, widget_ref)` - Unified method to add widget to /Annots
90
+ - `DictScan.remove_from_annots_array(page_body, widget_ref)` - Unified method to remove widget from /Annots
91
+ - `DictScan.get_annots_array(page_body)` - Extract /Annots array (handles both inline and indirect)
92
+
93
+ ### Benefits
94
+ - Single implementation of /Annots manipulation logic
95
+ - Consistent handling of edge cases
96
+ - Easier to test /Annots operations
97
+
98
+ ---
99
+
100
+ ## 4. Duplicated Box Parsing Logic
101
+
102
+ ### Issue
103
+ `Document#list_pages` has repeated code blocks for parsing different box types (MediaBox, CropBox, ArtBox, BleedBox, TrimBox).
104
+
105
+ ### Locations
106
+ - `Document#list_pages` (lines 120-165)
107
+
108
+ ### Pattern
109
+ Each box type uses identical logic:
110
+ ```ruby
111
+ if body =~ %r{/MediaBox\s*\[(.*?)\]}
112
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
113
+ if box_values.length == 4
114
+ llx, lly, urx, ury = box_values
115
+ media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
116
+ end
117
+ end
118
+ ```
119
+
120
+ ### Suggested Refactor
121
+ Create a helper method:
122
+ ```ruby
123
+ def parse_box(body, box_type)
124
+ pattern = %r{/#{box_type}\s*\[(.*?)\]}
125
+ return nil unless body =~ pattern
126
+
127
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
128
+ return nil unless box_values.length == 4
129
+
130
+ llx, lly, urx, ury = box_values
131
+ { llx: llx, lly: lly, urx: urx, ury: ury }
132
+ end
133
+ ```
134
+
135
+ ### Benefits
136
+ - Reduces code duplication from ~45 lines to ~10 lines per box type
137
+ - Easier to add new box types
138
+ - Consistent parsing logic
139
+
140
+ ---
141
+
142
+ ## 5. Duplicated next_fresh_object_number Implementation
143
+
144
+ ### Issue
145
+ The `next_fresh_object_number` method is implemented identically in two places.
146
+
147
+ ### Locations
148
+ - `Document#next_fresh_object_number` (lines 730-739)
149
+ - `Base#next_fresh_object_number` (lines 28-37)
150
+
151
+ ### Pattern
152
+ Both methods have identical implementation:
153
+ ```ruby
154
+ def next_fresh_object_number
155
+ max_obj_num = 0
156
+ resolver.each_object do |ref, _|
157
+ max_obj_num = [max_obj_num, ref[0]].max
158
+ end
159
+ patches.each do |p|
160
+ max_obj_num = [max_obj_num, p[:ref][0]].max
161
+ end
162
+ max_obj_num + 1
163
+ end
164
+ ```
165
+
166
+ ### Suggested Refactor
167
+ - Remove `Document#next_fresh_object_number` - it's only called within `Document` but could use `Base`'s implementation
168
+ - Or: Document already has access to resolver and patches, so remove duplication by making Document use Base's method
169
+
170
+ ### Benefits
171
+ - Single implementation
172
+ - Consistent object numbering logic
173
+
174
+ ---
175
+
176
+ ## 6. Unused Methods
177
+
178
+ ### Issue
179
+ Some methods are defined but never called.
180
+
181
+ ### Locations
182
+ - `AddSignatureAppearance#get_widget_rect_dimensions` (lines 218-223)
183
+ - Defined but never used
184
+ - `extract_rect` is used instead, which provides the same information
185
+
186
+ ### Suggested Refactor
187
+ - Remove `get_widget_rect_dimensions` if it's truly unused
188
+ - Or: Verify if it was intended for future use and document it
189
+
190
+ ### Benefits
191
+ - Cleaner codebase
192
+ - Less confusion about which method to use
193
+
194
+ ---
195
+
196
+ ## 7. Duplicated Base64 Decoding Logic
197
+
198
+ ### Issue
199
+ `AddSignatureAppearance` has two similar methods for decoding base64 data.
200
+
201
+ ### Locations
202
+ - `AddSignatureAppearance#decode_base64_data_uri` (lines 101-106)
203
+ - `AddSignatureAppearance#decode_base64_if_needed` (lines 108-119)
204
+
205
+ ### Pattern
206
+ Both methods handle base64 decoding, with slightly different logic. Could potentially be unified.
207
+
208
+ ### Suggested Refactor
209
+ - Consider merging into a single method that handles both cases
210
+ - Or: Document the distinction if both are needed
211
+
212
+ ### Benefits
213
+ - Simpler API
214
+ - Less code duplication
215
+
216
+ ---
217
+
218
+ ## 8. Duplicated Regex Pattern for Object Reference
219
+
220
+ ### Issue
221
+ The pattern for extracting object references `(\d+)\s+(\d+)\s+R` appears in many places.
222
+
223
+ ### Locations
224
+ Throughout the codebase, used in:
225
+ - Extracting `/Parent` references
226
+ - Extracting `/P` (page) references
227
+ - Extracting `/Pages` references
228
+ - Extracting `/Fields` array references
229
+ - And many more...
230
+
231
+ ### Suggested Refactor
232
+ Create a utility method:
233
+ ```ruby
234
+ def DictScan.extract_object_ref(str)
235
+ # Extract object reference from string
236
+ # Returns [obj_num, gen_num] or nil
237
+ end
238
+ ```
239
+
240
+ ### Benefits
241
+ - Consistent reference extraction
242
+ - Easier to update if PDF reference format changes
243
+ - More readable code
244
+
245
+ ---
246
+
247
+ ## Priority Recommendations
248
+
249
+ ### High Priority
250
+ 1. **Widget Matching Logic (#2)** - Most duplicated, used in many critical operations
251
+ 2. **/Annots Array Manipulation (#3)** - Complex logic that's error-prone when duplicated
252
+
253
+ ### Medium Priority
254
+ 3. **Page-Finding Logic (#1)** - Used in multiple places, but less frequently
255
+ 4. **Box Parsing Logic (#4)** - Simple duplication, easy to refactor
256
+
257
+ ### Low Priority
258
+ 5. **next_fresh_object_number (#5)** - Simple duplication
259
+ 6. **Object Reference Extraction (#8)** - Could improve consistency
260
+ 7. **Unused Methods (#6)** - Cleanup task
261
+ 8. **Base64 Decoding (#7)** - Minor duplication
262
+
263
+ ---
264
+
265
+ ## Notes
266
+ - All refactoring should be accompanied by tests to ensure behavior doesn't change
267
+ - Consider backward compatibility if any methods are moved between modules
268
+ - Some duplication may be intentional for performance reasons (avoid method call overhead) - evaluate before refactoring
269
+
@@ -157,10 +157,10 @@ module AcroThat
157
157
  resolver.each_object do |ref, body|
158
158
  next unless body
159
159
 
160
- # Check for /Type /Page with or without space, or /Type/Page
160
+ # Check for /Type /Page (actual page, not /Type/Pages)
161
+ # Must match /Type /Page or /Type/Page but NOT /Type/Pages
161
162
  is_page = body.include?("/Type /Page") ||
162
- body.include?("/Type/Page") ||
163
- (body.include?("/Type") && body.include?("/Page") && body =~ %r{/Type\s*/Page})
163
+ (body =~ %r{/Type\s*/Page(?!s)\b})
164
164
  next unless is_page
165
165
 
166
166
  page_objects << ref
@@ -183,8 +183,8 @@ module AcroThat
183
183
  kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
184
184
  kid_ref = [num_str.to_i, gen_str.to_i]
185
185
  kid_body = resolver.object_body(kid_ref)
186
- # Check if this kid is a page or another Pages node
187
- if kid_body && (kid_body.include?("/Type /Page") || kid_body.include?("/Type/Page") || (kid_body.include?("/Type") && kid_body.include?("/Page")))
186
+ # Check if this kid is a page (not /Type/Pages)
187
+ if kid_body && (kid_body.include?("/Type /Page") || kid_body =~ %r{/Type\s*/Page(?!s)\b})
188
188
  page_objects << kid_ref
189
189
  elsif kid_body && kid_body.include?("/Type /Pages")
190
190
  # Recursively find pages in this Pages node
@@ -192,7 +192,7 @@ module AcroThat
192
192
  kid_body[::Regexp.last_match(0)..].scan(/(\d+)\s+(\d+)\s+R/) do |n, g|
193
193
  grandkid_ref = [n.to_i, g.to_i]
194
194
  grandkid_body = resolver.object_body(grandkid_ref)
195
- if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body.include?("/Type/Page"))
195
+ if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body =~ %r{/Type\s*/Page(?!s)\b})
196
196
  page_objects << grandkid_ref
197
197
  end
198
198
  end
@@ -336,10 +336,10 @@ module AcroThat
336
336
  end
337
337
 
338
338
  def create_form_xobject(_obj_num, image_obj_num, field_width, field_height, _scale_factor, scaled_width,
339
- scaled_height)
339
+ scaled_height)
340
340
  # Calculate offset to left-align the image horizontally and center vertically
341
- offset_x = 0.0 # Left-aligned (no horizontal offset)
342
- offset_y = (field_height - scaled_height) / 2.0 # Center vertically
341
+ offset_x = 0.0 # Left-aligned (no horizontal offset)
342
+ offset_y = (field_height - scaled_height) / 2.0 # Center vertically
343
343
 
344
344
  # PDF content stream that draws the image
345
345
  # q = save graphics state
@@ -71,62 +71,213 @@ module AcroThat
71
71
  self
72
72
  end
73
73
 
74
- # Return an array of Field(name, value, type, ref)
75
- def list_fields
76
- fields = []
77
- field_widgets = {}
78
- widgets_by_name = {}
74
+ # Return an array of page information (page number, width, height, ref, metadata)
75
+ def list_pages
76
+ pages = []
77
+ page_objects = []
79
78
 
80
- # First pass: collect widget information
81
- @resolver.each_object do |ref, body|
82
- next unless DictScan.is_widget?(body)
79
+ # Try to get pages in document order via page tree first
80
+ root_ref = @resolver.root_ref
81
+ if root_ref
82
+ catalog_body = @resolver.object_body(root_ref)
83
+ if catalog_body && catalog_body =~ %r{/Pages\s+(\d+)\s+(\d+)\s+R}
84
+ pages_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
83
85
 
84
- # Extract position from widget
85
- rect_tok = DictScan.value_token_after("/Rect", body)
86
- next unless rect_tok && rect_tok.start_with?("[")
86
+ # Recursively collect pages from page tree
87
+ collect_pages_from_tree(pages_ref, page_objects)
88
+ end
89
+ end
87
90
 
88
- # Parse [x y x+width y+height] format
89
- rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
90
- next unless rect_values.length == 4
91
+ # Fallback: collect all page objects if page tree didn't work
92
+ if page_objects.empty?
93
+ @resolver.each_object do |ref, body|
94
+ next unless body
91
95
 
92
- x, y, x2, y2 = rect_values
93
- width = x2 - x
94
- height = y2 - y
96
+ # Match /Type /Page or /Type/Page but NOT /Type/Pages
97
+ is_page = body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}
98
+ next unless is_page
95
99
 
96
- page_num = nil
97
- if body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
98
- page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
99
- page_num = find_page_number_for_ref(page_ref)
100
+ page_objects << ref unless page_objects.include?(ref)
100
101
  end
101
102
 
102
- widget_info = {
103
- x: x, y: y, width: width, height: height, page: page_num
104
- }
103
+ # Sort by object number as fallback
104
+ page_objects.sort_by! { |ref| ref[0] }
105
+ end
105
106
 
106
- if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
107
- parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
107
+ # Second pass: extract information from each page
108
+ page_objects.each_with_index do |ref, index|
109
+ body = @resolver.object_body(ref)
110
+ next unless body
111
+
112
+ # Extract MediaBox, CropBox, or ArtBox for dimensions
113
+ width = nil
114
+ height = nil
115
+ media_box = nil
116
+ crop_box = nil
117
+ art_box = nil
118
+ bleed_box = nil
119
+ trim_box = nil
120
+
121
+ # Try MediaBox first (most common)
122
+ if body =~ %r{/MediaBox\s*\[(.*?)\]}
123
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
124
+ if box_values.length == 4
125
+ llx, lly, urx, ury = box_values
126
+ width = urx - llx
127
+ height = ury - lly
128
+ media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
129
+ end
130
+ end
108
131
 
109
- field_widgets[parent_ref] ||= []
110
- field_widgets[parent_ref] << widget_info
132
+ # Try CropBox
133
+ if body =~ %r{/CropBox\s*\[(.*?)\]}
134
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
135
+ if box_values.length == 4
136
+ llx, lly, urx, ury = box_values
137
+ crop_box = { llx: llx, lly: lly, urx: urx, ury: ury }
138
+ end
111
139
  end
112
140
 
113
- next unless body.include?("/T")
141
+ # Try ArtBox
142
+ if body =~ %r{/ArtBox\s*\[(.*?)\]}
143
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
144
+ if box_values.length == 4
145
+ llx, lly, urx, ury = box_values
146
+ art_box = { llx: llx, lly: lly, urx: urx, ury: ury }
147
+ end
148
+ end
114
149
 
115
- t_tok = DictScan.value_token_after("/T", body)
116
- if t_tok
117
- widget_name = DictScan.decode_pdf_string(t_tok)
118
- if widget_name && !widget_name.empty?
119
- widgets_by_name[widget_name] ||= []
120
- widgets_by_name[widget_name] << widget_info
150
+ # Try BleedBox
151
+ if body =~ %r{/BleedBox\s*\[(.*?)\]}
152
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
153
+ if box_values.length == 4
154
+ llx, lly, urx, ury = box_values
155
+ bleed_box = { llx: llx, lly: lly, urx: urx, ury: ury }
156
+ end
157
+ end
158
+
159
+ # Try TrimBox
160
+ if body =~ %r{/TrimBox\s*\[(.*?)\]}
161
+ box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
162
+ if box_values.length == 4
163
+ llx, lly, urx, ury = box_values
164
+ trim_box = { llx: llx, lly: lly, urx: urx, ury: ury }
165
+ end
166
+ end
167
+
168
+ # Extract rotation
169
+ rotate = nil
170
+ if body =~ %r{/Rotate\s+(\d+)}
171
+ rotate = Integer(::Regexp.last_match(1))
172
+ end
173
+
174
+ # Extract Resources reference
175
+ resources_ref = nil
176
+ if body =~ %r{/Resources\s+(\d+)\s+(\d+)\s+R}
177
+ resources_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
178
+ end
179
+
180
+ # Extract Parent reference
181
+ parent_ref = nil
182
+ if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
183
+ parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
184
+ end
185
+
186
+ # Extract Contents reference(s)
187
+ contents_refs = []
188
+ if body =~ %r{/Contents\s+(\d+)\s+(\d+)\s+R}
189
+ contents_refs << [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
190
+ elsif body =~ %r{/Contents\s*\[(.*?)\]}
191
+ contents_array = ::Regexp.last_match(1)
192
+ contents_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
193
+ contents_refs << [num_str.to_i, gen_str.to_i]
121
194
  end
122
195
  end
196
+
197
+ # Build metadata hash
198
+ metadata = {
199
+ rotate: rotate,
200
+ media_box: media_box,
201
+ crop_box: crop_box,
202
+ art_box: art_box,
203
+ bleed_box: bleed_box,
204
+ trim_box: trim_box,
205
+ resources_ref: resources_ref,
206
+ parent_ref: parent_ref,
207
+ contents_refs: contents_refs
208
+ }
209
+
210
+ pages << {
211
+ page: index + 1, # Page number starting at 1
212
+ width: width,
213
+ height: height,
214
+ ref: ref,
215
+ metadata: metadata
216
+ }
123
217
  end
124
218
 
125
- # Second pass: collect all fields (both field objects and widget annotations with /T)
219
+ pages
220
+ end
221
+
222
+ # Return an array of Field(name, value, type, ref)
223
+ def list_fields
224
+ fields = []
225
+ field_widgets = {}
226
+ widgets_by_name = {}
227
+
228
+ # First pass: collect widget information
126
229
  @resolver.each_object do |ref, body|
127
- next unless body&.include?("/T")
230
+ next unless body
231
+
232
+ is_widget = DictScan.is_widget?(body)
233
+
234
+ # Collect widget information if this is a widget
235
+ if is_widget
236
+ # Extract position from widget
237
+ rect_tok = DictScan.value_token_after("/Rect", body)
238
+ if rect_tok && rect_tok.start_with?("[")
239
+ # Parse [x y x+width y+height] format
240
+ rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
241
+ if rect_values.length == 4
242
+ x, y, x2, y2 = rect_values
243
+ width = x2 - x
244
+ height = y2 - y
245
+
246
+ page_num = nil
247
+ if body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
248
+ page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
249
+ page_num = find_page_number_for_ref(page_ref)
250
+ end
251
+
252
+ widget_info = {
253
+ x: x, y: y, width: width, height: height, page: page_num
254
+ }
255
+
256
+ if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
257
+ parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
128
258
 
129
- is_widget_field = DictScan.is_widget?(body)
259
+ field_widgets[parent_ref] ||= []
260
+ field_widgets[parent_ref] << widget_info
261
+ end
262
+
263
+ if body.include?("/T")
264
+ t_tok = DictScan.value_token_after("/T", body)
265
+ if t_tok
266
+ widget_name = DictScan.decode_pdf_string(t_tok)
267
+ if widget_name && !widget_name.empty?
268
+ widgets_by_name[widget_name] ||= []
269
+ widgets_by_name[widget_name] << widget_info
270
+ end
271
+ end
272
+ end
273
+ end
274
+ end
275
+ end
276
+
277
+ # Second pass: collect all fields (both field objects and widget annotations with /T)
278
+ next unless body.include?("/T")
279
+
280
+ is_widget_field = is_widget
130
281
  hint = body.include?("/FT") || is_widget_field || body.include?("/Kids") || body.include?("/Parent")
131
282
  next unless hint
132
283
 
@@ -143,8 +294,7 @@ module AcroThat
143
294
  type = ft_tok
144
295
 
145
296
  position = {}
146
- is_widget_annot = DictScan.is_widget?(body)
147
- if is_widget_annot
297
+ if is_widget
148
298
  rect_tok = DictScan.value_token_after("/Rect", body)
149
299
  if rect_tok && rect_tok.start_with?("[")
150
300
  rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
@@ -270,8 +420,261 @@ module AcroThat
270
420
  field.remove
271
421
  end
272
422
 
423
+ # Clean up the PDF by removing unwanted fields.
424
+ # Options:
425
+ # - keep_fields: Array of field names to keep (all others removed)
426
+ # - remove_fields: Array of field names to remove
427
+ # - remove_pattern: Regex pattern - fields matching this are removed
428
+ # - block: Given field name, return true to keep, false to remove
429
+ # This rewrites the entire PDF (like flatten) but excludes the unwanted fields.
430
+ def clear(keep_fields: nil, remove_fields: nil, remove_pattern: nil)
431
+ root_ref = @resolver.root_ref
432
+ raise "Cannot clear: no /Root found" unless root_ref
433
+
434
+ # Build a set of fields to remove
435
+ fields_to_remove = Set.new
436
+
437
+ # Get all current fields
438
+ all_fields = list_fields
439
+
440
+ if block_given?
441
+ # Use block to determine which fields to keep
442
+ all_fields.each do |field|
443
+ fields_to_remove.add(field.name) unless yield(field.name)
444
+ end
445
+ elsif keep_fields
446
+ # Keep only specified fields
447
+ keep_set = Set.new(keep_fields.map(&:to_s))
448
+ all_fields.each do |field|
449
+ fields_to_remove.add(field.name) unless keep_set.include?(field.name)
450
+ end
451
+ elsif remove_fields
452
+ # Remove specified fields
453
+ remove_set = Set.new(remove_fields.map(&:to_s))
454
+ all_fields.each do |field|
455
+ fields_to_remove.add(field.name) if remove_set.include?(field.name)
456
+ end
457
+ elsif remove_pattern
458
+ # Remove fields matching pattern
459
+ all_fields.each do |field|
460
+ fields_to_remove.add(field.name) if field.name =~ remove_pattern
461
+ end
462
+ else
463
+ # No criteria specified, return original
464
+ return @raw
465
+ end
466
+
467
+ # Build sets of refs to exclude
468
+ field_refs_to_remove = Set.new
469
+ widget_refs_to_remove = Set.new
470
+
471
+ all_fields.each do |field|
472
+ next unless fields_to_remove.include?(field.name)
473
+
474
+ field_refs_to_remove.add(field.ref) if field.valid_ref?
475
+
476
+ # Find all widget annotations for this field
477
+ @resolver.each_object do |widget_ref, body|
478
+ next unless body && DictScan.is_widget?(body)
479
+ next if widget_ref == field.ref
480
+
481
+ # Match by /Parent reference
482
+ if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
483
+ widget_parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
484
+ if widget_parent_ref == field.ref
485
+ widget_refs_to_remove.add(widget_ref)
486
+ next
487
+ end
488
+ end
489
+
490
+ # Also match by field name (/T)
491
+ next unless body.include?("/T")
492
+
493
+ t_tok = DictScan.value_token_after("/T", body)
494
+ next unless t_tok
495
+
496
+ widget_name = DictScan.decode_pdf_string(t_tok)
497
+ if widget_name && widget_name == field.name
498
+ widget_refs_to_remove.add(widget_ref)
499
+ end
500
+ end
501
+ end
502
+
503
+ # Collect objects to write (excluding removed fields and widgets)
504
+ objects = []
505
+ @resolver.each_object do |ref, body|
506
+ next if field_refs_to_remove.include?(ref)
507
+ next if widget_refs_to_remove.include?(ref)
508
+ next unless body
509
+
510
+ objects << { ref: ref, body: body }
511
+ end
512
+
513
+ # Process AcroForm to remove field references from /Fields array
514
+ af_ref = acroform_ref
515
+ if af_ref
516
+ # Find the AcroForm object in our objects list
517
+ af_obj = objects.find { |o| o[:ref] == af_ref }
518
+ if af_obj
519
+ af_body = af_obj[:body]
520
+ fields_array_ref = DictScan.value_token_after("/Fields", af_body)
521
+
522
+ if fields_array_ref && fields_array_ref =~ /\A(\d+)\s+(\d+)\s+R/
523
+ # /Fields points to separate array object
524
+ arr_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
525
+ arr_obj = objects.find { |o| o[:ref] == arr_ref }
526
+ if arr_obj
527
+ arr_body = arr_obj[:body]
528
+ field_refs_to_remove.each do |field_ref|
529
+ arr_body = DictScan.remove_ref_from_array(arr_body, field_ref)
530
+ end
531
+ # Clean up empty array
532
+ arr_body = arr_body.strip.gsub(/\[\s+\]/, "[]")
533
+ arr_obj[:body] = arr_body
534
+ end
535
+ elsif af_body.include?("/Fields")
536
+ # Inline /Fields array
537
+ field_refs_to_remove.each do |field_ref|
538
+ af_body = DictScan.remove_ref_from_inline_array(af_body, "/Fields", field_ref)
539
+ end
540
+ af_obj[:body] = af_body
541
+ end
542
+ end
543
+ end
544
+
545
+ # Process page objects to remove widget references from /Annots arrays
546
+ # Also remove any orphaned widget references (widgets that reference non-existent fields)
547
+ objects_in_file = Set.new(objects.map { |o| o[:ref] })
548
+ field_refs_in_file = Set.new
549
+ objects.each do |obj|
550
+ body = obj[:body]
551
+ # Check if this is a field object
552
+ if body&.include?("/FT") && body.include?("/T")
553
+ field_refs_in_file.add(obj[:ref])
554
+ end
555
+
556
+ body = obj[:body]
557
+ # Match /Type /Page or /Type/Page but NOT /Type/Pages
558
+ next unless body&.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}
559
+
560
+ # Handle inline /Annots array
561
+ if body =~ %r{/Annots\s*\[(.*?)\]}
562
+ annots_array_str = ::Regexp.last_match(1)
563
+
564
+ # Remove widgets that match removed fields
565
+ widget_refs_to_remove.each do |widget_ref|
566
+ annots_array_str = annots_array_str.gsub(/\b#{widget_ref[0]}\s+#{widget_ref[1]}\s+R\b/, "").strip
567
+ annots_array_str = annots_array_str.gsub(/\s+/, " ")
568
+ end
569
+
570
+ # Also remove orphaned widget references (widgets not in objects_in_file or pointing to non-existent fields)
571
+ annots_refs = annots_array_str.scan(/(\d+)\s+(\d+)\s+R/).map { |n, g| [Integer(n), Integer(g)] }
572
+ annots_refs.each do |annot_ref|
573
+ # Check if this annotation is a widget that should be removed
574
+ if objects_in_file.include?(annot_ref)
575
+ # Widget exists - check if it's an orphaned widget (references non-existent field)
576
+ widget_obj = objects.find { |o| o[:ref] == annot_ref }
577
+ if widget_obj && DictScan.is_widget?(widget_obj[:body])
578
+ widget_body = widget_obj[:body]
579
+ # Check if widget references a parent field that doesn't exist
580
+ if widget_body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
581
+ parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
582
+ unless field_refs_in_file.include?(parent_ref)
583
+ # Parent field doesn't exist - orphaned widget, remove it
584
+ annots_array_str = annots_array_str.gsub(/\b#{annot_ref[0]}\s+#{annot_ref[1]}\s+R\b/, "").strip
585
+ annots_array_str = annots_array_str.gsub(/\s+/, " ")
586
+ end
587
+ end
588
+ end
589
+ else
590
+ # Widget object doesn't exist - remove it
591
+ annots_array_str = annots_array_str.gsub(/\b#{annot_ref[0]}\s+#{annot_ref[1]}\s+R\b/, "").strip
592
+ annots_array_str = annots_array_str.gsub(/\s+/, " ")
593
+ end
594
+ end
595
+
596
+ new_annots = if annots_array_str.empty? || annots_array_str.strip.empty?
597
+ "[]"
598
+ else
599
+ "[#{annots_array_str}]"
600
+ end
601
+
602
+ new_body = body.sub(%r{/Annots\s*\[.*?\]}, "/Annots #{new_annots}")
603
+ obj[:body] = new_body
604
+ # Handle indirect /Annots array reference
605
+ elsif body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
606
+ annots_array_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
607
+ annots_obj = objects.find { |o| o[:ref] == annots_array_ref }
608
+ if annots_obj
609
+ annots_body = annots_obj[:body]
610
+
611
+ # Remove widgets that match removed fields
612
+ widget_refs_to_remove.each do |widget_ref|
613
+ annots_body = DictScan.remove_ref_from_array(annots_body, widget_ref)
614
+ end
615
+
616
+ # Also remove orphaned widget references
617
+ annots_refs = annots_body.scan(/(\d+)\s+(\d+)\s+R/).map { |n, g| [Integer(n), Integer(g)] }
618
+ annots_refs.each do |annot_ref|
619
+ if objects_in_file.include?(annot_ref)
620
+ widget_obj = objects.find { |o| o[:ref] == annot_ref }
621
+ if widget_obj && DictScan.is_widget?(widget_obj[:body])
622
+ widget_body = widget_obj[:body]
623
+ if widget_body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
624
+ parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
625
+ unless field_refs_in_file.include?(parent_ref)
626
+ annots_body = DictScan.remove_ref_from_array(annots_body, annot_ref)
627
+ end
628
+ end
629
+ end
630
+ else
631
+ annots_body = DictScan.remove_ref_from_array(annots_body, annot_ref)
632
+ end
633
+ end
634
+
635
+ annots_obj[:body] = annots_body
636
+ end
637
+ end
638
+ end
639
+
640
+ # Sort objects by object number
641
+ objects.sort_by! { |obj| obj[:ref][0] }
642
+
643
+ # Write the cleaned PDF
644
+ writer = PDFWriter.new
645
+ writer.write_header
646
+
647
+ objects.each do |obj|
648
+ writer.write_object(obj[:ref], obj[:body])
649
+ end
650
+
651
+ writer.write_xref
652
+
653
+ trailer_dict = @resolver.trailer_dict
654
+ info_ref = nil
655
+ if trailer_dict =~ %r{/Info\s+(\d+)\s+(\d+)\s+R}
656
+ info_ref = [::Regexp.last_match(1).to_i, ::Regexp.last_match(2).to_i]
657
+ end
658
+
659
+ # Write trailer
660
+ max_obj_num = objects.map { |obj| obj[:ref][0] }.max || 0
661
+ writer.write_trailer(max_obj_num + 1, root_ref, info_ref)
662
+
663
+ writer.output
664
+ end
665
+
666
+ # Clean up in-place (mutates current instance)
667
+ def clear!(...)
668
+ cleaned_content = clear(...)
669
+ @raw = cleaned_content
670
+ @resolver = AcroThat::ObjectResolver.new(cleaned_content)
671
+ @patches = []
672
+
673
+ self
674
+ end
675
+
273
676
  # Write out with an incremental update
274
- def write(path_out = nil, flatten: false)
677
+ def write(path_out = nil, flatten: true)
275
678
  deduped_patches = @patches.reverse.uniq { |p| p[:ref] }.reverse
276
679
  writer = AcroThat::IncrementalWriter.new(@raw, deduped_patches)
277
680
  @raw = writer.render
@@ -290,6 +693,29 @@ module AcroThat
290
693
 
291
694
  private
292
695
 
696
+ def collect_pages_from_tree(pages_ref, page_objects)
697
+ pages_body = @resolver.object_body(pages_ref)
698
+ return unless pages_body
699
+
700
+ # Extract /Kids array from Pages object
701
+ if pages_body =~ %r{/Kids\s*\[(.*?)\]}m
702
+ kids_array = ::Regexp.last_match(1)
703
+ # Extract all object references from Kids array in order
704
+ kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
705
+ kid_ref = [num_str.to_i, gen_str.to_i]
706
+ kid_body = @resolver.object_body(kid_ref)
707
+
708
+ # Check if this kid is a page (not /Type/Pages)
709
+ if kid_body && (kid_body.include?("/Type /Page") || kid_body =~ %r{/Type\s*/Page(?!s)\b})
710
+ page_objects << kid_ref unless page_objects.include?(kid_ref)
711
+ elsif kid_body && kid_body.include?("/Type /Pages")
712
+ # Recursively find pages in this Pages node
713
+ collect_pages_from_tree(kid_ref, page_objects)
714
+ end
715
+ end
716
+ end
717
+ end
718
+
293
719
  def find_page_number_for_ref(page_ref)
294
720
  page_objects = []
295
721
  @resolver.each_object do |ref, body|
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module AcroThat
4
- VERSION = "0.1.1"
4
+ VERSION = "0.1.2"
5
5
  end
data/lib/acro_that.rb CHANGED
@@ -4,6 +4,7 @@ require "strscan"
4
4
  require "stringio"
5
5
  require "zlib"
6
6
  require "base64"
7
+ require "set"
7
8
 
8
9
  require_relative "acro_that/dict_scan"
9
10
  require_relative "acro_that/object_resolver"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: acro_that
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Wynkoop
@@ -98,9 +98,12 @@ files:
98
98
  - Rakefile
99
99
  - acro_that.gemspec
100
100
  - docs/README.md
101
+ - docs/clear_fields.md
101
102
  - docs/dict_scan_explained.md
102
103
  - docs/object_streams.md
103
104
  - docs/pdf_structure.md
105
+ - issues/README.md
106
+ - issues/refactoring-opportunities.md
104
107
  - lib/acro_that.rb
105
108
  - lib/acro_that/actions/add_field.rb
106
109
  - lib/acro_that/actions/add_signature_appearance.rb