acro_that 0.1.1 → 0.1.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -0
- data/Gemfile.lock +1 -1
- data/README.md +49 -0
- data/docs/README.md +12 -0
- data/docs/clear_fields.md +202 -0
- data/issues/README.md +38 -0
- data/issues/refactoring-opportunities.md +269 -0
- data/lib/acro_that/actions/add_field.rb +2 -55
- data/lib/acro_that/actions/add_signature_appearance.rb +3 -3
- data/lib/acro_that/actions/base.rb +4 -0
- data/lib/acro_that/actions/remove_field.rb +1 -5
- data/lib/acro_that/dict_scan.rb +7 -0
- data/lib/acro_that/document.rb +480 -45
- data/lib/acro_that/version.rb +1 -1
- data/lib/acro_that.rb +1 -0
- data/publish +183 -0
- metadata +5 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 56f6be44d023bdedaf254cc097d18791f284793d0c79c07847fd462c0643cc91
|
|
4
|
+
data.tar.gz: b649d290433fac6a6113a74ca47bfde60e31687f6e93b85e30c32053ee416b3a
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: b27c5ec88a2cfdec49750d9d3316979c90c0815461eb6281b4cf602dc6533c8761c8fba7fb49407256b9ade87f2f3731e8f9d121c4ef089d08be2901107cc295
|
|
7
|
+
data.tar.gz: f1b41cef40049564af8f081547460d72e787a721f676db4c80cd74eb21844b48364ac3b871e6ad2839505908d9e5be3b8d643c16812bd91ae6d0e858bcfb51eb
|
data/CHANGELOG.md
CHANGED
|
@@ -5,6 +5,22 @@ All notable changes to this project will be documented in this file.
|
|
|
5
5
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
7
|
|
|
8
|
+
## [0.1.3] - 2025-01-XX
|
|
9
|
+
|
|
10
|
+
### Fixed
|
|
11
|
+
- Fixed bug where fields added to multi-page PDFs were all placed on the same page. Fields now correctly appear on their specified pages when using the `page` option in `add_field`.
|
|
12
|
+
|
|
13
|
+
### Changed
|
|
14
|
+
- Refactored page-finding logic to eliminate code duplication across `Document` and `AddField` classes
|
|
15
|
+
- Unified page discovery through `Document#find_all_pages` and `Document#find_page_by_number` methods
|
|
16
|
+
- Updated all page detection patterns to use centralized `DictScan.is_page?` utility method
|
|
17
|
+
|
|
18
|
+
### Added
|
|
19
|
+
- `DictScan.is_page?` utility method for consistent page object detection across the codebase
|
|
20
|
+
- `Document#find_all_pages` private method for unified page discovery in document order
|
|
21
|
+
- `Document#find_page_by_number` private method for finding pages by page number
|
|
22
|
+
- Exposed `find_page_by_number` through `Base` module for use in action classes
|
|
23
|
+
|
|
8
24
|
## [0.1.1] - 2025-10-31
|
|
9
25
|
|
|
10
26
|
### Added
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
|
@@ -189,6 +189,31 @@ flattened_doc = AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
|
|
|
189
189
|
flattened_bytes = AcroThat::Document.flatten_pdf("input.pdf")
|
|
190
190
|
```
|
|
191
191
|
|
|
192
|
+
#### Clearing Fields
|
|
193
|
+
|
|
194
|
+
The `clear` and `clear!` methods allow you to completely remove unwanted fields by rewriting the entire PDF:
|
|
195
|
+
|
|
196
|
+
```ruby
|
|
197
|
+
doc = AcroThat::Document.new("form.pdf")
|
|
198
|
+
|
|
199
|
+
# Remove all fields matching a pattern
|
|
200
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
201
|
+
|
|
202
|
+
# Keep only specific fields
|
|
203
|
+
doc.clear!(keep_fields: ["Name", "Email"])
|
|
204
|
+
|
|
205
|
+
# Remove specific fields
|
|
206
|
+
doc.clear!(remove_fields: ["OldField1", "OldField2"])
|
|
207
|
+
|
|
208
|
+
# Use a block to determine which fields to keep
|
|
209
|
+
doc.clear! { |name| !name.start_with?("temp_") }
|
|
210
|
+
|
|
211
|
+
# Write the cleared PDF
|
|
212
|
+
doc.write("cleared.pdf", flatten: true)
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Note:** Unlike `remove_field`, which uses incremental updates, `clear` completely rewrites the PDF to exclude unwanted fields. This is more efficient when removing many fields and ensures complete removal. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
|
|
216
|
+
|
|
192
217
|
### API Reference
|
|
193
218
|
|
|
194
219
|
#### `AcroThat::Document.new(path_or_io)`
|
|
@@ -282,6 +307,30 @@ AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
|
|
|
282
307
|
flattened_doc = AcroThat::Document.flatten_pdf("input.pdf")
|
|
283
308
|
```
|
|
284
309
|
|
|
310
|
+
#### `#clear(options = {})` and `#clear!(options = {})`
|
|
311
|
+
Removes unwanted fields by rewriting the entire PDF. `clear` returns cleared PDF bytes without modifying the document, while `clear!` modifies the document in-place. Options include:
|
|
312
|
+
|
|
313
|
+
- `keep_fields`: Array of field names to keep (all others removed)
|
|
314
|
+
- `remove_fields`: Array of field names to remove
|
|
315
|
+
- `remove_pattern`: Regex pattern - fields matching this are removed
|
|
316
|
+
- Block: Given field name, return `true` to keep, `false` to remove
|
|
317
|
+
|
|
318
|
+
```ruby
|
|
319
|
+
# Remove all fields
|
|
320
|
+
cleared = doc.clear(remove_pattern: /.*/)
|
|
321
|
+
|
|
322
|
+
# Remove fields matching pattern (in-place)
|
|
323
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
324
|
+
|
|
325
|
+
# Keep only specific fields
|
|
326
|
+
doc.clear!(keep_fields: ["Name", "Email"])
|
|
327
|
+
|
|
328
|
+
# Use block to filter fields
|
|
329
|
+
doc.clear! { |name| !name.match?(/^[a-f0-9-]{30,}/) }
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
**Note:** This completely rewrites the PDF (like `flatten`), so it's more efficient than using `remove_field` multiple times. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
|
|
333
|
+
|
|
285
334
|
### Field Object
|
|
286
335
|
|
|
287
336
|
Each field returned by `#list_fields` is a `Field` object with the following attributes and methods:
|
data/docs/README.md
CHANGED
|
@@ -38,6 +38,17 @@ Explains how PDF object streams work and how `AcroThat` parses them:
|
|
|
38
38
|
|
|
39
39
|
**Key insight:** Object streams compress multiple objects together, but parsing them is still **text traversal**—once decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
|
|
40
40
|
|
|
41
|
+
### [Clearing Fields](./cleaning_fields.md)
|
|
42
|
+
|
|
43
|
+
Documentation for the `clear` and `clear!` methods:
|
|
44
|
+
- How to remove unwanted fields completely
|
|
45
|
+
- Difference between `clear` and `remove_field`
|
|
46
|
+
- Pattern matching and field selection
|
|
47
|
+
- Removing orphaned widget references
|
|
48
|
+
- Best practices for clearing PDFs
|
|
49
|
+
|
|
50
|
+
**Key insight:** `clear` rewrites the entire PDF to exclude unwanted fields, ensuring complete removal rather than just marking fields as deleted.
|
|
51
|
+
|
|
41
52
|
## Common Themes
|
|
42
53
|
|
|
43
54
|
Throughout all documentation, you'll see these recurring themes:
|
|
@@ -54,6 +65,7 @@ Throughout all documentation, you'll see these recurring themes:
|
|
|
54
65
|
1. Start with [PDF Structure](./pdf_structure.md) to understand PDFs at a high level
|
|
55
66
|
2. Read [DictScan Explained](./dict_scan_explained.md) to see how text traversal works
|
|
56
67
|
3. Read [Object Streams](./object_streams.md) to understand compression features
|
|
68
|
+
4. Read [Clearing Fields](./cleaning_fields.md) to learn how to remove unwanted fields
|
|
57
69
|
|
|
58
70
|
**If you're debugging:**
|
|
59
71
|
- [DictScan Explained](./dict_scan_explained.md) has function-by-function walkthroughs
|
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
# Clearing Fields with `clear` and `clear!`
|
|
2
|
+
|
|
3
|
+
The `clear` method allows you to completely remove unwanted form fields from a PDF by rewriting the entire document, rather than using incremental updates. This is useful when you want to:
|
|
4
|
+
|
|
5
|
+
- Remove multiple layers of added fields
|
|
6
|
+
- Clear a PDF that has accumulated many unwanted fields
|
|
7
|
+
- Get back to a base file without certain fields
|
|
8
|
+
- Remove orphaned or invalid field references
|
|
9
|
+
|
|
10
|
+
Unlike `remove_field`, which uses incremental updates, `clear` rewrites the entire PDF (similar to `flatten`) but excludes the unwanted fields entirely. This ensures that:
|
|
11
|
+
|
|
12
|
+
- Field objects are completely removed (not just marked as deleted)
|
|
13
|
+
- Widget annotations are removed from page `/Annots` arrays
|
|
14
|
+
- Orphaned widget references are cleaned up
|
|
15
|
+
- The AcroForm `/Fields` array is updated
|
|
16
|
+
- All references to removed fields are eliminated
|
|
17
|
+
|
|
18
|
+
## Methods
|
|
19
|
+
|
|
20
|
+
### `clear(options = {})`
|
|
21
|
+
|
|
22
|
+
Returns a new PDF with unwanted fields removed. Does not modify the current document.
|
|
23
|
+
|
|
24
|
+
**Options:**
|
|
25
|
+
- `keep_fields`: Array of field names to keep (all others removed)
|
|
26
|
+
- `remove_fields`: Array of field names to remove
|
|
27
|
+
- `remove_pattern`: Regex pattern - fields matching this are removed
|
|
28
|
+
- Block: Given field name, return `true` to keep, `false` to remove
|
|
29
|
+
|
|
30
|
+
### `clear!(options = {})`
|
|
31
|
+
|
|
32
|
+
Same as `clear`, but modifies the current document in-place. Mutates the document instance.
|
|
33
|
+
|
|
34
|
+
## Usage Examples
|
|
35
|
+
|
|
36
|
+
### Remove All Fields
|
|
37
|
+
|
|
38
|
+
```ruby
|
|
39
|
+
doc = AcroThat::Document.new("form.pdf")
|
|
40
|
+
|
|
41
|
+
# Remove all fields
|
|
42
|
+
cleared_pdf = doc.clear(remove_pattern: /.*/)
|
|
43
|
+
|
|
44
|
+
# Or in-place
|
|
45
|
+
doc.clear!(remove_pattern: /.*/)
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Remove Fields Matching a Pattern
|
|
49
|
+
|
|
50
|
+
```ruby
|
|
51
|
+
# Remove all fields starting with "text-"
|
|
52
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
53
|
+
|
|
54
|
+
# Remove UUID-like generated fields
|
|
55
|
+
doc.clear! { |name| !(name =~ /text-/ || name =~ /^[a-f0-9]{20,}/) }
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### Keep Only Specific Fields
|
|
59
|
+
|
|
60
|
+
```ruby
|
|
61
|
+
# Keep only these fields, remove all others
|
|
62
|
+
doc.clear!(keep_fields: ["Name", "Email", "Phone"])
|
|
63
|
+
|
|
64
|
+
# Write the cleared PDF
|
|
65
|
+
doc.write("cleared.pdf", flatten: true)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### Remove Specific Fields
|
|
69
|
+
|
|
70
|
+
```ruby
|
|
71
|
+
# Remove specific unwanted fields
|
|
72
|
+
doc.clear!(remove_fields: ["OldField1", "OldField2", "GeneratedField3"])
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Complex Selection with Block
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
# Remove all fields except those matching certain criteria
|
|
79
|
+
doc.clear! do |field_name|
|
|
80
|
+
# Keep fields that don't look generated
|
|
81
|
+
!field_name.start_with?("text-") &&
|
|
82
|
+
!field_name.match?(/^[a-f0-9]{20,}/)
|
|
83
|
+
end
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## How It Works
|
|
87
|
+
|
|
88
|
+
The `clear` method:
|
|
89
|
+
|
|
90
|
+
1. **Identifies fields to remove** based on the provided criteria (pattern, list, or block)
|
|
91
|
+
|
|
92
|
+
2. **Finds related widgets** for each field to be removed:
|
|
93
|
+
- Widgets that reference the field via `/Parent`
|
|
94
|
+
- Widgets that have the same name via `/T`
|
|
95
|
+
|
|
96
|
+
3. **Collects objects to write**, excluding:
|
|
97
|
+
- Field objects that should be removed
|
|
98
|
+
- Widget annotation objects that should be removed
|
|
99
|
+
|
|
100
|
+
4. **Updates AcroForm structure**:
|
|
101
|
+
- Removes field references from the `/Fields` array
|
|
102
|
+
- Handles both inline and indirect array references
|
|
103
|
+
|
|
104
|
+
5. **Clears page annotations**:
|
|
105
|
+
- Removes widget references from page `/Annots` arrays
|
|
106
|
+
- Removes orphaned widget references (widgets pointing to non-existent fields)
|
|
107
|
+
- Removes references to widgets that don't exist in the cleared PDF
|
|
108
|
+
|
|
109
|
+
6. **Rewrites the entire PDF** from scratch (like `flatten`) with only the selected objects
|
|
110
|
+
|
|
111
|
+
## Key Differences from `remove_field`
|
|
112
|
+
|
|
113
|
+
| Feature | `remove_field` | `clear` |
|
|
114
|
+
|---------|---------------|---------|
|
|
115
|
+
| Update Type | Incremental update | Complete rewrite |
|
|
116
|
+
| Object Removal | Marks as deleted | Completely excluded |
|
|
117
|
+
| PDF Structure | Preserves all objects | Only includes selected objects |
|
|
118
|
+
| Use Case | Remove one/a few fields | Remove many fields or clean up |
|
|
119
|
+
| Performance | Fast (append only) | Slower (full rewrite) |
|
|
120
|
+
|
|
121
|
+
## Best Practices
|
|
122
|
+
|
|
123
|
+
1. **Use `clear` when removing many fields**: If you need to remove a large number of fields, `clear` is more efficient and produces cleaner output.
|
|
124
|
+
|
|
125
|
+
2. **Always flatten after clearing**: Since `clear` rewrites the PDF, consider using `write(..., flatten: true)` to ensure compatibility with all PDF viewers:
|
|
126
|
+
|
|
127
|
+
```ruby
|
|
128
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
129
|
+
doc.write("output.pdf", flatten: true)
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
3. **Combine with field addition**: After clearing, you can add new fields:
|
|
133
|
+
|
|
134
|
+
```ruby
|
|
135
|
+
doc.clear!(remove_pattern: /.*/)
|
|
136
|
+
doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1)
|
|
137
|
+
doc.write("output.pdf", flatten: true)
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
4. **Use patterns for generated fields**: If you have fields with predictable naming patterns (e.g., UUID-based names), use regex patterns:
|
|
141
|
+
|
|
142
|
+
```ruby
|
|
143
|
+
# Remove all UUID-like fields
|
|
144
|
+
doc.clear!(remove_pattern: /^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/)
|
|
145
|
+
|
|
146
|
+
# Remove all fields containing "temp" or "test"
|
|
147
|
+
doc.clear!(remove_pattern: /temp|test/i)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Technical Details
|
|
151
|
+
|
|
152
|
+
### Orphaned Widget Removal
|
|
153
|
+
|
|
154
|
+
The `clear` method automatically identifies and removes orphaned widget references:
|
|
155
|
+
|
|
156
|
+
- **Non-existent widgets**: Widget references in `/Annots` arrays that point to objects that don't exist
|
|
157
|
+
- **Orphaned widgets**: Widgets that reference parent fields that don't exist in the cleaned PDF
|
|
158
|
+
|
|
159
|
+
This ensures that page annotation arrays don't contain invalid references that could confuse PDF viewers.
|
|
160
|
+
|
|
161
|
+
### Page Detection
|
|
162
|
+
|
|
163
|
+
The method correctly identifies actual page objects (`/Type /Page`) and avoids matching page container objects (`/Type /Pages`), ensuring widgets are properly associated with the correct page.
|
|
164
|
+
|
|
165
|
+
### AcroForm Structure
|
|
166
|
+
|
|
167
|
+
The method properly handles both:
|
|
168
|
+
- **Inline `/Fields` arrays**: Arrays directly in the AcroForm dictionary
|
|
169
|
+
- **Indirect `/Fields` arrays**: Arrays referenced as separate objects
|
|
170
|
+
|
|
171
|
+
Both are updated to remove references to deleted fields.
|
|
172
|
+
|
|
173
|
+
## Example: Complete Clearing Workflow
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
require 'acro_that'
|
|
177
|
+
|
|
178
|
+
# Load PDF with many unwanted fields
|
|
179
|
+
doc = AcroThat::Document.new("messy_form.pdf")
|
|
180
|
+
|
|
181
|
+
# Remove all generated/UUID-like fields
|
|
182
|
+
doc.clear! { |name|
|
|
183
|
+
# Keep only fields that look intentional
|
|
184
|
+
!name.match?(/^[a-f0-9-]{30,}/) && # Not UUID-like
|
|
185
|
+
!name.start_with?("temp_") && # Not temporary
|
|
186
|
+
!name.empty? # Not empty
|
|
187
|
+
}
|
|
188
|
+
|
|
189
|
+
# Add new fields
|
|
190
|
+
doc.add_field("Name", value: "", x: 100, y: 700, width: 200, height: 20, page: 1, type: :text)
|
|
191
|
+
doc.add_field("Email", value: "", x: 100, y: 670, width: 200, height: 20, page: 1, type: :text)
|
|
192
|
+
|
|
193
|
+
# Write cleared and updated PDF
|
|
194
|
+
doc.write("cleared_form.pdf", flatten: true)
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
## See Also
|
|
198
|
+
|
|
199
|
+
- [`flatten` and `flatten!`](./README.md#flattening-pdfs) - Similar rewrite approach for removing incremental updates
|
|
200
|
+
- [`remove_field`](../README.md#remove_field) - Incremental removal of single fields
|
|
201
|
+
- [Main README](../README.md) - General usage and API reference
|
|
202
|
+
|
data/issues/README.md
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# Code Review Issues
|
|
2
|
+
|
|
3
|
+
This folder contains documentation of code cleanup and refactoring opportunities found in the codebase.
|
|
4
|
+
|
|
5
|
+
## Files
|
|
6
|
+
|
|
7
|
+
- **[refactoring-opportunities.md](./refactoring-opportunities.md)** - Detailed list of code duplication and refactoring opportunities
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
### High Priority Issues
|
|
12
|
+
1. **Widget Matching Logic** - Duplicated across 6+ locations
|
|
13
|
+
2. **/Annots Array Manipulation** - Complex logic duplicated in 3 locations
|
|
14
|
+
|
|
15
|
+
### Medium Priority Issues
|
|
16
|
+
3. **Page-Finding Logic** - Similar logic in 4+ methods
|
|
17
|
+
4. **Box Parsing Logic** - Repeated code blocks for 5 box types
|
|
18
|
+
|
|
19
|
+
### Low Priority Issues
|
|
20
|
+
5. Duplicated `next_fresh_object_number` implementation
|
|
21
|
+
6. Object reference extraction pattern duplication
|
|
22
|
+
7. Unused method: `get_widget_rect_dimensions`
|
|
23
|
+
8. Base64 decoding logic duplication
|
|
24
|
+
|
|
25
|
+
## Quick Stats
|
|
26
|
+
|
|
27
|
+
- **8 refactoring opportunities** identified
|
|
28
|
+
- **6+ locations** with widget matching duplication
|
|
29
|
+
- **3 locations** with /Annots array manipulation duplication
|
|
30
|
+
- **1 unused method** found
|
|
31
|
+
|
|
32
|
+
## Next Steps
|
|
33
|
+
|
|
34
|
+
1. Review [refactoring-opportunities.md](./refactoring-opportunities.md) for detailed information
|
|
35
|
+
2. Prioritize refactoring based on maintenance needs
|
|
36
|
+
3. Create test coverage before refactoring
|
|
37
|
+
4. Refactor incrementally, starting with high-priority items
|
|
38
|
+
|
|
@@ -0,0 +1,269 @@
|
|
|
1
|
+
# Refactoring Opportunities
|
|
2
|
+
|
|
3
|
+
This document identifies code duplication and unused methods that could be refactored to improve maintainability.
|
|
4
|
+
|
|
5
|
+
## 1. Duplicated Page-Finding Logic
|
|
6
|
+
|
|
7
|
+
### Issue
|
|
8
|
+
Multiple methods have similar logic for finding page objects in a PDF document.
|
|
9
|
+
|
|
10
|
+
### Locations
|
|
11
|
+
- `Document#list_pages` (lines 75-104)
|
|
12
|
+
- `Document#collect_pages_from_tree` (lines 691-712)
|
|
13
|
+
- `Document#find_page_number_for_ref` (lines 714-728)
|
|
14
|
+
- `AddField#find_page_ref` (lines 155-211)
|
|
15
|
+
|
|
16
|
+
### Pattern
|
|
17
|
+
The pattern `body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}` appears in multiple places with slight variations.
|
|
18
|
+
|
|
19
|
+
### Suggested Refactor
|
|
20
|
+
Create a shared module or utility methods in `DictScan`:
|
|
21
|
+
- `DictScan.is_page?(body)` - Check if a body represents a page object
|
|
22
|
+
- `Document#find_all_pages` - Unified method to find all page objects
|
|
23
|
+
- `Document#find_page_by_number(page_num)` - Find a specific page by number
|
|
24
|
+
|
|
25
|
+
### Benefits
|
|
26
|
+
- Single source of truth for page detection logic
|
|
27
|
+
- Easier to maintain and update page-finding behavior
|
|
28
|
+
- Consistent page ordering across methods
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## 2. Duplicated Widget-Matching Logic
|
|
33
|
+
|
|
34
|
+
### Issue
|
|
35
|
+
Multiple methods have similar logic for finding widgets that belong to a field. Widgets can be matched by:
|
|
36
|
+
1. `/Parent` reference pointing to the field
|
|
37
|
+
2. `/T` (field name) matching the field name
|
|
38
|
+
|
|
39
|
+
### Locations
|
|
40
|
+
- `Document#list_fields` (lines 222-327) - Finds widgets and matches them to fields
|
|
41
|
+
- `Document#clear` (lines 472-495) - Finds widgets for removed fields
|
|
42
|
+
- `UpdateField#update_widget_annotations_for_field` (lines 220-247) - Finds widgets by /Parent
|
|
43
|
+
- `UpdateField#update_widget_names_for_field` (lines 249-280) - Finds widgets by /Parent and /T
|
|
44
|
+
- `RemoveField#remove_widget_annotations_from_pages` (lines 55-103) - Finds widgets by /Parent and /T
|
|
45
|
+
- `AddSignatureAppearance#find_widget_annotation` (lines 164-206) - Finds widgets by /Parent
|
|
46
|
+
|
|
47
|
+
### Pattern
|
|
48
|
+
The pattern of checking `/Parent` reference and matching by `/T` field name is repeated throughout.
|
|
49
|
+
|
|
50
|
+
### Suggested Refactor
|
|
51
|
+
Create utility methods in `Base` or a new `WidgetMatcher` module:
|
|
52
|
+
- `find_widgets_by_parent(field_ref)` - Find widgets with /Parent pointing to field_ref
|
|
53
|
+
- `find_widgets_by_name(field_name)` - Find widgets with /T matching field_name
|
|
54
|
+
- `find_widgets_for_field(field_ref, field_name)` - Find all widgets for a field (by parent or name)
|
|
55
|
+
|
|
56
|
+
### Benefits
|
|
57
|
+
- Centralized widget matching logic
|
|
58
|
+
- Consistent widget finding behavior
|
|
59
|
+
- Easier to extend matching criteria
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## 3. Duplicated /Annots Array Manipulation
|
|
64
|
+
|
|
65
|
+
### Issue
|
|
66
|
+
Multiple methods handle adding or removing widget references from page `/Annots` arrays. The logic needs to handle:
|
|
67
|
+
1. Inline `/Annots` arrays: `/Annots [...]`
|
|
68
|
+
2. Indirect `/Annots` arrays: `/Annots X Y R` (reference to separate array object)
|
|
69
|
+
|
|
70
|
+
### Locations
|
|
71
|
+
- `AddField#add_widget_to_page` (lines 213-275) - Adds widget to /Annots
|
|
72
|
+
- `RemoveField#remove_widget_from_page_annots` (lines 125-155) - Removes widget from /Annots
|
|
73
|
+
- `Document#clear` (lines 555-633) - Removes widgets from /Annots during cleanup
|
|
74
|
+
|
|
75
|
+
### Pattern
|
|
76
|
+
All three methods have similar conditional logic:
|
|
77
|
+
```ruby
|
|
78
|
+
if page_body =~ %r{/Annots\s*\[(.*?)\]}m
|
|
79
|
+
# Handle inline array
|
|
80
|
+
elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
|
|
81
|
+
# Handle indirect array
|
|
82
|
+
else
|
|
83
|
+
# Create new /Annots array
|
|
84
|
+
end
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Suggested Refactor
|
|
88
|
+
Extend `DictScan` with methods:
|
|
89
|
+
- `DictScan.add_to_annots_array(page_body, widget_ref)` - Unified method to add widget to /Annots
|
|
90
|
+
- `DictScan.remove_from_annots_array(page_body, widget_ref)` - Unified method to remove widget from /Annots
|
|
91
|
+
- `DictScan.get_annots_array(page_body)` - Extract /Annots array (handles both inline and indirect)
|
|
92
|
+
|
|
93
|
+
### Benefits
|
|
94
|
+
- Single implementation of /Annots manipulation logic
|
|
95
|
+
- Consistent handling of edge cases
|
|
96
|
+
- Easier to test /Annots operations
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## 4. Duplicated Box Parsing Logic
|
|
101
|
+
|
|
102
|
+
### Issue
|
|
103
|
+
`Document#list_pages` has repeated code blocks for parsing different box types (MediaBox, CropBox, ArtBox, BleedBox, TrimBox).
|
|
104
|
+
|
|
105
|
+
### Locations
|
|
106
|
+
- `Document#list_pages` (lines 120-165)
|
|
107
|
+
|
|
108
|
+
### Pattern
|
|
109
|
+
Each box type uses identical logic:
|
|
110
|
+
```ruby
|
|
111
|
+
if body =~ %r{/MediaBox\s*\[(.*?)\]}
|
|
112
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
113
|
+
if box_values.length == 4
|
|
114
|
+
llx, lly, urx, ury = box_values
|
|
115
|
+
media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
|
|
116
|
+
end
|
|
117
|
+
end
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Suggested Refactor
|
|
121
|
+
Create a helper method:
|
|
122
|
+
```ruby
|
|
123
|
+
def parse_box(body, box_type)
|
|
124
|
+
pattern = %r{/#{box_type}\s*\[(.*?)\]}
|
|
125
|
+
return nil unless body =~ pattern
|
|
126
|
+
|
|
127
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
128
|
+
return nil unless box_values.length == 4
|
|
129
|
+
|
|
130
|
+
llx, lly, urx, ury = box_values
|
|
131
|
+
{ llx: llx, lly: lly, urx: urx, ury: ury }
|
|
132
|
+
end
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Benefits
|
|
136
|
+
- Reduces code duplication from ~45 lines to ~10 lines per box type
|
|
137
|
+
- Easier to add new box types
|
|
138
|
+
- Consistent parsing logic
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## 5. Duplicated next_fresh_object_number Implementation
|
|
143
|
+
|
|
144
|
+
### Issue
|
|
145
|
+
The `next_fresh_object_number` method is implemented identically in two places.
|
|
146
|
+
|
|
147
|
+
### Locations
|
|
148
|
+
- `Document#next_fresh_object_number` (lines 730-739)
|
|
149
|
+
- `Base#next_fresh_object_number` (lines 28-37)
|
|
150
|
+
|
|
151
|
+
### Pattern
|
|
152
|
+
Both methods have identical implementation:
|
|
153
|
+
```ruby
|
|
154
|
+
def next_fresh_object_number
|
|
155
|
+
max_obj_num = 0
|
|
156
|
+
resolver.each_object do |ref, _|
|
|
157
|
+
max_obj_num = [max_obj_num, ref[0]].max
|
|
158
|
+
end
|
|
159
|
+
patches.each do |p|
|
|
160
|
+
max_obj_num = [max_obj_num, p[:ref][0]].max
|
|
161
|
+
end
|
|
162
|
+
max_obj_num + 1
|
|
163
|
+
end
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
### Suggested Refactor
|
|
167
|
+
- Remove `Document#next_fresh_object_number` - it's only called within `Document` but could use `Base`'s implementation
|
|
168
|
+
- Or: Document already has access to resolver and patches, so remove duplication by making Document use Base's method
|
|
169
|
+
|
|
170
|
+
### Benefits
|
|
171
|
+
- Single implementation
|
|
172
|
+
- Consistent object numbering logic
|
|
173
|
+
|
|
174
|
+
---
|
|
175
|
+
|
|
176
|
+
## 6. Unused Methods
|
|
177
|
+
|
|
178
|
+
### Issue
|
|
179
|
+
Some methods are defined but never called.
|
|
180
|
+
|
|
181
|
+
### Locations
|
|
182
|
+
- `AddSignatureAppearance#get_widget_rect_dimensions` (lines 218-223)
|
|
183
|
+
- Defined but never used
|
|
184
|
+
- `extract_rect` is used instead, which provides the same information
|
|
185
|
+
|
|
186
|
+
### Suggested Refactor
|
|
187
|
+
- Remove `get_widget_rect_dimensions` if it's truly unused
|
|
188
|
+
- Or: Verify if it was intended for future use and document it
|
|
189
|
+
|
|
190
|
+
### Benefits
|
|
191
|
+
- Cleaner codebase
|
|
192
|
+
- Less confusion about which method to use
|
|
193
|
+
|
|
194
|
+
---
|
|
195
|
+
|
|
196
|
+
## 7. Duplicated Base64 Decoding Logic
|
|
197
|
+
|
|
198
|
+
### Issue
|
|
199
|
+
`AddSignatureAppearance` has two similar methods for decoding base64 data.
|
|
200
|
+
|
|
201
|
+
### Locations
|
|
202
|
+
- `AddSignatureAppearance#decode_base64_data_uri` (lines 101-106)
|
|
203
|
+
- `AddSignatureAppearance#decode_base64_if_needed` (lines 108-119)
|
|
204
|
+
|
|
205
|
+
### Pattern
|
|
206
|
+
Both methods handle base64 decoding, with slightly different logic. Could potentially be unified.
|
|
207
|
+
|
|
208
|
+
### Suggested Refactor
|
|
209
|
+
- Consider merging into a single method that handles both cases
|
|
210
|
+
- Or: Document the distinction if both are needed
|
|
211
|
+
|
|
212
|
+
### Benefits
|
|
213
|
+
- Simpler API
|
|
214
|
+
- Less code duplication
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
## 8. Duplicated Regex Pattern for Object Reference
|
|
219
|
+
|
|
220
|
+
### Issue
|
|
221
|
+
The pattern for extracting object references `(\d+)\s+(\d+)\s+R` appears in many places.
|
|
222
|
+
|
|
223
|
+
### Locations
|
|
224
|
+
Throughout the codebase, used in:
|
|
225
|
+
- Extracting `/Parent` references
|
|
226
|
+
- Extracting `/P` (page) references
|
|
227
|
+
- Extracting `/Pages` references
|
|
228
|
+
- Extracting `/Fields` array references
|
|
229
|
+
- And many more...
|
|
230
|
+
|
|
231
|
+
### Suggested Refactor
|
|
232
|
+
Create a utility method:
|
|
233
|
+
```ruby
|
|
234
|
+
def DictScan.extract_object_ref(str)
|
|
235
|
+
# Extract object reference from string
|
|
236
|
+
# Returns [obj_num, gen_num] or nil
|
|
237
|
+
end
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
### Benefits
|
|
241
|
+
- Consistent reference extraction
|
|
242
|
+
- Easier to update if PDF reference format changes
|
|
243
|
+
- More readable code
|
|
244
|
+
|
|
245
|
+
---
|
|
246
|
+
|
|
247
|
+
## Priority Recommendations
|
|
248
|
+
|
|
249
|
+
### High Priority
|
|
250
|
+
1. **Widget Matching Logic (#2)** - Most duplicated, used in many critical operations
|
|
251
|
+
2. **/Annots Array Manipulation (#3)** - Complex logic that's error-prone when duplicated
|
|
252
|
+
|
|
253
|
+
### Medium Priority
|
|
254
|
+
3. **Page-Finding Logic (#1)** - Used in multiple places, but less frequently
|
|
255
|
+
4. **Box Parsing Logic (#4)** - Simple duplication, easy to refactor
|
|
256
|
+
|
|
257
|
+
### Low Priority
|
|
258
|
+
5. **next_fresh_object_number (#5)** - Simple duplication
|
|
259
|
+
6. **Object Reference Extraction (#8)** - Could improve consistency
|
|
260
|
+
7. **Unused Methods (#6)** - Cleanup task
|
|
261
|
+
8. **Base64 Decoding (#7)** - Minor duplication
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## Notes
|
|
266
|
+
- All refactoring should be accompanied by tests to ensure behavior doesn't change
|
|
267
|
+
- Consider backward compatibility if any methods are moved between modules
|
|
268
|
+
- Some duplication may be intentional for performance reasons (avoid method call overhead) - evaluate before refactoring
|
|
269
|
+
|