acro_that 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +1 -1
- data/README.md +49 -0
- data/docs/README.md +12 -0
- data/docs/clear_fields.md +202 -0
- data/issues/README.md +38 -0
- data/issues/refactoring-opportunities.md +269 -0
- data/lib/acro_that/actions/add_field.rb +6 -6
- data/lib/acro_that/actions/add_signature_appearance.rb +3 -3
- data/lib/acro_that/document.rb +467 -41
- data/lib/acro_that/version.rb +1 -1
- data/lib/acro_that.rb +1 -0
- metadata +4 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: aa3b37611a5ca19d2cb0fa1c86d3821fb139b6f65ce9d3cd314d913b6e9e5db5
|
|
4
|
+
data.tar.gz: a4d07aaa7f268eb151b324b03e9575f563880b2cf97fd0172edc0e96c35e6eef
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: a545eea311d77e7b46459e710925f8ec7b61baebd846ec7f602f6a77ab9532bab87d37be4bba53b00b43a09281e79faba79fcb71e5371c36953e877d556dcdd8
|
|
7
|
+
data.tar.gz: b9074d19a0cc330af44f4fd5609f49f8892e20deb95d522fa7ca5fb3099ca8c71f01cbc98ef7d3e6ed95620af090afc96035fbeaaf4a551ca6f9e4003c49f4a8
|
data/Gemfile.lock
CHANGED
data/README.md
CHANGED
|
@@ -189,6 +189,31 @@ flattened_doc = AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
|
|
|
189
189
|
flattened_bytes = AcroThat::Document.flatten_pdf("input.pdf")
|
|
190
190
|
```
|
|
191
191
|
|
|
192
|
+
#### Clearing Fields
|
|
193
|
+
|
|
194
|
+
The `clear` and `clear!` methods allow you to completely remove unwanted fields by rewriting the entire PDF:
|
|
195
|
+
|
|
196
|
+
```ruby
|
|
197
|
+
doc = AcroThat::Document.new("form.pdf")
|
|
198
|
+
|
|
199
|
+
# Remove all fields matching a pattern
|
|
200
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
201
|
+
|
|
202
|
+
# Keep only specific fields
|
|
203
|
+
doc.clear!(keep_fields: ["Name", "Email"])
|
|
204
|
+
|
|
205
|
+
# Remove specific fields
|
|
206
|
+
doc.clear!(remove_fields: ["OldField1", "OldField2"])
|
|
207
|
+
|
|
208
|
+
# Use a block to determine which fields to keep
|
|
209
|
+
doc.clear! { |name| !name.start_with?("temp_") }
|
|
210
|
+
|
|
211
|
+
# Write the cleared PDF
|
|
212
|
+
doc.write("cleared.pdf", flatten: true)
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Note:** Unlike `remove_field`, which uses incremental updates, `clear` completely rewrites the PDF to exclude unwanted fields. This is more efficient when removing many fields and ensures complete removal. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
|
|
216
|
+
|
|
192
217
|
### API Reference
|
|
193
218
|
|
|
194
219
|
#### `AcroThat::Document.new(path_or_io)`
|
|
@@ -282,6 +307,30 @@ AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
|
|
|
282
307
|
flattened_doc = AcroThat::Document.flatten_pdf("input.pdf")
|
|
283
308
|
```
|
|
284
309
|
|
|
310
|
+
#### `#clear(options = {})` and `#clear!(options = {})`
|
|
311
|
+
Removes unwanted fields by rewriting the entire PDF. `clear` returns cleared PDF bytes without modifying the document, while `clear!` modifies the document in-place. Options include:
|
|
312
|
+
|
|
313
|
+
- `keep_fields`: Array of field names to keep (all others removed)
|
|
314
|
+
- `remove_fields`: Array of field names to remove
|
|
315
|
+
- `remove_pattern`: Regex pattern - fields matching this are removed
|
|
316
|
+
- Block: Given field name, return `true` to keep, `false` to remove
|
|
317
|
+
|
|
318
|
+
```ruby
|
|
319
|
+
# Remove all fields
|
|
320
|
+
cleared = doc.clear(remove_pattern: /.*/)
|
|
321
|
+
|
|
322
|
+
# Remove fields matching pattern (in-place)
|
|
323
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
324
|
+
|
|
325
|
+
# Keep only specific fields
|
|
326
|
+
doc.clear!(keep_fields: ["Name", "Email"])
|
|
327
|
+
|
|
328
|
+
# Use block to filter fields
|
|
329
|
+
doc.clear! { |name| !name.match?(/^[a-f0-9-]{30,}/) }
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
**Note:** This completely rewrites the PDF (like `flatten`), so it's more efficient than using `remove_field` multiple times. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
|
|
333
|
+
|
|
285
334
|
### Field Object
|
|
286
335
|
|
|
287
336
|
Each field returned by `#list_fields` is a `Field` object with the following attributes and methods:
|
data/docs/README.md
CHANGED
|
@@ -38,6 +38,17 @@ Explains how PDF object streams work and how `AcroThat` parses them:
|
|
|
38
38
|
|
|
39
39
|
**Key insight:** Object streams compress multiple objects together, but parsing them is still **text traversal**—once decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
|
|
40
40
|
|
|
41
|
+
### [Clearing Fields](./cleaning_fields.md)
|
|
42
|
+
|
|
43
|
+
Documentation for the `clear` and `clear!` methods:
|
|
44
|
+
- How to remove unwanted fields completely
|
|
45
|
+
- Difference between `clear` and `remove_field`
|
|
46
|
+
- Pattern matching and field selection
|
|
47
|
+
- Removing orphaned widget references
|
|
48
|
+
- Best practices for clearing PDFs
|
|
49
|
+
|
|
50
|
+
**Key insight:** `clear` rewrites the entire PDF to exclude unwanted fields, ensuring complete removal rather than just marking fields as deleted.
|
|
51
|
+
|
|
41
52
|
## Common Themes
|
|
42
53
|
|
|
43
54
|
Throughout all documentation, you'll see these recurring themes:
|
|
@@ -54,6 +65,7 @@ Throughout all documentation, you'll see these recurring themes:
|
|
|
54
65
|
1. Start with [PDF Structure](./pdf_structure.md) to understand PDFs at a high level
|
|
55
66
|
2. Read [DictScan Explained](./dict_scan_explained.md) to see how text traversal works
|
|
56
67
|
3. Read [Object Streams](./object_streams.md) to understand compression features
|
|
68
|
+
4. Read [Clearing Fields](./cleaning_fields.md) to learn how to remove unwanted fields
|
|
57
69
|
|
|
58
70
|
**If you're debugging:**
|
|
59
71
|
- [DictScan Explained](./dict_scan_explained.md) has function-by-function walkthroughs
|
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
# Clearing Fields with `clear` and `clear!`
|
|
2
|
+
|
|
3
|
+
The `clear` method allows you to completely remove unwanted form fields from a PDF by rewriting the entire document, rather than using incremental updates. This is useful when you want to:
|
|
4
|
+
|
|
5
|
+
- Remove multiple layers of added fields
|
|
6
|
+
- Clear a PDF that has accumulated many unwanted fields
|
|
7
|
+
- Get back to a base file without certain fields
|
|
8
|
+
- Remove orphaned or invalid field references
|
|
9
|
+
|
|
10
|
+
Unlike `remove_field`, which uses incremental updates, `clear` rewrites the entire PDF (similar to `flatten`) but excludes the unwanted fields entirely. This ensures that:
|
|
11
|
+
|
|
12
|
+
- Field objects are completely removed (not just marked as deleted)
|
|
13
|
+
- Widget annotations are removed from page `/Annots` arrays
|
|
14
|
+
- Orphaned widget references are cleaned up
|
|
15
|
+
- The AcroForm `/Fields` array is updated
|
|
16
|
+
- All references to removed fields are eliminated
|
|
17
|
+
|
|
18
|
+
## Methods
|
|
19
|
+
|
|
20
|
+
### `clear(options = {})`
|
|
21
|
+
|
|
22
|
+
Returns a new PDF with unwanted fields removed. Does not modify the current document.
|
|
23
|
+
|
|
24
|
+
**Options:**
|
|
25
|
+
- `keep_fields`: Array of field names to keep (all others removed)
|
|
26
|
+
- `remove_fields`: Array of field names to remove
|
|
27
|
+
- `remove_pattern`: Regex pattern - fields matching this are removed
|
|
28
|
+
- Block: Given field name, return `true` to keep, `false` to remove
|
|
29
|
+
|
|
30
|
+
### `clear!(options = {})`
|
|
31
|
+
|
|
32
|
+
Same as `clear`, but modifies the current document in-place. Mutates the document instance.
|
|
33
|
+
|
|
34
|
+
## Usage Examples
|
|
35
|
+
|
|
36
|
+
### Remove All Fields
|
|
37
|
+
|
|
38
|
+
```ruby
|
|
39
|
+
doc = AcroThat::Document.new("form.pdf")
|
|
40
|
+
|
|
41
|
+
# Remove all fields
|
|
42
|
+
cleared_pdf = doc.clear(remove_pattern: /.*/)
|
|
43
|
+
|
|
44
|
+
# Or in-place
|
|
45
|
+
doc.clear!(remove_pattern: /.*/)
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Remove Fields Matching a Pattern
|
|
49
|
+
|
|
50
|
+
```ruby
|
|
51
|
+
# Remove all fields starting with "text-"
|
|
52
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
53
|
+
|
|
54
|
+
# Remove UUID-like generated fields
|
|
55
|
+
doc.clear! { |name| !(name =~ /text-/ || name =~ /^[a-f0-9]{20,}/) }
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### Keep Only Specific Fields
|
|
59
|
+
|
|
60
|
+
```ruby
|
|
61
|
+
# Keep only these fields, remove all others
|
|
62
|
+
doc.clear!(keep_fields: ["Name", "Email", "Phone"])
|
|
63
|
+
|
|
64
|
+
# Write the cleared PDF
|
|
65
|
+
doc.write("cleared.pdf", flatten: true)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### Remove Specific Fields
|
|
69
|
+
|
|
70
|
+
```ruby
|
|
71
|
+
# Remove specific unwanted fields
|
|
72
|
+
doc.clear!(remove_fields: ["OldField1", "OldField2", "GeneratedField3"])
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Complex Selection with Block
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
# Remove all fields except those matching certain criteria
|
|
79
|
+
doc.clear! do |field_name|
|
|
80
|
+
# Keep fields that don't look generated
|
|
81
|
+
!field_name.start_with?("text-") &&
|
|
82
|
+
!field_name.match?(/^[a-f0-9]{20,}/)
|
|
83
|
+
end
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## How It Works
|
|
87
|
+
|
|
88
|
+
The `clear` method:
|
|
89
|
+
|
|
90
|
+
1. **Identifies fields to remove** based on the provided criteria (pattern, list, or block)
|
|
91
|
+
|
|
92
|
+
2. **Finds related widgets** for each field to be removed:
|
|
93
|
+
- Widgets that reference the field via `/Parent`
|
|
94
|
+
- Widgets that have the same name via `/T`
|
|
95
|
+
|
|
96
|
+
3. **Collects objects to write**, excluding:
|
|
97
|
+
- Field objects that should be removed
|
|
98
|
+
- Widget annotation objects that should be removed
|
|
99
|
+
|
|
100
|
+
4. **Updates AcroForm structure**:
|
|
101
|
+
- Removes field references from the `/Fields` array
|
|
102
|
+
- Handles both inline and indirect array references
|
|
103
|
+
|
|
104
|
+
5. **Clears page annotations**:
|
|
105
|
+
- Removes widget references from page `/Annots` arrays
|
|
106
|
+
- Removes orphaned widget references (widgets pointing to non-existent fields)
|
|
107
|
+
- Removes references to widgets that don't exist in the cleared PDF
|
|
108
|
+
|
|
109
|
+
6. **Rewrites the entire PDF** from scratch (like `flatten`) with only the selected objects
|
|
110
|
+
|
|
111
|
+
## Key Differences from `remove_field`
|
|
112
|
+
|
|
113
|
+
| Feature | `remove_field` | `clear` |
|
|
114
|
+
|---------|---------------|---------|
|
|
115
|
+
| Update Type | Incremental update | Complete rewrite |
|
|
116
|
+
| Object Removal | Marks as deleted | Completely excluded |
|
|
117
|
+
| PDF Structure | Preserves all objects | Only includes selected objects |
|
|
118
|
+
| Use Case | Remove one/a few fields | Remove many fields or clean up |
|
|
119
|
+
| Performance | Fast (append only) | Slower (full rewrite) |
|
|
120
|
+
|
|
121
|
+
## Best Practices
|
|
122
|
+
|
|
123
|
+
1. **Use `clear` when removing many fields**: If you need to remove a large number of fields, `clear` is more efficient and produces cleaner output.
|
|
124
|
+
|
|
125
|
+
2. **Always flatten after clearing**: Since `clear` rewrites the PDF, consider using `write(..., flatten: true)` to ensure compatibility with all PDF viewers:
|
|
126
|
+
|
|
127
|
+
```ruby
|
|
128
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
129
|
+
doc.write("output.pdf", flatten: true)
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
3. **Combine with field addition**: After clearing, you can add new fields:
|
|
133
|
+
|
|
134
|
+
```ruby
|
|
135
|
+
doc.clear!(remove_pattern: /.*/)
|
|
136
|
+
doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1)
|
|
137
|
+
doc.write("output.pdf", flatten: true)
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
4. **Use patterns for generated fields**: If you have fields with predictable naming patterns (e.g., UUID-based names), use regex patterns:
|
|
141
|
+
|
|
142
|
+
```ruby
|
|
143
|
+
# Remove all UUID-like fields
|
|
144
|
+
doc.clear!(remove_pattern: /^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/)
|
|
145
|
+
|
|
146
|
+
# Remove all fields containing "temp" or "test"
|
|
147
|
+
doc.clear!(remove_pattern: /temp|test/i)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Technical Details
|
|
151
|
+
|
|
152
|
+
### Orphaned Widget Removal
|
|
153
|
+
|
|
154
|
+
The `clear` method automatically identifies and removes orphaned widget references:
|
|
155
|
+
|
|
156
|
+
- **Non-existent widgets**: Widget references in `/Annots` arrays that point to objects that don't exist
|
|
157
|
+
- **Orphaned widgets**: Widgets that reference parent fields that don't exist in the cleaned PDF
|
|
158
|
+
|
|
159
|
+
This ensures that page annotation arrays don't contain invalid references that could confuse PDF viewers.
|
|
160
|
+
|
|
161
|
+
### Page Detection
|
|
162
|
+
|
|
163
|
+
The method correctly identifies actual page objects (`/Type /Page`) and avoids matching page container objects (`/Type /Pages`), ensuring widgets are properly associated with the correct page.
|
|
164
|
+
|
|
165
|
+
### AcroForm Structure
|
|
166
|
+
|
|
167
|
+
The method properly handles both:
|
|
168
|
+
- **Inline `/Fields` arrays**: Arrays directly in the AcroForm dictionary
|
|
169
|
+
- **Indirect `/Fields` arrays**: Arrays referenced as separate objects
|
|
170
|
+
|
|
171
|
+
Both are updated to remove references to deleted fields.
|
|
172
|
+
|
|
173
|
+
## Example: Complete Clearing Workflow
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
require 'acro_that'
|
|
177
|
+
|
|
178
|
+
# Load PDF with many unwanted fields
|
|
179
|
+
doc = AcroThat::Document.new("messy_form.pdf")
|
|
180
|
+
|
|
181
|
+
# Remove all generated/UUID-like fields
|
|
182
|
+
doc.clear! { |name|
|
|
183
|
+
# Keep only fields that look intentional
|
|
184
|
+
!name.match?(/^[a-f0-9-]{30,}/) && # Not UUID-like
|
|
185
|
+
!name.start_with?("temp_") && # Not temporary
|
|
186
|
+
!name.empty? # Not empty
|
|
187
|
+
}
|
|
188
|
+
|
|
189
|
+
# Add new fields
|
|
190
|
+
doc.add_field("Name", value: "", x: 100, y: 700, width: 200, height: 20, page: 1, type: :text)
|
|
191
|
+
doc.add_field("Email", value: "", x: 100, y: 670, width: 200, height: 20, page: 1, type: :text)
|
|
192
|
+
|
|
193
|
+
# Write cleared and updated PDF
|
|
194
|
+
doc.write("cleared_form.pdf", flatten: true)
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
## See Also
|
|
198
|
+
|
|
199
|
+
- [`flatten` and `flatten!`](./README.md#flattening-pdfs) - Similar rewrite approach for removing incremental updates
|
|
200
|
+
- [`remove_field`](../README.md#remove_field) - Incremental removal of single fields
|
|
201
|
+
- [Main README](../README.md) - General usage and API reference
|
|
202
|
+
|
data/issues/README.md
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# Code Review Issues
|
|
2
|
+
|
|
3
|
+
This folder contains documentation of code cleanup and refactoring opportunities found in the codebase.
|
|
4
|
+
|
|
5
|
+
## Files
|
|
6
|
+
|
|
7
|
+
- **[refactoring-opportunities.md](./refactoring-opportunities.md)** - Detailed list of code duplication and refactoring opportunities
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
### High Priority Issues
|
|
12
|
+
1. **Widget Matching Logic** - Duplicated across 6+ locations
|
|
13
|
+
2. **/Annots Array Manipulation** - Complex logic duplicated in 3 locations
|
|
14
|
+
|
|
15
|
+
### Medium Priority Issues
|
|
16
|
+
3. **Page-Finding Logic** - Similar logic in 4+ methods
|
|
17
|
+
4. **Box Parsing Logic** - Repeated code blocks for 5 box types
|
|
18
|
+
|
|
19
|
+
### Low Priority Issues
|
|
20
|
+
5. Duplicated `next_fresh_object_number` implementation
|
|
21
|
+
6. Object reference extraction pattern duplication
|
|
22
|
+
7. Unused method: `get_widget_rect_dimensions`
|
|
23
|
+
8. Base64 decoding logic duplication
|
|
24
|
+
|
|
25
|
+
## Quick Stats
|
|
26
|
+
|
|
27
|
+
- **8 refactoring opportunities** identified
|
|
28
|
+
- **6+ locations** with widget matching duplication
|
|
29
|
+
- **3 locations** with /Annots array manipulation duplication
|
|
30
|
+
- **1 unused method** found
|
|
31
|
+
|
|
32
|
+
## Next Steps
|
|
33
|
+
|
|
34
|
+
1. Review [refactoring-opportunities.md](./refactoring-opportunities.md) for detailed information
|
|
35
|
+
2. Prioritize refactoring based on maintenance needs
|
|
36
|
+
3. Create test coverage before refactoring
|
|
37
|
+
4. Refactor incrementally, starting with high-priority items
|
|
38
|
+
|
|
@@ -0,0 +1,269 @@
|
|
|
1
|
+
# Refactoring Opportunities
|
|
2
|
+
|
|
3
|
+
This document identifies code duplication and unused methods that could be refactored to improve maintainability.
|
|
4
|
+
|
|
5
|
+
## 1. Duplicated Page-Finding Logic
|
|
6
|
+
|
|
7
|
+
### Issue
|
|
8
|
+
Multiple methods have similar logic for finding page objects in a PDF document.
|
|
9
|
+
|
|
10
|
+
### Locations
|
|
11
|
+
- `Document#list_pages` (lines 75-104)
|
|
12
|
+
- `Document#collect_pages_from_tree` (lines 691-712)
|
|
13
|
+
- `Document#find_page_number_for_ref` (lines 714-728)
|
|
14
|
+
- `AddField#find_page_ref` (lines 155-211)
|
|
15
|
+
|
|
16
|
+
### Pattern
|
|
17
|
+
The pattern `body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}` appears in multiple places with slight variations.
|
|
18
|
+
|
|
19
|
+
### Suggested Refactor
|
|
20
|
+
Create a shared module or utility methods in `DictScan`:
|
|
21
|
+
- `DictScan.is_page?(body)` - Check if a body represents a page object
|
|
22
|
+
- `Document#find_all_pages` - Unified method to find all page objects
|
|
23
|
+
- `Document#find_page_by_number(page_num)` - Find a specific page by number
|
|
24
|
+
|
|
25
|
+
### Benefits
|
|
26
|
+
- Single source of truth for page detection logic
|
|
27
|
+
- Easier to maintain and update page-finding behavior
|
|
28
|
+
- Consistent page ordering across methods
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## 2. Duplicated Widget-Matching Logic
|
|
33
|
+
|
|
34
|
+
### Issue
|
|
35
|
+
Multiple methods have similar logic for finding widgets that belong to a field. Widgets can be matched by:
|
|
36
|
+
1. `/Parent` reference pointing to the field
|
|
37
|
+
2. `/T` (field name) matching the field name
|
|
38
|
+
|
|
39
|
+
### Locations
|
|
40
|
+
- `Document#list_fields` (lines 222-327) - Finds widgets and matches them to fields
|
|
41
|
+
- `Document#clear` (lines 472-495) - Finds widgets for removed fields
|
|
42
|
+
- `UpdateField#update_widget_annotations_for_field` (lines 220-247) - Finds widgets by /Parent
|
|
43
|
+
- `UpdateField#update_widget_names_for_field` (lines 249-280) - Finds widgets by /Parent and /T
|
|
44
|
+
- `RemoveField#remove_widget_annotations_from_pages` (lines 55-103) - Finds widgets by /Parent and /T
|
|
45
|
+
- `AddSignatureAppearance#find_widget_annotation` (lines 164-206) - Finds widgets by /Parent
|
|
46
|
+
|
|
47
|
+
### Pattern
|
|
48
|
+
The pattern of checking `/Parent` reference and matching by `/T` field name is repeated throughout.
|
|
49
|
+
|
|
50
|
+
### Suggested Refactor
|
|
51
|
+
Create utility methods in `Base` or a new `WidgetMatcher` module:
|
|
52
|
+
- `find_widgets_by_parent(field_ref)` - Find widgets with /Parent pointing to field_ref
|
|
53
|
+
- `find_widgets_by_name(field_name)` - Find widgets with /T matching field_name
|
|
54
|
+
- `find_widgets_for_field(field_ref, field_name)` - Find all widgets for a field (by parent or name)
|
|
55
|
+
|
|
56
|
+
### Benefits
|
|
57
|
+
- Centralized widget matching logic
|
|
58
|
+
- Consistent widget finding behavior
|
|
59
|
+
- Easier to extend matching criteria
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## 3. Duplicated /Annots Array Manipulation
|
|
64
|
+
|
|
65
|
+
### Issue
|
|
66
|
+
Multiple methods handle adding or removing widget references from page `/Annots` arrays. The logic needs to handle:
|
|
67
|
+
1. Inline `/Annots` arrays: `/Annots [...]`
|
|
68
|
+
2. Indirect `/Annots` arrays: `/Annots X Y R` (reference to separate array object)
|
|
69
|
+
|
|
70
|
+
### Locations
|
|
71
|
+
- `AddField#add_widget_to_page` (lines 213-275) - Adds widget to /Annots
|
|
72
|
+
- `RemoveField#remove_widget_from_page_annots` (lines 125-155) - Removes widget from /Annots
|
|
73
|
+
- `Document#clear` (lines 555-633) - Removes widgets from /Annots during cleanup
|
|
74
|
+
|
|
75
|
+
### Pattern
|
|
76
|
+
All three methods have similar conditional logic:
|
|
77
|
+
```ruby
|
|
78
|
+
if page_body =~ %r{/Annots\s*\[(.*?)\]}m
|
|
79
|
+
# Handle inline array
|
|
80
|
+
elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
|
|
81
|
+
# Handle indirect array
|
|
82
|
+
else
|
|
83
|
+
# Create new /Annots array
|
|
84
|
+
end
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
### Suggested Refactor
|
|
88
|
+
Extend `DictScan` with methods:
|
|
89
|
+
- `DictScan.add_to_annots_array(page_body, widget_ref)` - Unified method to add widget to /Annots
|
|
90
|
+
- `DictScan.remove_from_annots_array(page_body, widget_ref)` - Unified method to remove widget from /Annots
|
|
91
|
+
- `DictScan.get_annots_array(page_body)` - Extract /Annots array (handles both inline and indirect)
|
|
92
|
+
|
|
93
|
+
### Benefits
|
|
94
|
+
- Single implementation of /Annots manipulation logic
|
|
95
|
+
- Consistent handling of edge cases
|
|
96
|
+
- Easier to test /Annots operations
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## 4. Duplicated Box Parsing Logic
|
|
101
|
+
|
|
102
|
+
### Issue
|
|
103
|
+
`Document#list_pages` has repeated code blocks for parsing different box types (MediaBox, CropBox, ArtBox, BleedBox, TrimBox).
|
|
104
|
+
|
|
105
|
+
### Locations
|
|
106
|
+
- `Document#list_pages` (lines 120-165)
|
|
107
|
+
|
|
108
|
+
### Pattern
|
|
109
|
+
Each box type uses identical logic:
|
|
110
|
+
```ruby
|
|
111
|
+
if body =~ %r{/MediaBox\s*\[(.*?)\]}
|
|
112
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
113
|
+
if box_values.length == 4
|
|
114
|
+
llx, lly, urx, ury = box_values
|
|
115
|
+
media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
|
|
116
|
+
end
|
|
117
|
+
end
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Suggested Refactor
|
|
121
|
+
Create a helper method:
|
|
122
|
+
```ruby
|
|
123
|
+
def parse_box(body, box_type)
|
|
124
|
+
pattern = %r{/#{box_type}\s*\[(.*?)\]}
|
|
125
|
+
return nil unless body =~ pattern
|
|
126
|
+
|
|
127
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
128
|
+
return nil unless box_values.length == 4
|
|
129
|
+
|
|
130
|
+
llx, lly, urx, ury = box_values
|
|
131
|
+
{ llx: llx, lly: lly, urx: urx, ury: ury }
|
|
132
|
+
end
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Benefits
|
|
136
|
+
- Reduces code duplication from ~45 lines to ~10 lines per box type
|
|
137
|
+
- Easier to add new box types
|
|
138
|
+
- Consistent parsing logic
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## 5. Duplicated next_fresh_object_number Implementation
|
|
143
|
+
|
|
144
|
+
### Issue
|
|
145
|
+
The `next_fresh_object_number` method is implemented identically in two places.
|
|
146
|
+
|
|
147
|
+
### Locations
|
|
148
|
+
- `Document#next_fresh_object_number` (lines 730-739)
|
|
149
|
+
- `Base#next_fresh_object_number` (lines 28-37)
|
|
150
|
+
|
|
151
|
+
### Pattern
|
|
152
|
+
Both methods have identical implementation:
|
|
153
|
+
```ruby
|
|
154
|
+
def next_fresh_object_number
|
|
155
|
+
max_obj_num = 0
|
|
156
|
+
resolver.each_object do |ref, _|
|
|
157
|
+
max_obj_num = [max_obj_num, ref[0]].max
|
|
158
|
+
end
|
|
159
|
+
patches.each do |p|
|
|
160
|
+
max_obj_num = [max_obj_num, p[:ref][0]].max
|
|
161
|
+
end
|
|
162
|
+
max_obj_num + 1
|
|
163
|
+
end
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
### Suggested Refactor
|
|
167
|
+
- Remove `Document#next_fresh_object_number` - it's only called within `Document` but could use `Base`'s implementation
|
|
168
|
+
- Or: Document already has access to resolver and patches, so remove duplication by making Document use Base's method
|
|
169
|
+
|
|
170
|
+
### Benefits
|
|
171
|
+
- Single implementation
|
|
172
|
+
- Consistent object numbering logic
|
|
173
|
+
|
|
174
|
+
---
|
|
175
|
+
|
|
176
|
+
## 6. Unused Methods
|
|
177
|
+
|
|
178
|
+
### Issue
|
|
179
|
+
Some methods are defined but never called.
|
|
180
|
+
|
|
181
|
+
### Locations
|
|
182
|
+
- `AddSignatureAppearance#get_widget_rect_dimensions` (lines 218-223)
|
|
183
|
+
- Defined but never used
|
|
184
|
+
- `extract_rect` is used instead, which provides the same information
|
|
185
|
+
|
|
186
|
+
### Suggested Refactor
|
|
187
|
+
- Remove `get_widget_rect_dimensions` if it's truly unused
|
|
188
|
+
- Or: Verify if it was intended for future use and document it
|
|
189
|
+
|
|
190
|
+
### Benefits
|
|
191
|
+
- Cleaner codebase
|
|
192
|
+
- Less confusion about which method to use
|
|
193
|
+
|
|
194
|
+
---
|
|
195
|
+
|
|
196
|
+
## 7. Duplicated Base64 Decoding Logic
|
|
197
|
+
|
|
198
|
+
### Issue
|
|
199
|
+
`AddSignatureAppearance` has two similar methods for decoding base64 data.
|
|
200
|
+
|
|
201
|
+
### Locations
|
|
202
|
+
- `AddSignatureAppearance#decode_base64_data_uri` (lines 101-106)
|
|
203
|
+
- `AddSignatureAppearance#decode_base64_if_needed` (lines 108-119)
|
|
204
|
+
|
|
205
|
+
### Pattern
|
|
206
|
+
Both methods handle base64 decoding, with slightly different logic. Could potentially be unified.
|
|
207
|
+
|
|
208
|
+
### Suggested Refactor
|
|
209
|
+
- Consider merging into a single method that handles both cases
|
|
210
|
+
- Or: Document the distinction if both are needed
|
|
211
|
+
|
|
212
|
+
### Benefits
|
|
213
|
+
- Simpler API
|
|
214
|
+
- Less code duplication
|
|
215
|
+
|
|
216
|
+
---
|
|
217
|
+
|
|
218
|
+
## 8. Duplicated Regex Pattern for Object Reference
|
|
219
|
+
|
|
220
|
+
### Issue
|
|
221
|
+
The pattern for extracting object references `(\d+)\s+(\d+)\s+R` appears in many places.
|
|
222
|
+
|
|
223
|
+
### Locations
|
|
224
|
+
Throughout the codebase, used in:
|
|
225
|
+
- Extracting `/Parent` references
|
|
226
|
+
- Extracting `/P` (page) references
|
|
227
|
+
- Extracting `/Pages` references
|
|
228
|
+
- Extracting `/Fields` array references
|
|
229
|
+
- And many more...
|
|
230
|
+
|
|
231
|
+
### Suggested Refactor
|
|
232
|
+
Create a utility method:
|
|
233
|
+
```ruby
|
|
234
|
+
def DictScan.extract_object_ref(str)
|
|
235
|
+
# Extract object reference from string
|
|
236
|
+
# Returns [obj_num, gen_num] or nil
|
|
237
|
+
end
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
### Benefits
|
|
241
|
+
- Consistent reference extraction
|
|
242
|
+
- Easier to update if PDF reference format changes
|
|
243
|
+
- More readable code
|
|
244
|
+
|
|
245
|
+
---
|
|
246
|
+
|
|
247
|
+
## Priority Recommendations
|
|
248
|
+
|
|
249
|
+
### High Priority
|
|
250
|
+
1. **Widget Matching Logic (#2)** - Most duplicated, used in many critical operations
|
|
251
|
+
2. **/Annots Array Manipulation (#3)** - Complex logic that's error-prone when duplicated
|
|
252
|
+
|
|
253
|
+
### Medium Priority
|
|
254
|
+
3. **Page-Finding Logic (#1)** - Used in multiple places, but less frequently
|
|
255
|
+
4. **Box Parsing Logic (#4)** - Simple duplication, easy to refactor
|
|
256
|
+
|
|
257
|
+
### Low Priority
|
|
258
|
+
5. **next_fresh_object_number (#5)** - Simple duplication
|
|
259
|
+
6. **Object Reference Extraction (#8)** - Could improve consistency
|
|
260
|
+
7. **Unused Methods (#6)** - Cleanup task
|
|
261
|
+
8. **Base64 Decoding (#7)** - Minor duplication
|
|
262
|
+
|
|
263
|
+
---
|
|
264
|
+
|
|
265
|
+
## Notes
|
|
266
|
+
- All refactoring should be accompanied by tests to ensure behavior doesn't change
|
|
267
|
+
- Consider backward compatibility if any methods are moved between modules
|
|
268
|
+
- Some duplication may be intentional for performance reasons (avoid method call overhead) - evaluate before refactoring
|
|
269
|
+
|
|
@@ -157,10 +157,10 @@ module AcroThat
|
|
|
157
157
|
resolver.each_object do |ref, body|
|
|
158
158
|
next unless body
|
|
159
159
|
|
|
160
|
-
# Check for /Type /Page
|
|
160
|
+
# Check for /Type /Page (actual page, not /Type/Pages)
|
|
161
|
+
# Must match /Type /Page or /Type/Page but NOT /Type/Pages
|
|
161
162
|
is_page = body.include?("/Type /Page") ||
|
|
162
|
-
body
|
|
163
|
-
(body.include?("/Type") && body.include?("/Page") && body =~ %r{/Type\s*/Page})
|
|
163
|
+
(body =~ %r{/Type\s*/Page(?!s)\b})
|
|
164
164
|
next unless is_page
|
|
165
165
|
|
|
166
166
|
page_objects << ref
|
|
@@ -183,8 +183,8 @@ module AcroThat
|
|
|
183
183
|
kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
|
|
184
184
|
kid_ref = [num_str.to_i, gen_str.to_i]
|
|
185
185
|
kid_body = resolver.object_body(kid_ref)
|
|
186
|
-
# Check if this kid is a page
|
|
187
|
-
if kid_body && (kid_body.include?("/Type /Page") || kid_body
|
|
186
|
+
# Check if this kid is a page (not /Type/Pages)
|
|
187
|
+
if kid_body && (kid_body.include?("/Type /Page") || kid_body =~ %r{/Type\s*/Page(?!s)\b})
|
|
188
188
|
page_objects << kid_ref
|
|
189
189
|
elsif kid_body && kid_body.include?("/Type /Pages")
|
|
190
190
|
# Recursively find pages in this Pages node
|
|
@@ -192,7 +192,7 @@ module AcroThat
|
|
|
192
192
|
kid_body[::Regexp.last_match(0)..].scan(/(\d+)\s+(\d+)\s+R/) do |n, g|
|
|
193
193
|
grandkid_ref = [n.to_i, g.to_i]
|
|
194
194
|
grandkid_body = resolver.object_body(grandkid_ref)
|
|
195
|
-
if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body
|
|
195
|
+
if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body =~ %r{/Type\s*/Page(?!s)\b})
|
|
196
196
|
page_objects << grandkid_ref
|
|
197
197
|
end
|
|
198
198
|
end
|
|
@@ -336,10 +336,10 @@ module AcroThat
|
|
|
336
336
|
end
|
|
337
337
|
|
|
338
338
|
def create_form_xobject(_obj_num, image_obj_num, field_width, field_height, _scale_factor, scaled_width,
|
|
339
|
-
|
|
339
|
+
scaled_height)
|
|
340
340
|
# Calculate offset to left-align the image horizontally and center vertically
|
|
341
|
-
offset_x = 0.0
|
|
342
|
-
offset_y = (field_height - scaled_height) / 2.0
|
|
341
|
+
offset_x = 0.0 # Left-aligned (no horizontal offset)
|
|
342
|
+
offset_y = (field_height - scaled_height) / 2.0 # Center vertically
|
|
343
343
|
|
|
344
344
|
# PDF content stream that draws the image
|
|
345
345
|
# q = save graphics state
|
data/lib/acro_that/document.rb
CHANGED
|
@@ -71,62 +71,213 @@ module AcroThat
|
|
|
71
71
|
self
|
|
72
72
|
end
|
|
73
73
|
|
|
74
|
-
# Return an array of
|
|
75
|
-
def
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
widgets_by_name = {}
|
|
74
|
+
# Return an array of page information (page number, width, height, ref, metadata)
|
|
75
|
+
def list_pages
|
|
76
|
+
pages = []
|
|
77
|
+
page_objects = []
|
|
79
78
|
|
|
80
|
-
#
|
|
81
|
-
@resolver.
|
|
82
|
-
|
|
79
|
+
# Try to get pages in document order via page tree first
|
|
80
|
+
root_ref = @resolver.root_ref
|
|
81
|
+
if root_ref
|
|
82
|
+
catalog_body = @resolver.object_body(root_ref)
|
|
83
|
+
if catalog_body && catalog_body =~ %r{/Pages\s+(\d+)\s+(\d+)\s+R}
|
|
84
|
+
pages_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
83
85
|
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
86
|
+
# Recursively collect pages from page tree
|
|
87
|
+
collect_pages_from_tree(pages_ref, page_objects)
|
|
88
|
+
end
|
|
89
|
+
end
|
|
87
90
|
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
+
# Fallback: collect all page objects if page tree didn't work
|
|
92
|
+
if page_objects.empty?
|
|
93
|
+
@resolver.each_object do |ref, body|
|
|
94
|
+
next unless body
|
|
91
95
|
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
96
|
+
# Match /Type /Page or /Type/Page but NOT /Type/Pages
|
|
97
|
+
is_page = body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}
|
|
98
|
+
next unless is_page
|
|
95
99
|
|
|
96
|
-
|
|
97
|
-
if body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
|
|
98
|
-
page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
99
|
-
page_num = find_page_number_for_ref(page_ref)
|
|
100
|
+
page_objects << ref unless page_objects.include?(ref)
|
|
100
101
|
end
|
|
101
102
|
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
103
|
+
# Sort by object number as fallback
|
|
104
|
+
page_objects.sort_by! { |ref| ref[0] }
|
|
105
|
+
end
|
|
105
106
|
|
|
106
|
-
|
|
107
|
-
|
|
107
|
+
# Second pass: extract information from each page
|
|
108
|
+
page_objects.each_with_index do |ref, index|
|
|
109
|
+
body = @resolver.object_body(ref)
|
|
110
|
+
next unless body
|
|
111
|
+
|
|
112
|
+
# Extract MediaBox, CropBox, or ArtBox for dimensions
|
|
113
|
+
width = nil
|
|
114
|
+
height = nil
|
|
115
|
+
media_box = nil
|
|
116
|
+
crop_box = nil
|
|
117
|
+
art_box = nil
|
|
118
|
+
bleed_box = nil
|
|
119
|
+
trim_box = nil
|
|
120
|
+
|
|
121
|
+
# Try MediaBox first (most common)
|
|
122
|
+
if body =~ %r{/MediaBox\s*\[(.*?)\]}
|
|
123
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
124
|
+
if box_values.length == 4
|
|
125
|
+
llx, lly, urx, ury = box_values
|
|
126
|
+
width = urx - llx
|
|
127
|
+
height = ury - lly
|
|
128
|
+
media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
|
|
129
|
+
end
|
|
130
|
+
end
|
|
108
131
|
|
|
109
|
-
|
|
110
|
-
|
|
132
|
+
# Try CropBox
|
|
133
|
+
if body =~ %r{/CropBox\s*\[(.*?)\]}
|
|
134
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
135
|
+
if box_values.length == 4
|
|
136
|
+
llx, lly, urx, ury = box_values
|
|
137
|
+
crop_box = { llx: llx, lly: lly, urx: urx, ury: ury }
|
|
138
|
+
end
|
|
111
139
|
end
|
|
112
140
|
|
|
113
|
-
|
|
141
|
+
# Try ArtBox
|
|
142
|
+
if body =~ %r{/ArtBox\s*\[(.*?)\]}
|
|
143
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
144
|
+
if box_values.length == 4
|
|
145
|
+
llx, lly, urx, ury = box_values
|
|
146
|
+
art_box = { llx: llx, lly: lly, urx: urx, ury: ury }
|
|
147
|
+
end
|
|
148
|
+
end
|
|
114
149
|
|
|
115
|
-
|
|
116
|
-
if
|
|
117
|
-
|
|
118
|
-
if
|
|
119
|
-
|
|
120
|
-
|
|
150
|
+
# Try BleedBox
|
|
151
|
+
if body =~ %r{/BleedBox\s*\[(.*?)\]}
|
|
152
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
153
|
+
if box_values.length == 4
|
|
154
|
+
llx, lly, urx, ury = box_values
|
|
155
|
+
bleed_box = { llx: llx, lly: lly, urx: urx, ury: ury }
|
|
156
|
+
end
|
|
157
|
+
end
|
|
158
|
+
|
|
159
|
+
# Try TrimBox
|
|
160
|
+
if body =~ %r{/TrimBox\s*\[(.*?)\]}
|
|
161
|
+
box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
162
|
+
if box_values.length == 4
|
|
163
|
+
llx, lly, urx, ury = box_values
|
|
164
|
+
trim_box = { llx: llx, lly: lly, urx: urx, ury: ury }
|
|
165
|
+
end
|
|
166
|
+
end
|
|
167
|
+
|
|
168
|
+
# Extract rotation
|
|
169
|
+
rotate = nil
|
|
170
|
+
if body =~ %r{/Rotate\s+(\d+)}
|
|
171
|
+
rotate = Integer(::Regexp.last_match(1))
|
|
172
|
+
end
|
|
173
|
+
|
|
174
|
+
# Extract Resources reference
|
|
175
|
+
resources_ref = nil
|
|
176
|
+
if body =~ %r{/Resources\s+(\d+)\s+(\d+)\s+R}
|
|
177
|
+
resources_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
178
|
+
end
|
|
179
|
+
|
|
180
|
+
# Extract Parent reference
|
|
181
|
+
parent_ref = nil
|
|
182
|
+
if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
|
|
183
|
+
parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
184
|
+
end
|
|
185
|
+
|
|
186
|
+
# Extract Contents reference(s)
|
|
187
|
+
contents_refs = []
|
|
188
|
+
if body =~ %r{/Contents\s+(\d+)\s+(\d+)\s+R}
|
|
189
|
+
contents_refs << [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
190
|
+
elsif body =~ %r{/Contents\s*\[(.*?)\]}
|
|
191
|
+
contents_array = ::Regexp.last_match(1)
|
|
192
|
+
contents_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
|
|
193
|
+
contents_refs << [num_str.to_i, gen_str.to_i]
|
|
121
194
|
end
|
|
122
195
|
end
|
|
196
|
+
|
|
197
|
+
# Build metadata hash
|
|
198
|
+
metadata = {
|
|
199
|
+
rotate: rotate,
|
|
200
|
+
media_box: media_box,
|
|
201
|
+
crop_box: crop_box,
|
|
202
|
+
art_box: art_box,
|
|
203
|
+
bleed_box: bleed_box,
|
|
204
|
+
trim_box: trim_box,
|
|
205
|
+
resources_ref: resources_ref,
|
|
206
|
+
parent_ref: parent_ref,
|
|
207
|
+
contents_refs: contents_refs
|
|
208
|
+
}
|
|
209
|
+
|
|
210
|
+
pages << {
|
|
211
|
+
page: index + 1, # Page number starting at 1
|
|
212
|
+
width: width,
|
|
213
|
+
height: height,
|
|
214
|
+
ref: ref,
|
|
215
|
+
metadata: metadata
|
|
216
|
+
}
|
|
123
217
|
end
|
|
124
218
|
|
|
125
|
-
|
|
219
|
+
pages
|
|
220
|
+
end
|
|
221
|
+
|
|
222
|
+
# Return an array of Field(name, value, type, ref)
|
|
223
|
+
def list_fields
|
|
224
|
+
fields = []
|
|
225
|
+
field_widgets = {}
|
|
226
|
+
widgets_by_name = {}
|
|
227
|
+
|
|
228
|
+
# First pass: collect widget information
|
|
126
229
|
@resolver.each_object do |ref, body|
|
|
127
|
-
next unless body
|
|
230
|
+
next unless body
|
|
231
|
+
|
|
232
|
+
is_widget = DictScan.is_widget?(body)
|
|
233
|
+
|
|
234
|
+
# Collect widget information if this is a widget
|
|
235
|
+
if is_widget
|
|
236
|
+
# Extract position from widget
|
|
237
|
+
rect_tok = DictScan.value_token_after("/Rect", body)
|
|
238
|
+
if rect_tok && rect_tok.start_with?("[")
|
|
239
|
+
# Parse [x y x+width y+height] format
|
|
240
|
+
rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
241
|
+
if rect_values.length == 4
|
|
242
|
+
x, y, x2, y2 = rect_values
|
|
243
|
+
width = x2 - x
|
|
244
|
+
height = y2 - y
|
|
245
|
+
|
|
246
|
+
page_num = nil
|
|
247
|
+
if body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
|
|
248
|
+
page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
249
|
+
page_num = find_page_number_for_ref(page_ref)
|
|
250
|
+
end
|
|
251
|
+
|
|
252
|
+
widget_info = {
|
|
253
|
+
x: x, y: y, width: width, height: height, page: page_num
|
|
254
|
+
}
|
|
255
|
+
|
|
256
|
+
if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
|
|
257
|
+
parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
128
258
|
|
|
129
|
-
|
|
259
|
+
field_widgets[parent_ref] ||= []
|
|
260
|
+
field_widgets[parent_ref] << widget_info
|
|
261
|
+
end
|
|
262
|
+
|
|
263
|
+
if body.include?("/T")
|
|
264
|
+
t_tok = DictScan.value_token_after("/T", body)
|
|
265
|
+
if t_tok
|
|
266
|
+
widget_name = DictScan.decode_pdf_string(t_tok)
|
|
267
|
+
if widget_name && !widget_name.empty?
|
|
268
|
+
widgets_by_name[widget_name] ||= []
|
|
269
|
+
widgets_by_name[widget_name] << widget_info
|
|
270
|
+
end
|
|
271
|
+
end
|
|
272
|
+
end
|
|
273
|
+
end
|
|
274
|
+
end
|
|
275
|
+
end
|
|
276
|
+
|
|
277
|
+
# Second pass: collect all fields (both field objects and widget annotations with /T)
|
|
278
|
+
next unless body.include?("/T")
|
|
279
|
+
|
|
280
|
+
is_widget_field = is_widget
|
|
130
281
|
hint = body.include?("/FT") || is_widget_field || body.include?("/Kids") || body.include?("/Parent")
|
|
131
282
|
next unless hint
|
|
132
283
|
|
|
@@ -143,8 +294,7 @@ module AcroThat
|
|
|
143
294
|
type = ft_tok
|
|
144
295
|
|
|
145
296
|
position = {}
|
|
146
|
-
|
|
147
|
-
if is_widget_annot
|
|
297
|
+
if is_widget
|
|
148
298
|
rect_tok = DictScan.value_token_after("/Rect", body)
|
|
149
299
|
if rect_tok && rect_tok.start_with?("[")
|
|
150
300
|
rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
|
|
@@ -270,8 +420,261 @@ module AcroThat
|
|
|
270
420
|
field.remove
|
|
271
421
|
end
|
|
272
422
|
|
|
423
|
+
# Clean up the PDF by removing unwanted fields.
|
|
424
|
+
# Options:
|
|
425
|
+
# - keep_fields: Array of field names to keep (all others removed)
|
|
426
|
+
# - remove_fields: Array of field names to remove
|
|
427
|
+
# - remove_pattern: Regex pattern - fields matching this are removed
|
|
428
|
+
# - block: Given field name, return true to keep, false to remove
|
|
429
|
+
# This rewrites the entire PDF (like flatten) but excludes the unwanted fields.
|
|
430
|
+
def clear(keep_fields: nil, remove_fields: nil, remove_pattern: nil)
|
|
431
|
+
root_ref = @resolver.root_ref
|
|
432
|
+
raise "Cannot clear: no /Root found" unless root_ref
|
|
433
|
+
|
|
434
|
+
# Build a set of fields to remove
|
|
435
|
+
fields_to_remove = Set.new
|
|
436
|
+
|
|
437
|
+
# Get all current fields
|
|
438
|
+
all_fields = list_fields
|
|
439
|
+
|
|
440
|
+
if block_given?
|
|
441
|
+
# Use block to determine which fields to keep
|
|
442
|
+
all_fields.each do |field|
|
|
443
|
+
fields_to_remove.add(field.name) unless yield(field.name)
|
|
444
|
+
end
|
|
445
|
+
elsif keep_fields
|
|
446
|
+
# Keep only specified fields
|
|
447
|
+
keep_set = Set.new(keep_fields.map(&:to_s))
|
|
448
|
+
all_fields.each do |field|
|
|
449
|
+
fields_to_remove.add(field.name) unless keep_set.include?(field.name)
|
|
450
|
+
end
|
|
451
|
+
elsif remove_fields
|
|
452
|
+
# Remove specified fields
|
|
453
|
+
remove_set = Set.new(remove_fields.map(&:to_s))
|
|
454
|
+
all_fields.each do |field|
|
|
455
|
+
fields_to_remove.add(field.name) if remove_set.include?(field.name)
|
|
456
|
+
end
|
|
457
|
+
elsif remove_pattern
|
|
458
|
+
# Remove fields matching pattern
|
|
459
|
+
all_fields.each do |field|
|
|
460
|
+
fields_to_remove.add(field.name) if field.name =~ remove_pattern
|
|
461
|
+
end
|
|
462
|
+
else
|
|
463
|
+
# No criteria specified, return original
|
|
464
|
+
return @raw
|
|
465
|
+
end
|
|
466
|
+
|
|
467
|
+
# Build sets of refs to exclude
|
|
468
|
+
field_refs_to_remove = Set.new
|
|
469
|
+
widget_refs_to_remove = Set.new
|
|
470
|
+
|
|
471
|
+
all_fields.each do |field|
|
|
472
|
+
next unless fields_to_remove.include?(field.name)
|
|
473
|
+
|
|
474
|
+
field_refs_to_remove.add(field.ref) if field.valid_ref?
|
|
475
|
+
|
|
476
|
+
# Find all widget annotations for this field
|
|
477
|
+
@resolver.each_object do |widget_ref, body|
|
|
478
|
+
next unless body && DictScan.is_widget?(body)
|
|
479
|
+
next if widget_ref == field.ref
|
|
480
|
+
|
|
481
|
+
# Match by /Parent reference
|
|
482
|
+
if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
|
|
483
|
+
widget_parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
484
|
+
if widget_parent_ref == field.ref
|
|
485
|
+
widget_refs_to_remove.add(widget_ref)
|
|
486
|
+
next
|
|
487
|
+
end
|
|
488
|
+
end
|
|
489
|
+
|
|
490
|
+
# Also match by field name (/T)
|
|
491
|
+
next unless body.include?("/T")
|
|
492
|
+
|
|
493
|
+
t_tok = DictScan.value_token_after("/T", body)
|
|
494
|
+
next unless t_tok
|
|
495
|
+
|
|
496
|
+
widget_name = DictScan.decode_pdf_string(t_tok)
|
|
497
|
+
if widget_name && widget_name == field.name
|
|
498
|
+
widget_refs_to_remove.add(widget_ref)
|
|
499
|
+
end
|
|
500
|
+
end
|
|
501
|
+
end
|
|
502
|
+
|
|
503
|
+
# Collect objects to write (excluding removed fields and widgets)
|
|
504
|
+
objects = []
|
|
505
|
+
@resolver.each_object do |ref, body|
|
|
506
|
+
next if field_refs_to_remove.include?(ref)
|
|
507
|
+
next if widget_refs_to_remove.include?(ref)
|
|
508
|
+
next unless body
|
|
509
|
+
|
|
510
|
+
objects << { ref: ref, body: body }
|
|
511
|
+
end
|
|
512
|
+
|
|
513
|
+
# Process AcroForm to remove field references from /Fields array
|
|
514
|
+
af_ref = acroform_ref
|
|
515
|
+
if af_ref
|
|
516
|
+
# Find the AcroForm object in our objects list
|
|
517
|
+
af_obj = objects.find { |o| o[:ref] == af_ref }
|
|
518
|
+
if af_obj
|
|
519
|
+
af_body = af_obj[:body]
|
|
520
|
+
fields_array_ref = DictScan.value_token_after("/Fields", af_body)
|
|
521
|
+
|
|
522
|
+
if fields_array_ref && fields_array_ref =~ /\A(\d+)\s+(\d+)\s+R/
|
|
523
|
+
# /Fields points to separate array object
|
|
524
|
+
arr_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
525
|
+
arr_obj = objects.find { |o| o[:ref] == arr_ref }
|
|
526
|
+
if arr_obj
|
|
527
|
+
arr_body = arr_obj[:body]
|
|
528
|
+
field_refs_to_remove.each do |field_ref|
|
|
529
|
+
arr_body = DictScan.remove_ref_from_array(arr_body, field_ref)
|
|
530
|
+
end
|
|
531
|
+
# Clean up empty array
|
|
532
|
+
arr_body = arr_body.strip.gsub(/\[\s+\]/, "[]")
|
|
533
|
+
arr_obj[:body] = arr_body
|
|
534
|
+
end
|
|
535
|
+
elsif af_body.include?("/Fields")
|
|
536
|
+
# Inline /Fields array
|
|
537
|
+
field_refs_to_remove.each do |field_ref|
|
|
538
|
+
af_body = DictScan.remove_ref_from_inline_array(af_body, "/Fields", field_ref)
|
|
539
|
+
end
|
|
540
|
+
af_obj[:body] = af_body
|
|
541
|
+
end
|
|
542
|
+
end
|
|
543
|
+
end
|
|
544
|
+
|
|
545
|
+
# Process page objects to remove widget references from /Annots arrays
|
|
546
|
+
# Also remove any orphaned widget references (widgets that reference non-existent fields)
|
|
547
|
+
objects_in_file = Set.new(objects.map { |o| o[:ref] })
|
|
548
|
+
field_refs_in_file = Set.new
|
|
549
|
+
objects.each do |obj|
|
|
550
|
+
body = obj[:body]
|
|
551
|
+
# Check if this is a field object
|
|
552
|
+
if body&.include?("/FT") && body.include?("/T")
|
|
553
|
+
field_refs_in_file.add(obj[:ref])
|
|
554
|
+
end
|
|
555
|
+
|
|
556
|
+
body = obj[:body]
|
|
557
|
+
# Match /Type /Page or /Type/Page but NOT /Type/Pages
|
|
558
|
+
next unless body&.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}
|
|
559
|
+
|
|
560
|
+
# Handle inline /Annots array
|
|
561
|
+
if body =~ %r{/Annots\s*\[(.*?)\]}
|
|
562
|
+
annots_array_str = ::Regexp.last_match(1)
|
|
563
|
+
|
|
564
|
+
# Remove widgets that match removed fields
|
|
565
|
+
widget_refs_to_remove.each do |widget_ref|
|
|
566
|
+
annots_array_str = annots_array_str.gsub(/\b#{widget_ref[0]}\s+#{widget_ref[1]}\s+R\b/, "").strip
|
|
567
|
+
annots_array_str = annots_array_str.gsub(/\s+/, " ")
|
|
568
|
+
end
|
|
569
|
+
|
|
570
|
+
# Also remove orphaned widget references (widgets not in objects_in_file or pointing to non-existent fields)
|
|
571
|
+
annots_refs = annots_array_str.scan(/(\d+)\s+(\d+)\s+R/).map { |n, g| [Integer(n), Integer(g)] }
|
|
572
|
+
annots_refs.each do |annot_ref|
|
|
573
|
+
# Check if this annotation is a widget that should be removed
|
|
574
|
+
if objects_in_file.include?(annot_ref)
|
|
575
|
+
# Widget exists - check if it's an orphaned widget (references non-existent field)
|
|
576
|
+
widget_obj = objects.find { |o| o[:ref] == annot_ref }
|
|
577
|
+
if widget_obj && DictScan.is_widget?(widget_obj[:body])
|
|
578
|
+
widget_body = widget_obj[:body]
|
|
579
|
+
# Check if widget references a parent field that doesn't exist
|
|
580
|
+
if widget_body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
|
|
581
|
+
parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
582
|
+
unless field_refs_in_file.include?(parent_ref)
|
|
583
|
+
# Parent field doesn't exist - orphaned widget, remove it
|
|
584
|
+
annots_array_str = annots_array_str.gsub(/\b#{annot_ref[0]}\s+#{annot_ref[1]}\s+R\b/, "").strip
|
|
585
|
+
annots_array_str = annots_array_str.gsub(/\s+/, " ")
|
|
586
|
+
end
|
|
587
|
+
end
|
|
588
|
+
end
|
|
589
|
+
else
|
|
590
|
+
# Widget object doesn't exist - remove it
|
|
591
|
+
annots_array_str = annots_array_str.gsub(/\b#{annot_ref[0]}\s+#{annot_ref[1]}\s+R\b/, "").strip
|
|
592
|
+
annots_array_str = annots_array_str.gsub(/\s+/, " ")
|
|
593
|
+
end
|
|
594
|
+
end
|
|
595
|
+
|
|
596
|
+
new_annots = if annots_array_str.empty? || annots_array_str.strip.empty?
|
|
597
|
+
"[]"
|
|
598
|
+
else
|
|
599
|
+
"[#{annots_array_str}]"
|
|
600
|
+
end
|
|
601
|
+
|
|
602
|
+
new_body = body.sub(%r{/Annots\s*\[.*?\]}, "/Annots #{new_annots}")
|
|
603
|
+
obj[:body] = new_body
|
|
604
|
+
# Handle indirect /Annots array reference
|
|
605
|
+
elsif body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
|
|
606
|
+
annots_array_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
607
|
+
annots_obj = objects.find { |o| o[:ref] == annots_array_ref }
|
|
608
|
+
if annots_obj
|
|
609
|
+
annots_body = annots_obj[:body]
|
|
610
|
+
|
|
611
|
+
# Remove widgets that match removed fields
|
|
612
|
+
widget_refs_to_remove.each do |widget_ref|
|
|
613
|
+
annots_body = DictScan.remove_ref_from_array(annots_body, widget_ref)
|
|
614
|
+
end
|
|
615
|
+
|
|
616
|
+
# Also remove orphaned widget references
|
|
617
|
+
annots_refs = annots_body.scan(/(\d+)\s+(\d+)\s+R/).map { |n, g| [Integer(n), Integer(g)] }
|
|
618
|
+
annots_refs.each do |annot_ref|
|
|
619
|
+
if objects_in_file.include?(annot_ref)
|
|
620
|
+
widget_obj = objects.find { |o| o[:ref] == annot_ref }
|
|
621
|
+
if widget_obj && DictScan.is_widget?(widget_obj[:body])
|
|
622
|
+
widget_body = widget_obj[:body]
|
|
623
|
+
if widget_body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
|
|
624
|
+
parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
|
|
625
|
+
unless field_refs_in_file.include?(parent_ref)
|
|
626
|
+
annots_body = DictScan.remove_ref_from_array(annots_body, annot_ref)
|
|
627
|
+
end
|
|
628
|
+
end
|
|
629
|
+
end
|
|
630
|
+
else
|
|
631
|
+
annots_body = DictScan.remove_ref_from_array(annots_body, annot_ref)
|
|
632
|
+
end
|
|
633
|
+
end
|
|
634
|
+
|
|
635
|
+
annots_obj[:body] = annots_body
|
|
636
|
+
end
|
|
637
|
+
end
|
|
638
|
+
end
|
|
639
|
+
|
|
640
|
+
# Sort objects by object number
|
|
641
|
+
objects.sort_by! { |obj| obj[:ref][0] }
|
|
642
|
+
|
|
643
|
+
# Write the cleaned PDF
|
|
644
|
+
writer = PDFWriter.new
|
|
645
|
+
writer.write_header
|
|
646
|
+
|
|
647
|
+
objects.each do |obj|
|
|
648
|
+
writer.write_object(obj[:ref], obj[:body])
|
|
649
|
+
end
|
|
650
|
+
|
|
651
|
+
writer.write_xref
|
|
652
|
+
|
|
653
|
+
trailer_dict = @resolver.trailer_dict
|
|
654
|
+
info_ref = nil
|
|
655
|
+
if trailer_dict =~ %r{/Info\s+(\d+)\s+(\d+)\s+R}
|
|
656
|
+
info_ref = [::Regexp.last_match(1).to_i, ::Regexp.last_match(2).to_i]
|
|
657
|
+
end
|
|
658
|
+
|
|
659
|
+
# Write trailer
|
|
660
|
+
max_obj_num = objects.map { |obj| obj[:ref][0] }.max || 0
|
|
661
|
+
writer.write_trailer(max_obj_num + 1, root_ref, info_ref)
|
|
662
|
+
|
|
663
|
+
writer.output
|
|
664
|
+
end
|
|
665
|
+
|
|
666
|
+
# Clean up in-place (mutates current instance)
|
|
667
|
+
def clear!(...)
|
|
668
|
+
cleaned_content = clear(...)
|
|
669
|
+
@raw = cleaned_content
|
|
670
|
+
@resolver = AcroThat::ObjectResolver.new(cleaned_content)
|
|
671
|
+
@patches = []
|
|
672
|
+
|
|
673
|
+
self
|
|
674
|
+
end
|
|
675
|
+
|
|
273
676
|
# Write out with an incremental update
|
|
274
|
-
def write(path_out = nil, flatten:
|
|
677
|
+
def write(path_out = nil, flatten: true)
|
|
275
678
|
deduped_patches = @patches.reverse.uniq { |p| p[:ref] }.reverse
|
|
276
679
|
writer = AcroThat::IncrementalWriter.new(@raw, deduped_patches)
|
|
277
680
|
@raw = writer.render
|
|
@@ -290,6 +693,29 @@ module AcroThat
|
|
|
290
693
|
|
|
291
694
|
private
|
|
292
695
|
|
|
696
|
+
def collect_pages_from_tree(pages_ref, page_objects)
|
|
697
|
+
pages_body = @resolver.object_body(pages_ref)
|
|
698
|
+
return unless pages_body
|
|
699
|
+
|
|
700
|
+
# Extract /Kids array from Pages object
|
|
701
|
+
if pages_body =~ %r{/Kids\s*\[(.*?)\]}m
|
|
702
|
+
kids_array = ::Regexp.last_match(1)
|
|
703
|
+
# Extract all object references from Kids array in order
|
|
704
|
+
kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
|
|
705
|
+
kid_ref = [num_str.to_i, gen_str.to_i]
|
|
706
|
+
kid_body = @resolver.object_body(kid_ref)
|
|
707
|
+
|
|
708
|
+
# Check if this kid is a page (not /Type/Pages)
|
|
709
|
+
if kid_body && (kid_body.include?("/Type /Page") || kid_body =~ %r{/Type\s*/Page(?!s)\b})
|
|
710
|
+
page_objects << kid_ref unless page_objects.include?(kid_ref)
|
|
711
|
+
elsif kid_body && kid_body.include?("/Type /Pages")
|
|
712
|
+
# Recursively find pages in this Pages node
|
|
713
|
+
collect_pages_from_tree(kid_ref, page_objects)
|
|
714
|
+
end
|
|
715
|
+
end
|
|
716
|
+
end
|
|
717
|
+
end
|
|
718
|
+
|
|
293
719
|
def find_page_number_for_ref(page_ref)
|
|
294
720
|
page_objects = []
|
|
295
721
|
@resolver.each_object do |ref, body|
|
data/lib/acro_that/version.rb
CHANGED
data/lib/acro_that.rb
CHANGED
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: acro_that
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.2
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Michael Wynkoop
|
|
@@ -98,9 +98,12 @@ files:
|
|
|
98
98
|
- Rakefile
|
|
99
99
|
- acro_that.gemspec
|
|
100
100
|
- docs/README.md
|
|
101
|
+
- docs/clear_fields.md
|
|
101
102
|
- docs/dict_scan_explained.md
|
|
102
103
|
- docs/object_streams.md
|
|
103
104
|
- docs/pdf_structure.md
|
|
105
|
+
- issues/README.md
|
|
106
|
+
- issues/refactoring-opportunities.md
|
|
104
107
|
- lib/acro_that.rb
|
|
105
108
|
- lib/acro_that/actions/add_field.rb
|
|
106
109
|
- lib/acro_that/actions/add_signature_appearance.rb
|