RubyGems - acro_that - Versions diffs - 0.1.1 → 0.1.2 - Mend

acro_that 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

checksums.yaml +4 -4
data/Gemfile.lock +1 -1
data/README.md +49 -0
data/docs/README.md +12 -0
data/docs/clear_fields.md +202 -0
data/issues/README.md +38 -0
data/issues/refactoring-opportunities.md +269 -0
data/lib/acro_that/actions/add_field.rb +6 -6
data/lib/acro_that/actions/add_signature_appearance.rb +3 -3
data/lib/acro_that/document.rb +467 -41
data/lib/acro_that/version.rb +1 -1
data/lib/acro_that.rb +1 -0
metadata +4 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 7cccc792d205578fc981258a0b171e4906b3f9c945ac2abd664447054251f940
-  data.tar.gz: 4544010bb0642b5c88cd9b5cc95a0c97d018109ecbe19648807419590d3f5e8b
+  metadata.gz: aa3b37611a5ca19d2cb0fa1c86d3821fb139b6f65ce9d3cd314d913b6e9e5db5
+  data.tar.gz: a4d07aaa7f268eb151b324b03e9575f563880b2cf97fd0172edc0e96c35e6eef
 SHA512:
-  metadata.gz: 5a43d6c18e6babead9223d93f661032ff324ee45fae76646d2d8c37528ea4ddf294b5c5240ddc67d894c53aaff05d2245baff6c9c6f7813d9dad83c0584e5cea
-  data.tar.gz: 17052dece8b5a7700c4d9f822b2ba7e2f2b8d0ded537a173d38b2e7190ffc0b6d2d9f86d51848b8d7dd99cd35015cda99299dfab0b7f87a5df1d0404e087cbe8
+  metadata.gz: a545eea311d77e7b46459e710925f8ec7b61baebd846ec7f602f6a77ab9532bab87d37be4bba53b00b43a09281e79faba79fcb71e5371c36953e877d556dcdd8
+  data.tar.gz: b9074d19a0cc330af44f4fd5609f49f8892e20deb95d522fa7ca5fb3099ca8c71f01cbc98ef7d3e6ed95620af090afc96035fbeaaf4a551ca6f9e4003c49f4a8

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    acro_that (0.1.0)
+    acro_that (0.1.2)
       chunky_png (~> 1.4)
 GEM

data/README.md CHANGED Viewed

@@ -189,6 +189,31 @@ flattened_doc = AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
 flattened_bytes = AcroThat::Document.flatten_pdf("input.pdf")
 ```
+#### Clearing Fields
+The `clear` and `clear!` methods allow you to completely remove unwanted fields by rewriting the entire PDF:
+```ruby
+doc = AcroThat::Document.new("form.pdf")
+# Remove all fields matching a pattern
+doc.clear!(remove_pattern: /^text-/)
+# Keep only specific fields
+doc.clear!(keep_fields: ["Name", "Email"])
+# Remove specific fields
+doc.clear!(remove_fields: ["OldField1", "OldField2"])
+# Use a block to determine which fields to keep
+doc.clear! { |name| !name.start_with?("temp_") }
+# Write the cleared PDF
+doc.write("cleared.pdf", flatten: true)
+```
+**Note:** Unlike `remove_field`, which uses incremental updates, `clear` completely rewrites the PDF to exclude unwanted fields. This is more efficient when removing many fields and ensures complete removal. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
 ### API Reference
 #### `AcroThat::Document.new(path_or_io)`
@@ -282,6 +307,30 @@ AcroThat::Document.flatten_pdf("input.pdf", "output.pdf")
 flattened_doc = AcroThat::Document.flatten_pdf("input.pdf")
 ```
+#### `#clear(options = {})` and `#clear!(options = {})`
+Removes unwanted fields by rewriting the entire PDF. `clear` returns cleared PDF bytes without modifying the document, while `clear!` modifies the document in-place. Options include:
+- `keep_fields`: Array of field names to keep (all others removed)
+- `remove_fields`: Array of field names to remove
+- `remove_pattern`: Regex pattern - fields matching this are removed
+- Block: Given field name, return `true` to keep, `false` to remove
+```ruby
+# Remove all fields
+cleared = doc.clear(remove_pattern: /.*/)
+# Remove fields matching pattern (in-place)
+doc.clear!(remove_pattern: /^text-/)
+# Keep only specific fields
+doc.clear!(keep_fields: ["Name", "Email"])
+# Use block to filter fields
+doc.clear! { |name| !name.match?(/^[a-f0-9-]{30,}/) }
+```
+**Note:** This completely rewrites the PDF (like `flatten`), so it's more efficient than using `remove_field` multiple times. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
 ### Field Object
 Each field returned by `#list_fields` is a `Field` object with the following attributes and methods:

data/docs/README.md CHANGED Viewed

@@ -38,6 +38,17 @@ Explains how PDF object streams work and how `AcroThat` parses them:
 **Key insight:** Object streams compress multiple objects together, but parsing them is still **text traversal**—once decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
+### [Clearing Fields](./cleaning_fields.md)
+Documentation for the `clear` and `clear!` methods:
+- How to remove unwanted fields completely
+- Difference between `clear` and `remove_field`
+- Pattern matching and field selection
+- Removing orphaned widget references
+- Best practices for clearing PDFs
+**Key insight:** `clear` rewrites the entire PDF to exclude unwanted fields, ensuring complete removal rather than just marking fields as deleted.
 ## Common Themes
 Throughout all documentation, you'll see these recurring themes:
@@ -54,6 +65,7 @@ Throughout all documentation, you'll see these recurring themes:
 1. Start with [PDF Structure](./pdf_structure.md) to understand PDFs at a high level
 2. Read [DictScan Explained](./dict_scan_explained.md) to see how text traversal works
 3. Read [Object Streams](./object_streams.md) to understand compression features
+4. Read [Clearing Fields](./cleaning_fields.md) to learn how to remove unwanted fields
 **If you're debugging:**
 - [DictScan Explained](./dict_scan_explained.md) has function-by-function walkthroughs

data/docs/clear_fields.md ADDED Viewed

@@ -0,0 +1,202 @@
+# Clearing Fields with `clear` and `clear!`
+The `clear` method allows you to completely remove unwanted form fields from a PDF by rewriting the entire document, rather than using incremental updates. This is useful when you want to:
+- Remove multiple layers of added fields
+- Clear a PDF that has accumulated many unwanted fields
+- Get back to a base file without certain fields
+- Remove orphaned or invalid field references
+Unlike `remove_field`, which uses incremental updates, `clear` rewrites the entire PDF (similar to `flatten`) but excludes the unwanted fields entirely. This ensures that:
+- Field objects are completely removed (not just marked as deleted)
+- Widget annotations are removed from page `/Annots` arrays
+- Orphaned widget references are cleaned up
+- The AcroForm `/Fields` array is updated
+- All references to removed fields are eliminated
+## Methods
+### `clear(options = {})`
+Returns a new PDF with unwanted fields removed. Does not modify the current document.
+**Options:**
+- `keep_fields`: Array of field names to keep (all others removed)
+- `remove_fields`: Array of field names to remove
+- `remove_pattern`: Regex pattern - fields matching this are removed
+- Block: Given field name, return `true` to keep, `false` to remove
+### `clear!(options = {})`
+Same as `clear`, but modifies the current document in-place. Mutates the document instance.
+## Usage Examples
+### Remove All Fields
+```ruby
+doc = AcroThat::Document.new("form.pdf")
+# Remove all fields
+cleared_pdf = doc.clear(remove_pattern: /.*/)
+# Or in-place
+doc.clear!(remove_pattern: /.*/)
+```
+### Remove Fields Matching a Pattern
+```ruby
+# Remove all fields starting with "text-"
+doc.clear!(remove_pattern: /^text-/)
+# Remove UUID-like generated fields
+doc.clear! { |name| !(name =~ /text-/ || name =~ /^[a-f0-9]{20,}/) }
+```
+### Keep Only Specific Fields
+```ruby
+# Keep only these fields, remove all others
+doc.clear!(keep_fields: ["Name", "Email", "Phone"])
+# Write the cleared PDF
+doc.write("cleared.pdf", flatten: true)
+```
+### Remove Specific Fields
+```ruby
+# Remove specific unwanted fields
+doc.clear!(remove_fields: ["OldField1", "OldField2", "GeneratedField3"])
+```
+### Complex Selection with Block
+```ruby
+# Remove all fields except those matching certain criteria
+doc.clear! do |field_name|
+  # Keep fields that don't look generated
+  !field_name.start_with?("text-") &&
+  !field_name.match?(/^[a-f0-9]{20,}/)
+end
+```
+## How It Works
+The `clear` method:
+1. **Identifies fields to remove** based on the provided criteria (pattern, list, or block)
+2. **Finds related widgets** for each field to be removed:
+   - Widgets that reference the field via `/Parent`
+   - Widgets that have the same name via `/T`
+3. **Collects objects to write**, excluding:
+   - Field objects that should be removed
+   - Widget annotation objects that should be removed
+4. **Updates AcroForm structure**:
+   - Removes field references from the `/Fields` array
+   - Handles both inline and indirect array references
+5. **Clears page annotations**:
+   - Removes widget references from page `/Annots` arrays
+   - Removes orphaned widget references (widgets pointing to non-existent fields)
+   - Removes references to widgets that don't exist in the cleared PDF
+6. **Rewrites the entire PDF** from scratch (like `flatten`) with only the selected objects
+## Key Differences from `remove_field`
+| Feature | `remove_field` | `clear` |
+|---------|---------------|---------|
+| Update Type | Incremental update | Complete rewrite |
+| Object Removal | Marks as deleted | Completely excluded |
+| PDF Structure | Preserves all objects | Only includes selected objects |
+| Use Case | Remove one/a few fields | Remove many fields or clean up |
+| Performance | Fast (append only) | Slower (full rewrite) |
+## Best Practices
+1. **Use `clear` when removing many fields**: If you need to remove a large number of fields, `clear` is more efficient and produces cleaner output.
+2. **Always flatten after clearing**: Since `clear` rewrites the PDF, consider using `write(..., flatten: true)` to ensure compatibility with all PDF viewers:
+```ruby
+doc.clear!(remove_pattern: /^text-/)
+doc.write("output.pdf", flatten: true)
+```
+3. **Combine with field addition**: After clearing, you can add new fields:
+```ruby
+doc.clear!(remove_pattern: /.*/)
+doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1)
+doc.write("output.pdf", flatten: true)
+```
+4. **Use patterns for generated fields**: If you have fields with predictable naming patterns (e.g., UUID-based names), use regex patterns:
+```ruby
+# Remove all UUID-like fields
+doc.clear!(remove_pattern: /^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/)
+# Remove all fields containing "temp" or "test"
+doc.clear!(remove_pattern: /temp|test/i)
+```
+## Technical Details
+### Orphaned Widget Removal
+The `clear` method automatically identifies and removes orphaned widget references:
+- **Non-existent widgets**: Widget references in `/Annots` arrays that point to objects that don't exist
+- **Orphaned widgets**: Widgets that reference parent fields that don't exist in the cleaned PDF
+This ensures that page annotation arrays don't contain invalid references that could confuse PDF viewers.
+### Page Detection
+The method correctly identifies actual page objects (`/Type /Page`) and avoids matching page container objects (`/Type /Pages`), ensuring widgets are properly associated with the correct page.
+### AcroForm Structure
+The method properly handles both:
+- **Inline `/Fields` arrays**: Arrays directly in the AcroForm dictionary
+- **Indirect `/Fields` arrays**: Arrays referenced as separate objects
+Both are updated to remove references to deleted fields.
+## Example: Complete Clearing Workflow
+```ruby
+require 'acro_that'
+# Load PDF with many unwanted fields
+doc = AcroThat::Document.new("messy_form.pdf")
+# Remove all generated/UUID-like fields
+doc.clear! { |name|
+  # Keep only fields that look intentional
+  !name.match?(/^[a-f0-9-]{30,}/) &&  # Not UUID-like
+  !name.start_with?("temp_") &&       # Not temporary
+  !name.empty?                         # Not empty
+}
+# Add new fields
+doc.add_field("Name", value: "", x: 100, y: 700, width: 200, height: 20, page: 1, type: :text)
+doc.add_field("Email", value: "", x: 100, y: 670, width: 200, height: 20, page: 1, type: :text)
+# Write cleared and updated PDF
+doc.write("cleared_form.pdf", flatten: true)
+```
+## See Also
+- [`flatten` and `flatten!`](./README.md#flattening-pdfs) - Similar rewrite approach for removing incremental updates
+- [`remove_field`](../README.md#remove_field) - Incremental removal of single fields
+- [Main README](../README.md) - General usage and API reference

data/issues/README.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Code Review Issues
+This folder contains documentation of code cleanup and refactoring opportunities found in the codebase.
+## Files
+- **[refactoring-opportunities.md](./refactoring-opportunities.md)** - Detailed list of code duplication and refactoring opportunities
+## Summary
+### High Priority Issues
+1. **Widget Matching Logic** - Duplicated across 6+ locations
+2. **/Annots Array Manipulation** - Complex logic duplicated in 3 locations
+### Medium Priority Issues
+3. **Page-Finding Logic** - Similar logic in 4+ methods
+4. **Box Parsing Logic** - Repeated code blocks for 5 box types
+### Low Priority Issues
+5. Duplicated `next_fresh_object_number` implementation
+6. Object reference extraction pattern duplication
+7. Unused method: `get_widget_rect_dimensions`
+8. Base64 decoding logic duplication
+## Quick Stats
+- **8 refactoring opportunities** identified
+- **6+ locations** with widget matching duplication
+- **3 locations** with /Annots array manipulation duplication
+- **1 unused method** found
+## Next Steps
+1. Review [refactoring-opportunities.md](./refactoring-opportunities.md) for detailed information
+2. Prioritize refactoring based on maintenance needs
+3. Create test coverage before refactoring
+4. Refactor incrementally, starting with high-priority items

data/issues/refactoring-opportunities.md ADDED Viewed

@@ -0,0 +1,269 @@
+# Refactoring Opportunities
+This document identifies code duplication and unused methods that could be refactored to improve maintainability.
+## 1. Duplicated Page-Finding Logic
+### Issue
+Multiple methods have similar logic for finding page objects in a PDF document.
+### Locations
+- `Document#list_pages` (lines 75-104)
+- `Document#collect_pages_from_tree` (lines 691-712)
+- `Document#find_page_number_for_ref` (lines 714-728)
+- `AddField#find_page_ref` (lines 155-211)
+### Pattern
+The pattern `body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}` appears in multiple places with slight variations.
+### Suggested Refactor
+Create a shared module or utility methods in `DictScan`:
+- `DictScan.is_page?(body)` - Check if a body represents a page object
+- `Document#find_all_pages` - Unified method to find all page objects
+- `Document#find_page_by_number(page_num)` - Find a specific page by number
+### Benefits
+- Single source of truth for page detection logic
+- Easier to maintain and update page-finding behavior
+- Consistent page ordering across methods
+---
+## 2. Duplicated Widget-Matching Logic
+### Issue
+Multiple methods have similar logic for finding widgets that belong to a field. Widgets can be matched by:
+1. `/Parent` reference pointing to the field
+2. `/T` (field name) matching the field name
+### Locations
+- `Document#list_fields` (lines 222-327) - Finds widgets and matches them to fields
+- `Document#clear` (lines 472-495) - Finds widgets for removed fields
+- `UpdateField#update_widget_annotations_for_field` (lines 220-247) - Finds widgets by /Parent
+- `UpdateField#update_widget_names_for_field` (lines 249-280) - Finds widgets by /Parent and /T
+- `RemoveField#remove_widget_annotations_from_pages` (lines 55-103) - Finds widgets by /Parent and /T
+- `AddSignatureAppearance#find_widget_annotation` (lines 164-206) - Finds widgets by /Parent
+### Pattern
+The pattern of checking `/Parent` reference and matching by `/T` field name is repeated throughout.
+### Suggested Refactor
+Create utility methods in `Base` or a new `WidgetMatcher` module:
+- `find_widgets_by_parent(field_ref)` - Find widgets with /Parent pointing to field_ref
+- `find_widgets_by_name(field_name)` - Find widgets with /T matching field_name
+- `find_widgets_for_field(field_ref, field_name)` - Find all widgets for a field (by parent or name)
+### Benefits
+- Centralized widget matching logic
+- Consistent widget finding behavior
+- Easier to extend matching criteria
+---
+## 3. Duplicated /Annots Array Manipulation
+### Issue
+Multiple methods handle adding or removing widget references from page `/Annots` arrays. The logic needs to handle:
+1. Inline `/Annots` arrays: `/Annots [...]`
+2. Indirect `/Annots` arrays: `/Annots X Y R` (reference to separate array object)
+### Locations
+- `AddField#add_widget_to_page` (lines 213-275) - Adds widget to /Annots
+- `RemoveField#remove_widget_from_page_annots` (lines 125-155) - Removes widget from /Annots
+- `Document#clear` (lines 555-633) - Removes widgets from /Annots during cleanup
+### Pattern
+All three methods have similar conditional logic:
+```ruby
+if page_body =~ %r{/Annots\s*\[(.*?)\]}m
+  # Handle inline array
+elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
+  # Handle indirect array
+else
+  # Create new /Annots array
+end
+```
+### Suggested Refactor
+Extend `DictScan` with methods:
+- `DictScan.add_to_annots_array(page_body, widget_ref)` - Unified method to add widget to /Annots
+- `DictScan.remove_from_annots_array(page_body, widget_ref)` - Unified method to remove widget from /Annots
+- `DictScan.get_annots_array(page_body)` - Extract /Annots array (handles both inline and indirect)
+### Benefits
+- Single implementation of /Annots manipulation logic
+- Consistent handling of edge cases
+- Easier to test /Annots operations
+---
+## 4. Duplicated Box Parsing Logic
+### Issue
+`Document#list_pages` has repeated code blocks for parsing different box types (MediaBox, CropBox, ArtBox, BleedBox, TrimBox).
+### Locations
+- `Document#list_pages` (lines 120-165)
+### Pattern
+Each box type uses identical logic:
+```ruby
+if body =~ %r{/MediaBox\s*\[(.*?)\]}
+  box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+  if box_values.length == 4
+    llx, lly, urx, ury = box_values
+    media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
+  end
+end
+```
+### Suggested Refactor
+Create a helper method:
+```ruby
+def parse_box(body, box_type)
+  pattern = %r{/#{box_type}\s*\[(.*?)\]}
+  return nil unless body =~ pattern
+  box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+  return nil unless box_values.length == 4
+  llx, lly, urx, ury = box_values
+  { llx: llx, lly: lly, urx: urx, ury: ury }
+end
+```
+### Benefits
+- Reduces code duplication from ~45 lines to ~10 lines per box type
+- Easier to add new box types
+- Consistent parsing logic
+---
+## 5. Duplicated next_fresh_object_number Implementation
+### Issue
+The `next_fresh_object_number` method is implemented identically in two places.
+### Locations
+- `Document#next_fresh_object_number` (lines 730-739)
+- `Base#next_fresh_object_number` (lines 28-37)
+### Pattern
+Both methods have identical implementation:
+```ruby
+def next_fresh_object_number
+  max_obj_num = 0
+  resolver.each_object do |ref, _|
+    max_obj_num = [max_obj_num, ref[0]].max
+  end
+  patches.each do |p|
+    max_obj_num = [max_obj_num, p[:ref][0]].max
+  end
+  max_obj_num + 1
+end
+```
+### Suggested Refactor
+- Remove `Document#next_fresh_object_number` - it's only called within `Document` but could use `Base`'s implementation
+- Or: Document already has access to resolver and patches, so remove duplication by making Document use Base's method
+### Benefits
+- Single implementation
+- Consistent object numbering logic
+---
+## 6. Unused Methods
+### Issue
+Some methods are defined but never called.
+### Locations
+- `AddSignatureAppearance#get_widget_rect_dimensions` (lines 218-223)
+  - Defined but never used
+  - `extract_rect` is used instead, which provides the same information
+### Suggested Refactor
+- Remove `get_widget_rect_dimensions` if it's truly unused
+- Or: Verify if it was intended for future use and document it
+### Benefits
+- Cleaner codebase
+- Less confusion about which method to use
+---
+## 7. Duplicated Base64 Decoding Logic
+### Issue
+`AddSignatureAppearance` has two similar methods for decoding base64 data.
+### Locations
+- `AddSignatureAppearance#decode_base64_data_uri` (lines 101-106)
+- `AddSignatureAppearance#decode_base64_if_needed` (lines 108-119)
+### Pattern
+Both methods handle base64 decoding, with slightly different logic. Could potentially be unified.
+### Suggested Refactor
+- Consider merging into a single method that handles both cases
+- Or: Document the distinction if both are needed
+### Benefits
+- Simpler API
+- Less code duplication
+---
+## 8. Duplicated Regex Pattern for Object Reference
+### Issue
+The pattern for extracting object references `(\d+)\s+(\d+)\s+R` appears in many places.
+### Locations
+Throughout the codebase, used in:
+- Extracting `/Parent` references
+- Extracting `/P` (page) references
+- Extracting `/Pages` references
+- Extracting `/Fields` array references
+- And many more...
+### Suggested Refactor
+Create a utility method:
+```ruby
+def DictScan.extract_object_ref(str)
+  # Extract object reference from string
+  # Returns [obj_num, gen_num] or nil
+end
+```
+### Benefits
+- Consistent reference extraction
+- Easier to update if PDF reference format changes
+- More readable code
+---
+## Priority Recommendations
+### High Priority
+1. **Widget Matching Logic (#2)** - Most duplicated, used in many critical operations
+2. **/Annots Array Manipulation (#3)** - Complex logic that's error-prone when duplicated
+### Medium Priority
+3. **Page-Finding Logic (#1)** - Used in multiple places, but less frequently
+4. **Box Parsing Logic (#4)** - Simple duplication, easy to refactor
+### Low Priority
+5. **next_fresh_object_number (#5)** - Simple duplication
+6. **Object Reference Extraction (#8)** - Could improve consistency
+7. **Unused Methods (#6)** - Cleanup task
+8. **Base64 Decoding (#7)** - Minor duplication
+---
+## Notes
+- All refactoring should be accompanied by tests to ensure behavior doesn't change
+- Consider backward compatibility if any methods are moved between modules
+- Some duplication may be intentional for performance reasons (avoid method call overhead) - evaluate before refactoring

data/lib/acro_that/actions/add_field.rb CHANGED Viewed

@@ -157,10 +157,10 @@ module AcroThat
         resolver.each_object do |ref, body|
           next unless body
-          # Check for /Type /Page with or without space, or /Type/Page
+          # Check for /Type /Page (actual page, not /Type/Pages)
+          # Must match /Type /Page or /Type/Page but NOT /Type/Pages
           is_page = body.include?("/Type /Page") ||
-                    body.include?("/Type/Page") ||
-                    (body.include?("/Type") && body.include?("/Page") && body =~ %r{/Type\s*/Page})
+                    (body =~ %r{/Type\s*/Page(?!s)\b})
           next unless is_page
           page_objects << ref
@@ -183,8 +183,8 @@ module AcroThat
                 kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
                   kid_ref = [num_str.to_i, gen_str.to_i]
                   kid_body = resolver.object_body(kid_ref)
-                  # Check if this kid is a page or another Pages node
-                  if kid_body && (kid_body.include?("/Type /Page") || kid_body.include?("/Type/Page") || (kid_body.include?("/Type") && kid_body.include?("/Page")))
+                  # Check if this kid is a page (not /Type/Pages)
+                  if kid_body && (kid_body.include?("/Type /Page") || kid_body =~ %r{/Type\s*/Page(?!s)\b})
                     page_objects << kid_ref
                   elsif kid_body && kid_body.include?("/Type /Pages")
                     # Recursively find pages in this Pages node
@@ -192,7 +192,7 @@ module AcroThat
                       kid_body[::Regexp.last_match(0)..].scan(/(\d+)\s+(\d+)\s+R/) do |n, g|
                         grandkid_ref = [n.to_i, g.to_i]
                         grandkid_body = resolver.object_body(grandkid_ref)
-                        if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body.include?("/Type/Page"))
+                        if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body =~ %r{/Type\s*/Page(?!s)\b})
                           page_objects << grandkid_ref
                         end
                       end

data/lib/acro_that/actions/add_signature_appearance.rb CHANGED Viewed

@@ -336,10 +336,10 @@ module AcroThat
       end
       def create_form_xobject(_obj_num, image_obj_num, field_width, field_height, _scale_factor, scaled_width,
-                               scaled_height)
+                              scaled_height)
         # Calculate offset to left-align the image horizontally and center vertically
-        offset_x = 0.0  # Left-aligned (no horizontal offset)
-        offset_y = (field_height - scaled_height) / 2.0  # Center vertically
+        offset_x = 0.0 # Left-aligned (no horizontal offset)
+        offset_y = (field_height - scaled_height) / 2.0 # Center vertically
         # PDF content stream that draws the image
         # q = save graphics state

data/lib/acro_that/document.rb CHANGED Viewed

@@ -71,62 +71,213 @@ module AcroThat
       self
     end
-    # Return an array of Field(name, value, type, ref)
-    def list_fields
-      fields = []
-      field_widgets = {}
-      widgets_by_name = {}
+    # Return an array of page information (page number, width, height, ref, metadata)
+    def list_pages
+      pages = []
+      page_objects = []
-      # First pass: collect widget information
-      @resolver.each_object do |ref, body|
-        next unless DictScan.is_widget?(body)
+      # Try to get pages in document order via page tree first
+      root_ref = @resolver.root_ref
+      if root_ref
+        catalog_body = @resolver.object_body(root_ref)
+        if catalog_body && catalog_body =~ %r{/Pages\s+(\d+)\s+(\d+)\s+R}
+          pages_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
-        # Extract position from widget
-        rect_tok = DictScan.value_token_after("/Rect", body)
-        next unless rect_tok && rect_tok.start_with?("[")
+          # Recursively collect pages from page tree
+          collect_pages_from_tree(pages_ref, page_objects)
+        end
+      end
-        # Parse [x y x+width y+height] format
-        rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
-        next unless rect_values.length == 4
+      # Fallback: collect all page objects if page tree didn't work
+      if page_objects.empty?
+        @resolver.each_object do |ref, body|
+          next unless body
-        x, y, x2, y2 = rect_values
-        width = x2 - x
-        height = y2 - y
+          # Match /Type /Page or /Type/Page but NOT /Type/Pages
+          is_page = body.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}
+          next unless is_page
-        page_num = nil
-        if body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
-          page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
-          page_num = find_page_number_for_ref(page_ref)
+          page_objects << ref unless page_objects.include?(ref)
         end
-        widget_info = {
-          x: x, y: y, width: width, height: height, page: page_num
-        }
+        # Sort by object number as fallback
+        page_objects.sort_by! { |ref| ref[0] }
+      end
-        if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
-          parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+      # Second pass: extract information from each page
+      page_objects.each_with_index do |ref, index|
+        body = @resolver.object_body(ref)
+        next unless body
+        # Extract MediaBox, CropBox, or ArtBox for dimensions
+        width = nil
+        height = nil
+        media_box = nil
+        crop_box = nil
+        art_box = nil
+        bleed_box = nil
+        trim_box = nil
+        # Try MediaBox first (most common)
+        if body =~ %r{/MediaBox\s*\[(.*?)\]}
+          box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+          if box_values.length == 4
+            llx, lly, urx, ury = box_values
+            width = urx - llx
+            height = ury - lly
+            media_box = { llx: llx, lly: lly, urx: urx, ury: ury }
+          end
+        end
-          field_widgets[parent_ref] ||= []
-          field_widgets[parent_ref] << widget_info
+        # Try CropBox
+        if body =~ %r{/CropBox\s*\[(.*?)\]}
+          box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+          if box_values.length == 4
+            llx, lly, urx, ury = box_values
+            crop_box = { llx: llx, lly: lly, urx: urx, ury: ury }
+          end
         end
-        next unless body.include?("/T")
+        # Try ArtBox
+        if body =~ %r{/ArtBox\s*\[(.*?)\]}
+          box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+          if box_values.length == 4
+            llx, lly, urx, ury = box_values
+            art_box = { llx: llx, lly: lly, urx: urx, ury: ury }
+          end
+        end
-        t_tok = DictScan.value_token_after("/T", body)
-        if t_tok
-          widget_name = DictScan.decode_pdf_string(t_tok)
-          if widget_name && !widget_name.empty?
-            widgets_by_name[widget_name] ||= []
-            widgets_by_name[widget_name] << widget_info
+        # Try BleedBox
+        if body =~ %r{/BleedBox\s*\[(.*?)\]}
+          box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+          if box_values.length == 4
+            llx, lly, urx, ury = box_values
+            bleed_box = { llx: llx, lly: lly, urx: urx, ury: ury }
+          end
+        end
+        # Try TrimBox
+        if body =~ %r{/TrimBox\s*\[(.*?)\]}
+          box_values = ::Regexp.last_match(1).scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+          if box_values.length == 4
+            llx, lly, urx, ury = box_values
+            trim_box = { llx: llx, lly: lly, urx: urx, ury: ury }
+          end
+        end
+        # Extract rotation
+        rotate = nil
+        if body =~ %r{/Rotate\s+(\d+)}
+          rotate = Integer(::Regexp.last_match(1))
+        end
+        # Extract Resources reference
+        resources_ref = nil
+        if body =~ %r{/Resources\s+(\d+)\s+(\d+)\s+R}
+          resources_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+        end
+        # Extract Parent reference
+        parent_ref = nil
+        if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
+          parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+        end
+        # Extract Contents reference(s)
+        contents_refs = []
+        if body =~ %r{/Contents\s+(\d+)\s+(\d+)\s+R}
+          contents_refs << [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+        elsif body =~ %r{/Contents\s*\[(.*?)\]}
+          contents_array = ::Regexp.last_match(1)
+          contents_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
+            contents_refs << [num_str.to_i, gen_str.to_i]
           end
         end
+        # Build metadata hash
+        metadata = {
+          rotate: rotate,
+          media_box: media_box,
+          crop_box: crop_box,
+          art_box: art_box,
+          bleed_box: bleed_box,
+          trim_box: trim_box,
+          resources_ref: resources_ref,
+          parent_ref: parent_ref,
+          contents_refs: contents_refs
+        }
+        pages << {
+          page: index + 1, # Page number starting at 1
+          width: width,
+          height: height,
+          ref: ref,
+          metadata: metadata
+        }
       end
-      # Second pass: collect all fields (both field objects and widget annotations with /T)
+      pages
+    end
+    # Return an array of Field(name, value, type, ref)
+    def list_fields
+      fields = []
+      field_widgets = {}
+      widgets_by_name = {}
+      # First pass: collect widget information
       @resolver.each_object do |ref, body|
-        next unless body&.include?("/T")
+        next unless body
+        is_widget = DictScan.is_widget?(body)
+        # Collect widget information if this is a widget
+        if is_widget
+          # Extract position from widget
+          rect_tok = DictScan.value_token_after("/Rect", body)
+          if rect_tok && rect_tok.start_with?("[")
+            # Parse [x y x+width y+height] format
+            rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
+            if rect_values.length == 4
+              x, y, x2, y2 = rect_values
+              width = x2 - x
+              height = y2 - y
+              page_num = nil
+              if body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
+                page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+                page_num = find_page_number_for_ref(page_ref)
+              end
+              widget_info = {
+                x: x, y: y, width: width, height: height, page: page_num
+              }
+              if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
+                parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
-        is_widget_field = DictScan.is_widget?(body)
+                field_widgets[parent_ref] ||= []
+                field_widgets[parent_ref] << widget_info
+              end
+              if body.include?("/T")
+                t_tok = DictScan.value_token_after("/T", body)
+                if t_tok
+                  widget_name = DictScan.decode_pdf_string(t_tok)
+                  if widget_name && !widget_name.empty?
+                    widgets_by_name[widget_name] ||= []
+                    widgets_by_name[widget_name] << widget_info
+                  end
+                end
+              end
+            end
+          end
+        end
+        # Second pass: collect all fields (both field objects and widget annotations with /T)
+        next unless body.include?("/T")
+        is_widget_field = is_widget
         hint = body.include?("/FT") || is_widget_field || body.include?("/Kids") || body.include?("/Parent")
         next unless hint
@@ -143,8 +294,7 @@ module AcroThat
         type = ft_tok
         position = {}
-        is_widget_annot = DictScan.is_widget?(body)
-        if is_widget_annot
+        if is_widget
           rect_tok = DictScan.value_token_after("/Rect", body)
           if rect_tok && rect_tok.start_with?("[")
             rect_values = rect_tok.scan(/[-+]?\d*\.?\d+/).map(&:to_f)
@@ -270,8 +420,261 @@ module AcroThat
       field.remove
     end
+    # Clean up the PDF by removing unwanted fields.
+    # Options:
+    #   - keep_fields: Array of field names to keep (all others removed)
+    #   - remove_fields: Array of field names to remove
+    #   - remove_pattern: Regex pattern - fields matching this are removed
+    #   - block: Given field name, return true to keep, false to remove
+    # This rewrites the entire PDF (like flatten) but excludes the unwanted fields.
+    def clear(keep_fields: nil, remove_fields: nil, remove_pattern: nil)
+      root_ref = @resolver.root_ref
+      raise "Cannot clear: no /Root found" unless root_ref
+      # Build a set of fields to remove
+      fields_to_remove = Set.new
+      # Get all current fields
+      all_fields = list_fields
+      if block_given?
+        # Use block to determine which fields to keep
+        all_fields.each do |field|
+          fields_to_remove.add(field.name) unless yield(field.name)
+        end
+      elsif keep_fields
+        # Keep only specified fields
+        keep_set = Set.new(keep_fields.map(&:to_s))
+        all_fields.each do |field|
+          fields_to_remove.add(field.name) unless keep_set.include?(field.name)
+        end
+      elsif remove_fields
+        # Remove specified fields
+        remove_set = Set.new(remove_fields.map(&:to_s))
+        all_fields.each do |field|
+          fields_to_remove.add(field.name) if remove_set.include?(field.name)
+        end
+      elsif remove_pattern
+        # Remove fields matching pattern
+        all_fields.each do |field|
+          fields_to_remove.add(field.name) if field.name =~ remove_pattern
+        end
+      else
+        # No criteria specified, return original
+        return @raw
+      end
+      # Build sets of refs to exclude
+      field_refs_to_remove = Set.new
+      widget_refs_to_remove = Set.new
+      all_fields.each do |field|
+        next unless fields_to_remove.include?(field.name)
+        field_refs_to_remove.add(field.ref) if field.valid_ref?
+        # Find all widget annotations for this field
+        @resolver.each_object do |widget_ref, body|
+          next unless body && DictScan.is_widget?(body)
+          next if widget_ref == field.ref
+          # Match by /Parent reference
+          if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
+            widget_parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+            if widget_parent_ref == field.ref
+              widget_refs_to_remove.add(widget_ref)
+              next
+            end
+          end
+          # Also match by field name (/T)
+          next unless body.include?("/T")
+          t_tok = DictScan.value_token_after("/T", body)
+          next unless t_tok
+          widget_name = DictScan.decode_pdf_string(t_tok)
+          if widget_name && widget_name == field.name
+            widget_refs_to_remove.add(widget_ref)
+          end
+        end
+      end
+      # Collect objects to write (excluding removed fields and widgets)
+      objects = []
+      @resolver.each_object do |ref, body|
+        next if field_refs_to_remove.include?(ref)
+        next if widget_refs_to_remove.include?(ref)
+        next unless body
+        objects << { ref: ref, body: body }
+      end
+      # Process AcroForm to remove field references from /Fields array
+      af_ref = acroform_ref
+      if af_ref
+        # Find the AcroForm object in our objects list
+        af_obj = objects.find { |o| o[:ref] == af_ref }
+        if af_obj
+          af_body = af_obj[:body]
+          fields_array_ref = DictScan.value_token_after("/Fields", af_body)
+          if fields_array_ref && fields_array_ref =~ /\A(\d+)\s+(\d+)\s+R/
+            # /Fields points to separate array object
+            arr_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+            arr_obj = objects.find { |o| o[:ref] == arr_ref }
+            if arr_obj
+              arr_body = arr_obj[:body]
+              field_refs_to_remove.each do |field_ref|
+                arr_body = DictScan.remove_ref_from_array(arr_body, field_ref)
+              end
+              # Clean up empty array
+              arr_body = arr_body.strip.gsub(/\[\s+\]/, "[]")
+              arr_obj[:body] = arr_body
+            end
+          elsif af_body.include?("/Fields")
+            # Inline /Fields array
+            field_refs_to_remove.each do |field_ref|
+              af_body = DictScan.remove_ref_from_inline_array(af_body, "/Fields", field_ref)
+            end
+            af_obj[:body] = af_body
+          end
+        end
+      end
+      # Process page objects to remove widget references from /Annots arrays
+      # Also remove any orphaned widget references (widgets that reference non-existent fields)
+      objects_in_file = Set.new(objects.map { |o| o[:ref] })
+      field_refs_in_file = Set.new
+      objects.each do |obj|
+        body = obj[:body]
+        # Check if this is a field object
+        if body&.include?("/FT") && body.include?("/T")
+          field_refs_in_file.add(obj[:ref])
+        end
+        body = obj[:body]
+        # Match /Type /Page or /Type/Page but NOT /Type/Pages
+        next unless body&.include?("/Type /Page") || body =~ %r{/Type\s*/Page(?!s)\b}
+        # Handle inline /Annots array
+        if body =~ %r{/Annots\s*\[(.*?)\]}
+          annots_array_str = ::Regexp.last_match(1)
+          # Remove widgets that match removed fields
+          widget_refs_to_remove.each do |widget_ref|
+            annots_array_str = annots_array_str.gsub(/\b#{widget_ref[0]}\s+#{widget_ref[1]}\s+R\b/, "").strip
+            annots_array_str = annots_array_str.gsub(/\s+/, " ")
+          end
+          # Also remove orphaned widget references (widgets not in objects_in_file or pointing to non-existent fields)
+          annots_refs = annots_array_str.scan(/(\d+)\s+(\d+)\s+R/).map { |n, g| [Integer(n), Integer(g)] }
+          annots_refs.each do |annot_ref|
+            # Check if this annotation is a widget that should be removed
+            if objects_in_file.include?(annot_ref)
+              # Widget exists - check if it's an orphaned widget (references non-existent field)
+              widget_obj = objects.find { |o| o[:ref] == annot_ref }
+              if widget_obj && DictScan.is_widget?(widget_obj[:body])
+                widget_body = widget_obj[:body]
+                # Check if widget references a parent field that doesn't exist
+                if widget_body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
+                  parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+                  unless field_refs_in_file.include?(parent_ref)
+                    # Parent field doesn't exist - orphaned widget, remove it
+                    annots_array_str = annots_array_str.gsub(/\b#{annot_ref[0]}\s+#{annot_ref[1]}\s+R\b/, "").strip
+                    annots_array_str = annots_array_str.gsub(/\s+/, " ")
+                  end
+                end
+              end
+            else
+              # Widget object doesn't exist - remove it
+              annots_array_str = annots_array_str.gsub(/\b#{annot_ref[0]}\s+#{annot_ref[1]}\s+R\b/, "").strip
+              annots_array_str = annots_array_str.gsub(/\s+/, " ")
+            end
+          end
+          new_annots = if annots_array_str.empty? || annots_array_str.strip.empty?
+                         "[]"
+                       else
+                         "[#{annots_array_str}]"
+                       end
+          new_body = body.sub(%r{/Annots\s*\[.*?\]}, "/Annots #{new_annots}")
+          obj[:body] = new_body
+        # Handle indirect /Annots array reference
+        elsif body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
+          annots_array_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+          annots_obj = objects.find { |o| o[:ref] == annots_array_ref }
+          if annots_obj
+            annots_body = annots_obj[:body]
+            # Remove widgets that match removed fields
+            widget_refs_to_remove.each do |widget_ref|
+              annots_body = DictScan.remove_ref_from_array(annots_body, widget_ref)
+            end
+            # Also remove orphaned widget references
+            annots_refs = annots_body.scan(/(\d+)\s+(\d+)\s+R/).map { |n, g| [Integer(n), Integer(g)] }
+            annots_refs.each do |annot_ref|
+              if objects_in_file.include?(annot_ref)
+                widget_obj = objects.find { |o| o[:ref] == annot_ref }
+                if widget_obj && DictScan.is_widget?(widget_obj[:body])
+                  widget_body = widget_obj[:body]
+                  if widget_body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
+                    parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+                    unless field_refs_in_file.include?(parent_ref)
+                      annots_body = DictScan.remove_ref_from_array(annots_body, annot_ref)
+                    end
+                  end
+                end
+              else
+                annots_body = DictScan.remove_ref_from_array(annots_body, annot_ref)
+              end
+            end
+            annots_obj[:body] = annots_body
+          end
+        end
+      end
+      # Sort objects by object number
+      objects.sort_by! { |obj| obj[:ref][0] }
+      # Write the cleaned PDF
+      writer = PDFWriter.new
+      writer.write_header
+      objects.each do |obj|
+        writer.write_object(obj[:ref], obj[:body])
+      end
+      writer.write_xref
+      trailer_dict = @resolver.trailer_dict
+      info_ref = nil
+      if trailer_dict =~ %r{/Info\s+(\d+)\s+(\d+)\s+R}
+        info_ref = [::Regexp.last_match(1).to_i, ::Regexp.last_match(2).to_i]
+      end
+      # Write trailer
+      max_obj_num = objects.map { |obj| obj[:ref][0] }.max || 0
+      writer.write_trailer(max_obj_num + 1, root_ref, info_ref)
+      writer.output
+    end
+    # Clean up in-place (mutates current instance)
+    def clear!(...)
+      cleaned_content = clear(...)
+      @raw = cleaned_content
+      @resolver = AcroThat::ObjectResolver.new(cleaned_content)
+      @patches = []
+      self
+    end
     # Write out with an incremental update
-    def write(path_out = nil, flatten: false)
+    def write(path_out = nil, flatten: true)
       deduped_patches = @patches.reverse.uniq { |p| p[:ref] }.reverse
       writer = AcroThat::IncrementalWriter.new(@raw, deduped_patches)
       @raw = writer.render
@@ -290,6 +693,29 @@ module AcroThat
     private
+    def collect_pages_from_tree(pages_ref, page_objects)
+      pages_body = @resolver.object_body(pages_ref)
+      return unless pages_body
+      # Extract /Kids array from Pages object
+      if pages_body =~ %r{/Kids\s*\[(.*?)\]}m
+        kids_array = ::Regexp.last_match(1)
+        # Extract all object references from Kids array in order
+        kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
+          kid_ref = [num_str.to_i, gen_str.to_i]
+          kid_body = @resolver.object_body(kid_ref)
+          # Check if this kid is a page (not /Type/Pages)
+          if kid_body && (kid_body.include?("/Type /Page") || kid_body =~ %r{/Type\s*/Page(?!s)\b})
+            page_objects << kid_ref unless page_objects.include?(kid_ref)
+          elsif kid_body && kid_body.include?("/Type /Pages")
+            # Recursively find pages in this Pages node
+            collect_pages_from_tree(kid_ref, page_objects)
+          end
+        end
+      end
+    end
     def find_page_number_for_ref(page_ref)
       page_objects = []
       @resolver.each_object do |ref, body|

data/lib/acro_that/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module AcroThat
-  VERSION = "0.1.1"
+  VERSION = "0.1.2"
 end

data/lib/acro_that.rb CHANGED Viewed

@@ -4,6 +4,7 @@ require "strscan"
 require "stringio"
 require "zlib"
 require "base64"
+require "set"
 require_relative "acro_that/dict_scan"
 require_relative "acro_that/object_resolver"

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: acro_that
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
 platform: ruby
 authors:
 - Michael Wynkoop
@@ -98,9 +98,12 @@ files:
 - Rakefile
 - acro_that.gemspec
 - docs/README.md
+- docs/clear_fields.md
 - docs/dict_scan_explained.md
 - docs/object_streams.md
 - docs/pdf_structure.md
+- issues/README.md
+- issues/refactoring-opportunities.md
 - lib/acro_that.rb
 - lib/acro_that/actions/add_field.rb
 - lib/acro_that/actions/add_signature_appearance.rb