RubyGems - corp_pdf - Versions diffs - 1.0.5 - Mend

corp_pdf 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

checksums.yaml +7 -0
data/.gitignore +13 -0
data/.rubocop.yml +78 -0
data/CHANGELOG.md +122 -0
data/Gemfile +5 -0
data/Gemfile.lock +90 -0
data/README.md +518 -0
data/Rakefile +18 -0
data/corp_pdf.gemspec +35 -0
data/docs/README.md +111 -0
data/docs/clear_fields.md +202 -0
data/docs/dict_scan_explained.md +341 -0
data/docs/object_streams.md +311 -0
data/docs/pdf_structure.md +251 -0
data/issues/README.md +59 -0
data/issues/memory-benchmark-results.md +551 -0
data/issues/memory-improvements.md +388 -0
data/issues/memory-optimization-summary.md +204 -0
data/issues/refactoring-opportunities.md +259 -0
data/lib/corp_pdf/actions/add_field.rb +73 -0
data/lib/corp_pdf/actions/base.rb +48 -0
data/lib/corp_pdf/actions/remove_field.rb +154 -0
data/lib/corp_pdf/actions/update_field.rb +663 -0
data/lib/corp_pdf/dict_scan.rb +523 -0
data/lib/corp_pdf/document.rb +782 -0
data/lib/corp_pdf/field.rb +145 -0
data/lib/corp_pdf/fields/base.rb +384 -0
data/lib/corp_pdf/fields/checkbox.rb +164 -0
data/lib/corp_pdf/fields/radio.rb +220 -0
data/lib/corp_pdf/fields/signature.rb +393 -0
data/lib/corp_pdf/fields/text.rb +31 -0
data/lib/corp_pdf/incremental_writer.rb +245 -0
data/lib/corp_pdf/object_resolver.rb +381 -0
data/lib/corp_pdf/objstm.rb +75 -0
data/lib/corp_pdf/page.rb +90 -0
data/lib/corp_pdf/pdf_writer.rb +133 -0
data/lib/corp_pdf/version.rb +5 -0
data/lib/corp_pdf.rb +35 -0
data/publish +183 -0
metadata +169 -0

data/issues/refactoring-opportunities.md ADDED Viewed

@@ -0,0 +1,259 @@
+# Refactoring Opportunities
+This document identifies code duplication and unused methods that could be refactored to improve maintainability.
+## 1. Duplicated Page-Finding Logic ✅ **COMPLETED**
+### Status
+**RESOLVED** - This refactoring has been completed:
+- ✅ `DictScan.is_page?(body)` exists (line 320 in dict_scan.rb)
+- ✅ `Document#find_all_pages` exists (line 693 in document.rb)
+- ✅ `Document#find_page_by_number(page_num)` exists (line 725 in document.rb)
+- ✅ `Base#find_page_by_number` delegates to Document
+- ✅ `AddField#find_page_ref` now uses `find_page_by_number` (line 288)
+### Original Issue
+Multiple methods had similar logic for finding page objects in a PDF document.
+### Resolution
+All page-finding logic has been unified into `DictScan.is_page?` and `Document#find_all_pages` / `find_page_by_number`.
+---
+## 2. Duplicated Widget-Matching Logic
+### Issue
+Multiple methods have similar logic for finding widgets that belong to a field. Widgets can be matched by:
+1. `/Parent` reference pointing to the field
+2. `/T` (field name) matching the field name
+### Locations
+- `Document#list_fields` (lines 222-327) - Finds widgets and matches them to fields
+- `Document#clear` (lines 472-495) - Finds widgets for removed fields
+- `UpdateField#update_widget_annotations_for_field` (lines 220-247) - Finds widgets by /Parent
+- `UpdateField#update_widget_names_for_field` (lines 249-280) - Finds widgets by /Parent and /T
+- `RemoveField#remove_widget_annotations_from_pages` (lines 55-103) - Finds widgets by /Parent and /T
+- `AddSignatureAppearance#find_widget_annotation` (lines 164-206) - Finds widgets by /Parent
+### Pattern
+The pattern of checking `/Parent` reference and matching by `/T` field name is repeated throughout.
+### Suggested Refactor
+Create utility methods in `Base` or a new `WidgetMatcher` module:
+- `find_widgets_by_parent(field_ref)` - Find widgets with /Parent pointing to field_ref
+- `find_widgets_by_name(field_name)` - Find widgets with /T matching field_name
+- `find_widgets_for_field(field_ref, field_name)` - Find all widgets for a field (by parent or name)
+### Benefits
+- Centralized widget matching logic
+- Consistent widget finding behavior
+- Easier to extend matching criteria
+---
+## 3. Duplicated /Annots Array Manipulation
+### Issue
+Multiple methods handle adding or removing widget references from page `/Annots` arrays. The logic needs to handle:
+1. Inline `/Annots` arrays: `/Annots [...]`
+2. Indirect `/Annots` arrays: `/Annots X Y R` (reference to separate array object)
+### Locations
+- `AddField#add_widget_to_page` (lines 213-275) - Adds widget to /Annots
+- `RemoveField#remove_widget_from_page_annots` (lines 125-155) - Removes widget from /Annots
+- `Document#clear` (lines 555-633) - Removes widgets from /Annots during cleanup
+### Pattern
+All three methods have similar conditional logic:
+```ruby
+if page_body =~ %r{/Annots\s*\[(.*?)\]}m
+  # Handle inline array
+elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
+  # Handle indirect array
+else
+  # Create new /Annots array
+end
+```
+### Suggested Refactor
+Extend `DictScan` with methods:
+- `DictScan.add_to_annots_array(page_body, widget_ref)` - Unified method to add widget to /Annots
+- `DictScan.remove_from_annots_array(page_body, widget_ref)` - Unified method to remove widget from /Annots
+- `DictScan.get_annots_array(page_body)` - Extract /Annots array (handles both inline and indirect)
+### Benefits
+- Single implementation of /Annots manipulation logic
+- Consistent handling of edge cases
+- Easier to test /Annots operations
+---
+## 4. Duplicated Box Parsing Logic ✅ **COMPLETED**
+### Status
+**RESOLVED** - This refactoring has been completed:
+- ✅ `DictScan.parse_box(body, box_type)` exists (line 340 in dict_scan.rb)
+- ✅ `Document#list_pages` now uses `parse_box` for all box types (lines 89-99 in document.rb)
+### Original Issue
+`Document#list_pages` had repeated code blocks for parsing different box types (MediaBox, CropBox, ArtBox, BleedBox, TrimBox).
+### Resolution
+Extracted the common box parsing logic into `DictScan.parse_box` helper method. All box type parsing in `Document#list_pages` now uses this shared method, reducing code duplication from ~45 lines to ~10 lines while maintaining existing functionality.
+---
+## 5. Duplicated next_fresh_object_number Implementation
+### Issue
+The `next_fresh_object_number` method is implemented identically in two places.
+### Locations
+- `Document#next_fresh_object_number` (lines 745-754)
+- `Base#next_fresh_object_number` (lines 28-37)
+### Pattern
+Both methods have identical implementation. However, `Document` doesn't include `Base`, so both need to exist independently.
+### Suggested Refactor
+- Consider whether `Document` should use `Base`'s implementation via delegation
+- Or: Keep both implementations if Document needs independent access
+### Benefits
+- Single implementation
+- Consistent object numbering logic
+### Note
+This may be intentional since `Document` doesn't include `Base` - both classes need this functionality independently.
+---
+## 6. Unused Methods
+### Issue
+Some methods are defined but never called.
+### Locations
+- `AddSignatureAppearance#get_widget_rect_dimensions` (lines 218-223)
+  - Defined but never used
+  - `extract_rect` is used instead, which provides the same information
+### Suggested Refactor
+- Remove `get_widget_rect_dimensions` if it's truly unused
+- Or: Verify if it was intended for future use and document it
+### Benefits
+- Cleaner codebase
+- Less confusion about which method to use
+---
+## 7. Duplicated Base64 Decoding Logic
+### Issue
+`AddSignatureAppearance` has two similar methods for decoding base64 data.
+### Locations
+- `AddSignatureAppearance#decode_base64_data_uri` (lines 101-106)
+- `AddSignatureAppearance#decode_base64_if_needed` (lines 108-119)
+### Pattern
+Both methods handle base64 decoding, with slightly different logic. Could potentially be unified.
+### Suggested Refactor
+- Consider merging into a single method that handles both cases
+- Or: Document the distinction if both are needed
+### Benefits
+- Simpler API
+- Less code duplication
+---
+## 8. Duplicated Regex Pattern for Object Reference
+### Issue
+The pattern for extracting object references `(\d+)\s+(\d+)\s+R` appears in many places.
+### Locations
+Throughout the codebase, used in:
+- Extracting `/Parent` references
+- Extracting `/P` (page) references
+- Extracting `/Pages` references
+- Extracting `/Fields` array references
+- And many more...
+### Suggested Refactor
+Create a utility method:
+```ruby
+def DictScan.extract_object_ref(str)
+  # Extract object reference from string
+  # Returns [obj_num, gen_num] or nil
+end
+```
+### Benefits
+- Consistent reference extraction
+- Easier to update if PDF reference format changes
+- More readable code
+---
+## Priority Recommendations
+### High Priority
+1. **Widget Matching Logic (#2)** - Most duplicated, used in many critical operations
+2. **/Annots Array Manipulation (#3)** - Complex logic that's error-prone when duplicated
+### Low Priority
+6. **next_fresh_object_number (#5)** - Simple duplication (may be intentional)
+7. **Object Reference Extraction (#8)** - Could improve consistency
+8. **Unused Methods (#6)** - Cleanup task (`get_widget_rect_dimensions`)
+9. **Base64 Decoding (#7)** - Minor duplication
+### Completed ✅
+- **Page-Finding Logic (#1)** - Successfully refactored into `DictScan.is_page?` and unified page-finding methods
+- **Checkbox Appearance Creation (#9)** - Extracted common Form XObject building logic into `build_form_xobject` helper method
+- **Box Parsing Logic (#4)** - Extracted common box parsing logic into `DictScan.parse_box` helper method
+- **PDF Metadata Formatting (#10)** - Moved `format_pdf_key` and `format_pdf_value` to `DictScan` module as shared utilities
+---
+## 9. Duplicated Checkbox Appearance Creation Logic ✅ **COMPLETED**
+### Status
+**RESOLVED** - This refactoring has been completed:
+- ✅ `AddField#build_form_xobject` exists (line 472 in add_field.rb)
+- ✅ `AddField#create_checkbox_yes_appearance` now uses `build_form_xobject` (line 458)
+- ✅ `AddField#create_checkbox_off_appearance` now uses `build_form_xobject` (line 469)
+### Original Issue
+The `create_checkbox_yes_appearance` and `create_checkbox_off_appearance` methods had duplicated Form XObject dictionary building logic.
+### Resolution
+Extracted the common Form XObject dictionary building logic into `build_form_xobject` helper method. Both checkbox appearance methods now use this shared method, reducing duplication while maintaining existing functionality.
+---
+## 10. PDF Metadata Formatting Methods Could Be Shared ✅ **COMPLETED**
+### Status
+**RESOLVED** - This refactoring has been completed:
+- ✅ `DictScan.format_pdf_key(key)` exists (line 134 in dict_scan.rb)
+- ✅ `DictScan.format_pdf_value(value)` exists (line 140 in dict_scan.rb)
+- ✅ `AddField` now uses `DictScan.format_pdf_key` and `DictScan.format_pdf_value` (lines 145-146, 195-196)
+### Original Issue
+The `format_pdf_key` and `format_pdf_value` methods in `AddField` were useful utility functions that could be shared across the codebase.
+### Resolution
+Moved `format_pdf_key` and `format_pdf_value` from `AddField` to the `DictScan` module as module functions. This makes them reusable throughout the codebase and provides a single source of truth for PDF formatting rules. `AddField` now uses these shared utilities, maintaining existing functionality while improving code reusability.
+---
+## Notes
+- All refactoring should be accompanied by tests to ensure behavior doesn't change
+- Consider backward compatibility if any methods are moved between modules
+- Some duplication may be intentional for performance reasons (avoid method call overhead) - evaluate before refactoring

data/lib/corp_pdf/actions/add_field.rb ADDED Viewed

@@ -0,0 +1,73 @@
+# frozen_string_literal: true
+module CorpPdf
+  module Actions
+    # Action to add a new field to a PDF document
+    # Delegates to field-specific classes for actual field creation
+    class AddField
+      include Base
+      attr_reader :field_obj_num, :field_type, :field_value
+      def initialize(document, name, options = {})
+        @document = document
+        @name = name
+        @options = normalize_hash_keys(options)
+        @metadata = normalize_hash_keys(@options[:metadata] || {})
+      end
+      def call
+        type_input = @options[:type] || "/Tx"
+        @options[:group_id]
+        # Auto-set radio button flags if type is :radio and flags not explicitly set
+        # MUST set this BEFORE creating the field handler so it gets passed correctly
+        if [:radio, "radio"].include?(type_input) && !@metadata[:Ff]
+          @metadata[:Ff] = 49_152
+        end
+        # Determine field type and create appropriate field handler
+        field_handler = create_field_handler(type_input)
+        # Call the field handler
+        field_handler.call
+        # Store field_obj_num from handler for compatibility
+        @field_obj_num = field_handler.field_obj_num
+        @field_type = field_handler.field_type
+        @field_value = field_handler.field_value
+        true
+      end
+      private
+      def normalize_hash_keys(hash)
+        return hash unless hash.is_a?(Hash)
+        hash.each_with_object({}) do |(key, value), normalized|
+          sym_key = key.is_a?(Symbol) ? key : key.to_sym
+          normalized[sym_key] = value.is_a?(Hash) ? normalize_hash_keys(value) : value
+        end
+      end
+      def create_field_handler(type_input)
+        is_radio = [:radio, "radio"].include?(type_input)
+        group_id = @options[:group_id]
+        is_button = [:button, "button", "/Btn", "/btn"].include?(type_input)
+        if is_radio && group_id
+          CorpPdf::Fields::Radio.new(@document, @name, @options.merge(metadata: @metadata))
+        elsif [:signature, "signature", "/Sig"].include?(type_input)
+          CorpPdf::Fields::Signature.new(@document, @name, @options.merge(metadata: @metadata))
+        elsif [:checkbox, "checkbox"].include?(type_input) || is_button
+          # :button type maps to /Btn which are checkboxes by default (unless radio flag is set)
+          CorpPdf::Fields::Checkbox.new(@document, @name, @options.merge(metadata: @metadata))
+        else
+          # Default to text field
+          CorpPdf::Fields::Text.new(@document, @name, @options.merge(metadata: @metadata))
+        end
+      end
+    end
+  end
+end

data/lib/corp_pdf/actions/base.rb ADDED Viewed

@@ -0,0 +1,48 @@
+# frozen_string_literal: true
+module CorpPdf
+  module Actions
+    module Base
+      def resolver
+        @document.instance_variable_get(:@resolver)
+      end
+      def patches
+        @document.instance_variable_get(:@patches)
+      end
+      def get_object_body_with_patch(ref)
+        body = resolver.object_body(ref)
+        existing_patch = patches.find { |p| p[:ref] == ref }
+        existing_patch ? existing_patch[:body] : body
+      end
+      def apply_patch(ref, body, original_body = nil)
+        original_body ||= resolver.object_body(ref)
+        return if body == original_body
+        patches.reject! { |p| p[:ref] == ref }
+        patches << { ref: ref, body: body }
+      end
+      def next_fresh_object_number
+        max_obj_num = 0
+        resolver.each_object do |ref, _|
+          max_obj_num = [max_obj_num, ref[0]].max
+        end
+        patches.each do |p|
+          max_obj_num = [max_obj_num, p[:ref][0]].max
+        end
+        max_obj_num + 1
+      end
+      def acroform_ref
+        @document.send(:acroform_ref)
+      end
+      def find_page_by_number(page_num)
+        @document.send(:find_page_by_number, page_num)
+      end
+    end
+  end
+end

data/lib/corp_pdf/actions/remove_field.rb ADDED Viewed

@@ -0,0 +1,154 @@
+# frozen_string_literal: true
+module CorpPdf
+  module Actions
+    # Action to remove a field from a PDF document
+    class RemoveField
+      include Base
+      def initialize(document, field)
+        @document = document
+        @field = field
+      end
+      def call
+        af_ref = acroform_ref
+        return false unless af_ref
+        # Step 1: Remove widget annotations from pages' /Annots arrays
+        remove_widget_annotations_from_pages
+        # Step 2: Remove from /Fields array
+        remove_from_fields_array(af_ref)
+        # Step 3: Mark the field object as deleted by setting /T to empty
+        mark_field_deleted
+        true
+      end
+      private
+      def remove_from_fields_array(af_ref)
+        af_body = get_object_body_with_patch(af_ref)
+        fields_array_ref = DictScan.value_token_after("/Fields", af_body)
+        if fields_array_ref && fields_array_ref =~ /\A(\d+)\s+(\d+)\s+R/
+          arr_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+          arr_body = get_object_body_with_patch(arr_ref)
+          filtered = DictScan.remove_ref_from_array(arr_body, @field.ref)
+          apply_patch(arr_ref, filtered, arr_body)
+        else
+          filtered_af = DictScan.remove_ref_from_inline_array(af_body, "/Fields", @field.ref)
+          apply_patch(af_ref, filtered_af, af_body) if filtered_af
+        end
+      end
+      def mark_field_deleted
+        fld_body = get_object_body_with_patch(@field.ref)
+        return unless fld_body
+        deleted_body = DictScan.replace_key_value(fld_body, "/T", "()")
+        apply_patch(@field.ref, deleted_body, fld_body)
+      end
+      def remove_widget_annotations_from_pages
+        widget_refs_to_remove = []
+        field_body = get_object_body_with_patch(@field.ref)
+        if field_body && DictScan.is_widget?(field_body)
+          widget_refs_to_remove << @field.ref
+        end
+        resolver.each_object do |widget_ref, body|
+          next unless body
+          next if widget_ref == @field.ref
+          next unless DictScan.is_widget?(body)
+          body = get_object_body_with_patch(widget_ref)
+          # Match by /Parent reference
+          if body =~ %r{/Parent\s+(\d+)\s+(\d+)\s+R}
+            widget_parent_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+            if widget_parent_ref == @field.ref
+              widget_refs_to_remove << widget_ref
+              next
+            end
+          end
+          # Also match by field name (/T) - some widgets might not have /Parent
+          next unless body.include?("/T") && @field.name
+          t_tok = DictScan.value_token_after("/T", body)
+          next unless t_tok
+          widget_name = DictScan.decode_pdf_string(t_tok)
+          if widget_name && widget_name == @field.name
+            widget_refs_to_remove << widget_ref
+          end
+        end
+        return if widget_refs_to_remove.empty?
+        widget_refs_to_remove.each do |widget_ref|
+          widget_body = get_object_body_with_patch(widget_ref)
+          if widget_body && widget_body =~ %r{/P\s+(\d+)\s+(\d+)\s+R}
+            page_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+            remove_widget_from_page_annots(page_ref, widget_ref)
+          else
+            find_and_remove_widget_from_all_pages(widget_ref)
+          end
+        end
+      end
+      def find_and_remove_widget_from_all_pages(widget_ref)
+        # Find all page objects and check their /Annots arrays
+        page_objects = []
+        resolver.each_object do |ref, body|
+          next unless body
+          next unless DictScan.is_page?(body)
+          page_objects << ref
+        end
+        # Check each page's /Annots array
+        page_objects.each do |page_ref|
+          remove_widget_from_page_annots(page_ref, widget_ref)
+        end
+      end
+      def remove_widget_from_page_annots(page_ref, widget_ref)
+        page_body = get_object_body_with_patch(page_ref)
+        return unless page_body
+        # Handle inline /Annots array
+        if page_body =~ %r{/Annots\s*\[(.*?)\]}m
+          annots_array_str = ::Regexp.last_match(1)
+          # Remove the widget reference from the array
+          filtered_array = annots_array_str.gsub(/\b#{widget_ref[0]}\s+#{widget_ref[1]}\s+R\b/, "").strip
+          # Clean up extra spaces
+          filtered_array.gsub!(/\s+/, " ")
+          new_annots = if filtered_array.empty?
+                         "[]"
+                       else
+                         "[#{filtered_array}]"
+                       end
+          new_page_body = page_body.sub(%r{/Annots\s*\[.*?\]}, "/Annots #{new_annots}")
+          apply_patch(page_ref, new_page_body, page_body)
+        # Handle indirect /Annots array reference
+        elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
+          annots_array_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+          annots_array_body = get_object_body_with_patch(annots_array_ref)
+          if annots_array_body
+            filtered_body = DictScan.remove_ref_from_array(annots_array_body, widget_ref)
+            apply_patch(annots_array_ref, filtered_body, annots_array_body)
+          end
+        end
+      end
+    end
+  end
+end