RubyGems - acro_that - Versions diffs - 0.1.0 - Mend

acro_that 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

checksums.yaml +7 -0
data/.DS_Store +0 -0
data/.gitignore +8 -0
data/.rubocop.yml +78 -0
data/Gemfile +5 -0
data/Gemfile.lock +86 -0
data/README.md +360 -0
data/Rakefile +18 -0
data/acro_that.gemspec +34 -0
data/docs/README.md +99 -0
data/docs/dict_scan_explained.md +341 -0
data/docs/object_streams.md +311 -0
data/docs/pdf_structure.md +251 -0
data/lib/acro_that/actions/add_field.rb +278 -0
data/lib/acro_that/actions/add_signature_appearance.rb +422 -0
data/lib/acro_that/actions/base.rb +44 -0
data/lib/acro_that/actions/remove_field.rb +158 -0
data/lib/acro_that/actions/update_field.rb +301 -0
data/lib/acro_that/dict_scan.rb +413 -0
data/lib/acro_that/document.rb +331 -0
data/lib/acro_that/field.rb +143 -0
data/lib/acro_that/incremental_writer.rb +244 -0
data/lib/acro_that/object_resolver.rb +376 -0
data/lib/acro_that/objstm.rb +75 -0
data/lib/acro_that/pdf_writer.rb +97 -0
data/lib/acro_that/version.rb +5 -0
data/lib/acro_that.rb +24 -0
metadata +143 -0

data/docs/pdf_structure.md ADDED Viewed

@@ -0,0 +1,251 @@
+# PDF File Structure
+## Overview
+PDF (Portable Document Format) files have a reputation for being complex binary formats, but at their core, they are **text-based files with a structured syntax**. Understanding this fundamental fact is key to understanding how PDF works.
+While PDFs can contain binary data (like compressed streams, images, and fonts), the **structure** of a PDF—its objects, dictionaries, arrays, and references—is defined using plain text syntax.
+## PDF File Anatomy
+A PDF file consists of several main parts:
+1. **Header**: `%PDF-1.4` (or similar version)
+2. **Body**: A collection of PDF objects (the actual content)
+3. **Cross-Reference Table (xref)**: Points to byte offsets of objects
+4. **Trailer**: Contains the root object reference and metadata
+5. **EOF Marker**: `%%EOF`
+### PDF Objects
+The body contains PDF objects. Each object has:
+- An object number and generation number (e.g., `5 0 obj`)
+- Content (dictionary, array, stream, etc.)
+- An `endobj` marker
+Example:
+```
+5 0 obj
+<< /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] >>
+endobj
+```
+## PDF Dictionaries
+**PDF dictionaries are the heart of PDF structure.** They're defined using angle brackets:
+```
+<< /Key1 value1 /Key2 value2 /Key3 value3 >>
+```
+Think of them like JSON objects or Ruby hashes, but with PDF-specific syntax:
+- Keys are PDF names (always start with `/`)
+- Values can be: strings, numbers, booleans, arrays, dictionaries, or object references
+- Whitespace is generally ignored (but required between tokens)
+### Dictionary Examples
+**Simple dictionary:**
+```
+<< /Type /Page /Width 612 /Height 792 >>
+```
+**Nested dictionary:**
+```
+<<
+  /Type /Annot
+  /Subtype /Widget
+  /Rect [100 500 200 520]
+  /AP <<
+    /N <<
+      /Yes 10 0 R
+      /Off 11 0 R
+    >>
+  >>
+>>
+```
+**Dictionary with array:**
+```
+<< /Kids [5 0 R 6 0 R 7 0 R] >>
+```
+**Dictionary with string values:**
+```
+<< /Title (My Document) /Author (John Doe) >>
+```
+The parentheses `()` denote literal strings in PDF syntax. Hex strings use angle brackets: `<>`.
+## PDF Text-Based Syntax
+Despite being "binary" files, PDFs use text-based syntax for their structure. This means:
+1. **Dictionaries are text**: `<< ... >>` are just character sequences
+2. **Arrays are text**: `[ ... ]` are just character sequences
+3. **References are text**: `5 0 R` means "object 5, generation 0"
+4. **Strings can be text or hex**: `(Hello)` or `<48656C6C6F>`
+### Why This Matters
+Because PDF dictionaries are just text with delimiters (`<<`, `>>`), we can parse them using **simple text traversal algorithms**—no complex parser generator, no AST construction, just:
+1. Find opening `<<`
+2. Track nesting depth by counting `<<` and `>>`
+3. When depth reaches zero, we've found a complete dictionary
+4. Repeat
+## PDF Object References
+PDFs use references to link objects together:
+```
+5 0 R
+```
+This means:
+- Object number: `5`
+- Generation number: `0` (usually 0 for non-incremental PDFs)
+- `R` means "reference"
+When you see `/Parent 5 0 R`, it means the `Parent` key references object 5.
+## PDF Arrays
+Arrays are space-separated lists in square brackets:
+```
+[0 0 612 792]
+```
+Can contain any PDF value type:
+```
+[5 0 R 6 0 R]
+[/Yes /Off]
+[(Hello) (World)]
+```
+## PDF Strings
+PDF strings come in two flavors:
+### Literal Strings (parentheses)
+```
+(Hello World)
+(Line 1\nLine 2)
+```
+Can contain escape sequences: `\n`, `\r`, `\t`, `\\(`, `\\)`, octal `\123`.
+### Hex Strings (angle brackets)
+```
+<48656C6C6F>
+<FEFF00480065006C006C006F>
+```
+The hex string `<FEFF...>` with BOM indicates UTF-16BE encoding.
+## PDF Names
+PDF names start with `/`:
+```
+/Type
+/Subtype
+/Widget
+```
+Names can contain most characters except special delimiters.
+## Stream Objects
+Some PDF objects contain **streams** (binary or text data):
+```
+10 0 obj
+<< /Length 100 /Filter /FlateDecode >>
+stream
+[compressed binary data here]
+endstream
+endobj
+```
+For parsing structure (dictionaries), we typically strip or ignore stream bodies because they can contain arbitrary binary data that would confuse text-based parsing.
+## Why AcroThat Works
+`AcroThat` works because **PDF dictionaries are just text patterns**. Despite looking complicated, the algorithms are straightforward:
+### Finding Dictionaries
+The `each_dictionary` method:
+1. Searches for `<<` (start of dictionary)
+2. Tracks nesting depth: `<<` increments, `>>` decrements
+3. When depth returns to 0, we've found a complete dictionary
+4. Yield it and continue searching
+This is **pure text traversal**—no PDF-specific knowledge beyond "dictionaries use `<<` and `>>`".
+### Extracting Values
+The `value_token_after` method:
+1. Finds a key (like `/V`)
+2. Skips whitespace
+3. Based on the next character, extracts the value:
+   - `(` → Extract literal string (handle escaping)
+   - `<` → Extract hex string or dictionary
+   - `[` → Extract array (match brackets)
+   - `/` → Extract name
+   - Otherwise → Extract atom (number, reference, etc.)
+Again, this is just **text pattern matching** with some bracket/depth tracking.
+### Why It Seems Complicated
+The complexity comes from:
+1. **Handling edge cases**: Escaped characters, nested structures, various value types
+2. **Preserving exact formatting**: When replacing values, we must maintain valid PDF syntax
+3. **Encoding/decoding**: PDF strings have special encoding rules (UTF-16BE BOM, escapes)
+4. **Safety checks**: Verifying dictionaries are still valid after modification
+But the **core concept** is simple: PDF dictionaries are text, so we can parse them with text traversal.
+## Example: Walking Through a PDF Dictionary
+Given this PDF dictionary text:
+```
+<< /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>
+```
+How `AcroThat` would parse it:
+1. **`each_dictionary` finds it:**
+   - Finds `<<` at position 0
+   - Depth: 0 → 1 (after `<<`)
+   - Scans forward...
+   - Finds `>>` at position 64
+   - Depth: 1 → 0
+   - Yields: `"<< /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>"`
+2. **`value_token_after("/V", dict)` extracts value:**
+   - Finds `/V` (followed by space)
+   - Skips whitespace
+   - Next char is `(`, so extract literal string
+   - Scan forward, handle escaping, match closing `)`
+   - Returns: `"(Hello World)"`
+3. **`decode_pdf_string("(Hello World)")` decodes:**
+   - Starts with `(`, ends with `)`
+   - Extract inner: `"Hello World"`
+   - Unescape (no escapes here)
+   - Check for UTF-16BE BOM (none)
+   - Return: `"Hello World"`
+## Conclusion
+PDF files are **structured text files** with binary data embedded in streams. The structure itself—dictionaries, arrays, strings, references—is all text-based syntax. This is why `AcroThat` can use simple text traversal to parse and modify PDF dictionaries without needing a full PDF parser.
+The apparent complexity in `AcroThat` comes from:
+- Handling PDF's various value types
+- Proper encoding/decoding of strings
+- Careful preservation of structure during edits
+- Edge case handling (escaping, nesting, etc.)
+But the **fundamental approach** is elegantly simple: treat PDF dictionaries as text patterns and parse them with character-by-character traversal.

data/lib/acro_that/actions/add_field.rb ADDED Viewed

@@ -0,0 +1,278 @@
+# frozen_string_literal: true
+module AcroThat
+  module Actions
+    # Action to add a new field to a PDF document
+    class AddField
+      include Base
+      attr_reader :field_obj_num, :field_type, :field_value
+      def initialize(document, name, options = {})
+        @document = document
+        @name = name
+        @options = options
+      end
+      def call
+        x = @options[:x] || 100
+        y = @options[:y] || 500
+        width = @options[:width] || 100
+        height = @options[:height] || 20
+        page_num = @options[:page] || 1
+        # Normalize field type: accept symbols or strings, convert to PDF format
+        type_input = @options[:type] || "/Tx"
+        @field_type = case type_input
+                      when :text, "text", "/Tx", "/tx"
+                        "/Tx"
+                      when :button, "button", "/Btn", "/btn"
+                        "/Btn"
+                      when :choice, "choice", "/Ch", "/ch"
+                        "/Ch"
+                      when :signature, "signature", "/Sig", "/sig"
+                        "/Sig"
+                      else
+                        type_input.to_s # Use as-is if it's already in PDF format
+                      end
+        @field_value = @options[:value] || ""
+        # Create a proper field dictionary + a widget annotation that references it via /Parent
+        @field_obj_num = next_fresh_object_number
+        widget_obj_num = @field_obj_num + 1
+        field_body = create_field_dictionary(@field_value, @field_type)
+        # Find the page ref for /P on widget (must happen before we create patches)
+        page_ref = find_page_ref(page_num)
+        # Create widget with page reference
+        widget_body = create_widget_annotation_with_parent(widget_obj_num, [@field_obj_num, 0], page_ref, x, y, width,
+                                                           height, @field_type, @field_value)
+        # Queue objects
+        @document.instance_variable_get(:@patches) << { ref: [@field_obj_num, 0], body: field_body }
+        @document.instance_variable_get(:@patches) << { ref: [widget_obj_num, 0], body: widget_body }
+        # Add field reference (not widget) to AcroForm /Fields AND ensure defaults in ONE patch
+        add_field_to_acroform_with_defaults(@field_obj_num)
+        # Add widget to the target page's /Annots
+        add_widget_to_page(widget_obj_num, page_num)
+        true
+      end
+      private
+      def create_field_dictionary(value, type)
+        dict = "<<\n"
+        dict += "  /FT #{type}\n"
+        dict += "  /T #{DictScan.encode_pdf_string(@name)}\n"
+        dict += "  /Ff 0\n"
+        dict += "  /DA (/Helv 0 Tf 0 g)\n"
+        dict += "  /V #{DictScan.encode_pdf_string(value)}\n" if value && !value.empty?
+        dict += ">>"
+        dict
+      end
+      def create_widget_annotation_with_parent(_widget_obj_num, parent_ref, page_ref, x, y, width, height, type, value)
+        rect_array = "[#{x} #{y} #{x + width} #{y + height}]"
+        widget = "<<\n"
+        widget += "  /Type /Annot\n"
+        widget += "  /Subtype /Widget\n"
+        widget += "  /Parent #{parent_ref[0]} #{parent_ref[1]} R\n"
+        widget += "  /P #{page_ref[0]} #{page_ref[1]} R\n" if page_ref
+        widget += "  /FT #{type}\n"
+        widget += "  /Rect #{rect_array}\n"
+        widget += "  /F 4\n"
+        widget += "  /DA (/Helv 0 Tf 0 g)\n"
+        widget += "  /V #{DictScan.encode_pdf_string(value)}\n" if value && !value.empty?
+        widget += ">>"
+        widget
+      end
+      def add_field_to_acroform_with_defaults(field_obj_num)
+        af_ref = acroform_ref
+        return false unless af_ref
+        af_body = get_object_body_with_patch(af_ref)
+        patched = af_body.dup
+        # Step 1: Add field to /Fields array
+        fields_array_ref = DictScan.value_token_after("/Fields", patched)
+        if fields_array_ref && fields_array_ref =~ /\A(\d+)\s+(\d+)\s+R/
+          # Reference case: /Fields points to a separate array object
+          arr_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+          arr_body = get_object_body_with_patch(arr_ref)
+          new_body = DictScan.add_ref_to_array(arr_body, [field_obj_num, 0])
+          apply_patch(arr_ref, new_body, arr_body)
+        elsif patched.include?("/Fields")
+          # Inline array case: use DictScan utility
+          patched = DictScan.add_ref_to_inline_array(patched, "/Fields", [field_obj_num, 0])
+        else
+          # No /Fields exists - add it with the field reference
+          patched = DictScan.upsert_key_value(patched, "/Fields", "[#{field_obj_num} 0 R]")
+        end
+        # Step 2: Ensure /NeedAppearances true
+        unless patched.include?("/NeedAppearances")
+          patched = DictScan.upsert_key_value(patched, "/NeedAppearances", "true")
+        end
+        # Step 3: Ensure /DR /Font has /Helv mapping
+        unless patched.include?("/DR") && patched.include?("/Helv")
+          font_obj_num = next_fresh_object_number
+          font_body = "<<\n  /Type /Font\n  /Subtype /Type1\n  /BaseFont /Helvetica\n>>"
+          patches << { ref: [font_obj_num, 0], body: font_body }
+          if patched.include?("/DR")
+            # /DR exists - try to add /Font if it doesn't exist
+            dr_tok = DictScan.value_token_after("/DR", patched)
+            if dr_tok && dr_tok.start_with?("<<")
+              # Check if /Font already exists in /DR
+              unless dr_tok.include?("/Font")
+                # Add /Font to existing /DR dictionary
+                new_dr_tok = dr_tok.chomp(">>") + "  /Font << /Helv #{font_obj_num} 0 R >>\n>>"
+                patched = patched.sub(dr_tok) { |_| new_dr_tok }
+              end
+            else
+              # /DR exists but isn't a dictionary - replace it
+              patched = DictScan.replace_key_value(patched, "/DR", "<< /Font << /Helv #{font_obj_num} 0 R >> >>")
+            end
+          else
+            # No /DR exists - add it
+            patched = DictScan.upsert_key_value(patched, "/DR", "<< /Font << /Helv #{font_obj_num} 0 R >> >>")
+          end
+        end
+        apply_patch(af_ref, patched, af_body)
+        true
+      end
+      def find_page_ref(page_num)
+        page_objects = []
+        resolver.each_object do |ref, body|
+          next unless body
+          # Check for /Type /Page with or without space, or /Type/Page
+          is_page = body.include?("/Type /Page") ||
+                    body.include?("/Type/Page") ||
+                    (body.include?("/Type") && body.include?("/Page") && body =~ %r{/Type\s*/Page})
+          next unless is_page
+          page_objects << ref
+        end
+        # If still no pages found, try to find them via the page tree
+        if page_objects.empty?
+          # Find the document catalog's /Pages entry
+          root_ref = resolver.root_ref
+          if root_ref
+            catalog_body = resolver.object_body(root_ref)
+            if catalog_body && catalog_body =~ %r{/Pages\s+(\d+)\s+(\d+)\s+R}
+              pages_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+              pages_body = resolver.object_body(pages_ref)
+              # Extract /Kids array from Pages object
+              if pages_body && pages_body =~ %r{/Kids\s*\[(.*?)\]}m
+                kids_array = ::Regexp.last_match(1)
+                # Extract all object references from Kids array
+                kids_array.scan(/(\d+)\s+(\d+)\s+R/) do |num_str, gen_str|
+                  kid_ref = [num_str.to_i, gen_str.to_i]
+                  kid_body = resolver.object_body(kid_ref)
+                  # Check if this kid is a page or another Pages node
+                  if kid_body && (kid_body.include?("/Type /Page") || kid_body.include?("/Type/Page") || (kid_body.include?("/Type") && kid_body.include?("/Page")))
+                    page_objects << kid_ref
+                  elsif kid_body && kid_body.include?("/Type /Pages")
+                    # Recursively find pages in this Pages node
+                    if kid_body =~ %r{/Kids\s*\[(.*?)\]}m
+                      kid_body[::Regexp.last_match(0)..].scan(/(\d+)\s+(\d+)\s+R/) do |n, g|
+                        grandkid_ref = [n.to_i, g.to_i]
+                        grandkid_body = resolver.object_body(grandkid_ref)
+                        if grandkid_body && (grandkid_body.include?("/Type /Page") || grandkid_body.include?("/Type/Page"))
+                          page_objects << grandkid_ref
+                        end
+                      end
+                    end
+                  end
+                end
+              end
+            end
+          end
+        end
+        return page_objects[0] if page_objects.empty?
+        return page_objects[page_num - 1] if page_num.positive? && page_num <= page_objects.length
+        page_objects[0]
+      end
+      def add_widget_to_page(widget_obj_num, page_num)
+        # Find the specific page using the same logic as find_page_ref
+        target_page_ref = find_page_ref(page_num)
+        return false unless target_page_ref
+        page_body = get_object_body_with_patch(target_page_ref)
+        # Use DictScan utility to safely add reference to /Annots array
+        new_body = if page_body =~ %r{/Annots\s*\[(.*?)\]}m
+                     # Inline array - add to it
+                     result = DictScan.add_ref_to_inline_array(page_body, "/Annots", [widget_obj_num, 0])
+                     if result && result != page_body
+                       result
+                     else
+                       # Fallback: use string manipulation
+                       annots_array = ::Regexp.last_match(1)
+                       ref_token = "#{widget_obj_num} 0 R"
+                       new_annots = if annots_array.strip.empty?
+                                      "[#{ref_token}]"
+                                    else
+                                      "[#{annots_array} #{ref_token}]"
+                                    end
+                       page_body.sub(%r{/Annots\s*\[.*?\]}, "/Annots #{new_annots}")
+                     end
+                   elsif page_body =~ %r{/Annots\s+(\d+)\s+(\d+)\s+R}
+                     # Indirect array reference - need to read and modify the array object
+                     annots_array_ref = [Integer(::Regexp.last_match(1)), Integer(::Regexp.last_match(2))]
+                     annots_array_body = get_object_body_with_patch(annots_array_ref)
+                     ref_token = "#{widget_obj_num} 0 R"
+                     if annots_array_body
+                       new_annots_body = if annots_array_body.strip == "[]"
+                                           "[#{ref_token}]"
+                                         elsif annots_array_body.strip.start_with?("[") && annots_array_body.strip.end_with?("]")
+                                           without_brackets = annots_array_body.strip[1..-2].strip
+                                           "[#{without_brackets} #{ref_token}]"
+                                         else
+                                           "[#{annots_array_body} #{ref_token}]"
+                                         end
+                       apply_patch(annots_array_ref, new_annots_body, annots_array_body)
+                       # Page body doesn't need to change (still references the same array object)
+                       page_body
+                     else
+                       # Array object not found - fallback to creating inline array
+                       page_body.sub(%r{/Annots\s+\d+\s+\d+\s+R}, "/Annots [#{ref_token}]")
+                     end
+                   else
+                     # No /Annots exists - add it with the widget reference
+                     # Insert /Annots before the closing >> of the dictionary
+                     ref_token = "#{widget_obj_num} 0 R"
+                     if page_body.include?(">>")
+                       # Find the last >> (closing the outermost dictionary) and insert /Annots before it
+                       page_body.reverse.sub(">>".reverse, "/Annots [#{ref_token}]>>".reverse).reverse
+                     else
+                       page_body + " /Annots [#{ref_token}]"
+                     end
+                   end
+        apply_patch(target_page_ref, new_body, page_body) if new_body && new_body != page_body
+        true
+      end
+    end
+  end
+end