RubyGems - corp_pdf - Versions diffs - 1.0.5 - Mend

corp_pdf 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (40) hide show

checksums.yaml +7 -0
data/.gitignore +13 -0
data/.rubocop.yml +78 -0
data/CHANGELOG.md +122 -0
data/Gemfile +5 -0
data/Gemfile.lock +90 -0
data/README.md +518 -0
data/Rakefile +18 -0
data/corp_pdf.gemspec +35 -0
data/docs/README.md +111 -0
data/docs/clear_fields.md +202 -0
data/docs/dict_scan_explained.md +341 -0
data/docs/object_streams.md +311 -0
data/docs/pdf_structure.md +251 -0
data/issues/README.md +59 -0
data/issues/memory-benchmark-results.md +551 -0
data/issues/memory-improvements.md +388 -0
data/issues/memory-optimization-summary.md +204 -0
data/issues/refactoring-opportunities.md +259 -0
data/lib/corp_pdf/actions/add_field.rb +73 -0
data/lib/corp_pdf/actions/base.rb +48 -0
data/lib/corp_pdf/actions/remove_field.rb +154 -0
data/lib/corp_pdf/actions/update_field.rb +663 -0
data/lib/corp_pdf/dict_scan.rb +523 -0
data/lib/corp_pdf/document.rb +782 -0
data/lib/corp_pdf/field.rb +145 -0
data/lib/corp_pdf/fields/base.rb +384 -0
data/lib/corp_pdf/fields/checkbox.rb +164 -0
data/lib/corp_pdf/fields/radio.rb +220 -0
data/lib/corp_pdf/fields/signature.rb +393 -0
data/lib/corp_pdf/fields/text.rb +31 -0
data/lib/corp_pdf/incremental_writer.rb +245 -0
data/lib/corp_pdf/object_resolver.rb +381 -0
data/lib/corp_pdf/objstm.rb +75 -0
data/lib/corp_pdf/page.rb +90 -0
data/lib/corp_pdf/pdf_writer.rb +133 -0
data/lib/corp_pdf/version.rb +5 -0
data/lib/corp_pdf.rb +35 -0
data/publish +183 -0
metadata +169 -0

data/README.md ADDED Viewed

@@ -0,0 +1,518 @@
+# CorpPdf
+A minimal pure Ruby library for parsing and editing PDF AcroForm fields.
+## Features
+- ✅ **Pure Ruby** - Minimal dependencies (only `chunky_png` for PNG image processing)
+- ✅ **StringIO Only** - Works entirely in memory, no temp files
+- ✅ **PDF AcroForm Support** - Parse, list, add, remove, and modify form fields
+- ✅ **Signature Field Images** - Add image appearances to signature fields (JPEG and PNG support)
+- ✅ **Minimal PDF Engine** - Basic PDF parser/writer for AcroForm manipulation
+- ✅ **Ruby 3.1+** - Modern Ruby support
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'corp_pdf'
+```
+And then execute:
+```bash
+bundle install
+```
+Or install it directly:
+```bash
+gem install corp_pdf
+```
+## Usage
+### Basic Usage
+```ruby
+require 'corp_pdf'
+# Create a document from a file path or StringIO
+doc = CorpPdf::Document.new("form.pdf")
+# Or from StringIO
+require 'stringio'
+pdf_data = File.binread("form.pdf")
+io = StringIO.new(pdf_data)
+doc = CorpPdf::Document.new(io)
+# List all form fields
+fields = doc.list_fields
+fields.each do |field|
+  type_info = field.type_key ? "#{field.type} (:#{field.type_key})" : field.type
+  puts "#{field.name} (#{type_info}) = #{field.value}"
+end
+# Add a new field (using symbol key for type)
+new_field = doc.add_field("NameField",
+  value: "John Doe",
+  x: 100,
+  y: 500,
+  width: 200,
+  height: 20,
+  page: 1,
+  type: :text  # Optional: :text, :button, :choice, :signature (or "/Tx", "/Btn", etc.)
+)
+# Or using the PDF type string directly
+button_field = doc.add_field("CheckBox",
+  type: "/Btn",  # Or use :button symbol
+  x: 100,
+  y: 600,
+  width: 20,
+  height: 20,
+  page: 1
+)
+# Update a field value
+doc.update_field("ExistingField", "New Value")
+# Rename a field while updating it
+doc.update_field("OldName", "New Value", new_name: "NewName")
+# Remove a field
+doc.remove_field("FieldToRemove")
+# Write the modified PDF to a file
+doc.write("output.pdf")
+# Or write with flattening (removes incremental updates)
+doc.write("output.pdf", flatten: true)
+# Or get PDF bytes as a String (returns String, not StringIO)
+pdf_bytes = doc.write
+File.binwrite("output.pdf", pdf_bytes)
+```
+### Advanced Usage
+#### Working with Field Objects
+```ruby
+doc = CorpPdf::Document.new("form.pdf")
+fields = doc.list_fields
+# Access field properties
+field = fields.first
+puts field.name        # Field name
+puts field.value       # Field value
+puts field.type        # Field type (e.g., "/Tx")
+puts field.type_key    # Symbol key (e.g., :text) or nil if not mapped
+puts field.x           # X position
+puts field.y           # Y position
+puts field.width       # Width
+puts field.height      # Height
+puts field.page        # Page number
+# Fields default to "/Tx" if type is missing from PDF
+# Update a field directly
+field.update("New Value")
+# Update and rename a field
+field.update("New Value", new_name: "NewName")
+# Remove a field directly
+field.remove
+# Check field type
+field.text_field?      # true for text fields
+field.button_field?    # true for button/checkbox fields
+field.choice_field?    # true for choice/dropdown fields
+field.signature_field? # true for signature fields
+# Check if field has a value
+field.has_value?
+# Check if field has position information
+field.has_position?
+```
+#### Signature Fields with Image Appearances
+Signature fields can be enhanced with image appearances (signature images). When you update a signature field with image data (base64-encoded JPEG or PNG), CorpPdf will automatically add the image as the field's appearance.
+```ruby
+doc = CorpPdf::Document.new("form.pdf")
+# Add a signature field
+sig_field = doc.add_field("MySignature",
+  type: :signature,
+  x: 100,
+  y: 500,
+  width: 200,
+  height: 100,
+  page: 1
+)
+# Update signature field with base64-encoded image data
+# JPEG example:
+jpeg_base64 = Base64.encode64(File.binread("signature.jpg")).strip
+doc.update_field("MySignature", jpeg_base64)
+# PNG example (requires chunky_png gem):
+png_base64 = Base64.encode64(File.binread("signature.png")).strip
+doc.update_field("MySignature", png_base64)
+# Or using data URI format:
+data_uri = "data:image/png;base64,#{png_base64}"
+doc.update_field("MySignature", data_uri)
+# Write the PDF with the signature appearance
+doc.write("form_with_signature.pdf")
+```
+**Note**: PNG image processing requires the `chunky_png` gem, which is included as a dependency. JPEG images can be processed without any additional dependencies.
+#### Radio Buttons
+Radio buttons allow users to select a single option from a group of mutually exclusive choices. Radio buttons in CorpPdf are created using the `:radio` type and require a `group_id` to group related buttons together.
+```ruby
+doc = CorpPdf::Document.new("form.pdf")
+# Create a radio button group with multiple options
+# All buttons in the same group must share the same group_id
+# First radio button in the group (creates the parent field)
+doc.add_field("Option1",
+  type: :radio,
+  group_id: "my_radio_group",
+  value: "option1",  # Export value for this button
+  x: 100,
+  y: 500,
+  width: 20,
+  height: 20,
+  page: 1,
+  selected: true  # This button will be selected by default
+)
+# Second radio button in the same group
+doc.add_field("Option2",
+  type: :radio,
+  group_id: "my_radio_group",  # Same group_id as above
+  value: "option2",
+  x: 100,
+  y: 470,
+  width: 20,
+  height: 20,
+  page: 1
+)
+# Third radio button in the same group
+doc.add_field("Option3",
+  type: :radio,
+  group_id: "my_radio_group",  # Same group_id
+  value: "option3",
+  x: 100,
+  y: 440,
+  width: 20,
+  height: 20,
+  page: 1
+)
+# Write the PDF with radio buttons
+doc.write("form_with_radio.pdf")
+```
+**Key Points:**
+- **`group_id`**: Required. All radio buttons that should be mutually exclusive must share the same `group_id`. This can be any string or identifier.
+- **`type: :radio`**: Required. Specifies that this is a radio button field.
+- **`value`**: The export value for this specific button. This is what gets returned when the button is selected. If not provided, a unique value will be generated automatically.
+- **`selected`**: Optional boolean (`true` or `false`, or string `"true"`). If set to `true`, this button will be selected by default. Only one button in a group should have `selected: true`. If not specified, the button defaults to unselected.
+- **Positioning**: Each radio button needs its own `x`, `y`, `width`, `height`, and `page` values to position it on the form.
+**Example with multiple groups:**
+```ruby
+doc = CorpPdf::Document.new("form.pdf")
+# First radio button group (e.g., "Gender")
+doc.add_field("Male", type: :radio, group_id: "gender", value: "male", x: 100, y: 500, width: 20, height: 20, page: 1, selected: true)
+doc.add_field("Female", type: :radio, group_id: "gender", value: "female", x: 100, y: 470, width: 20, height: 20, page: 1)
+doc.add_field("Other", type: :radio, group_id: "gender", value: "other", x: 100, y: 440, width: 20, height: 20, page: 1)
+# Second radio button group (e.g., "Age Range")
+doc.add_field("18-25", type: :radio, group_id: "age", value: "18-25", x: 200, y: 500, width: 20, height: 20, page: 1)
+doc.add_field("26-35", type: :radio, group_id: "age", value: "26-35", x: 200, y: 470, width: 20, height: 20, page: 1, selected: true)
+doc.add_field("36+", type: :radio, group_id: "age", value: "36+", x: 200, y: 440, width: 20, height: 20, page: 1)
+doc.write("form_with_multiple_groups.pdf")
+```
+**Note:** Radio buttons are automatically configured with the correct PDF flags to enable mutual exclusivity within a group. When a user selects one radio button, all others in the same group are automatically deselected.
+#### Flattening PDFs
+```ruby
+# Flatten a PDF to remove incremental updates
+doc = CorpPdf::Document.new("form.pdf")
+doc.flatten!  # Modifies the document in-place
+# Or create a new flattened document
+flattened_doc = CorpPdf::Document.flatten_pdf("input.pdf", "output.pdf")
+# Or get flattened bytes
+flattened_bytes = CorpPdf::Document.flatten_pdf("input.pdf")
+```
+#### Clearing Fields
+The `clear` and `clear!` methods allow you to completely remove unwanted fields by rewriting the entire PDF:
+```ruby
+doc = CorpPdf::Document.new("form.pdf")
+# Remove all fields matching a pattern
+doc.clear!(remove_pattern: /^text-/)
+# Keep only specific fields
+doc.clear!(keep_fields: ["Name", "Email"])
+# Remove specific fields
+doc.clear!(remove_fields: ["OldField1", "OldField2"])
+# Use a block to determine which fields to keep
+doc.clear! { |name| !name.start_with?("temp_") }
+# Write the cleared PDF
+doc.write("cleared.pdf", flatten: true)
+```
+**Note:** Unlike `remove_field`, which uses incremental updates, `clear` completely rewrites the PDF to exclude unwanted fields. This is more efficient when removing many fields and ensures complete removal. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
+### API Reference
+#### `CorpPdf::Document.new(path_or_io)`
+Creates a PDF document from a file path (String) or StringIO object.
+```ruby
+doc = CorpPdf::Document.new("path/to/file.pdf")
+doc = CorpPdf::Document.new(StringIO.new(pdf_bytes))
+```
+#### `#list_fields`
+Returns an array of `Field` objects representing all form fields in the document.
+```ruby
+fields = doc.list_fields
+fields.each do |field|
+  puts field.name
+end
+```
+#### `#list_pages`
+Returns an array of `Page` objects representing all pages in the document. Each `Page` object provides page information and methods to add fields to that specific page.
+```ruby
+pages = doc.list_pages
+pages.each do |page|
+  puts "Page #{page.page_number}: #{page.width}x#{page.height}"
+end
+# Add fields to specific pages - the page is automatically set!
+first_page = pages[0]
+first_page.add_field("Name", x: 100, y: 700, width: 200, height: 20)
+second_page = pages[1]
+second_page.add_field("Email", x: 100, y: 650, width: 200, height: 20)
+```
+**Page Object Methods:**
+- `page.page_number` - Returns the page number (1-indexed)
+- `page.width` - Page width in points
+- `page.height` - Page height in points
+- `page.ref` - Page object reference `[obj_num, gen_num]`
+- `page.metadata` - Hash containing page metadata (rotation, boxes, etc.)
+- `page.add_field(name, options)` - Add a field to this page (page number is automatically set)
+- `page.to_h` - Convert to hash for backward compatibility
+#### `#add_field(name, options)`
+Adds a new form field to the document. Options include:
+- `value`: Default value for the field (String)
+- `x`: X coordinate (Integer, default: 100)
+- `y`: Y coordinate (Integer, default: 500)
+- `width`: Field width (Integer, default: 100)
+- `height`: Field height (Integer, default: 20)
+- `page`: Page number to add the field to (Integer, default: 1)
+- `type`: Field type (Symbol or String, default: `"/Tx"`). Options:
+  - Symbol keys: `:text`, `:button`, `:choice`, `:signature`, `:radio`
+  - PDF type strings: `"/Tx"`, `"/Btn"`, `"/Ch"`, `"/Sig"`
+- `group_id`: Required for radio buttons. String or identifier to group radio buttons together. All radio buttons in the same group must share the same `group_id`.
+- `selected`: Optional for radio buttons. Boolean (`true` or `false`, or string `"true"`). If set to `true`, this radio button will be selected by default.
+Returns a `Field` object if successful.
+```ruby
+# Using symbol keys (recommended)
+field = doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1, type: :text)
+# Using PDF type strings
+field = doc.add_field("ButtonField", type: "/Btn", x: 100, y: 500, width: 20, height: 20, page: 1)
+# Radio button example
+field = doc.add_field("Option1", type: :radio, group_id: "my_group", value: "option1", x: 100, y: 500, width: 20, height: 20, page: 1, selected: true)
+```
+#### `#update_field(name, new_value, new_name: nil)`
+Updates a field's value and optionally renames it. For signature fields, if `new_value` looks like image data (base64-encoded JPEG/PNG or a data URI), it will automatically add the image as the field's appearance. Returns `true` if successful, `false` if field not found.
+```ruby
+doc.update_field("FieldName", "New Value")
+doc.update_field("OldName", "New Value", new_name: "NewName")
+# For signature fields with images:
+doc.update_field("SignatureField", base64_image_data)  # Base64-encoded JPEG or PNG
+doc.update_field("SignatureField", "data:image/png;base64,...")  # Data URI format
+```
+#### `#remove_field(name_or_field)`
+Removes a form field by name (String) or Field object. Returns `true` if successful, `false` if field not found.
+```ruby
+doc.remove_field("FieldName")
+doc.remove_field(field_object)
+```
+#### `#write(path_out = nil, flatten: false)`
+Writes the modified PDF. If `path_out` is provided, writes to that file path and returns `true`. If no path is provided, returns the PDF bytes as a String. The `flatten` option removes incremental updates from the PDF.
+```ruby
+doc.write("output.pdf")              # Write to file
+doc.write("output.pdf", flatten: true) # Write flattened PDF to file
+pdf_bytes = doc.write                 # Get PDF bytes as String
+```
+#### `#flatten`
+Returns flattened PDF bytes (removes incremental updates) without modifying the document.
+```ruby
+flattened_bytes = doc.flatten
+```
+#### `#flatten!`
+Flattens the PDF in-place (modifies the current document instance).
+```ruby
+doc.flatten!
+```
+#### `CorpPdf::Document.flatten_pdf(input_path, output_path = nil)`
+Class method to flatten a PDF. If `output_path` is provided, writes to that path and returns the path. Otherwise returns a new `Document` instance with the flattened content.
+```ruby
+CorpPdf::Document.flatten_pdf("input.pdf", "output.pdf")
+flattened_doc = CorpPdf::Document.flatten_pdf("input.pdf")
+```
+#### `#clear(options = {})` and `#clear!(options = {})`
+Removes unwanted fields by rewriting the entire PDF. `clear` returns cleared PDF bytes without modifying the document, while `clear!` modifies the document in-place. Options include:
+- `keep_fields`: Array of field names to keep (all others removed)
+- `remove_fields`: Array of field names to remove
+- `remove_pattern`: Regex pattern - fields matching this are removed
+- Block: Given field name, return `true` to keep, `false` to remove
+```ruby
+# Remove all fields
+cleared = doc.clear(remove_pattern: /.*/)
+# Remove fields matching pattern (in-place)
+doc.clear!(remove_pattern: /^text-/)
+# Keep only specific fields
+doc.clear!(keep_fields: ["Name", "Email"])
+# Use block to filter fields (return true to remove)
+doc.clear! { |field| field.name.match?(/^[a-f0-9-]{30,}/) }
+```
+**Note:** This completely rewrites the PDF (like `flatten`), so it's more efficient than using `remove_field` multiple times. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
+### Field Object
+Each field returned by `#list_fields` is a `Field` object with the following attributes and methods:
+#### Attributes
+- `name`: Field name (String)
+- `value`: Field value (String or nil)
+- `type`: Field type (String, e.g., "/Tx", "/Btn", "/Ch", "/Sig"). Defaults to "/Tx" if missing from PDF.
+- `ref`: Object reference array `[object_number, generation]`
+- `x`: X coordinate (Float or nil)
+- `y`: Y coordinate (Float or nil)
+- `width`: Field width (Float or nil)
+- `height`: Field height (Float or nil)
+- `page`: Page number (Integer or nil)
+#### Methods
+- `#update(new_value, new_name: nil)`: Update the field's value and optionally rename it
+- `#remove`: Remove the field from the document
+- `#type_key`: Returns the symbol key for the type (e.g., `:text` for `"/Tx"`) or `nil` if not mapped
+- `#text_field?`: Returns true if field is a text field
+- `#button_field?`: Returns true if field is a button/checkbox field
+- `#choice_field?`: Returns true if field is a choice/dropdown field
+- `#signature_field?`: Returns true if field is a signature field
+- `#has_value?`: Returns true if field has a non-empty value
+- `#has_position?`: Returns true if field has position information
+- `#object_number`: Returns the object number (first element of ref)
+- `#generation`: Returns the generation number (second element of ref)
+- `#valid_ref?`: Returns true if field has a valid reference (not a placeholder)
+**Note**: When reading fields from a PDF, if the type is missing or empty, it defaults to `"/Tx"` (text field). The `type_key` method allows you to get the symbol representation (e.g., `:text`) from the type string.
+## Example
+For complete working examples, see the test files in the `spec/` directory:
+- `spec/document_spec.rb` - Basic document operations
+- `spec/form_editing_spec.rb` - Form field editing examples
+- `spec/field_editor_spec.rb` - Field object manipulation
+## Architecture
+CorpPdf is built as a minimal PDF engine with the following components:
+- **ObjectResolver**: Resolves and extracts PDF objects from the document
+- **DictScan**: Parses PDF dictionaries and extracts field information
+- **IncrementalWriter**: Handles incremental PDF updates (appends changes)
+- **PDFWriter**: Writes complete PDF files (for flattening)
+- **Actions**: Modular actions for adding, updating, and removing fields (`AddField`, `UpdateField`, `RemoveField`)
+- **Document**: Main orchestration class that coordinates all operations
+- **Field**: Represents a form field with its properties and methods
+## Limitations
+This is a minimal implementation focused on AcroForm manipulation. It does not support:
+- Complex PDF features (images, fonts, advanced graphics, etc.)
+- PDF compression/decompression (streams are preserved as-is)
+- Full PDF rendering or display
+- Digital signatures (though signature fields can be added)
+- JavaScript or other interactive features
+- Form submission/validation logic
+## Dependencies
+- **chunky_png** (~> 1.4): Required for PNG image processing in signature field appearances. JPEG images can be processed without this dependency, but PNG support requires it.
+## Development
+After checking out the repo, run `bundle install` to install dependencies. Then, run `bundle exec rspec` to run the tests.
+## Contributing
+Bug reports and pull requests are welcome on GitHub.
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/Rakefile ADDED Viewed

@@ -0,0 +1,18 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+desc "Run RuboCop"
+task :rubocop do
+  sh "bundle exec rubocop"
+end
+desc "Run RuboCop with auto-correct"
+task "rubocop:fix" do
+  sh "bundle exec rubocop --auto-correct"
+end
+task default: :spec

data/corp_pdf.gemspec ADDED Viewed

@@ -0,0 +1,35 @@
+# frozen_string_literal: true
+require_relative 'lib/corp_pdf/version'
+Gem::Specification.new do |spec|
+  spec.name          = "corp_pdf"
+  spec.version       = CorpPdf::VERSION
+  spec.authors       = ["Michael Wynkoop"]
+  spec.email         = ["michaelwynkoop@corporatetools.com"]
+  spec.summary       = "Pure Ruby PDF AcroForm editing library"
+  spec.description   = "A minimal pure Ruby library for parsing and editing PDF AcroForm fields using only stdlib"
+  spec.homepage      = "https://github.com/corporatetools/corp_pdf"
+  spec.license       = "MIT"
+  spec.required_ruby_version = Gem::Requirement.new(">= 3.1.0")
+  spec.metadata["homepage_uri"] = spec.homepage
+  spec.metadata["source_code_uri"] = "https://github.com/corporatetools/corp_pdf"
+  spec.metadata["changelog_uri"] = "https://github.com/corporatetools/corp_pdf/blob/main/CHANGELOG.md"
+  # Specify which files should be added to the gem when it is released.
+  spec.files         = Dir.chdir(File.expand_path('..', __FILE__)) do
+    `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
+  end
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  spec.add_runtime_dependency "chunky_png", "~> 1.4"
+  spec.add_runtime_dependency "i18n", "~> 1.14"
+  spec.add_development_dependency "rspec", "~> 3.0"
+  spec.add_development_dependency "pry", "~> 0.14"
+  spec.add_development_dependency "rubocop", "~> 1.50"
+  spec.add_development_dependency "rubocop-rspec", "~> 2.20"
+end

data/docs/README.md ADDED Viewed

@@ -0,0 +1,111 @@
+# CorpPdf Documentation
+This directory contains detailed documentation about how `CorpPdf` works, with a focus on explaining the text-based nature of PDFs and how the library uses simple text traversal to parse and modify them.
+## Documentation Overview
+### [PDF Structure](./pdf_structure.md)
+Explains the fundamental structure of PDF files, including:
+- PDFs as text-based files with structured syntax
+- PDF dictionaries (`<< ... >>`)
+- PDF objects, references, arrays, and strings
+- Why PDF structure is parseable with text traversal
+- Examples of PDF dictionary structure
+**Key insight:** PDFs may contain binary data in streams, but their **structure**—dictionaries, arrays, strings, references—is all text-based syntax.
+### [DictScan Explained](./dict_scan_explained.md)
+A detailed walkthrough of the `DictScan` module:
+- How each function works
+- Why text traversal is the core approach
+- Step-by-step algorithm explanations
+- Common patterns for using `DictScan`
+- Examples showing how text traversal parses PDF dictionaries
+**Key insight:** Despite appearing complicated, `DictScan` is fundamentally **text traversal**—finding delimiters (`<<`, `>>`, `(`, `)`, etc.) and tracking depth to extract values.
+### [Object Streams](./object_streams.md)
+Explains how PDF object streams work and how `CorpPdf` parses them:
+- What object streams are and why they're used
+- Object stream structure (header + data sections)
+- How `ObjectResolver` identifies objects in streams
+- The `ObjStm.parse` algorithm
+- Stream decoding (compression, PNG predictor)
+- Lazy loading and caching
+**Key insight:** Object streams compress multiple objects together, but parsing them is still **text traversal**—once decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
+### [Clearing Fields](./cleaning_fields.md)
+Documentation for the `clear` and `clear!` methods:
+- How to remove unwanted fields completely
+- Difference between `clear` and `remove_field`
+- Pattern matching and field selection
+- Removing orphaned widget references
+- Best practices for clearing PDFs
+**Key insight:** `clear` rewrites the entire PDF to exclude unwanted fields, ensuring complete removal rather than just marking fields as deleted.
+## Common Themes
+Throughout all documentation, you'll see these recurring themes:
+1. **PDFs are text-based**: Despite being "binary" files, PDF structure uses text syntax
+2. **Text traversal works**: Simple character-by-character scanning can parse PDF dictionaries
+3. **Depth tracking**: Nested structures (dictionaries, arrays, strings) use depth counting
+4. **Position-based replacement**: Using exact byte positions is safer than regex replacement
+5. **Minimal parsing**: We don't need a full PDF parser—just enough to find dictionaries and extract/replace values
+## How to Read These Docs
+**If you're new to PDFs:**
+1. Start with [PDF Structure](./pdf_structure.md) to understand PDFs at a high level
+2. Read [DictScan Explained](./dict_scan_explained.md) to see how text traversal works
+3. Read [Object Streams](./object_streams.md) to understand compression features
+4. Read [Clearing Fields](./cleaning_fields.md) to learn how to remove unwanted fields
+**If you're debugging:**
+- [DictScan Explained](./dict_scan_explained.md) has function-by-function walkthroughs
+- [Object Streams](./object_streams.md) explains how object streams are parsed
+**If you're contributing:**
+- All docs include code examples and algorithm explanations
+- Each document explains **why** the approach works, not just **how**
+## Technical Details
+### Why Text Traversal Works
+PDF dictionaries use distinct delimiters:
+- `<<` `>>` for dictionaries
+- `[` `]` for arrays
+- `(` `)` for literal strings
+- `<` `>` for hex strings
+- `/` for names
+These unique delimiters allow pattern-matching on the first character to determine value types. Depth tracking (counting `<<`/`>>`, `[`/`]`, etc.) handles nested structures.
+### Performance
+**Why text traversal is fast:**
+- No AST construction
+- No full PDF parsing
+- Direct string manipulation
+- Minimal memory allocation
+**Trade-offs:**
+- Doesn't validate entire PDF structure
+- Assumes dictionaries are well-formed
+- Some preprocessing needed (stream stripping)
+### Safety
+**Position-based replacement** (using exact byte positions) avoids regex edge cases and preserves formatting. The code verifies dictionaries remain valid after modification.
+## Questions?
+If you have questions about how `CorpPdf` works, these docs should answer them. The code is also well-commented, so reading the source alongside the docs is recommended.