corp_pdf 1.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +13 -0
- data/.rubocop.yml +78 -0
- data/CHANGELOG.md +122 -0
- data/Gemfile +5 -0
- data/Gemfile.lock +90 -0
- data/README.md +518 -0
- data/Rakefile +18 -0
- data/corp_pdf.gemspec +35 -0
- data/docs/README.md +111 -0
- data/docs/clear_fields.md +202 -0
- data/docs/dict_scan_explained.md +341 -0
- data/docs/object_streams.md +311 -0
- data/docs/pdf_structure.md +251 -0
- data/issues/README.md +59 -0
- data/issues/memory-benchmark-results.md +551 -0
- data/issues/memory-improvements.md +388 -0
- data/issues/memory-optimization-summary.md +204 -0
- data/issues/refactoring-opportunities.md +259 -0
- data/lib/corp_pdf/actions/add_field.rb +73 -0
- data/lib/corp_pdf/actions/base.rb +48 -0
- data/lib/corp_pdf/actions/remove_field.rb +154 -0
- data/lib/corp_pdf/actions/update_field.rb +663 -0
- data/lib/corp_pdf/dict_scan.rb +523 -0
- data/lib/corp_pdf/document.rb +782 -0
- data/lib/corp_pdf/field.rb +145 -0
- data/lib/corp_pdf/fields/base.rb +384 -0
- data/lib/corp_pdf/fields/checkbox.rb +164 -0
- data/lib/corp_pdf/fields/radio.rb +220 -0
- data/lib/corp_pdf/fields/signature.rb +393 -0
- data/lib/corp_pdf/fields/text.rb +31 -0
- data/lib/corp_pdf/incremental_writer.rb +245 -0
- data/lib/corp_pdf/object_resolver.rb +381 -0
- data/lib/corp_pdf/objstm.rb +75 -0
- data/lib/corp_pdf/page.rb +90 -0
- data/lib/corp_pdf/pdf_writer.rb +133 -0
- data/lib/corp_pdf/version.rb +5 -0
- data/lib/corp_pdf.rb +35 -0
- data/publish +183 -0
- metadata +169 -0
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
# Clearing Fields with `clear` and `clear!`
|
|
2
|
+
|
|
3
|
+
The `clear` method allows you to completely remove unwanted form fields from a PDF by rewriting the entire document, rather than using incremental updates. This is useful when you want to:
|
|
4
|
+
|
|
5
|
+
- Remove multiple layers of added fields
|
|
6
|
+
- Clear a PDF that has accumulated many unwanted fields
|
|
7
|
+
- Get back to a base file without certain fields
|
|
8
|
+
- Remove orphaned or invalid field references
|
|
9
|
+
|
|
10
|
+
Unlike `remove_field`, which uses incremental updates, `clear` rewrites the entire PDF (similar to `flatten`) but excludes the unwanted fields entirely. This ensures that:
|
|
11
|
+
|
|
12
|
+
- Field objects are completely removed (not just marked as deleted)
|
|
13
|
+
- Widget annotations are removed from page `/Annots` arrays
|
|
14
|
+
- Orphaned widget references are cleaned up
|
|
15
|
+
- The AcroForm `/Fields` array is updated
|
|
16
|
+
- All references to removed fields are eliminated
|
|
17
|
+
|
|
18
|
+
## Methods
|
|
19
|
+
|
|
20
|
+
### `clear(options = {})`
|
|
21
|
+
|
|
22
|
+
Returns a new PDF with unwanted fields removed. Does not modify the current document.
|
|
23
|
+
|
|
24
|
+
**Options:**
|
|
25
|
+
- `keep_fields`: Array of field names to keep (all others removed)
|
|
26
|
+
- `remove_fields`: Array of field names to remove
|
|
27
|
+
- `remove_pattern`: Regex pattern - fields matching this are removed
|
|
28
|
+
- Block: Given field name, return `true` to keep, `false` to remove
|
|
29
|
+
|
|
30
|
+
### `clear!(options = {})`
|
|
31
|
+
|
|
32
|
+
Same as `clear`, but modifies the current document in-place. Mutates the document instance.
|
|
33
|
+
|
|
34
|
+
## Usage Examples
|
|
35
|
+
|
|
36
|
+
### Remove All Fields
|
|
37
|
+
|
|
38
|
+
```ruby
|
|
39
|
+
doc = CorpPdf::Document.new("form.pdf")
|
|
40
|
+
|
|
41
|
+
# Remove all fields
|
|
42
|
+
cleared_pdf = doc.clear(remove_pattern: /.*/)
|
|
43
|
+
|
|
44
|
+
# Or in-place
|
|
45
|
+
doc.clear!(remove_pattern: /.*/)
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Remove Fields Matching a Pattern
|
|
49
|
+
|
|
50
|
+
```ruby
|
|
51
|
+
# Remove all fields starting with "text-"
|
|
52
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
53
|
+
|
|
54
|
+
# Remove UUID-like generated fields
|
|
55
|
+
doc.clear! { |name| !(name =~ /text-/ || name =~ /^[a-f0-9]{20,}/) }
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
### Keep Only Specific Fields
|
|
59
|
+
|
|
60
|
+
```ruby
|
|
61
|
+
# Keep only these fields, remove all others
|
|
62
|
+
doc.clear!(keep_fields: ["Name", "Email", "Phone"])
|
|
63
|
+
|
|
64
|
+
# Write the cleared PDF
|
|
65
|
+
doc.write("cleared.pdf", flatten: true)
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
### Remove Specific Fields
|
|
69
|
+
|
|
70
|
+
```ruby
|
|
71
|
+
# Remove specific unwanted fields
|
|
72
|
+
doc.clear!(remove_fields: ["OldField1", "OldField2", "GeneratedField3"])
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Complex Selection with Block
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
# Remove fields matching certain criteria
|
|
79
|
+
doc.clear! do |field|
|
|
80
|
+
# Remove fields that look generated
|
|
81
|
+
field.name.start_with?("text-") ||
|
|
82
|
+
field.name.match?(/^[a-f0-9]{20,}/)
|
|
83
|
+
end
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
## How It Works
|
|
87
|
+
|
|
88
|
+
The `clear` method:
|
|
89
|
+
|
|
90
|
+
1. **Identifies fields to remove** based on the provided criteria (pattern, list, or block)
|
|
91
|
+
|
|
92
|
+
2. **Finds related widgets** for each field to be removed:
|
|
93
|
+
- Widgets that reference the field via `/Parent`
|
|
94
|
+
- Widgets that have the same name via `/T`
|
|
95
|
+
|
|
96
|
+
3. **Collects objects to write**, excluding:
|
|
97
|
+
- Field objects that should be removed
|
|
98
|
+
- Widget annotation objects that should be removed
|
|
99
|
+
|
|
100
|
+
4. **Updates AcroForm structure**:
|
|
101
|
+
- Removes field references from the `/Fields` array
|
|
102
|
+
- Handles both inline and indirect array references
|
|
103
|
+
|
|
104
|
+
5. **Clears page annotations**:
|
|
105
|
+
- Removes widget references from page `/Annots` arrays
|
|
106
|
+
- Removes orphaned widget references (widgets pointing to non-existent fields)
|
|
107
|
+
- Removes references to widgets that don't exist in the cleared PDF
|
|
108
|
+
|
|
109
|
+
6. **Rewrites the entire PDF** from scratch (like `flatten`) with only the selected objects
|
|
110
|
+
|
|
111
|
+
## Key Differences from `remove_field`
|
|
112
|
+
|
|
113
|
+
| Feature | `remove_field` | `clear` |
|
|
114
|
+
|---------|---------------|---------|
|
|
115
|
+
| Update Type | Incremental update | Complete rewrite |
|
|
116
|
+
| Object Removal | Marks as deleted | Completely excluded |
|
|
117
|
+
| PDF Structure | Preserves all objects | Only includes selected objects |
|
|
118
|
+
| Use Case | Remove one/a few fields | Remove many fields or clean up |
|
|
119
|
+
| Performance | Fast (append only) | Slower (full rewrite) |
|
|
120
|
+
|
|
121
|
+
## Best Practices
|
|
122
|
+
|
|
123
|
+
1. **Use `clear` when removing many fields**: If you need to remove a large number of fields, `clear` is more efficient and produces cleaner output.
|
|
124
|
+
|
|
125
|
+
2. **Always flatten after clearing**: Since `clear` rewrites the PDF, consider using `write(..., flatten: true)` to ensure compatibility with all PDF viewers:
|
|
126
|
+
|
|
127
|
+
```ruby
|
|
128
|
+
doc.clear!(remove_pattern: /^text-/)
|
|
129
|
+
doc.write("output.pdf", flatten: true)
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
3. **Combine with field addition**: After clearing, you can add new fields:
|
|
133
|
+
|
|
134
|
+
```ruby
|
|
135
|
+
doc.clear!(remove_pattern: /.*/)
|
|
136
|
+
doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1)
|
|
137
|
+
doc.write("output.pdf", flatten: true)
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
4. **Use patterns for generated fields**: If you have fields with predictable naming patterns (e.g., UUID-based names), use regex patterns:
|
|
141
|
+
|
|
142
|
+
```ruby
|
|
143
|
+
# Remove all UUID-like fields
|
|
144
|
+
doc.clear!(remove_pattern: /^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/)
|
|
145
|
+
|
|
146
|
+
# Remove all fields containing "temp" or "test"
|
|
147
|
+
doc.clear!(remove_pattern: /temp|test/i)
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
## Technical Details
|
|
151
|
+
|
|
152
|
+
### Orphaned Widget Removal
|
|
153
|
+
|
|
154
|
+
The `clear` method automatically identifies and removes orphaned widget references:
|
|
155
|
+
|
|
156
|
+
- **Non-existent widgets**: Widget references in `/Annots` arrays that point to objects that don't exist
|
|
157
|
+
- **Orphaned widgets**: Widgets that reference parent fields that don't exist in the cleaned PDF
|
|
158
|
+
|
|
159
|
+
This ensures that page annotation arrays don't contain invalid references that could confuse PDF viewers.
|
|
160
|
+
|
|
161
|
+
### Page Detection
|
|
162
|
+
|
|
163
|
+
The method correctly identifies actual page objects (`/Type /Page`) and avoids matching page container objects (`/Type /Pages`), ensuring widgets are properly associated with the correct page.
|
|
164
|
+
|
|
165
|
+
### AcroForm Structure
|
|
166
|
+
|
|
167
|
+
The method properly handles both:
|
|
168
|
+
- **Inline `/Fields` arrays**: Arrays directly in the AcroForm dictionary
|
|
169
|
+
- **Indirect `/Fields` arrays**: Arrays referenced as separate objects
|
|
170
|
+
|
|
171
|
+
Both are updated to remove references to deleted fields.
|
|
172
|
+
|
|
173
|
+
## Example: Complete Clearing Workflow
|
|
174
|
+
|
|
175
|
+
```ruby
|
|
176
|
+
require 'corp_pdf'
|
|
177
|
+
|
|
178
|
+
# Load PDF with many unwanted fields
|
|
179
|
+
doc = CorpPdf::Document.new("messy_form.pdf")
|
|
180
|
+
|
|
181
|
+
# Remove all generated/UUID-like fields
|
|
182
|
+
doc.clear! { |field|
|
|
183
|
+
# Remove fields that look generated or temporary
|
|
184
|
+
field.name.match?(/^[a-f0-9-]{30,}/) || # UUID-like
|
|
185
|
+
field.name.start_with?("temp_") || # Temporary
|
|
186
|
+
field.name.empty? # Empty name
|
|
187
|
+
}
|
|
188
|
+
|
|
189
|
+
# Add new fields
|
|
190
|
+
doc.add_field("Name", value: "", x: 100, y: 700, width: 200, height: 20, page: 1, type: :text)
|
|
191
|
+
doc.add_field("Email", value: "", x: 100, y: 670, width: 200, height: 20, page: 1, type: :text)
|
|
192
|
+
|
|
193
|
+
# Write cleared and updated PDF
|
|
194
|
+
doc.write("cleared_form.pdf", flatten: true)
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
## See Also
|
|
198
|
+
|
|
199
|
+
- [`flatten` and `flatten!`](./README.md#flattening-pdfs) - Similar rewrite approach for removing incremental updates
|
|
200
|
+
- [`remove_field`](../README.md#remove_field) - Incremental removal of single fields
|
|
201
|
+
- [Main README](../README.md) - General usage and API reference
|
|
202
|
+
|
|
@@ -0,0 +1,341 @@
|
|
|
1
|
+
# DictScan Explained: Text Traversal in Action
|
|
2
|
+
|
|
3
|
+
## The Big Picture
|
|
4
|
+
|
|
5
|
+
`DictScan` is a module that appears complicated at first glance, but it's fundamentally just **text traversal**—walking through PDF files character by character to find and extract dictionary structures.
|
|
6
|
+
|
|
7
|
+
This document explains how each function in `DictScan` works and why the text-traversal approach is both powerful and straightforward.
|
|
8
|
+
|
|
9
|
+
## Core Principle
|
|
10
|
+
|
|
11
|
+
**PDF dictionaries are text patterns.** They use `<<` and `>>` as delimiters, just like how programming languages use `{` and `}` or `[` and `]`. Once you recognize this, parsing becomes a matter of tracking depth and matching delimiters.
|
|
12
|
+
|
|
13
|
+
## Function-by-Function Guide
|
|
14
|
+
|
|
15
|
+
### `strip_stream_bodies(pdf)`
|
|
16
|
+
|
|
17
|
+
**Purpose:** Remove binary stream data that would confuse text parsing.
|
|
18
|
+
|
|
19
|
+
**How it works:**
|
|
20
|
+
- Finds all `stream...endstream` blocks using regex
|
|
21
|
+
- Replaces the binary content with a placeholder
|
|
22
|
+
- Preserves the stream structure markers
|
|
23
|
+
|
|
24
|
+
**Why:** Streams can contain arbitrary binary data (compressed images, fonts, etc.) that would break our text-based parsing. We strip them out since we're only interested in the dictionary structure.
|
|
25
|
+
|
|
26
|
+
```ruby
|
|
27
|
+
pdf.gsub(/stream\r?\n.*?endstream/mi) { "stream\nENDSTREAM_STRIPPED\nendstream" }
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
This is regex-based, but it's necessary preprocessing before we can safely do text traversal.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
### `each_dictionary(str)`
|
|
35
|
+
|
|
36
|
+
**Purpose:** Iterate through all dictionaries in a string.
|
|
37
|
+
|
|
38
|
+
**Algorithm:**
|
|
39
|
+
1. Find the first `<<` at position `i`
|
|
40
|
+
2. Initialize depth counter to 0
|
|
41
|
+
3. Scan forward:
|
|
42
|
+
- If we see `<<`, increment depth
|
|
43
|
+
- If we see `>>`, decrement depth
|
|
44
|
+
- If depth reaches 0, we've found a complete dictionary
|
|
45
|
+
4. Yield the dictionary substring
|
|
46
|
+
5. Continue from where we left off
|
|
47
|
+
|
|
48
|
+
**Example:**
|
|
49
|
+
```
|
|
50
|
+
Input: "<< /A 1 >> << /B 2 >>"
|
|
51
|
+
i=0: find "<<"
|
|
52
|
+
depth=1, scan forward
|
|
53
|
+
see ">>", depth=0 → found "<< /A 1 >>"
|
|
54
|
+
yield and continue from i=11
|
|
55
|
+
i=11: find "<<"
|
|
56
|
+
depth=1, scan forward
|
|
57
|
+
see ">>", depth=0 → found "<< /B 2 >>"
|
|
58
|
+
yield and continue
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
**Why it works:** This is classic **bracket matching**. No PDF-specific knowledge needed—just counting delimiters.
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
### `unescape_literal(s)`
|
|
66
|
+
|
|
67
|
+
**Purpose:** Decode PDF escape sequences in literal strings.
|
|
68
|
+
|
|
69
|
+
**PDF escapes:**
|
|
70
|
+
- `\n` → newline
|
|
71
|
+
- `\r` → carriage return
|
|
72
|
+
- `\t` → tab
|
|
73
|
+
- `\b` → backspace
|
|
74
|
+
- `\f` → form feed
|
|
75
|
+
- `\\(` → literal `(`
|
|
76
|
+
- `\\)` → literal `)`
|
|
77
|
+
- `\123` → octal character (up to 3 digits)
|
|
78
|
+
|
|
79
|
+
**Algorithm:** Character-by-character scan:
|
|
80
|
+
1. If we see `\`, look ahead one character
|
|
81
|
+
2. Map escape sequences to actual characters
|
|
82
|
+
3. Handle octal sequences (1-3 digits)
|
|
83
|
+
4. Otherwise, copy character as-is
|
|
84
|
+
|
|
85
|
+
**Why it works:** This is standard escape sequence handling, identical to how many programming languages handle string literals.
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
### `decode_pdf_string(token)`
|
|
90
|
+
|
|
91
|
+
**Purpose:** Decode a PDF string token into a Ruby string.
|
|
92
|
+
|
|
93
|
+
**PDF string types:**
|
|
94
|
+
1. **Literal:** `(Hello World)` or `(Hello\nWorld)`
|
|
95
|
+
2. **Hex:** `<48656C6C6F>` or `<FEFF00480065006C006C006F>`
|
|
96
|
+
|
|
97
|
+
**Algorithm:**
|
|
98
|
+
1. Check if token starts with `(` → literal string
|
|
99
|
+
- Extract content between parentheses
|
|
100
|
+
- Unescape using `unescape_literal`
|
|
101
|
+
- Check for UTF-16BE BOM (`FE FF`)
|
|
102
|
+
- Decode accordingly
|
|
103
|
+
2. Check if token starts with `<` → hex string
|
|
104
|
+
- Remove spaces, pad if odd length
|
|
105
|
+
- Convert hex to bytes
|
|
106
|
+
- Check for UTF-16BE BOM
|
|
107
|
+
- Decode accordingly
|
|
108
|
+
3. Otherwise, return as-is (name, number, reference, etc.)
|
|
109
|
+
|
|
110
|
+
**Why it works:** PDF strings have well-defined formats. We just pattern-match on the delimiters and decode accordingly.
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
### `encode_pdf_string(val)`
|
|
115
|
+
|
|
116
|
+
**Purpose:** Encode a Ruby value into a PDF string token.
|
|
117
|
+
|
|
118
|
+
**Handles:**
|
|
119
|
+
- `true` → `"true"`
|
|
120
|
+
- `false` → `"false"`
|
|
121
|
+
- `Symbol` → `"/symbol_name"`
|
|
122
|
+
- `String`:
|
|
123
|
+
- ASCII-only → literal string `(value)`
|
|
124
|
+
- Non-ASCII → hex string with UTF-16BE encoding
|
|
125
|
+
|
|
126
|
+
**Why it works:** Reverse of `decode_pdf_string`—we know the target format and encode accordingly.
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
### `value_token_after(key, dict_src)`
|
|
131
|
+
|
|
132
|
+
**Purpose:** Extract the value token that follows a key in a dictionary.
|
|
133
|
+
|
|
134
|
+
**This is the heart of text traversal.** Here's how it works:
|
|
135
|
+
|
|
136
|
+
1. **Find the key:**
|
|
137
|
+
```ruby
|
|
138
|
+
match = dict_src.match(%r{#{Regexp.escape(key)}(?=[\s(<\[/])})
|
|
139
|
+
```
|
|
140
|
+
Use regex to ensure the key is followed by a delimiter (whitespace, `(`, `<`, `[`, or `/`). This prevents partial matches.
|
|
141
|
+
|
|
142
|
+
2. **Skip whitespace:**
|
|
143
|
+
```ruby
|
|
144
|
+
i += 1 while i < dict_src.length && dict_src[i] =~ /\s/
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
3. **Switch on the next character:**
|
|
148
|
+
- **`(` → Literal string:**
|
|
149
|
+
- Track depth of parentheses
|
|
150
|
+
- Handle escaped characters (skip `\` and next char)
|
|
151
|
+
- Match closing `)` when depth returns to 0
|
|
152
|
+
- **`<` → Hex string or dictionary:**
|
|
153
|
+
- If `<<` → return `"<<"` (nested dictionary marker)
|
|
154
|
+
- Otherwise, find matching `>`
|
|
155
|
+
- **`[` → Array:**
|
|
156
|
+
- Track depth of brackets
|
|
157
|
+
- Match closing `]` when depth returns to 0
|
|
158
|
+
- **`/` → PDF name:**
|
|
159
|
+
- Extract until whitespace or delimiter
|
|
160
|
+
- **Otherwise → Atom:**
|
|
161
|
+
- Extract until whitespace or delimiter (number, reference, boolean, etc.)
|
|
162
|
+
|
|
163
|
+
**Why it works:** PDF has well-defined token syntax. Each value type has distinct delimiters, so we can pattern-match on the first character and extract accordingly.
|
|
164
|
+
|
|
165
|
+
**Example:**
|
|
166
|
+
```
|
|
167
|
+
Dict: "<< /V (Hello) /R [1 2 3] >>"
|
|
168
|
+
value_token_after("/V", dict):
|
|
169
|
+
→ Finds "/V" at position 3
|
|
170
|
+
→ Skips space
|
|
171
|
+
→ Next char is "("
|
|
172
|
+
→ Extracts "(Hello)" using paren matching
|
|
173
|
+
→ Returns "(Hello)"
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
### `replace_key_value(dict_src, key, new_token)`
|
|
179
|
+
|
|
180
|
+
**Purpose:** Replace a key's value in a dictionary string.
|
|
181
|
+
|
|
182
|
+
**Algorithm:**
|
|
183
|
+
1. Find the key using pattern matching
|
|
184
|
+
2. Extract the existing value token using `value_token_after`
|
|
185
|
+
3. Find exact byte positions:
|
|
186
|
+
- Key start/end
|
|
187
|
+
- Value start/end
|
|
188
|
+
4. Replace using string slicing: `before + new_token + after`
|
|
189
|
+
5. Verify dictionary is still valid (contains `<<` and `>>`)
|
|
190
|
+
|
|
191
|
+
**Why it works:** We use **precise byte positions** rather than regex replacement. This:
|
|
192
|
+
- Preserves exact formatting (whitespace, etc.)
|
|
193
|
+
- Avoids regex edge cases
|
|
194
|
+
- Is deterministic and safe
|
|
195
|
+
|
|
196
|
+
**Example:**
|
|
197
|
+
```
|
|
198
|
+
Input: "<< /V (Old) /X 1 >>"
|
|
199
|
+
before: "<< /V "
|
|
200
|
+
value: "(Old)"
|
|
201
|
+
after: " /X 1 >>"
|
|
202
|
+
Output: "<< /V (New) /X 1 >>"
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
### `upsert_key_value(dict_src, key, token)`
|
|
208
|
+
|
|
209
|
+
**Purpose:** Insert a key-value pair if the key doesn't exist.
|
|
210
|
+
|
|
211
|
+
**Algorithm:**
|
|
212
|
+
- If key not found, insert after the opening `<<`
|
|
213
|
+
- Uses simple string substitution: `"<<#{key} #{token}"`
|
|
214
|
+
|
|
215
|
+
**Why it works:** Simple string manipulation when we know the key doesn't exist.
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
### `remove_ref_from_array(array_body, ref)` and `add_ref_to_array(array_body, ref)`
|
|
220
|
+
|
|
221
|
+
**Purpose:** Manipulate object references in PDF arrays.
|
|
222
|
+
|
|
223
|
+
**Algorithm:**
|
|
224
|
+
- Use regex/gsub to find and replace reference patterns: `"5 0 R"`
|
|
225
|
+
- Handle edge cases: empty arrays, spacing
|
|
226
|
+
|
|
227
|
+
**Why it works:** Object references have a fixed format (`num gen R`), so we can pattern-match and replace.
|
|
228
|
+
|
|
229
|
+
---
|
|
230
|
+
|
|
231
|
+
## Why This Approach Works
|
|
232
|
+
|
|
233
|
+
### 1. PDF Structure is Text-Based
|
|
234
|
+
|
|
235
|
+
PDF dictionaries, arrays, strings, and references are all defined using text syntax. No binary parsing needed for structure.
|
|
236
|
+
|
|
237
|
+
### 2. Delimiters Are Unique
|
|
238
|
+
|
|
239
|
+
Each PDF value type has distinct delimiters:
|
|
240
|
+
- Dictionaries: `<<` `>>`
|
|
241
|
+
- Arrays: `[` `]`
|
|
242
|
+
- Literal strings: `(` `)`
|
|
243
|
+
- Hex strings: `<` `>`
|
|
244
|
+
- Names: `/`
|
|
245
|
+
- References: `R`
|
|
246
|
+
|
|
247
|
+
We can pattern-match on these to extract values.
|
|
248
|
+
|
|
249
|
+
### 3. Depth Tracking is Simple
|
|
250
|
+
|
|
251
|
+
Nested structures (dictionaries, arrays, strings) can be parsed by tracking depth—increment on open, decrement on close. Standard algorithm from compiler theory.
|
|
252
|
+
|
|
253
|
+
### 4. Position-Based Replacement is Safe
|
|
254
|
+
|
|
255
|
+
When modifying dictionaries, we use exact byte positions rather than regex replacement. This:
|
|
256
|
+
- Preserves formatting
|
|
257
|
+
- Avoids edge cases
|
|
258
|
+
- Is predictable
|
|
259
|
+
|
|
260
|
+
### 5. No Full Parser Needed
|
|
261
|
+
|
|
262
|
+
We don't need to:
|
|
263
|
+
- Build an AST
|
|
264
|
+
- Validate the entire PDF structure
|
|
265
|
+
- Handle all PDF features
|
|
266
|
+
|
|
267
|
+
We only need to:
|
|
268
|
+
- Find dictionaries
|
|
269
|
+
- Extract values
|
|
270
|
+
- Replace values
|
|
271
|
+
- Preserve structure
|
|
272
|
+
|
|
273
|
+
This is a **minimal parser** that does exactly what we need.
|
|
274
|
+
|
|
275
|
+
## Common Patterns
|
|
276
|
+
|
|
277
|
+
### Pattern 1: Find and Extract
|
|
278
|
+
|
|
279
|
+
```ruby
|
|
280
|
+
# Find all dictionaries
|
|
281
|
+
each_dictionary(pdf_text) do |dict|
|
|
282
|
+
# Extract a value
|
|
283
|
+
value_token = value_token_after("/V", dict)
|
|
284
|
+
value = decode_pdf_string(value_token)
|
|
285
|
+
puts value
|
|
286
|
+
end
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
### Pattern 2: Find and Replace
|
|
290
|
+
|
|
291
|
+
```ruby
|
|
292
|
+
# Get dictionary
|
|
293
|
+
dict = "<< /V (Old) >>"
|
|
294
|
+
|
|
295
|
+
# Replace value
|
|
296
|
+
new_dict = replace_key_value(dict, "/V", "(New)")
|
|
297
|
+
|
|
298
|
+
# Result: "<< /V (New) >>"
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
### Pattern 3: Encode and Insert
|
|
302
|
+
|
|
303
|
+
```ruby
|
|
304
|
+
# Prepare new value
|
|
305
|
+
token = encode_pdf_string("Hello")
|
|
306
|
+
|
|
307
|
+
# Insert into dictionary
|
|
308
|
+
dict = upsert_key_value(dict, "/V", token)
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
## Performance Considerations
|
|
312
|
+
|
|
313
|
+
**Why this is fast:**
|
|
314
|
+
- No AST construction
|
|
315
|
+
- No full PDF parsing
|
|
316
|
+
- Direct string manipulation
|
|
317
|
+
- Minimal memory allocation
|
|
318
|
+
|
|
319
|
+
**Trade-offs:**
|
|
320
|
+
- Doesn't validate entire PDF structure
|
|
321
|
+
- Assumes dictionaries are well-formed
|
|
322
|
+
- Stream stripping is regex-based (could be optimized)
|
|
323
|
+
|
|
324
|
+
## Conclusion
|
|
325
|
+
|
|
326
|
+
`DictScan` appears complicated because it handles many edge cases and value types, but the **core approach is elegantly simple**:
|
|
327
|
+
|
|
328
|
+
1. PDF dictionaries are text patterns
|
|
329
|
+
2. Parse them with character-by-character traversal
|
|
330
|
+
3. Track depth for nested structures
|
|
331
|
+
4. Use precise positions for replacement
|
|
332
|
+
|
|
333
|
+
No magic, no complex parsers—just careful text traversal with attention to PDF syntax rules.
|
|
334
|
+
|
|
335
|
+
The complexity you see is:
|
|
336
|
+
- **Edge case handling** (escaping, nesting, encoding)
|
|
337
|
+
- **Safety checks** (verification, error handling)
|
|
338
|
+
- **Support for multiple value types** (strings, arrays, dictionaries, references)
|
|
339
|
+
|
|
340
|
+
But the **fundamental algorithm** is straightforward: find delimiters, track depth, extract substrings, replace substrings.
|
|
341
|
+
|