corp_pdf 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,202 @@
1
+ # Clearing Fields with `clear` and `clear!`
2
+
3
+ The `clear` method allows you to completely remove unwanted form fields from a PDF by rewriting the entire document, rather than using incremental updates. This is useful when you want to:
4
+
5
+ - Remove multiple layers of added fields
6
+ - Clear a PDF that has accumulated many unwanted fields
7
+ - Get back to a base file without certain fields
8
+ - Remove orphaned or invalid field references
9
+
10
+ Unlike `remove_field`, which uses incremental updates, `clear` rewrites the entire PDF (similar to `flatten`) but excludes the unwanted fields entirely. This ensures that:
11
+
12
+ - Field objects are completely removed (not just marked as deleted)
13
+ - Widget annotations are removed from page `/Annots` arrays
14
+ - Orphaned widget references are cleaned up
15
+ - The AcroForm `/Fields` array is updated
16
+ - All references to removed fields are eliminated
17
+
18
+ ## Methods
19
+
20
+ ### `clear(options = {})`
21
+
22
+ Returns a new PDF with unwanted fields removed. Does not modify the current document.
23
+
24
+ **Options:**
25
+ - `keep_fields`: Array of field names to keep (all others removed)
26
+ - `remove_fields`: Array of field names to remove
27
+ - `remove_pattern`: Regex pattern - fields matching this are removed
28
+ - Block: Given field name, return `true` to keep, `false` to remove
29
+
30
+ ### `clear!(options = {})`
31
+
32
+ Same as `clear`, but modifies the current document in-place. Mutates the document instance.
33
+
34
+ ## Usage Examples
35
+
36
+ ### Remove All Fields
37
+
38
+ ```ruby
39
+ doc = CorpPdf::Document.new("form.pdf")
40
+
41
+ # Remove all fields
42
+ cleared_pdf = doc.clear(remove_pattern: /.*/)
43
+
44
+ # Or in-place
45
+ doc.clear!(remove_pattern: /.*/)
46
+ ```
47
+
48
+ ### Remove Fields Matching a Pattern
49
+
50
+ ```ruby
51
+ # Remove all fields starting with "text-"
52
+ doc.clear!(remove_pattern: /^text-/)
53
+
54
+ # Remove UUID-like generated fields
55
+ doc.clear! { |name| !(name =~ /text-/ || name =~ /^[a-f0-9]{20,}/) }
56
+ ```
57
+
58
+ ### Keep Only Specific Fields
59
+
60
+ ```ruby
61
+ # Keep only these fields, remove all others
62
+ doc.clear!(keep_fields: ["Name", "Email", "Phone"])
63
+
64
+ # Write the cleared PDF
65
+ doc.write("cleared.pdf", flatten: true)
66
+ ```
67
+
68
+ ### Remove Specific Fields
69
+
70
+ ```ruby
71
+ # Remove specific unwanted fields
72
+ doc.clear!(remove_fields: ["OldField1", "OldField2", "GeneratedField3"])
73
+ ```
74
+
75
+ ### Complex Selection with Block
76
+
77
+ ```ruby
78
+ # Remove fields matching certain criteria
79
+ doc.clear! do |field|
80
+ # Remove fields that look generated
81
+ field.name.start_with?("text-") ||
82
+ field.name.match?(/^[a-f0-9]{20,}/)
83
+ end
84
+ ```
85
+
86
+ ## How It Works
87
+
88
+ The `clear` method:
89
+
90
+ 1. **Identifies fields to remove** based on the provided criteria (pattern, list, or block)
91
+
92
+ 2. **Finds related widgets** for each field to be removed:
93
+ - Widgets that reference the field via `/Parent`
94
+ - Widgets that have the same name via `/T`
95
+
96
+ 3. **Collects objects to write**, excluding:
97
+ - Field objects that should be removed
98
+ - Widget annotation objects that should be removed
99
+
100
+ 4. **Updates AcroForm structure**:
101
+ - Removes field references from the `/Fields` array
102
+ - Handles both inline and indirect array references
103
+
104
+ 5. **Clears page annotations**:
105
+ - Removes widget references from page `/Annots` arrays
106
+ - Removes orphaned widget references (widgets pointing to non-existent fields)
107
+ - Removes references to widgets that don't exist in the cleared PDF
108
+
109
+ 6. **Rewrites the entire PDF** from scratch (like `flatten`) with only the selected objects
110
+
111
+ ## Key Differences from `remove_field`
112
+
113
+ | Feature | `remove_field` | `clear` |
114
+ |---------|---------------|---------|
115
+ | Update Type | Incremental update | Complete rewrite |
116
+ | Object Removal | Marks as deleted | Completely excluded |
117
+ | PDF Structure | Preserves all objects | Only includes selected objects |
118
+ | Use Case | Remove one/a few fields | Remove many fields or clean up |
119
+ | Performance | Fast (append only) | Slower (full rewrite) |
120
+
121
+ ## Best Practices
122
+
123
+ 1. **Use `clear` when removing many fields**: If you need to remove a large number of fields, `clear` is more efficient and produces cleaner output.
124
+
125
+ 2. **Always flatten after clearing**: Since `clear` rewrites the PDF, consider using `write(..., flatten: true)` to ensure compatibility with all PDF viewers:
126
+
127
+ ```ruby
128
+ doc.clear!(remove_pattern: /^text-/)
129
+ doc.write("output.pdf", flatten: true)
130
+ ```
131
+
132
+ 3. **Combine with field addition**: After clearing, you can add new fields:
133
+
134
+ ```ruby
135
+ doc.clear!(remove_pattern: /.*/)
136
+ doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1)
137
+ doc.write("output.pdf", flatten: true)
138
+ ```
139
+
140
+ 4. **Use patterns for generated fields**: If you have fields with predictable naming patterns (e.g., UUID-based names), use regex patterns:
141
+
142
+ ```ruby
143
+ # Remove all UUID-like fields
144
+ doc.clear!(remove_pattern: /^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/)
145
+
146
+ # Remove all fields containing "temp" or "test"
147
+ doc.clear!(remove_pattern: /temp|test/i)
148
+ ```
149
+
150
+ ## Technical Details
151
+
152
+ ### Orphaned Widget Removal
153
+
154
+ The `clear` method automatically identifies and removes orphaned widget references:
155
+
156
+ - **Non-existent widgets**: Widget references in `/Annots` arrays that point to objects that don't exist
157
+ - **Orphaned widgets**: Widgets that reference parent fields that don't exist in the cleaned PDF
158
+
159
+ This ensures that page annotation arrays don't contain invalid references that could confuse PDF viewers.
160
+
161
+ ### Page Detection
162
+
163
+ The method correctly identifies actual page objects (`/Type /Page`) and avoids matching page container objects (`/Type /Pages`), ensuring widgets are properly associated with the correct page.
164
+
165
+ ### AcroForm Structure
166
+
167
+ The method properly handles both:
168
+ - **Inline `/Fields` arrays**: Arrays directly in the AcroForm dictionary
169
+ - **Indirect `/Fields` arrays**: Arrays referenced as separate objects
170
+
171
+ Both are updated to remove references to deleted fields.
172
+
173
+ ## Example: Complete Clearing Workflow
174
+
175
+ ```ruby
176
+ require 'corp_pdf'
177
+
178
+ # Load PDF with many unwanted fields
179
+ doc = CorpPdf::Document.new("messy_form.pdf")
180
+
181
+ # Remove all generated/UUID-like fields
182
+ doc.clear! { |field|
183
+ # Remove fields that look generated or temporary
184
+ field.name.match?(/^[a-f0-9-]{30,}/) || # UUID-like
185
+ field.name.start_with?("temp_") || # Temporary
186
+ field.name.empty? # Empty name
187
+ }
188
+
189
+ # Add new fields
190
+ doc.add_field("Name", value: "", x: 100, y: 700, width: 200, height: 20, page: 1, type: :text)
191
+ doc.add_field("Email", value: "", x: 100, y: 670, width: 200, height: 20, page: 1, type: :text)
192
+
193
+ # Write cleared and updated PDF
194
+ doc.write("cleared_form.pdf", flatten: true)
195
+ ```
196
+
197
+ ## See Also
198
+
199
+ - [`flatten` and `flatten!`](./README.md#flattening-pdfs) - Similar rewrite approach for removing incremental updates
200
+ - [`remove_field`](../README.md#remove_field) - Incremental removal of single fields
201
+ - [Main README](../README.md) - General usage and API reference
202
+
@@ -0,0 +1,341 @@
1
+ # DictScan Explained: Text Traversal in Action
2
+
3
+ ## The Big Picture
4
+
5
+ `DictScan` is a module that appears complicated at first glance, but it's fundamentally just **text traversal**—walking through PDF files character by character to find and extract dictionary structures.
6
+
7
+ This document explains how each function in `DictScan` works and why the text-traversal approach is both powerful and straightforward.
8
+
9
+ ## Core Principle
10
+
11
+ **PDF dictionaries are text patterns.** They use `<<` and `>>` as delimiters, just like how programming languages use `{` and `}` or `[` and `]`. Once you recognize this, parsing becomes a matter of tracking depth and matching delimiters.
12
+
13
+ ## Function-by-Function Guide
14
+
15
+ ### `strip_stream_bodies(pdf)`
16
+
17
+ **Purpose:** Remove binary stream data that would confuse text parsing.
18
+
19
+ **How it works:**
20
+ - Finds all `stream...endstream` blocks using regex
21
+ - Replaces the binary content with a placeholder
22
+ - Preserves the stream structure markers
23
+
24
+ **Why:** Streams can contain arbitrary binary data (compressed images, fonts, etc.) that would break our text-based parsing. We strip them out since we're only interested in the dictionary structure.
25
+
26
+ ```ruby
27
+ pdf.gsub(/stream\r?\n.*?endstream/mi) { "stream\nENDSTREAM_STRIPPED\nendstream" }
28
+ ```
29
+
30
+ This is regex-based, but it's necessary preprocessing before we can safely do text traversal.
31
+
32
+ ---
33
+
34
+ ### `each_dictionary(str)`
35
+
36
+ **Purpose:** Iterate through all dictionaries in a string.
37
+
38
+ **Algorithm:**
39
+ 1. Find the first `<<` at position `i`
40
+ 2. Initialize depth counter to 0
41
+ 3. Scan forward:
42
+ - If we see `<<`, increment depth
43
+ - If we see `>>`, decrement depth
44
+ - If depth reaches 0, we've found a complete dictionary
45
+ 4. Yield the dictionary substring
46
+ 5. Continue from where we left off
47
+
48
+ **Example:**
49
+ ```
50
+ Input: "<< /A 1 >> << /B 2 >>"
51
+ i=0: find "<<"
52
+ depth=1, scan forward
53
+ see ">>", depth=0 → found "<< /A 1 >>"
54
+ yield and continue from i=11
55
+ i=11: find "<<"
56
+ depth=1, scan forward
57
+ see ">>", depth=0 → found "<< /B 2 >>"
58
+ yield and continue
59
+ ```
60
+
61
+ **Why it works:** This is classic **bracket matching**. No PDF-specific knowledge needed—just counting delimiters.
62
+
63
+ ---
64
+
65
+ ### `unescape_literal(s)`
66
+
67
+ **Purpose:** Decode PDF escape sequences in literal strings.
68
+
69
+ **PDF escapes:**
70
+ - `\n` → newline
71
+ - `\r` → carriage return
72
+ - `\t` → tab
73
+ - `\b` → backspace
74
+ - `\f` → form feed
75
+ - `\\(` → literal `(`
76
+ - `\\)` → literal `)`
77
+ - `\123` → octal character (up to 3 digits)
78
+
79
+ **Algorithm:** Character-by-character scan:
80
+ 1. If we see `\`, look ahead one character
81
+ 2. Map escape sequences to actual characters
82
+ 3. Handle octal sequences (1-3 digits)
83
+ 4. Otherwise, copy character as-is
84
+
85
+ **Why it works:** This is standard escape sequence handling, identical to how many programming languages handle string literals.
86
+
87
+ ---
88
+
89
+ ### `decode_pdf_string(token)`
90
+
91
+ **Purpose:** Decode a PDF string token into a Ruby string.
92
+
93
+ **PDF string types:**
94
+ 1. **Literal:** `(Hello World)` or `(Hello\nWorld)`
95
+ 2. **Hex:** `<48656C6C6F>` or `<FEFF00480065006C006C006F>`
96
+
97
+ **Algorithm:**
98
+ 1. Check if token starts with `(` → literal string
99
+ - Extract content between parentheses
100
+ - Unescape using `unescape_literal`
101
+ - Check for UTF-16BE BOM (`FE FF`)
102
+ - Decode accordingly
103
+ 2. Check if token starts with `<` → hex string
104
+ - Remove spaces, pad if odd length
105
+ - Convert hex to bytes
106
+ - Check for UTF-16BE BOM
107
+ - Decode accordingly
108
+ 3. Otherwise, return as-is (name, number, reference, etc.)
109
+
110
+ **Why it works:** PDF strings have well-defined formats. We just pattern-match on the delimiters and decode accordingly.
111
+
112
+ ---
113
+
114
+ ### `encode_pdf_string(val)`
115
+
116
+ **Purpose:** Encode a Ruby value into a PDF string token.
117
+
118
+ **Handles:**
119
+ - `true` → `"true"`
120
+ - `false` → `"false"`
121
+ - `Symbol` → `"/symbol_name"`
122
+ - `String`:
123
+ - ASCII-only → literal string `(value)`
124
+ - Non-ASCII → hex string with UTF-16BE encoding
125
+
126
+ **Why it works:** Reverse of `decode_pdf_string`—we know the target format and encode accordingly.
127
+
128
+ ---
129
+
130
+ ### `value_token_after(key, dict_src)`
131
+
132
+ **Purpose:** Extract the value token that follows a key in a dictionary.
133
+
134
+ **This is the heart of text traversal.** Here's how it works:
135
+
136
+ 1. **Find the key:**
137
+ ```ruby
138
+ match = dict_src.match(%r{#{Regexp.escape(key)}(?=[\s(<\[/])})
139
+ ```
140
+ Use regex to ensure the key is followed by a delimiter (whitespace, `(`, `<`, `[`, or `/`). This prevents partial matches.
141
+
142
+ 2. **Skip whitespace:**
143
+ ```ruby
144
+ i += 1 while i < dict_src.length && dict_src[i] =~ /\s/
145
+ ```
146
+
147
+ 3. **Switch on the next character:**
148
+ - **`(` → Literal string:**
149
+ - Track depth of parentheses
150
+ - Handle escaped characters (skip `\` and next char)
151
+ - Match closing `)` when depth returns to 0
152
+ - **`<` → Hex string or dictionary:**
153
+ - If `<<` → return `"<<"` (nested dictionary marker)
154
+ - Otherwise, find matching `>`
155
+ - **`[` → Array:**
156
+ - Track depth of brackets
157
+ - Match closing `]` when depth returns to 0
158
+ - **`/` → PDF name:**
159
+ - Extract until whitespace or delimiter
160
+ - **Otherwise → Atom:**
161
+ - Extract until whitespace or delimiter (number, reference, boolean, etc.)
162
+
163
+ **Why it works:** PDF has well-defined token syntax. Each value type has distinct delimiters, so we can pattern-match on the first character and extract accordingly.
164
+
165
+ **Example:**
166
+ ```
167
+ Dict: "<< /V (Hello) /R [1 2 3] >>"
168
+ value_token_after("/V", dict):
169
+ → Finds "/V" at position 3
170
+ → Skips space
171
+ → Next char is "("
172
+ → Extracts "(Hello)" using paren matching
173
+ → Returns "(Hello)"
174
+ ```
175
+
176
+ ---
177
+
178
+ ### `replace_key_value(dict_src, key, new_token)`
179
+
180
+ **Purpose:** Replace a key's value in a dictionary string.
181
+
182
+ **Algorithm:**
183
+ 1. Find the key using pattern matching
184
+ 2. Extract the existing value token using `value_token_after`
185
+ 3. Find exact byte positions:
186
+ - Key start/end
187
+ - Value start/end
188
+ 4. Replace using string slicing: `before + new_token + after`
189
+ 5. Verify dictionary is still valid (contains `<<` and `>>`)
190
+
191
+ **Why it works:** We use **precise byte positions** rather than regex replacement. This:
192
+ - Preserves exact formatting (whitespace, etc.)
193
+ - Avoids regex edge cases
194
+ - Is deterministic and safe
195
+
196
+ **Example:**
197
+ ```
198
+ Input: "<< /V (Old) /X 1 >>"
199
+ before: "<< /V "
200
+ value: "(Old)"
201
+ after: " /X 1 >>"
202
+ Output: "<< /V (New) /X 1 >>"
203
+ ```
204
+
205
+ ---
206
+
207
+ ### `upsert_key_value(dict_src, key, token)`
208
+
209
+ **Purpose:** Insert a key-value pair if the key doesn't exist.
210
+
211
+ **Algorithm:**
212
+ - If key not found, insert after the opening `<<`
213
+ - Uses simple string substitution: `"<<#{key} #{token}"`
214
+
215
+ **Why it works:** Simple string manipulation when we know the key doesn't exist.
216
+
217
+ ---
218
+
219
+ ### `remove_ref_from_array(array_body, ref)` and `add_ref_to_array(array_body, ref)`
220
+
221
+ **Purpose:** Manipulate object references in PDF arrays.
222
+
223
+ **Algorithm:**
224
+ - Use regex/gsub to find and replace reference patterns: `"5 0 R"`
225
+ - Handle edge cases: empty arrays, spacing
226
+
227
+ **Why it works:** Object references have a fixed format (`num gen R`), so we can pattern-match and replace.
228
+
229
+ ---
230
+
231
+ ## Why This Approach Works
232
+
233
+ ### 1. PDF Structure is Text-Based
234
+
235
+ PDF dictionaries, arrays, strings, and references are all defined using text syntax. No binary parsing needed for structure.
236
+
237
+ ### 2. Delimiters Are Unique
238
+
239
+ Each PDF value type has distinct delimiters:
240
+ - Dictionaries: `<<` `>>`
241
+ - Arrays: `[` `]`
242
+ - Literal strings: `(` `)`
243
+ - Hex strings: `<` `>`
244
+ - Names: `/`
245
+ - References: `R`
246
+
247
+ We can pattern-match on these to extract values.
248
+
249
+ ### 3. Depth Tracking is Simple
250
+
251
+ Nested structures (dictionaries, arrays, strings) can be parsed by tracking depth—increment on open, decrement on close. Standard algorithm from compiler theory.
252
+
253
+ ### 4. Position-Based Replacement is Safe
254
+
255
+ When modifying dictionaries, we use exact byte positions rather than regex replacement. This:
256
+ - Preserves formatting
257
+ - Avoids edge cases
258
+ - Is predictable
259
+
260
+ ### 5. No Full Parser Needed
261
+
262
+ We don't need to:
263
+ - Build an AST
264
+ - Validate the entire PDF structure
265
+ - Handle all PDF features
266
+
267
+ We only need to:
268
+ - Find dictionaries
269
+ - Extract values
270
+ - Replace values
271
+ - Preserve structure
272
+
273
+ This is a **minimal parser** that does exactly what we need.
274
+
275
+ ## Common Patterns
276
+
277
+ ### Pattern 1: Find and Extract
278
+
279
+ ```ruby
280
+ # Find all dictionaries
281
+ each_dictionary(pdf_text) do |dict|
282
+ # Extract a value
283
+ value_token = value_token_after("/V", dict)
284
+ value = decode_pdf_string(value_token)
285
+ puts value
286
+ end
287
+ ```
288
+
289
+ ### Pattern 2: Find and Replace
290
+
291
+ ```ruby
292
+ # Get dictionary
293
+ dict = "<< /V (Old) >>"
294
+
295
+ # Replace value
296
+ new_dict = replace_key_value(dict, "/V", "(New)")
297
+
298
+ # Result: "<< /V (New) >>"
299
+ ```
300
+
301
+ ### Pattern 3: Encode and Insert
302
+
303
+ ```ruby
304
+ # Prepare new value
305
+ token = encode_pdf_string("Hello")
306
+
307
+ # Insert into dictionary
308
+ dict = upsert_key_value(dict, "/V", token)
309
+ ```
310
+
311
+ ## Performance Considerations
312
+
313
+ **Why this is fast:**
314
+ - No AST construction
315
+ - No full PDF parsing
316
+ - Direct string manipulation
317
+ - Minimal memory allocation
318
+
319
+ **Trade-offs:**
320
+ - Doesn't validate entire PDF structure
321
+ - Assumes dictionaries are well-formed
322
+ - Stream stripping is regex-based (could be optimized)
323
+
324
+ ## Conclusion
325
+
326
+ `DictScan` appears complicated because it handles many edge cases and value types, but the **core approach is elegantly simple**:
327
+
328
+ 1. PDF dictionaries are text patterns
329
+ 2. Parse them with character-by-character traversal
330
+ 3. Track depth for nested structures
331
+ 4. Use precise positions for replacement
332
+
333
+ No magic, no complex parsers—just careful text traversal with attention to PDF syntax rules.
334
+
335
+ The complexity you see is:
336
+ - **Edge case handling** (escaping, nesting, encoding)
337
+ - **Safety checks** (verification, error handling)
338
+ - **Support for multiple value types** (strings, arrays, dictionaries, references)
339
+
340
+ But the **fundamental algorithm** is straightforward: find delimiters, track depth, extract substrings, replace substrings.
341
+