corp_pdf 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,518 @@
1
+ # CorpPdf
2
+
3
+ A minimal pure Ruby library for parsing and editing PDF AcroForm fields.
4
+
5
+ ## Features
6
+
7
+ - ✅ **Pure Ruby** - Minimal dependencies (only `chunky_png` for PNG image processing)
8
+ - ✅ **StringIO Only** - Works entirely in memory, no temp files
9
+ - ✅ **PDF AcroForm Support** - Parse, list, add, remove, and modify form fields
10
+ - ✅ **Signature Field Images** - Add image appearances to signature fields (JPEG and PNG support)
11
+ - ✅ **Minimal PDF Engine** - Basic PDF parser/writer for AcroForm manipulation
12
+ - ✅ **Ruby 3.1+** - Modern Ruby support
13
+
14
+ ## Installation
15
+
16
+ Add this line to your application's Gemfile:
17
+
18
+ ```ruby
19
+ gem 'corp_pdf'
20
+ ```
21
+
22
+ And then execute:
23
+
24
+ ```bash
25
+ bundle install
26
+ ```
27
+
28
+ Or install it directly:
29
+
30
+ ```bash
31
+ gem install corp_pdf
32
+ ```
33
+
34
+ ## Usage
35
+
36
+ ### Basic Usage
37
+
38
+ ```ruby
39
+ require 'corp_pdf'
40
+
41
+ # Create a document from a file path or StringIO
42
+ doc = CorpPdf::Document.new("form.pdf")
43
+
44
+ # Or from StringIO
45
+ require 'stringio'
46
+ pdf_data = File.binread("form.pdf")
47
+ io = StringIO.new(pdf_data)
48
+ doc = CorpPdf::Document.new(io)
49
+
50
+ # List all form fields
51
+ fields = doc.list_fields
52
+ fields.each do |field|
53
+ type_info = field.type_key ? "#{field.type} (:#{field.type_key})" : field.type
54
+ puts "#{field.name} (#{type_info}) = #{field.value}"
55
+ end
56
+
57
+ # Add a new field (using symbol key for type)
58
+ new_field = doc.add_field("NameField",
59
+ value: "John Doe",
60
+ x: 100,
61
+ y: 500,
62
+ width: 200,
63
+ height: 20,
64
+ page: 1,
65
+ type: :text # Optional: :text, :button, :choice, :signature (or "/Tx", "/Btn", etc.)
66
+ )
67
+
68
+ # Or using the PDF type string directly
69
+ button_field = doc.add_field("CheckBox",
70
+ type: "/Btn", # Or use :button symbol
71
+ x: 100,
72
+ y: 600,
73
+ width: 20,
74
+ height: 20,
75
+ page: 1
76
+ )
77
+
78
+ # Update a field value
79
+ doc.update_field("ExistingField", "New Value")
80
+
81
+ # Rename a field while updating it
82
+ doc.update_field("OldName", "New Value", new_name: "NewName")
83
+
84
+ # Remove a field
85
+ doc.remove_field("FieldToRemove")
86
+
87
+ # Write the modified PDF to a file
88
+ doc.write("output.pdf")
89
+
90
+ # Or write with flattening (removes incremental updates)
91
+ doc.write("output.pdf", flatten: true)
92
+
93
+ # Or get PDF bytes as a String (returns String, not StringIO)
94
+ pdf_bytes = doc.write
95
+ File.binwrite("output.pdf", pdf_bytes)
96
+ ```
97
+
98
+ ### Advanced Usage
99
+
100
+ #### Working with Field Objects
101
+
102
+ ```ruby
103
+ doc = CorpPdf::Document.new("form.pdf")
104
+ fields = doc.list_fields
105
+
106
+ # Access field properties
107
+ field = fields.first
108
+ puts field.name # Field name
109
+ puts field.value # Field value
110
+ puts field.type # Field type (e.g., "/Tx")
111
+ puts field.type_key # Symbol key (e.g., :text) or nil if not mapped
112
+ puts field.x # X position
113
+ puts field.y # Y position
114
+ puts field.width # Width
115
+ puts field.height # Height
116
+ puts field.page # Page number
117
+
118
+ # Fields default to "/Tx" if type is missing from PDF
119
+
120
+ # Update a field directly
121
+ field.update("New Value")
122
+
123
+ # Update and rename a field
124
+ field.update("New Value", new_name: "NewName")
125
+
126
+ # Remove a field directly
127
+ field.remove
128
+
129
+ # Check field type
130
+ field.text_field? # true for text fields
131
+ field.button_field? # true for button/checkbox fields
132
+ field.choice_field? # true for choice/dropdown fields
133
+ field.signature_field? # true for signature fields
134
+
135
+ # Check if field has a value
136
+ field.has_value?
137
+
138
+ # Check if field has position information
139
+ field.has_position?
140
+ ```
141
+
142
+ #### Signature Fields with Image Appearances
143
+
144
+ Signature fields can be enhanced with image appearances (signature images). When you update a signature field with image data (base64-encoded JPEG or PNG), CorpPdf will automatically add the image as the field's appearance.
145
+
146
+ ```ruby
147
+ doc = CorpPdf::Document.new("form.pdf")
148
+
149
+ # Add a signature field
150
+ sig_field = doc.add_field("MySignature",
151
+ type: :signature,
152
+ x: 100,
153
+ y: 500,
154
+ width: 200,
155
+ height: 100,
156
+ page: 1
157
+ )
158
+
159
+ # Update signature field with base64-encoded image data
160
+ # JPEG example:
161
+ jpeg_base64 = Base64.encode64(File.binread("signature.jpg")).strip
162
+ doc.update_field("MySignature", jpeg_base64)
163
+
164
+ # PNG example (requires chunky_png gem):
165
+ png_base64 = Base64.encode64(File.binread("signature.png")).strip
166
+ doc.update_field("MySignature", png_base64)
167
+
168
+ # Or using data URI format:
169
+ data_uri = "data:image/png;base64,#{png_base64}"
170
+ doc.update_field("MySignature", data_uri)
171
+
172
+ # Write the PDF with the signature appearance
173
+ doc.write("form_with_signature.pdf")
174
+ ```
175
+
176
+ **Note**: PNG image processing requires the `chunky_png` gem, which is included as a dependency. JPEG images can be processed without any additional dependencies.
177
+
178
+ #### Radio Buttons
179
+
180
+ Radio buttons allow users to select a single option from a group of mutually exclusive choices. Radio buttons in CorpPdf are created using the `:radio` type and require a `group_id` to group related buttons together.
181
+
182
+ ```ruby
183
+ doc = CorpPdf::Document.new("form.pdf")
184
+
185
+ # Create a radio button group with multiple options
186
+ # All buttons in the same group must share the same group_id
187
+
188
+ # First radio button in the group (creates the parent field)
189
+ doc.add_field("Option1",
190
+ type: :radio,
191
+ group_id: "my_radio_group",
192
+ value: "option1", # Export value for this button
193
+ x: 100,
194
+ y: 500,
195
+ width: 20,
196
+ height: 20,
197
+ page: 1,
198
+ selected: true # This button will be selected by default
199
+ )
200
+
201
+ # Second radio button in the same group
202
+ doc.add_field("Option2",
203
+ type: :radio,
204
+ group_id: "my_radio_group", # Same group_id as above
205
+ value: "option2",
206
+ x: 100,
207
+ y: 470,
208
+ width: 20,
209
+ height: 20,
210
+ page: 1
211
+ )
212
+
213
+ # Third radio button in the same group
214
+ doc.add_field("Option3",
215
+ type: :radio,
216
+ group_id: "my_radio_group", # Same group_id
217
+ value: "option3",
218
+ x: 100,
219
+ y: 440,
220
+ width: 20,
221
+ height: 20,
222
+ page: 1
223
+ )
224
+
225
+ # Write the PDF with radio buttons
226
+ doc.write("form_with_radio.pdf")
227
+ ```
228
+
229
+ **Key Points:**
230
+ - **`group_id`**: Required. All radio buttons that should be mutually exclusive must share the same `group_id`. This can be any string or identifier.
231
+ - **`type: :radio`**: Required. Specifies that this is a radio button field.
232
+ - **`value`**: The export value for this specific button. This is what gets returned when the button is selected. If not provided, a unique value will be generated automatically.
233
+ - **`selected`**: Optional boolean (`true` or `false`, or string `"true"`). If set to `true`, this button will be selected by default. Only one button in a group should have `selected: true`. If not specified, the button defaults to unselected.
234
+ - **Positioning**: Each radio button needs its own `x`, `y`, `width`, `height`, and `page` values to position it on the form.
235
+
236
+ **Example with multiple groups:**
237
+
238
+ ```ruby
239
+ doc = CorpPdf::Document.new("form.pdf")
240
+
241
+ # First radio button group (e.g., "Gender")
242
+ doc.add_field("Male", type: :radio, group_id: "gender", value: "male", x: 100, y: 500, width: 20, height: 20, page: 1, selected: true)
243
+ doc.add_field("Female", type: :radio, group_id: "gender", value: "female", x: 100, y: 470, width: 20, height: 20, page: 1)
244
+ doc.add_field("Other", type: :radio, group_id: "gender", value: "other", x: 100, y: 440, width: 20, height: 20, page: 1)
245
+
246
+ # Second radio button group (e.g., "Age Range")
247
+ doc.add_field("18-25", type: :radio, group_id: "age", value: "18-25", x: 200, y: 500, width: 20, height: 20, page: 1)
248
+ doc.add_field("26-35", type: :radio, group_id: "age", value: "26-35", x: 200, y: 470, width: 20, height: 20, page: 1, selected: true)
249
+ doc.add_field("36+", type: :radio, group_id: "age", value: "36+", x: 200, y: 440, width: 20, height: 20, page: 1)
250
+
251
+ doc.write("form_with_multiple_groups.pdf")
252
+ ```
253
+
254
+ **Note:** Radio buttons are automatically configured with the correct PDF flags to enable mutual exclusivity within a group. When a user selects one radio button, all others in the same group are automatically deselected.
255
+
256
+ #### Flattening PDFs
257
+
258
+ ```ruby
259
+ # Flatten a PDF to remove incremental updates
260
+ doc = CorpPdf::Document.new("form.pdf")
261
+ doc.flatten! # Modifies the document in-place
262
+
263
+ # Or create a new flattened document
264
+ flattened_doc = CorpPdf::Document.flatten_pdf("input.pdf", "output.pdf")
265
+
266
+ # Or get flattened bytes
267
+ flattened_bytes = CorpPdf::Document.flatten_pdf("input.pdf")
268
+ ```
269
+
270
+ #### Clearing Fields
271
+
272
+ The `clear` and `clear!` methods allow you to completely remove unwanted fields by rewriting the entire PDF:
273
+
274
+ ```ruby
275
+ doc = CorpPdf::Document.new("form.pdf")
276
+
277
+ # Remove all fields matching a pattern
278
+ doc.clear!(remove_pattern: /^text-/)
279
+
280
+ # Keep only specific fields
281
+ doc.clear!(keep_fields: ["Name", "Email"])
282
+
283
+ # Remove specific fields
284
+ doc.clear!(remove_fields: ["OldField1", "OldField2"])
285
+
286
+ # Use a block to determine which fields to keep
287
+ doc.clear! { |name| !name.start_with?("temp_") }
288
+
289
+ # Write the cleared PDF
290
+ doc.write("cleared.pdf", flatten: true)
291
+ ```
292
+
293
+ **Note:** Unlike `remove_field`, which uses incremental updates, `clear` completely rewrites the PDF to exclude unwanted fields. This is more efficient when removing many fields and ensures complete removal. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
294
+
295
+ ### API Reference
296
+
297
+ #### `CorpPdf::Document.new(path_or_io)`
298
+ Creates a PDF document from a file path (String) or StringIO object.
299
+
300
+ ```ruby
301
+ doc = CorpPdf::Document.new("path/to/file.pdf")
302
+ doc = CorpPdf::Document.new(StringIO.new(pdf_bytes))
303
+ ```
304
+
305
+ #### `#list_fields`
306
+ Returns an array of `Field` objects representing all form fields in the document.
307
+
308
+ ```ruby
309
+ fields = doc.list_fields
310
+ fields.each do |field|
311
+ puts field.name
312
+ end
313
+ ```
314
+
315
+ #### `#list_pages`
316
+ Returns an array of `Page` objects representing all pages in the document. Each `Page` object provides page information and methods to add fields to that specific page.
317
+
318
+ ```ruby
319
+ pages = doc.list_pages
320
+ pages.each do |page|
321
+ puts "Page #{page.page_number}: #{page.width}x#{page.height}"
322
+ end
323
+
324
+ # Add fields to specific pages - the page is automatically set!
325
+ first_page = pages[0]
326
+ first_page.add_field("Name", x: 100, y: 700, width: 200, height: 20)
327
+
328
+ second_page = pages[1]
329
+ second_page.add_field("Email", x: 100, y: 650, width: 200, height: 20)
330
+ ```
331
+
332
+ **Page Object Methods:**
333
+ - `page.page_number` - Returns the page number (1-indexed)
334
+ - `page.width` - Page width in points
335
+ - `page.height` - Page height in points
336
+ - `page.ref` - Page object reference `[obj_num, gen_num]`
337
+ - `page.metadata` - Hash containing page metadata (rotation, boxes, etc.)
338
+ - `page.add_field(name, options)` - Add a field to this page (page number is automatically set)
339
+ - `page.to_h` - Convert to hash for backward compatibility
340
+
341
+ #### `#add_field(name, options)`
342
+ Adds a new form field to the document. Options include:
343
+ - `value`: Default value for the field (String)
344
+ - `x`: X coordinate (Integer, default: 100)
345
+ - `y`: Y coordinate (Integer, default: 500)
346
+ - `width`: Field width (Integer, default: 100)
347
+ - `height`: Field height (Integer, default: 20)
348
+ - `page`: Page number to add the field to (Integer, default: 1)
349
+ - `type`: Field type (Symbol or String, default: `"/Tx"`). Options:
350
+ - Symbol keys: `:text`, `:button`, `:choice`, `:signature`, `:radio`
351
+ - PDF type strings: `"/Tx"`, `"/Btn"`, `"/Ch"`, `"/Sig"`
352
+ - `group_id`: Required for radio buttons. String or identifier to group radio buttons together. All radio buttons in the same group must share the same `group_id`.
353
+ - `selected`: Optional for radio buttons. Boolean (`true` or `false`, or string `"true"`). If set to `true`, this radio button will be selected by default.
354
+
355
+ Returns a `Field` object if successful.
356
+
357
+ ```ruby
358
+ # Using symbol keys (recommended)
359
+ field = doc.add_field("NewField", value: "Value", x: 100, y: 500, width: 200, height: 20, page: 1, type: :text)
360
+
361
+ # Using PDF type strings
362
+ field = doc.add_field("ButtonField", type: "/Btn", x: 100, y: 500, width: 20, height: 20, page: 1)
363
+
364
+ # Radio button example
365
+ field = doc.add_field("Option1", type: :radio, group_id: "my_group", value: "option1", x: 100, y: 500, width: 20, height: 20, page: 1, selected: true)
366
+ ```
367
+
368
+ #### `#update_field(name, new_value, new_name: nil)`
369
+ Updates a field's value and optionally renames it. For signature fields, if `new_value` looks like image data (base64-encoded JPEG/PNG or a data URI), it will automatically add the image as the field's appearance. Returns `true` if successful, `false` if field not found.
370
+
371
+ ```ruby
372
+ doc.update_field("FieldName", "New Value")
373
+ doc.update_field("OldName", "New Value", new_name: "NewName")
374
+
375
+ # For signature fields with images:
376
+ doc.update_field("SignatureField", base64_image_data) # Base64-encoded JPEG or PNG
377
+ doc.update_field("SignatureField", "data:image/png;base64,...") # Data URI format
378
+ ```
379
+
380
+ #### `#remove_field(name_or_field)`
381
+ Removes a form field by name (String) or Field object. Returns `true` if successful, `false` if field not found.
382
+
383
+ ```ruby
384
+ doc.remove_field("FieldName")
385
+ doc.remove_field(field_object)
386
+ ```
387
+
388
+ #### `#write(path_out = nil, flatten: false)`
389
+ Writes the modified PDF. If `path_out` is provided, writes to that file path and returns `true`. If no path is provided, returns the PDF bytes as a String. The `flatten` option removes incremental updates from the PDF.
390
+
391
+ ```ruby
392
+ doc.write("output.pdf") # Write to file
393
+ doc.write("output.pdf", flatten: true) # Write flattened PDF to file
394
+ pdf_bytes = doc.write # Get PDF bytes as String
395
+ ```
396
+
397
+ #### `#flatten`
398
+ Returns flattened PDF bytes (removes incremental updates) without modifying the document.
399
+
400
+ ```ruby
401
+ flattened_bytes = doc.flatten
402
+ ```
403
+
404
+ #### `#flatten!`
405
+ Flattens the PDF in-place (modifies the current document instance).
406
+
407
+ ```ruby
408
+ doc.flatten!
409
+ ```
410
+
411
+ #### `CorpPdf::Document.flatten_pdf(input_path, output_path = nil)`
412
+ Class method to flatten a PDF. If `output_path` is provided, writes to that path and returns the path. Otherwise returns a new `Document` instance with the flattened content.
413
+
414
+ ```ruby
415
+ CorpPdf::Document.flatten_pdf("input.pdf", "output.pdf")
416
+ flattened_doc = CorpPdf::Document.flatten_pdf("input.pdf")
417
+ ```
418
+
419
+ #### `#clear(options = {})` and `#clear!(options = {})`
420
+ Removes unwanted fields by rewriting the entire PDF. `clear` returns cleared PDF bytes without modifying the document, while `clear!` modifies the document in-place. Options include:
421
+
422
+ - `keep_fields`: Array of field names to keep (all others removed)
423
+ - `remove_fields`: Array of field names to remove
424
+ - `remove_pattern`: Regex pattern - fields matching this are removed
425
+ - Block: Given field name, return `true` to keep, `false` to remove
426
+
427
+ ```ruby
428
+ # Remove all fields
429
+ cleared = doc.clear(remove_pattern: /.*/)
430
+
431
+ # Remove fields matching pattern (in-place)
432
+ doc.clear!(remove_pattern: /^text-/)
433
+
434
+ # Keep only specific fields
435
+ doc.clear!(keep_fields: ["Name", "Email"])
436
+
437
+ # Use block to filter fields (return true to remove)
438
+ doc.clear! { |field| field.name.match?(/^[a-f0-9-]{30,}/) }
439
+ ```
440
+
441
+ **Note:** This completely rewrites the PDF (like `flatten`), so it's more efficient than using `remove_field` multiple times. See [Clearing Fields Documentation](docs/cleaning_fields.md) for detailed information.
442
+
443
+ ### Field Object
444
+
445
+ Each field returned by `#list_fields` is a `Field` object with the following attributes and methods:
446
+
447
+ #### Attributes
448
+ - `name`: Field name (String)
449
+ - `value`: Field value (String or nil)
450
+ - `type`: Field type (String, e.g., "/Tx", "/Btn", "/Ch", "/Sig"). Defaults to "/Tx" if missing from PDF.
451
+ - `ref`: Object reference array `[object_number, generation]`
452
+ - `x`: X coordinate (Float or nil)
453
+ - `y`: Y coordinate (Float or nil)
454
+ - `width`: Field width (Float or nil)
455
+ - `height`: Field height (Float or nil)
456
+ - `page`: Page number (Integer or nil)
457
+
458
+ #### Methods
459
+ - `#update(new_value, new_name: nil)`: Update the field's value and optionally rename it
460
+ - `#remove`: Remove the field from the document
461
+ - `#type_key`: Returns the symbol key for the type (e.g., `:text` for `"/Tx"`) or `nil` if not mapped
462
+ - `#text_field?`: Returns true if field is a text field
463
+ - `#button_field?`: Returns true if field is a button/checkbox field
464
+ - `#choice_field?`: Returns true if field is a choice/dropdown field
465
+ - `#signature_field?`: Returns true if field is a signature field
466
+ - `#has_value?`: Returns true if field has a non-empty value
467
+ - `#has_position?`: Returns true if field has position information
468
+ - `#object_number`: Returns the object number (first element of ref)
469
+ - `#generation`: Returns the generation number (second element of ref)
470
+ - `#valid_ref?`: Returns true if field has a valid reference (not a placeholder)
471
+
472
+ **Note**: When reading fields from a PDF, if the type is missing or empty, it defaults to `"/Tx"` (text field). The `type_key` method allows you to get the symbol representation (e.g., `:text`) from the type string.
473
+
474
+ ## Example
475
+
476
+ For complete working examples, see the test files in the `spec/` directory:
477
+ - `spec/document_spec.rb` - Basic document operations
478
+ - `spec/form_editing_spec.rb` - Form field editing examples
479
+ - `spec/field_editor_spec.rb` - Field object manipulation
480
+
481
+ ## Architecture
482
+
483
+ CorpPdf is built as a minimal PDF engine with the following components:
484
+
485
+ - **ObjectResolver**: Resolves and extracts PDF objects from the document
486
+ - **DictScan**: Parses PDF dictionaries and extracts field information
487
+ - **IncrementalWriter**: Handles incremental PDF updates (appends changes)
488
+ - **PDFWriter**: Writes complete PDF files (for flattening)
489
+ - **Actions**: Modular actions for adding, updating, and removing fields (`AddField`, `UpdateField`, `RemoveField`)
490
+ - **Document**: Main orchestration class that coordinates all operations
491
+ - **Field**: Represents a form field with its properties and methods
492
+
493
+ ## Limitations
494
+
495
+ This is a minimal implementation focused on AcroForm manipulation. It does not support:
496
+
497
+ - Complex PDF features (images, fonts, advanced graphics, etc.)
498
+ - PDF compression/decompression (streams are preserved as-is)
499
+ - Full PDF rendering or display
500
+ - Digital signatures (though signature fields can be added)
501
+ - JavaScript or other interactive features
502
+ - Form submission/validation logic
503
+
504
+ ## Dependencies
505
+
506
+ - **chunky_png** (~> 1.4): Required for PNG image processing in signature field appearances. JPEG images can be processed without this dependency, but PNG support requires it.
507
+
508
+ ## Development
509
+
510
+ After checking out the repo, run `bundle install` to install dependencies. Then, run `bundle exec rspec` to run the tests.
511
+
512
+ ## Contributing
513
+
514
+ Bug reports and pull requests are welcome on GitHub.
515
+
516
+ ## License
517
+
518
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ desc "Run RuboCop"
9
+ task :rubocop do
10
+ sh "bundle exec rubocop"
11
+ end
12
+
13
+ desc "Run RuboCop with auto-correct"
14
+ task "rubocop:fix" do
15
+ sh "bundle exec rubocop --auto-correct"
16
+ end
17
+
18
+ task default: :spec
data/corp_pdf.gemspec ADDED
@@ -0,0 +1,35 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'lib/corp_pdf/version'
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = "corp_pdf"
7
+ spec.version = CorpPdf::VERSION
8
+ spec.authors = ["Michael Wynkoop"]
9
+ spec.email = ["michaelwynkoop@corporatetools.com"]
10
+
11
+ spec.summary = "Pure Ruby PDF AcroForm editing library"
12
+ spec.description = "A minimal pure Ruby library for parsing and editing PDF AcroForm fields using only stdlib"
13
+ spec.homepage = "https://github.com/corporatetools/corp_pdf"
14
+ spec.license = "MIT"
15
+ spec.required_ruby_version = Gem::Requirement.new(">= 3.1.0")
16
+
17
+ spec.metadata["homepage_uri"] = spec.homepage
18
+ spec.metadata["source_code_uri"] = "https://github.com/corporatetools/corp_pdf"
19
+ spec.metadata["changelog_uri"] = "https://github.com/corporatetools/corp_pdf/blob/main/CHANGELOG.md"
20
+
21
+ # Specify which files should be added to the gem when it is released.
22
+ spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
23
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
24
+ end
25
+ spec.bindir = "exe"
26
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
27
+ spec.require_paths = ["lib"]
28
+
29
+ spec.add_runtime_dependency "chunky_png", "~> 1.4"
30
+ spec.add_runtime_dependency "i18n", "~> 1.14"
31
+ spec.add_development_dependency "rspec", "~> 3.0"
32
+ spec.add_development_dependency "pry", "~> 0.14"
33
+ spec.add_development_dependency "rubocop", "~> 1.50"
34
+ spec.add_development_dependency "rubocop-rspec", "~> 2.20"
35
+ end
data/docs/README.md ADDED
@@ -0,0 +1,111 @@
1
+ # CorpPdf Documentation
2
+
3
+ This directory contains detailed documentation about how `CorpPdf` works, with a focus on explaining the text-based nature of PDFs and how the library uses simple text traversal to parse and modify them.
4
+
5
+ ## Documentation Overview
6
+
7
+ ### [PDF Structure](./pdf_structure.md)
8
+
9
+ Explains the fundamental structure of PDF files, including:
10
+ - PDFs as text-based files with structured syntax
11
+ - PDF dictionaries (`<< ... >>`)
12
+ - PDF objects, references, arrays, and strings
13
+ - Why PDF structure is parseable with text traversal
14
+ - Examples of PDF dictionary structure
15
+
16
+ **Key insight:** PDFs may contain binary data in streams, but their **structure**—dictionaries, arrays, strings, references—is all text-based syntax.
17
+
18
+ ### [DictScan Explained](./dict_scan_explained.md)
19
+
20
+ A detailed walkthrough of the `DictScan` module:
21
+ - How each function works
22
+ - Why text traversal is the core approach
23
+ - Step-by-step algorithm explanations
24
+ - Common patterns for using `DictScan`
25
+ - Examples showing how text traversal parses PDF dictionaries
26
+
27
+ **Key insight:** Despite appearing complicated, `DictScan` is fundamentally **text traversal**—finding delimiters (`<<`, `>>`, `(`, `)`, etc.) and tracking depth to extract values.
28
+
29
+ ### [Object Streams](./object_streams.md)
30
+
31
+ Explains how PDF object streams work and how `CorpPdf` parses them:
32
+ - What object streams are and why they're used
33
+ - Object stream structure (header + data sections)
34
+ - How `ObjectResolver` identifies objects in streams
35
+ - The `ObjStm.parse` algorithm
36
+ - Stream decoding (compression, PNG predictor)
37
+ - Lazy loading and caching
38
+
39
+ **Key insight:** Object streams compress multiple objects together, but parsing them is still **text traversal**—once decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
40
+
41
+ ### [Clearing Fields](./cleaning_fields.md)
42
+
43
+ Documentation for the `clear` and `clear!` methods:
44
+ - How to remove unwanted fields completely
45
+ - Difference between `clear` and `remove_field`
46
+ - Pattern matching and field selection
47
+ - Removing orphaned widget references
48
+ - Best practices for clearing PDFs
49
+
50
+ **Key insight:** `clear` rewrites the entire PDF to exclude unwanted fields, ensuring complete removal rather than just marking fields as deleted.
51
+
52
+ ## Common Themes
53
+
54
+ Throughout all documentation, you'll see these recurring themes:
55
+
56
+ 1. **PDFs are text-based**: Despite being "binary" files, PDF structure uses text syntax
57
+ 2. **Text traversal works**: Simple character-by-character scanning can parse PDF dictionaries
58
+ 3. **Depth tracking**: Nested structures (dictionaries, arrays, strings) use depth counting
59
+ 4. **Position-based replacement**: Using exact byte positions is safer than regex replacement
60
+ 5. **Minimal parsing**: We don't need a full PDF parser—just enough to find dictionaries and extract/replace values
61
+
62
+ ## How to Read These Docs
63
+
64
+ **If you're new to PDFs:**
65
+ 1. Start with [PDF Structure](./pdf_structure.md) to understand PDFs at a high level
66
+ 2. Read [DictScan Explained](./dict_scan_explained.md) to see how text traversal works
67
+ 3. Read [Object Streams](./object_streams.md) to understand compression features
68
+ 4. Read [Clearing Fields](./cleaning_fields.md) to learn how to remove unwanted fields
69
+
70
+ **If you're debugging:**
71
+ - [DictScan Explained](./dict_scan_explained.md) has function-by-function walkthroughs
72
+ - [Object Streams](./object_streams.md) explains how object streams are parsed
73
+
74
+ **If you're contributing:**
75
+ - All docs include code examples and algorithm explanations
76
+ - Each document explains **why** the approach works, not just **how**
77
+
78
+ ## Technical Details
79
+
80
+ ### Why Text Traversal Works
81
+
82
+ PDF dictionaries use distinct delimiters:
83
+ - `<<` `>>` for dictionaries
84
+ - `[` `]` for arrays
85
+ - `(` `)` for literal strings
86
+ - `<` `>` for hex strings
87
+ - `/` for names
88
+
89
+ These unique delimiters allow pattern-matching on the first character to determine value types. Depth tracking (counting `<<`/`>>`, `[`/`]`, etc.) handles nested structures.
90
+
91
+ ### Performance
92
+
93
+ **Why text traversal is fast:**
94
+ - No AST construction
95
+ - No full PDF parsing
96
+ - Direct string manipulation
97
+ - Minimal memory allocation
98
+
99
+ **Trade-offs:**
100
+ - Doesn't validate entire PDF structure
101
+ - Assumes dictionaries are well-formed
102
+ - Some preprocessing needed (stream stripping)
103
+
104
+ ### Safety
105
+
106
+ **Position-based replacement** (using exact byte positions) avoids regex edge cases and preserves formatting. The code verifies dictionaries remain valid after modification.
107
+
108
+ ## Questions?
109
+
110
+ If you have questions about how `CorpPdf` works, these docs should answer them. The code is also well-commented, so reading the source alongside the docs is recommended.
111
+