corp_pdf 1.0.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +13 -0
- data/.rubocop.yml +78 -0
- data/CHANGELOG.md +122 -0
- data/Gemfile +5 -0
- data/Gemfile.lock +90 -0
- data/README.md +518 -0
- data/Rakefile +18 -0
- data/corp_pdf.gemspec +35 -0
- data/docs/README.md +111 -0
- data/docs/clear_fields.md +202 -0
- data/docs/dict_scan_explained.md +341 -0
- data/docs/object_streams.md +311 -0
- data/docs/pdf_structure.md +251 -0
- data/issues/README.md +59 -0
- data/issues/memory-benchmark-results.md +551 -0
- data/issues/memory-improvements.md +388 -0
- data/issues/memory-optimization-summary.md +204 -0
- data/issues/refactoring-opportunities.md +259 -0
- data/lib/corp_pdf/actions/add_field.rb +73 -0
- data/lib/corp_pdf/actions/base.rb +48 -0
- data/lib/corp_pdf/actions/remove_field.rb +154 -0
- data/lib/corp_pdf/actions/update_field.rb +663 -0
- data/lib/corp_pdf/dict_scan.rb +523 -0
- data/lib/corp_pdf/document.rb +782 -0
- data/lib/corp_pdf/field.rb +145 -0
- data/lib/corp_pdf/fields/base.rb +384 -0
- data/lib/corp_pdf/fields/checkbox.rb +164 -0
- data/lib/corp_pdf/fields/radio.rb +220 -0
- data/lib/corp_pdf/fields/signature.rb +393 -0
- data/lib/corp_pdf/fields/text.rb +31 -0
- data/lib/corp_pdf/incremental_writer.rb +245 -0
- data/lib/corp_pdf/object_resolver.rb +381 -0
- data/lib/corp_pdf/objstm.rb +75 -0
- data/lib/corp_pdf/page.rb +90 -0
- data/lib/corp_pdf/pdf_writer.rb +133 -0
- data/lib/corp_pdf/version.rb +5 -0
- data/lib/corp_pdf.rb +35 -0
- data/publish +183 -0
- metadata +169 -0
|
@@ -0,0 +1,311 @@
|
|
|
1
|
+
# PDF Object Streams Explained
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
PDF Object Streams (also called "ObjStm") are a compression feature in PDFs that allows multiple PDF objects to be stored together in a single compressed stream. This reduces file size and improves performance. `CorpPdf` handles object streams transparently, so you don't need to worry about them when working with PDF objects—but understanding how they work helps explain how the library parses PDFs.
|
|
6
|
+
|
|
7
|
+
## What Are Object Streams?
|
|
8
|
+
|
|
9
|
+
Instead of storing objects individually in the PDF body:
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
5 0 obj
|
|
13
|
+
<< /Type /Annot /Subtype /Widget >>
|
|
14
|
+
endobj
|
|
15
|
+
|
|
16
|
+
6 0 obj
|
|
17
|
+
<< /Type /Annot /Subtype /Widget >>
|
|
18
|
+
endobj
|
|
19
|
+
|
|
20
|
+
7 0 obj
|
|
21
|
+
<< /Type /Annot /Subtype /Widget >>
|
|
22
|
+
endobj
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Object streams allow multiple objects to be **packed together** in a single compressed stream:
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
10 0 obj
|
|
29
|
+
<< /Type /ObjStm /N 3 /First 20 >>
|
|
30
|
+
stream
|
|
31
|
+
[compressed header + object bodies]
|
|
32
|
+
endstream
|
|
33
|
+
endobj
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
Where:
|
|
37
|
+
- `/N 3` means there are 3 objects in this stream
|
|
38
|
+
- `/First 20` means the object data starts at byte offset 20 (the first 20 bytes are the header)
|
|
39
|
+
- The stream contains: header (object numbers + offsets) + object bodies
|
|
40
|
+
|
|
41
|
+
## Object Stream Structure
|
|
42
|
+
|
|
43
|
+
An object stream consists of two parts:
|
|
44
|
+
|
|
45
|
+
### 1. Header Section (First N bytes)
|
|
46
|
+
|
|
47
|
+
The header is a list of space-separated integers:
|
|
48
|
+
```
|
|
49
|
+
5 0 6 10 7 25
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
This means:
|
|
53
|
+
- Object 5 starts at offset 0 (relative to the start of object data)
|
|
54
|
+
- Object 6 starts at offset 10 (relative to the start of object data)
|
|
55
|
+
- Object 7 starts at offset 25 (relative to the start of object data)
|
|
56
|
+
|
|
57
|
+
Format: `obj_num offset obj_num offset ...` (pairs of object number and offset)
|
|
58
|
+
|
|
59
|
+
### 2. Object Data Section (Starting at /First)
|
|
60
|
+
|
|
61
|
+
The object data section contains the actual object bodies, concatenated together:
|
|
62
|
+
```
|
|
63
|
+
<< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
The offsets in the header tell us where each object body starts within this data section.
|
|
67
|
+
|
|
68
|
+
### Complete Example
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
10 0 obj
|
|
72
|
+
<< /Type /ObjStm /N 3 /First 20 /Filter /FlateDecode >>
|
|
73
|
+
stream
|
|
74
|
+
[Compressed bytes containing:]
|
|
75
|
+
Header: "5 0 6 10 7 25 "
|
|
76
|
+
Data: "<< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>"
|
|
77
|
+
endstream
|
|
78
|
+
endobj
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
After decompression:
|
|
82
|
+
- Bytes 0-19: Header (`"5 0 6 10 7 25 "`)
|
|
83
|
+
- Bytes 20+: Object data (`"<< /Type /Annot /Subtype /Widget >>..."`)
|
|
84
|
+
- Object 5 body: bytes 20-29 (offset 0 + 20 = 20, next object at offset 10)
|
|
85
|
+
- Object 6 body: bytes 30-59 (offset 10 + 20 = 30, next object at offset 25)
|
|
86
|
+
- Object 7 body: bytes 45+ (offset 25 + 20 = 45)
|
|
87
|
+
|
|
88
|
+
## How CorpPdf Parses Object Streams
|
|
89
|
+
|
|
90
|
+
### Step 1: Cross-Reference Table Parsing
|
|
91
|
+
|
|
92
|
+
When `ObjectResolver` parses the cross-reference table or xref stream, it identifies objects stored in object streams:
|
|
93
|
+
|
|
94
|
+
```ruby
|
|
95
|
+
# In parse_xref_stream_records
|
|
96
|
+
when 2 then @entries[ref] ||= Entry.new(type: :in_objstm, objstm_num: f1, objstm_index: f2)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Where:
|
|
100
|
+
- `f1` = object stream number (the container object)
|
|
101
|
+
- `f2` = index within the object stream (0 = first object, 1 = second object, etc.)
|
|
102
|
+
|
|
103
|
+
Example: If object 5 is at index 0 in object stream 10, then `Entry` stores:
|
|
104
|
+
- `type: :in_objstm`
|
|
105
|
+
- `objstm_num: 10` (the object stream container)
|
|
106
|
+
- `objstm_index: 0` (first object in the stream)
|
|
107
|
+
|
|
108
|
+
### Step 2: Lazy Loading
|
|
109
|
+
|
|
110
|
+
When you request an object body that's in an object stream, `ObjectResolver` calls `load_objstm`:
|
|
111
|
+
|
|
112
|
+
```ruby
|
|
113
|
+
def load_objstm(container_ref)
|
|
114
|
+
return if @objstm_cache.key?(container_ref) # Already cached
|
|
115
|
+
|
|
116
|
+
# Get the object stream container's body
|
|
117
|
+
body = object_body(container_ref)
|
|
118
|
+
|
|
119
|
+
# Extract dictionary to get /N and /First
|
|
120
|
+
dict_src = extract_dictionary(body)
|
|
121
|
+
n = DictScan.value_token_after("/N", dict_src).to_i
|
|
122
|
+
first = DictScan.value_token_after("/First", dict_src).to_i
|
|
123
|
+
|
|
124
|
+
# Extract and decode stream data
|
|
125
|
+
raw = decode_stream_data(dict_src, extract_stream_body(body))
|
|
126
|
+
|
|
127
|
+
# Parse the object stream
|
|
128
|
+
parsed = CorpPdf::ObjStm.parse(raw, n: n, first: first)
|
|
129
|
+
|
|
130
|
+
# Cache the result
|
|
131
|
+
@objstm_cache[container_ref] = parsed
|
|
132
|
+
end
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Step 3: Stream Decoding
|
|
136
|
+
|
|
137
|
+
The `decode_stream_data` method handles:
|
|
138
|
+
1. **Extracting stream body**: Removes `stream` and `endstream` keywords
|
|
139
|
+
2. **Decompression**: If `/Filter /FlateDecode` is present, decompress using zlib
|
|
140
|
+
3. **PNG Predictor**: If `/Predictor` is present (10-15), apply PNG predictor decoding
|
|
141
|
+
|
|
142
|
+
Example:
|
|
143
|
+
```ruby
|
|
144
|
+
def decode_stream_data(dict_src, stream_chunk)
|
|
145
|
+
# Extract body between stream...endstream
|
|
146
|
+
body = extract_stream_body(stream_chunk)
|
|
147
|
+
|
|
148
|
+
# Decompress if FlateDecode
|
|
149
|
+
data = if dict_src =~ %r{/Filter\s*/FlateDecode}
|
|
150
|
+
Zlib::Inflate.inflate(body)
|
|
151
|
+
else
|
|
152
|
+
body
|
|
153
|
+
end
|
|
154
|
+
|
|
155
|
+
# Apply PNG predictor if present
|
|
156
|
+
if dict_src =~ %r{/Predictor\s+(\d+)}
|
|
157
|
+
# Decode PNG predictor (Sub, Up, Average, Paeth)
|
|
158
|
+
data = apply_png_predictor(data, columns)
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
data
|
|
162
|
+
end
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Step 4: Object Stream Parsing (`ObjStm.parse`)
|
|
166
|
+
|
|
167
|
+
The `ObjStm.parse` method is the heart of object stream parsing:
|
|
168
|
+
|
|
169
|
+
```ruby
|
|
170
|
+
def self.parse(bytes, n:, first:)
|
|
171
|
+
# Extract header (first N bytes)
|
|
172
|
+
head = bytes[0...first]
|
|
173
|
+
|
|
174
|
+
# Parse space-separated integers: obj_num offset obj_num offset ...
|
|
175
|
+
entries = head.strip.split(/\s+/).map!(&:to_i)
|
|
176
|
+
|
|
177
|
+
# Extract each object body
|
|
178
|
+
refs = []
|
|
179
|
+
n.times do |i|
|
|
180
|
+
obj = entries[2 * i] # Object number
|
|
181
|
+
off = entries[(2 * i) + 1] # Offset in data section
|
|
182
|
+
|
|
183
|
+
# Calculate next offset (or end of data)
|
|
184
|
+
next_off = i + 1 < n ? entries[(2 * (i + 1)) + 1] : (bytes.bytesize - first)
|
|
185
|
+
|
|
186
|
+
# Extract object body: start at (first + off), length is (next_off - off)
|
|
187
|
+
body = bytes[first + off, next_off - off]
|
|
188
|
+
|
|
189
|
+
refs << { ref: [obj, 0], body: body }
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
refs
|
|
193
|
+
end
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Step-by-step example:**
|
|
197
|
+
|
|
198
|
+
Given:
|
|
199
|
+
- `bytes` = decompressed stream data
|
|
200
|
+
- `n = 3` (3 objects)
|
|
201
|
+
- `first = 20` (data starts at byte 20)
|
|
202
|
+
- Header: `"5 0 6 10 7 25 "` (12 bytes, padded to 20)
|
|
203
|
+
|
|
204
|
+
Processing:
|
|
205
|
+
1. Extract header: `bytes[0...20]` → `"5 0 6 10 7 25 "`
|
|
206
|
+
2. Parse entries: `[5, 0, 6, 10, 7, 25]`
|
|
207
|
+
3. For `i=0` (first object):
|
|
208
|
+
- `obj = entries[0]` = 5
|
|
209
|
+
- `off = entries[1]` = 0
|
|
210
|
+
- `next_off = entries[3]` = 10 (offset of next object)
|
|
211
|
+
- `body = bytes[20 + 0, 10 - 0]` = bytes[20...30]
|
|
212
|
+
4. For `i=1` (second object):
|
|
213
|
+
- `obj = entries[2]` = 6
|
|
214
|
+
- `off = entries[3]` = 10
|
|
215
|
+
- `next_off = entries[5]` = 25
|
|
216
|
+
- `body = bytes[20 + 10, 25 - 10]` = bytes[30...45]
|
|
217
|
+
5. For `i=2` (third object):
|
|
218
|
+
- `obj = entries[4]` = 7
|
|
219
|
+
- `off = entries[5]` = 25
|
|
220
|
+
- `next_off = bytes.bytesize - first` (end of data)
|
|
221
|
+
- `body = bytes[20 + 25, ...]` = bytes[45...end]
|
|
222
|
+
|
|
223
|
+
Result:
|
|
224
|
+
```ruby
|
|
225
|
+
[
|
|
226
|
+
{ ref: [5, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
|
|
227
|
+
{ ref: [6, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
|
|
228
|
+
{ ref: [7, 0], body: "<< /Type /Annot /Subtype /Widget >>" }
|
|
229
|
+
]
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
### Step 5: Object Retrieval
|
|
233
|
+
|
|
234
|
+
When `object_body(ref)` is called for an object in an object stream:
|
|
235
|
+
|
|
236
|
+
```ruby
|
|
237
|
+
def object_body(ref)
|
|
238
|
+
case (e = @entries[ref])&.type
|
|
239
|
+
when :in_file
|
|
240
|
+
# Regular object: extract from file
|
|
241
|
+
extract_from_file(e.offset)
|
|
242
|
+
when :in_objstm
|
|
243
|
+
# Object stream: load stream (if not cached), then get object by index
|
|
244
|
+
load_objstm([e.objstm_num, 0])
|
|
245
|
+
@objstm_cache[[e.objstm_num, 0]][e.objstm_index][:body]
|
|
246
|
+
end
|
|
247
|
+
end
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
The index (`objstm_index`) tells us which object in the parsed array to return.
|
|
251
|
+
|
|
252
|
+
## Why Object Streams Matter
|
|
253
|
+
|
|
254
|
+
### Benefits
|
|
255
|
+
|
|
256
|
+
1. **File Size**: Compressing multiple objects together is more efficient than compressing each individually
|
|
257
|
+
2. **Performance**: Fewer objects to parse when opening the PDF
|
|
258
|
+
3. **Common in Modern PDFs**: Most PDFs created by modern tools use object streams
|
|
259
|
+
|
|
260
|
+
### Transparency in CorpPdf
|
|
261
|
+
|
|
262
|
+
`CorpPdf` handles object streams automatically:
|
|
263
|
+
- You don't need to know if an object is in a stream or not
|
|
264
|
+
- `object_body(ref)` returns the object body the same way regardless
|
|
265
|
+
- Object streams are cached after first load (no repeated parsing)
|
|
266
|
+
- The same `DictScan` methods work on extracted object bodies
|
|
267
|
+
|
|
268
|
+
## Cross-Reference Streams vs Object Streams
|
|
269
|
+
|
|
270
|
+
**Important distinction:**
|
|
271
|
+
|
|
272
|
+
1. **XRef Streams** (`/Type /XRef`): Used to find where objects are located in the PDF
|
|
273
|
+
- Contains byte offsets or references to object streams
|
|
274
|
+
- Replaces classic xref tables
|
|
275
|
+
|
|
276
|
+
2. **Object Streams** (`/Type /ObjStm`): Used to store actual object bodies
|
|
277
|
+
- Contains compressed object dictionaries
|
|
278
|
+
- Referenced by xref streams or classic xref tables
|
|
279
|
+
|
|
280
|
+
Both use the same stream format (compressed, potentially with PNG predictor), but serve different purposes.
|
|
281
|
+
|
|
282
|
+
## PNG Predictor
|
|
283
|
+
|
|
284
|
+
PNG Predictor is a compression technique that predicts values based on previous values to improve compression. `CorpPdf` supports all 5 PNG predictor types:
|
|
285
|
+
|
|
286
|
+
1. **Type 0 (None)**: No prediction
|
|
287
|
+
2. **Type 1 (Sub)**: Predict from left
|
|
288
|
+
3. **Type 2 (Up)**: Predict from above
|
|
289
|
+
4. **Type 3 (Average)**: Predict from average of left and above
|
|
290
|
+
5. **Type 4 (Paeth)**: Predict using Paeth algorithm
|
|
291
|
+
|
|
292
|
+
The `apply_png_predictor` method decodes predictor-encoded data row by row, using the `/Columns` parameter to determine row width.
|
|
293
|
+
|
|
294
|
+
## Summary
|
|
295
|
+
|
|
296
|
+
Object streams allow PDFs to store multiple objects in compressed streams. `CorpPdf` handles them by:
|
|
297
|
+
|
|
298
|
+
1. **Identifying** objects in streams via xref parsing
|
|
299
|
+
2. **Lazy loading** stream containers when needed
|
|
300
|
+
3. **Decoding** compressed stream data (zlib + PNG predictor)
|
|
301
|
+
4. **Parsing** the header to extract object offsets
|
|
302
|
+
5. **Extracting** individual object bodies by offset
|
|
303
|
+
6. **Caching** parsed streams for performance
|
|
304
|
+
|
|
305
|
+
The parsing itself is straightforward:
|
|
306
|
+
- Header is space-separated integers (object numbers and offsets)
|
|
307
|
+
- Object data follows the header
|
|
308
|
+
- Extract each object body using its offset
|
|
309
|
+
|
|
310
|
+
Just like `DictScan`, object stream parsing is **text traversal**—once the stream is decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
|
|
311
|
+
|
|
@@ -0,0 +1,251 @@
|
|
|
1
|
+
# PDF File Structure
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
PDF (Portable Document Format) files have a reputation for being complex binary formats, but at their core, they are **text-based files with a structured syntax**. Understanding this fundamental fact is key to understanding how PDF works.
|
|
6
|
+
|
|
7
|
+
While PDFs can contain binary data (like compressed streams, images, and fonts), the **structure** of a PDF—its objects, dictionaries, arrays, and references—is defined using plain text syntax.
|
|
8
|
+
|
|
9
|
+
## PDF File Anatomy
|
|
10
|
+
|
|
11
|
+
A PDF file consists of several main parts:
|
|
12
|
+
|
|
13
|
+
1. **Header**: `%PDF-1.4` (or similar version)
|
|
14
|
+
2. **Body**: A collection of PDF objects (the actual content)
|
|
15
|
+
3. **Cross-Reference Table (xref)**: Points to byte offsets of objects
|
|
16
|
+
4. **Trailer**: Contains the root object reference and metadata
|
|
17
|
+
5. **EOF Marker**: `%%EOF`
|
|
18
|
+
|
|
19
|
+
### PDF Objects
|
|
20
|
+
|
|
21
|
+
The body contains PDF objects. Each object has:
|
|
22
|
+
- An object number and generation number (e.g., `5 0 obj`)
|
|
23
|
+
- Content (dictionary, array, stream, etc.)
|
|
24
|
+
- An `endobj` marker
|
|
25
|
+
|
|
26
|
+
Example:
|
|
27
|
+
```
|
|
28
|
+
5 0 obj
|
|
29
|
+
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] >>
|
|
30
|
+
endobj
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## PDF Dictionaries
|
|
34
|
+
|
|
35
|
+
**PDF dictionaries are the heart of PDF structure.** They're defined using angle brackets:
|
|
36
|
+
|
|
37
|
+
```
|
|
38
|
+
<< /Key1 value1 /Key2 value2 /Key3 value3 >>
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
Think of them like JSON objects or Ruby hashes, but with PDF-specific syntax:
|
|
42
|
+
- Keys are PDF names (always start with `/`)
|
|
43
|
+
- Values can be: strings, numbers, booleans, arrays, dictionaries, or object references
|
|
44
|
+
- Whitespace is generally ignored (but required between tokens)
|
|
45
|
+
|
|
46
|
+
### Dictionary Examples
|
|
47
|
+
|
|
48
|
+
**Simple dictionary:**
|
|
49
|
+
```
|
|
50
|
+
<< /Type /Page /Width 612 /Height 792 >>
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
**Nested dictionary:**
|
|
54
|
+
```
|
|
55
|
+
<<
|
|
56
|
+
/Type /Annot
|
|
57
|
+
/Subtype /Widget
|
|
58
|
+
/Rect [100 500 200 520]
|
|
59
|
+
/AP <<
|
|
60
|
+
/N <<
|
|
61
|
+
/Yes 10 0 R
|
|
62
|
+
/Off 11 0 R
|
|
63
|
+
>>
|
|
64
|
+
>>
|
|
65
|
+
>>
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
**Dictionary with array:**
|
|
69
|
+
```
|
|
70
|
+
<< /Kids [5 0 R 6 0 R 7 0 R] >>
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
**Dictionary with string values:**
|
|
74
|
+
```
|
|
75
|
+
<< /Title (My Document) /Author (John Doe) >>
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
The parentheses `()` denote literal strings in PDF syntax. Hex strings use angle brackets: `<>`.
|
|
79
|
+
|
|
80
|
+
## PDF Text-Based Syntax
|
|
81
|
+
|
|
82
|
+
Despite being "binary" files, PDFs use text-based syntax for their structure. This means:
|
|
83
|
+
|
|
84
|
+
1. **Dictionaries are text**: `<< ... >>` are just character sequences
|
|
85
|
+
2. **Arrays are text**: `[ ... ]` are just character sequences
|
|
86
|
+
3. **References are text**: `5 0 R` means "object 5, generation 0"
|
|
87
|
+
4. **Strings can be text or hex**: `(Hello)` or `<48656C6C6F>`
|
|
88
|
+
|
|
89
|
+
### Why This Matters
|
|
90
|
+
|
|
91
|
+
Because PDF dictionaries are just text with delimiters (`<<`, `>>`), we can parse them using **simple text traversal algorithms**—no complex parser generator, no AST construction, just:
|
|
92
|
+
|
|
93
|
+
1. Find opening `<<`
|
|
94
|
+
2. Track nesting depth by counting `<<` and `>>`
|
|
95
|
+
3. When depth reaches zero, we've found a complete dictionary
|
|
96
|
+
4. Repeat
|
|
97
|
+
|
|
98
|
+
## PDF Object References
|
|
99
|
+
|
|
100
|
+
PDFs use references to link objects together:
|
|
101
|
+
```
|
|
102
|
+
5 0 R
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
This means:
|
|
106
|
+
- Object number: `5`
|
|
107
|
+
- Generation number: `0` (usually 0 for non-incremental PDFs)
|
|
108
|
+
- `R` means "reference"
|
|
109
|
+
|
|
110
|
+
When you see `/Parent 5 0 R`, it means the `Parent` key references object 5.
|
|
111
|
+
|
|
112
|
+
## PDF Arrays
|
|
113
|
+
|
|
114
|
+
Arrays are space-separated lists in square brackets:
|
|
115
|
+
```
|
|
116
|
+
[0 0 612 792]
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
Can contain any PDF value type:
|
|
120
|
+
```
|
|
121
|
+
[5 0 R 6 0 R]
|
|
122
|
+
[/Yes /Off]
|
|
123
|
+
[(Hello) (World)]
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
## PDF Strings
|
|
127
|
+
|
|
128
|
+
PDF strings come in two flavors:
|
|
129
|
+
|
|
130
|
+
### Literal Strings (parentheses)
|
|
131
|
+
```
|
|
132
|
+
(Hello World)
|
|
133
|
+
(Line 1\nLine 2)
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Can contain escape sequences: `\n`, `\r`, `\t`, `\\(`, `\\)`, octal `\123`.
|
|
137
|
+
|
|
138
|
+
### Hex Strings (angle brackets)
|
|
139
|
+
```
|
|
140
|
+
<48656C6C6F>
|
|
141
|
+
<FEFF00480065006C006C006F>
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
The hex string `<FEFF...>` with BOM indicates UTF-16BE encoding.
|
|
145
|
+
|
|
146
|
+
## PDF Names
|
|
147
|
+
|
|
148
|
+
PDF names start with `/`:
|
|
149
|
+
```
|
|
150
|
+
/Type
|
|
151
|
+
/Subtype
|
|
152
|
+
/Widget
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
Names can contain most characters except special delimiters.
|
|
156
|
+
|
|
157
|
+
## Stream Objects
|
|
158
|
+
|
|
159
|
+
Some PDF objects contain **streams** (binary or text data):
|
|
160
|
+
```
|
|
161
|
+
10 0 obj
|
|
162
|
+
<< /Length 100 /Filter /FlateDecode >>
|
|
163
|
+
stream
|
|
164
|
+
[compressed binary data here]
|
|
165
|
+
endstream
|
|
166
|
+
endobj
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
For parsing structure (dictionaries), we typically strip or ignore stream bodies because they can contain arbitrary binary data that would confuse text-based parsing.
|
|
170
|
+
|
|
171
|
+
## Why CorpPdf Works
|
|
172
|
+
|
|
173
|
+
`CorpPdf` works because **PDF dictionaries are just text patterns**. Despite looking complicated, the algorithms are straightforward:
|
|
174
|
+
|
|
175
|
+
### Finding Dictionaries
|
|
176
|
+
|
|
177
|
+
The `each_dictionary` method:
|
|
178
|
+
1. Searches for `<<` (start of dictionary)
|
|
179
|
+
2. Tracks nesting depth: `<<` increments, `>>` decrements
|
|
180
|
+
3. When depth returns to 0, we've found a complete dictionary
|
|
181
|
+
4. Yield it and continue searching
|
|
182
|
+
|
|
183
|
+
This is **pure text traversal**—no PDF-specific knowledge beyond "dictionaries use `<<` and `>>`".
|
|
184
|
+
|
|
185
|
+
### Extracting Values
|
|
186
|
+
|
|
187
|
+
The `value_token_after` method:
|
|
188
|
+
1. Finds a key (like `/V`)
|
|
189
|
+
2. Skips whitespace
|
|
190
|
+
3. Based on the next character, extracts the value:
|
|
191
|
+
- `(` → Extract literal string (handle escaping)
|
|
192
|
+
- `<` → Extract hex string or dictionary
|
|
193
|
+
- `[` → Extract array (match brackets)
|
|
194
|
+
- `/` → Extract name
|
|
195
|
+
- Otherwise → Extract atom (number, reference, etc.)
|
|
196
|
+
|
|
197
|
+
Again, this is just **text pattern matching** with some bracket/depth tracking.
|
|
198
|
+
|
|
199
|
+
### Why It Seems Complicated
|
|
200
|
+
|
|
201
|
+
The complexity comes from:
|
|
202
|
+
1. **Handling edge cases**: Escaped characters, nested structures, various value types
|
|
203
|
+
2. **Preserving exact formatting**: When replacing values, we must maintain valid PDF syntax
|
|
204
|
+
3. **Encoding/decoding**: PDF strings have special encoding rules (UTF-16BE BOM, escapes)
|
|
205
|
+
4. **Safety checks**: Verifying dictionaries are still valid after modification
|
|
206
|
+
|
|
207
|
+
But the **core concept** is simple: PDF dictionaries are text, so we can parse them with text traversal.
|
|
208
|
+
|
|
209
|
+
## Example: Walking Through a PDF Dictionary
|
|
210
|
+
|
|
211
|
+
Given this PDF dictionary text:
|
|
212
|
+
```
|
|
213
|
+
<< /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
How `CorpPdf` would parse it:
|
|
217
|
+
|
|
218
|
+
1. **`each_dictionary` finds it:**
|
|
219
|
+
- Finds `<<` at position 0
|
|
220
|
+
- Depth: 0 → 1 (after `<<`)
|
|
221
|
+
- Scans forward...
|
|
222
|
+
- Finds `>>` at position 64
|
|
223
|
+
- Depth: 1 → 0
|
|
224
|
+
- Yields: `"<< /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>"`
|
|
225
|
+
|
|
226
|
+
2. **`value_token_after("/V", dict)` extracts value:**
|
|
227
|
+
- Finds `/V` (followed by space)
|
|
228
|
+
- Skips whitespace
|
|
229
|
+
- Next char is `(`, so extract literal string
|
|
230
|
+
- Scan forward, handle escaping, match closing `)`
|
|
231
|
+
- Returns: `"(Hello World)"`
|
|
232
|
+
|
|
233
|
+
3. **`decode_pdf_string("(Hello World)")` decodes:**
|
|
234
|
+
- Starts with `(`, ends with `)`
|
|
235
|
+
- Extract inner: `"Hello World"`
|
|
236
|
+
- Unescape (no escapes here)
|
|
237
|
+
- Check for UTF-16BE BOM (none)
|
|
238
|
+
- Return: `"Hello World"`
|
|
239
|
+
|
|
240
|
+
## Conclusion
|
|
241
|
+
|
|
242
|
+
PDF files are **structured text files** with binary data embedded in streams. The structure itself—dictionaries, arrays, strings, references—is all text-based syntax. This is why `CorpPdf` can use simple text traversal to parse and modify PDF dictionaries without needing a full PDF parser.
|
|
243
|
+
|
|
244
|
+
The apparent complexity in `CorpPdf` comes from:
|
|
245
|
+
- Handling PDF's various value types
|
|
246
|
+
- Proper encoding/decoding of strings
|
|
247
|
+
- Careful preservation of structure during edits
|
|
248
|
+
- Edge case handling (escaping, nesting, etc.)
|
|
249
|
+
|
|
250
|
+
But the **fundamental approach** is elegantly simple: treat PDF dictionaries as text patterns and parse them with character-by-character traversal.
|
|
251
|
+
|
data/issues/README.md
ADDED
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
# Code Review Issues
|
|
2
|
+
|
|
3
|
+
This folder contains documentation of code cleanup, refactoring opportunities, and improvement tasks found in the codebase.
|
|
4
|
+
|
|
5
|
+
## Files
|
|
6
|
+
|
|
7
|
+
- **[refactoring-opportunities.md](./refactoring-opportunities.md)** - Detailed list of code duplication and refactoring opportunities
|
|
8
|
+
- **[memory-improvements.md](./memory-improvements.md)** - Memory usage issues and optimization opportunities for handling larger PDF documents
|
|
9
|
+
|
|
10
|
+
## Summary
|
|
11
|
+
|
|
12
|
+
### High Priority Issues
|
|
13
|
+
1. **Widget Matching Logic** - Duplicated across 6+ locations
|
|
14
|
+
2. **/Annots Array Manipulation** - Complex logic duplicated in 3 locations
|
|
15
|
+
|
|
16
|
+
### Medium Priority Issues
|
|
17
|
+
3. **Box Parsing Logic** - Repeated code blocks for 5 box types
|
|
18
|
+
4. **Checkbox Appearance Creation** - Significant duplication in new code
|
|
19
|
+
5. **PDF Metadata Formatting** - Could benefit from being shared utilities
|
|
20
|
+
|
|
21
|
+
### Low Priority Issues
|
|
22
|
+
6. Duplicated `next_fresh_object_number` implementation (may be intentional)
|
|
23
|
+
7. Object reference extraction pattern duplication
|
|
24
|
+
8. Unused method: `get_widget_rect_dimensions`
|
|
25
|
+
9. Base64 decoding logic duplication
|
|
26
|
+
|
|
27
|
+
### Completed ✅
|
|
28
|
+
- **Page-Finding Logic** - Successfully refactored into `DictScan.is_page?` and unified page-finding methods
|
|
29
|
+
|
|
30
|
+
## Quick Stats
|
|
31
|
+
|
|
32
|
+
- **10 refactoring opportunities** identified (1 completed, 9 remaining)
|
|
33
|
+
- **6+ locations** with widget matching duplication
|
|
34
|
+
- **3 locations** with /Annots array manipulation duplication
|
|
35
|
+
- **1 unused method** found
|
|
36
|
+
- **2 new issues** identified in recent code additions
|
|
37
|
+
|
|
38
|
+
## Memory & Performance
|
|
39
|
+
|
|
40
|
+
### Memory Improvement Opportunities
|
|
41
|
+
|
|
42
|
+
See **[memory-improvements.md](./memory-improvements.md)** for detailed analysis of memory usage and optimization strategies.
|
|
43
|
+
|
|
44
|
+
**Key Issues:**
|
|
45
|
+
- Duplicate PDF loading (2x memory usage)
|
|
46
|
+
- Stream decompression cache retention
|
|
47
|
+
- All-objects-in-memory operations
|
|
48
|
+
- Multiple full PDF copies during write operations
|
|
49
|
+
|
|
50
|
+
**Estimated Impact:** 50-90MB typical usage for 10MB PDF, can exceed 100-200MB+ for larger/complex PDFs (39+ pages).
|
|
51
|
+
|
|
52
|
+
## Next Steps
|
|
53
|
+
|
|
54
|
+
1. Review [refactoring-opportunities.md](./refactoring-opportunities.md) for detailed information
|
|
55
|
+
2. Review [memory-improvements.md](./memory-improvements.md) for memory optimization strategies
|
|
56
|
+
3. Prioritize improvements based on maintenance and performance needs
|
|
57
|
+
4. Create test coverage before refactoring
|
|
58
|
+
5. Implement improvements incrementally, starting with high-priority items
|
|
59
|
+
|