acro_that 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,341 @@
1
+ # DictScan Explained: Text Traversal in Action
2
+
3
+ ## The Big Picture
4
+
5
+ `DictScan` is a module that appears complicated at first glance, but it's fundamentally just **text traversal**—walking through PDF files character by character to find and extract dictionary structures.
6
+
7
+ This document explains how each function in `DictScan` works and why the text-traversal approach is both powerful and straightforward.
8
+
9
+ ## Core Principle
10
+
11
+ **PDF dictionaries are text patterns.** They use `<<` and `>>` as delimiters, just like how programming languages use `{` and `}` or `[` and `]`. Once you recognize this, parsing becomes a matter of tracking depth and matching delimiters.
12
+
13
+ ## Function-by-Function Guide
14
+
15
+ ### `strip_stream_bodies(pdf)`
16
+
17
+ **Purpose:** Remove binary stream data that would confuse text parsing.
18
+
19
+ **How it works:**
20
+ - Finds all `stream...endstream` blocks using regex
21
+ - Replaces the binary content with a placeholder
22
+ - Preserves the stream structure markers
23
+
24
+ **Why:** Streams can contain arbitrary binary data (compressed images, fonts, etc.) that would break our text-based parsing. We strip them out since we're only interested in the dictionary structure.
25
+
26
+ ```ruby
27
+ pdf.gsub(/stream\r?\n.*?endstream/mi) { "stream\nENDSTREAM_STRIPPED\nendstream" }
28
+ ```
29
+
30
+ This is regex-based, but it's necessary preprocessing before we can safely do text traversal.
31
+
32
+ ---
33
+
34
+ ### `each_dictionary(str)`
35
+
36
+ **Purpose:** Iterate through all dictionaries in a string.
37
+
38
+ **Algorithm:**
39
+ 1. Find the first `<<` at position `i`
40
+ 2. Initialize depth counter to 0
41
+ 3. Scan forward:
42
+ - If we see `<<`, increment depth
43
+ - If we see `>>`, decrement depth
44
+ - If depth reaches 0, we've found a complete dictionary
45
+ 4. Yield the dictionary substring
46
+ 5. Continue from where we left off
47
+
48
+ **Example:**
49
+ ```
50
+ Input: "<< /A 1 >> << /B 2 >>"
51
+ i=0: find "<<"
52
+ depth=1, scan forward
53
+ see ">>", depth=0 → found "<< /A 1 >>"
54
+ yield and continue from i=11
55
+ i=11: find "<<"
56
+ depth=1, scan forward
57
+ see ">>", depth=0 → found "<< /B 2 >>"
58
+ yield and continue
59
+ ```
60
+
61
+ **Why it works:** This is classic **bracket matching**. No PDF-specific knowledge needed—just counting delimiters.
62
+
63
+ ---
64
+
65
+ ### `unescape_literal(s)`
66
+
67
+ **Purpose:** Decode PDF escape sequences in literal strings.
68
+
69
+ **PDF escapes:**
70
+ - `\n` → newline
71
+ - `\r` → carriage return
72
+ - `\t` → tab
73
+ - `\b` → backspace
74
+ - `\f` → form feed
75
+ - `\\(` → literal `(`
76
+ - `\\)` → literal `)`
77
+ - `\123` → octal character (up to 3 digits)
78
+
79
+ **Algorithm:** Character-by-character scan:
80
+ 1. If we see `\`, look ahead one character
81
+ 2. Map escape sequences to actual characters
82
+ 3. Handle octal sequences (1-3 digits)
83
+ 4. Otherwise, copy character as-is
84
+
85
+ **Why it works:** This is standard escape sequence handling, identical to how many programming languages handle string literals.
86
+
87
+ ---
88
+
89
+ ### `decode_pdf_string(token)`
90
+
91
+ **Purpose:** Decode a PDF string token into a Ruby string.
92
+
93
+ **PDF string types:**
94
+ 1. **Literal:** `(Hello World)` or `(Hello\nWorld)`
95
+ 2. **Hex:** `<48656C6C6F>` or `<FEFF00480065006C006C006F>`
96
+
97
+ **Algorithm:**
98
+ 1. Check if token starts with `(` → literal string
99
+ - Extract content between parentheses
100
+ - Unescape using `unescape_literal`
101
+ - Check for UTF-16BE BOM (`FE FF`)
102
+ - Decode accordingly
103
+ 2. Check if token starts with `<` → hex string
104
+ - Remove spaces, pad if odd length
105
+ - Convert hex to bytes
106
+ - Check for UTF-16BE BOM
107
+ - Decode accordingly
108
+ 3. Otherwise, return as-is (name, number, reference, etc.)
109
+
110
+ **Why it works:** PDF strings have well-defined formats. We just pattern-match on the delimiters and decode accordingly.
111
+
112
+ ---
113
+
114
+ ### `encode_pdf_string(val)`
115
+
116
+ **Purpose:** Encode a Ruby value into a PDF string token.
117
+
118
+ **Handles:**
119
+ - `true` → `"true"`
120
+ - `false` → `"false"`
121
+ - `Symbol` → `"/symbol_name"`
122
+ - `String`:
123
+ - ASCII-only → literal string `(value)`
124
+ - Non-ASCII → hex string with UTF-16BE encoding
125
+
126
+ **Why it works:** Reverse of `decode_pdf_string`—we know the target format and encode accordingly.
127
+
128
+ ---
129
+
130
+ ### `value_token_after(key, dict_src)`
131
+
132
+ **Purpose:** Extract the value token that follows a key in a dictionary.
133
+
134
+ **This is the heart of text traversal.** Here's how it works:
135
+
136
+ 1. **Find the key:**
137
+ ```ruby
138
+ match = dict_src.match(%r{#{Regexp.escape(key)}(?=[\s(<\[/])})
139
+ ```
140
+ Use regex to ensure the key is followed by a delimiter (whitespace, `(`, `<`, `[`, or `/`). This prevents partial matches.
141
+
142
+ 2. **Skip whitespace:**
143
+ ```ruby
144
+ i += 1 while i < dict_src.length && dict_src[i] =~ /\s/
145
+ ```
146
+
147
+ 3. **Switch on the next character:**
148
+ - **`(` → Literal string:**
149
+ - Track depth of parentheses
150
+ - Handle escaped characters (skip `\` and next char)
151
+ - Match closing `)` when depth returns to 0
152
+ - **`<` → Hex string or dictionary:**
153
+ - If `<<` → return `"<<"` (nested dictionary marker)
154
+ - Otherwise, find matching `>`
155
+ - **`[` → Array:**
156
+ - Track depth of brackets
157
+ - Match closing `]` when depth returns to 0
158
+ - **`/` → PDF name:**
159
+ - Extract until whitespace or delimiter
160
+ - **Otherwise → Atom:**
161
+ - Extract until whitespace or delimiter (number, reference, boolean, etc.)
162
+
163
+ **Why it works:** PDF has well-defined token syntax. Each value type has distinct delimiters, so we can pattern-match on the first character and extract accordingly.
164
+
165
+ **Example:**
166
+ ```
167
+ Dict: "<< /V (Hello) /R [1 2 3] >>"
168
+ value_token_after("/V", dict):
169
+ → Finds "/V" at position 3
170
+ → Skips space
171
+ → Next char is "("
172
+ → Extracts "(Hello)" using paren matching
173
+ → Returns "(Hello)"
174
+ ```
175
+
176
+ ---
177
+
178
+ ### `replace_key_value(dict_src, key, new_token)`
179
+
180
+ **Purpose:** Replace a key's value in a dictionary string.
181
+
182
+ **Algorithm:**
183
+ 1. Find the key using pattern matching
184
+ 2. Extract the existing value token using `value_token_after`
185
+ 3. Find exact byte positions:
186
+ - Key start/end
187
+ - Value start/end
188
+ 4. Replace using string slicing: `before + new_token + after`
189
+ 5. Verify dictionary is still valid (contains `<<` and `>>`)
190
+
191
+ **Why it works:** We use **precise byte positions** rather than regex replacement. This:
192
+ - Preserves exact formatting (whitespace, etc.)
193
+ - Avoids regex edge cases
194
+ - Is deterministic and safe
195
+
196
+ **Example:**
197
+ ```
198
+ Input: "<< /V (Old) /X 1 >>"
199
+ before: "<< /V "
200
+ value: "(Old)"
201
+ after: " /X 1 >>"
202
+ Output: "<< /V (New) /X 1 >>"
203
+ ```
204
+
205
+ ---
206
+
207
+ ### `upsert_key_value(dict_src, key, token)`
208
+
209
+ **Purpose:** Insert a key-value pair if the key doesn't exist.
210
+
211
+ **Algorithm:**
212
+ - If key not found, insert after the opening `<<`
213
+ - Uses simple string substitution: `"<<#{key} #{token}"`
214
+
215
+ **Why it works:** Simple string manipulation when we know the key doesn't exist.
216
+
217
+ ---
218
+
219
+ ### `remove_ref_from_array(array_body, ref)` and `add_ref_to_array(array_body, ref)`
220
+
221
+ **Purpose:** Manipulate object references in PDF arrays.
222
+
223
+ **Algorithm:**
224
+ - Use regex/gsub to find and replace reference patterns: `"5 0 R"`
225
+ - Handle edge cases: empty arrays, spacing
226
+
227
+ **Why it works:** Object references have a fixed format (`num gen R`), so we can pattern-match and replace.
228
+
229
+ ---
230
+
231
+ ## Why This Approach Works
232
+
233
+ ### 1. PDF Structure is Text-Based
234
+
235
+ PDF dictionaries, arrays, strings, and references are all defined using text syntax. No binary parsing needed for structure.
236
+
237
+ ### 2. Delimiters Are Unique
238
+
239
+ Each PDF value type has distinct delimiters:
240
+ - Dictionaries: `<<` `>>`
241
+ - Arrays: `[` `]`
242
+ - Literal strings: `(` `)`
243
+ - Hex strings: `<` `>`
244
+ - Names: `/`
245
+ - References: `R`
246
+
247
+ We can pattern-match on these to extract values.
248
+
249
+ ### 3. Depth Tracking is Simple
250
+
251
+ Nested structures (dictionaries, arrays, strings) can be parsed by tracking depth—increment on open, decrement on close. Standard algorithm from compiler theory.
252
+
253
+ ### 4. Position-Based Replacement is Safe
254
+
255
+ When modifying dictionaries, we use exact byte positions rather than regex replacement. This:
256
+ - Preserves formatting
257
+ - Avoids edge cases
258
+ - Is predictable
259
+
260
+ ### 5. No Full Parser Needed
261
+
262
+ We don't need to:
263
+ - Build an AST
264
+ - Validate the entire PDF structure
265
+ - Handle all PDF features
266
+
267
+ We only need to:
268
+ - Find dictionaries
269
+ - Extract values
270
+ - Replace values
271
+ - Preserve structure
272
+
273
+ This is a **minimal parser** that does exactly what we need.
274
+
275
+ ## Common Patterns
276
+
277
+ ### Pattern 1: Find and Extract
278
+
279
+ ```ruby
280
+ # Find all dictionaries
281
+ each_dictionary(pdf_text) do |dict|
282
+ # Extract a value
283
+ value_token = value_token_after("/V", dict)
284
+ value = decode_pdf_string(value_token)
285
+ puts value
286
+ end
287
+ ```
288
+
289
+ ### Pattern 2: Find and Replace
290
+
291
+ ```ruby
292
+ # Get dictionary
293
+ dict = "<< /V (Old) >>"
294
+
295
+ # Replace value
296
+ new_dict = replace_key_value(dict, "/V", "(New)")
297
+
298
+ # Result: "<< /V (New) >>"
299
+ ```
300
+
301
+ ### Pattern 3: Encode and Insert
302
+
303
+ ```ruby
304
+ # Prepare new value
305
+ token = encode_pdf_string("Hello")
306
+
307
+ # Insert into dictionary
308
+ dict = upsert_key_value(dict, "/V", token)
309
+ ```
310
+
311
+ ## Performance Considerations
312
+
313
+ **Why this is fast:**
314
+ - No AST construction
315
+ - No full PDF parsing
316
+ - Direct string manipulation
317
+ - Minimal memory allocation
318
+
319
+ **Trade-offs:**
320
+ - Doesn't validate entire PDF structure
321
+ - Assumes dictionaries are well-formed
322
+ - Stream stripping is regex-based (could be optimized)
323
+
324
+ ## Conclusion
325
+
326
+ `DictScan` appears complicated because it handles many edge cases and value types, but the **core approach is elegantly simple**:
327
+
328
+ 1. PDF dictionaries are text patterns
329
+ 2. Parse them with character-by-character traversal
330
+ 3. Track depth for nested structures
331
+ 4. Use precise positions for replacement
332
+
333
+ No magic, no complex parsers—just careful text traversal with attention to PDF syntax rules.
334
+
335
+ The complexity you see is:
336
+ - **Edge case handling** (escaping, nesting, encoding)
337
+ - **Safety checks** (verification, error handling)
338
+ - **Support for multiple value types** (strings, arrays, dictionaries, references)
339
+
340
+ But the **fundamental algorithm** is straightforward: find delimiters, track depth, extract substrings, replace substrings.
341
+
@@ -0,0 +1,311 @@
1
+ # PDF Object Streams Explained
2
+
3
+ ## Overview
4
+
5
+ PDF Object Streams (also called "ObjStm") are a compression feature in PDFs that allows multiple PDF objects to be stored together in a single compressed stream. This reduces file size and improves performance. `AcroThat` handles object streams transparently, so you don't need to worry about them when working with PDF objects—but understanding how they work helps explain how the library parses PDFs.
6
+
7
+ ## What Are Object Streams?
8
+
9
+ Instead of storing objects individually in the PDF body:
10
+
11
+ ```
12
+ 5 0 obj
13
+ << /Type /Annot /Subtype /Widget >>
14
+ endobj
15
+
16
+ 6 0 obj
17
+ << /Type /Annot /Subtype /Widget >>
18
+ endobj
19
+
20
+ 7 0 obj
21
+ << /Type /Annot /Subtype /Widget >>
22
+ endobj
23
+ ```
24
+
25
+ Object streams allow multiple objects to be **packed together** in a single compressed stream:
26
+
27
+ ```
28
+ 10 0 obj
29
+ << /Type /ObjStm /N 3 /First 20 >>
30
+ stream
31
+ [compressed header + object bodies]
32
+ endstream
33
+ endobj
34
+ ```
35
+
36
+ Where:
37
+ - `/N 3` means there are 3 objects in this stream
38
+ - `/First 20` means the object data starts at byte offset 20 (the first 20 bytes are the header)
39
+ - The stream contains: header (object numbers + offsets) + object bodies
40
+
41
+ ## Object Stream Structure
42
+
43
+ An object stream consists of two parts:
44
+
45
+ ### 1. Header Section (First N bytes)
46
+
47
+ The header is a list of space-separated integers:
48
+ ```
49
+ 5 0 6 10 7 25
50
+ ```
51
+
52
+ This means:
53
+ - Object 5 starts at offset 0 (relative to the start of object data)
54
+ - Object 6 starts at offset 10 (relative to the start of object data)
55
+ - Object 7 starts at offset 25 (relative to the start of object data)
56
+
57
+ Format: `obj_num offset obj_num offset ...` (pairs of object number and offset)
58
+
59
+ ### 2. Object Data Section (Starting at /First)
60
+
61
+ The object data section contains the actual object bodies, concatenated together:
62
+ ```
63
+ << /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>
64
+ ```
65
+
66
+ The offsets in the header tell us where each object body starts within this data section.
67
+
68
+ ### Complete Example
69
+
70
+ ```
71
+ 10 0 obj
72
+ << /Type /ObjStm /N 3 /First 20 /Filter /FlateDecode >>
73
+ stream
74
+ [Compressed bytes containing:]
75
+ Header: "5 0 6 10 7 25 "
76
+ Data: "<< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>"
77
+ endstream
78
+ endobj
79
+ ```
80
+
81
+ After decompression:
82
+ - Bytes 0-19: Header (`"5 0 6 10 7 25 "`)
83
+ - Bytes 20+: Object data (`"<< /Type /Annot /Subtype /Widget >>..."`)
84
+ - Object 5 body: bytes 20-29 (offset 0 + 20 = 20, next object at offset 10)
85
+ - Object 6 body: bytes 30-59 (offset 10 + 20 = 30, next object at offset 25)
86
+ - Object 7 body: bytes 45+ (offset 25 + 20 = 45)
87
+
88
+ ## How AcroThat Parses Object Streams
89
+
90
+ ### Step 1: Cross-Reference Table Parsing
91
+
92
+ When `ObjectResolver` parses the cross-reference table or xref stream, it identifies objects stored in object streams:
93
+
94
+ ```ruby
95
+ # In parse_xref_stream_records
96
+ when 2 then @entries[ref] ||= Entry.new(type: :in_objstm, objstm_num: f1, objstm_index: f2)
97
+ ```
98
+
99
+ Where:
100
+ - `f1` = object stream number (the container object)
101
+ - `f2` = index within the object stream (0 = first object, 1 = second object, etc.)
102
+
103
+ Example: If object 5 is at index 0 in object stream 10, then `Entry` stores:
104
+ - `type: :in_objstm`
105
+ - `objstm_num: 10` (the object stream container)
106
+ - `objstm_index: 0` (first object in the stream)
107
+
108
+ ### Step 2: Lazy Loading
109
+
110
+ When you request an object body that's in an object stream, `ObjectResolver` calls `load_objstm`:
111
+
112
+ ```ruby
113
+ def load_objstm(container_ref)
114
+ return if @objstm_cache.key?(container_ref) # Already cached
115
+
116
+ # Get the object stream container's body
117
+ body = object_body(container_ref)
118
+
119
+ # Extract dictionary to get /N and /First
120
+ dict_src = extract_dictionary(body)
121
+ n = DictScan.value_token_after("/N", dict_src).to_i
122
+ first = DictScan.value_token_after("/First", dict_src).to_i
123
+
124
+ # Extract and decode stream data
125
+ raw = decode_stream_data(dict_src, extract_stream_body(body))
126
+
127
+ # Parse the object stream
128
+ parsed = AcroThat::ObjStm.parse(raw, n: n, first: first)
129
+
130
+ # Cache the result
131
+ @objstm_cache[container_ref] = parsed
132
+ end
133
+ ```
134
+
135
+ ### Step 3: Stream Decoding
136
+
137
+ The `decode_stream_data` method handles:
138
+ 1. **Extracting stream body**: Removes `stream` and `endstream` keywords
139
+ 2. **Decompression**: If `/Filter /FlateDecode` is present, decompress using zlib
140
+ 3. **PNG Predictor**: If `/Predictor` is present (10-15), apply PNG predictor decoding
141
+
142
+ Example:
143
+ ```ruby
144
+ def decode_stream_data(dict_src, stream_chunk)
145
+ # Extract body between stream...endstream
146
+ body = extract_stream_body(stream_chunk)
147
+
148
+ # Decompress if FlateDecode
149
+ data = if dict_src =~ %r{/Filter\s*/FlateDecode}
150
+ Zlib::Inflate.inflate(body)
151
+ else
152
+ body
153
+ end
154
+
155
+ # Apply PNG predictor if present
156
+ if dict_src =~ %r{/Predictor\s+(\d+)}
157
+ # Decode PNG predictor (Sub, Up, Average, Paeth)
158
+ data = apply_png_predictor(data, columns)
159
+ end
160
+
161
+ data
162
+ end
163
+ ```
164
+
165
+ ### Step 4: Object Stream Parsing (`ObjStm.parse`)
166
+
167
+ The `ObjStm.parse` method is the heart of object stream parsing:
168
+
169
+ ```ruby
170
+ def self.parse(bytes, n:, first:)
171
+ # Extract header (first N bytes)
172
+ head = bytes[0...first]
173
+
174
+ # Parse space-separated integers: obj_num offset obj_num offset ...
175
+ entries = head.strip.split(/\s+/).map!(&:to_i)
176
+
177
+ # Extract each object body
178
+ refs = []
179
+ n.times do |i|
180
+ obj = entries[2 * i] # Object number
181
+ off = entries[(2 * i) + 1] # Offset in data section
182
+
183
+ # Calculate next offset (or end of data)
184
+ next_off = i + 1 < n ? entries[(2 * (i + 1)) + 1] : (bytes.bytesize - first)
185
+
186
+ # Extract object body: start at (first + off), length is (next_off - off)
187
+ body = bytes[first + off, next_off - off]
188
+
189
+ refs << { ref: [obj, 0], body: body }
190
+ end
191
+
192
+ refs
193
+ end
194
+ ```
195
+
196
+ **Step-by-step example:**
197
+
198
+ Given:
199
+ - `bytes` = decompressed stream data
200
+ - `n = 3` (3 objects)
201
+ - `first = 20` (data starts at byte 20)
202
+ - Header: `"5 0 6 10 7 25 "` (12 bytes, padded to 20)
203
+
204
+ Processing:
205
+ 1. Extract header: `bytes[0...20]` → `"5 0 6 10 7 25 "`
206
+ 2. Parse entries: `[5, 0, 6, 10, 7, 25]`
207
+ 3. For `i=0` (first object):
208
+ - `obj = entries[0]` = 5
209
+ - `off = entries[1]` = 0
210
+ - `next_off = entries[3]` = 10 (offset of next object)
211
+ - `body = bytes[20 + 0, 10 - 0]` = bytes[20...30]
212
+ 4. For `i=1` (second object):
213
+ - `obj = entries[2]` = 6
214
+ - `off = entries[3]` = 10
215
+ - `next_off = entries[5]` = 25
216
+ - `body = bytes[20 + 10, 25 - 10]` = bytes[30...45]
217
+ 5. For `i=2` (third object):
218
+ - `obj = entries[4]` = 7
219
+ - `off = entries[5]` = 25
220
+ - `next_off = bytes.bytesize - first` (end of data)
221
+ - `body = bytes[20 + 25, ...]` = bytes[45...end]
222
+
223
+ Result:
224
+ ```ruby
225
+ [
226
+ { ref: [5, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
227
+ { ref: [6, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
228
+ { ref: [7, 0], body: "<< /Type /Annot /Subtype /Widget >>" }
229
+ ]
230
+ ```
231
+
232
+ ### Step 5: Object Retrieval
233
+
234
+ When `object_body(ref)` is called for an object in an object stream:
235
+
236
+ ```ruby
237
+ def object_body(ref)
238
+ case (e = @entries[ref])&.type
239
+ when :in_file
240
+ # Regular object: extract from file
241
+ extract_from_file(e.offset)
242
+ when :in_objstm
243
+ # Object stream: load stream (if not cached), then get object by index
244
+ load_objstm([e.objstm_num, 0])
245
+ @objstm_cache[[e.objstm_num, 0]][e.objstm_index][:body]
246
+ end
247
+ end
248
+ ```
249
+
250
+ The index (`objstm_index`) tells us which object in the parsed array to return.
251
+
252
+ ## Why Object Streams Matter
253
+
254
+ ### Benefits
255
+
256
+ 1. **File Size**: Compressing multiple objects together is more efficient than compressing each individually
257
+ 2. **Performance**: Fewer objects to parse when opening the PDF
258
+ 3. **Common in Modern PDFs**: Most PDFs created by modern tools use object streams
259
+
260
+ ### Transparency in AcroThat
261
+
262
+ `AcroThat` handles object streams automatically:
263
+ - You don't need to know if an object is in a stream or not
264
+ - `object_body(ref)` returns the object body the same way regardless
265
+ - Object streams are cached after first load (no repeated parsing)
266
+ - The same `DictScan` methods work on extracted object bodies
267
+
268
+ ## Cross-Reference Streams vs Object Streams
269
+
270
+ **Important distinction:**
271
+
272
+ 1. **XRef Streams** (`/Type /XRef`): Used to find where objects are located in the PDF
273
+ - Contains byte offsets or references to object streams
274
+ - Replaces classic xref tables
275
+
276
+ 2. **Object Streams** (`/Type /ObjStm`): Used to store actual object bodies
277
+ - Contains compressed object dictionaries
278
+ - Referenced by xref streams or classic xref tables
279
+
280
+ Both use the same stream format (compressed, potentially with PNG predictor), but serve different purposes.
281
+
282
+ ## PNG Predictor
283
+
284
+ PNG Predictor is a compression technique that predicts values based on previous values to improve compression. `AcroThat` supports all 5 PNG predictor types:
285
+
286
+ 1. **Type 0 (None)**: No prediction
287
+ 2. **Type 1 (Sub)**: Predict from left
288
+ 3. **Type 2 (Up)**: Predict from above
289
+ 4. **Type 3 (Average)**: Predict from average of left and above
290
+ 5. **Type 4 (Paeth)**: Predict using Paeth algorithm
291
+
292
+ The `apply_png_predictor` method decodes predictor-encoded data row by row, using the `/Columns` parameter to determine row width.
293
+
294
+ ## Summary
295
+
296
+ Object streams allow PDFs to store multiple objects in compressed streams. `AcroThat` handles them by:
297
+
298
+ 1. **Identifying** objects in streams via xref parsing
299
+ 2. **Lazy loading** stream containers when needed
300
+ 3. **Decoding** compressed stream data (zlib + PNG predictor)
301
+ 4. **Parsing** the header to extract object offsets
302
+ 5. **Extracting** individual object bodies by offset
303
+ 6. **Caching** parsed streams for performance
304
+
305
+ The parsing itself is straightforward:
306
+ - Header is space-separated integers (object numbers and offsets)
307
+ - Object data follows the header
308
+ - Extract each object body using its offset
309
+
310
+ Just like `DictScan`, object stream parsing is **text traversal**—once the stream is decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
311
+