acro_that 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.DS_Store +0 -0
- data/.gitignore +8 -0
- data/.rubocop.yml +78 -0
- data/Gemfile +5 -0
- data/Gemfile.lock +86 -0
- data/README.md +360 -0
- data/Rakefile +18 -0
- data/acro_that.gemspec +34 -0
- data/docs/README.md +99 -0
- data/docs/dict_scan_explained.md +341 -0
- data/docs/object_streams.md +311 -0
- data/docs/pdf_structure.md +251 -0
- data/lib/acro_that/actions/add_field.rb +278 -0
- data/lib/acro_that/actions/add_signature_appearance.rb +422 -0
- data/lib/acro_that/actions/base.rb +44 -0
- data/lib/acro_that/actions/remove_field.rb +158 -0
- data/lib/acro_that/actions/update_field.rb +301 -0
- data/lib/acro_that/dict_scan.rb +413 -0
- data/lib/acro_that/document.rb +331 -0
- data/lib/acro_that/field.rb +143 -0
- data/lib/acro_that/incremental_writer.rb +244 -0
- data/lib/acro_that/object_resolver.rb +376 -0
- data/lib/acro_that/objstm.rb +75 -0
- data/lib/acro_that/pdf_writer.rb +97 -0
- data/lib/acro_that/version.rb +5 -0
- data/lib/acro_that.rb +24 -0
- metadata +143 -0
|
@@ -0,0 +1,341 @@
|
|
|
1
|
+
# DictScan Explained: Text Traversal in Action
|
|
2
|
+
|
|
3
|
+
## The Big Picture
|
|
4
|
+
|
|
5
|
+
`DictScan` is a module that appears complicated at first glance, but it's fundamentally just **text traversal**—walking through PDF files character by character to find and extract dictionary structures.
|
|
6
|
+
|
|
7
|
+
This document explains how each function in `DictScan` works and why the text-traversal approach is both powerful and straightforward.
|
|
8
|
+
|
|
9
|
+
## Core Principle
|
|
10
|
+
|
|
11
|
+
**PDF dictionaries are text patterns.** They use `<<` and `>>` as delimiters, just like how programming languages use `{` and `}` or `[` and `]`. Once you recognize this, parsing becomes a matter of tracking depth and matching delimiters.
|
|
12
|
+
|
|
13
|
+
## Function-by-Function Guide
|
|
14
|
+
|
|
15
|
+
### `strip_stream_bodies(pdf)`
|
|
16
|
+
|
|
17
|
+
**Purpose:** Remove binary stream data that would confuse text parsing.
|
|
18
|
+
|
|
19
|
+
**How it works:**
|
|
20
|
+
- Finds all `stream...endstream` blocks using regex
|
|
21
|
+
- Replaces the binary content with a placeholder
|
|
22
|
+
- Preserves the stream structure markers
|
|
23
|
+
|
|
24
|
+
**Why:** Streams can contain arbitrary binary data (compressed images, fonts, etc.) that would break our text-based parsing. We strip them out since we're only interested in the dictionary structure.
|
|
25
|
+
|
|
26
|
+
```ruby
|
|
27
|
+
pdf.gsub(/stream\r?\n.*?endstream/mi) { "stream\nENDSTREAM_STRIPPED\nendstream" }
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
This is regex-based, but it's necessary preprocessing before we can safely do text traversal.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
### `each_dictionary(str)`
|
|
35
|
+
|
|
36
|
+
**Purpose:** Iterate through all dictionaries in a string.
|
|
37
|
+
|
|
38
|
+
**Algorithm:**
|
|
39
|
+
1. Find the first `<<` at position `i`
|
|
40
|
+
2. Initialize depth counter to 0
|
|
41
|
+
3. Scan forward:
|
|
42
|
+
- If we see `<<`, increment depth
|
|
43
|
+
- If we see `>>`, decrement depth
|
|
44
|
+
- If depth reaches 0, we've found a complete dictionary
|
|
45
|
+
4. Yield the dictionary substring
|
|
46
|
+
5. Continue from where we left off
|
|
47
|
+
|
|
48
|
+
**Example:**
|
|
49
|
+
```
|
|
50
|
+
Input: "<< /A 1 >> << /B 2 >>"
|
|
51
|
+
i=0: find "<<"
|
|
52
|
+
depth=1, scan forward
|
|
53
|
+
see ">>", depth=0 → found "<< /A 1 >>"
|
|
54
|
+
yield and continue from i=11
|
|
55
|
+
i=11: find "<<"
|
|
56
|
+
depth=1, scan forward
|
|
57
|
+
see ">>", depth=0 → found "<< /B 2 >>"
|
|
58
|
+
yield and continue
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
**Why it works:** This is classic **bracket matching**. No PDF-specific knowledge needed—just counting delimiters.
|
|
62
|
+
|
|
63
|
+
---
|
|
64
|
+
|
|
65
|
+
### `unescape_literal(s)`
|
|
66
|
+
|
|
67
|
+
**Purpose:** Decode PDF escape sequences in literal strings.
|
|
68
|
+
|
|
69
|
+
**PDF escapes:**
|
|
70
|
+
- `\n` → newline
|
|
71
|
+
- `\r` → carriage return
|
|
72
|
+
- `\t` → tab
|
|
73
|
+
- `\b` → backspace
|
|
74
|
+
- `\f` → form feed
|
|
75
|
+
- `\\(` → literal `(`
|
|
76
|
+
- `\\)` → literal `)`
|
|
77
|
+
- `\123` → octal character (up to 3 digits)
|
|
78
|
+
|
|
79
|
+
**Algorithm:** Character-by-character scan:
|
|
80
|
+
1. If we see `\`, look ahead one character
|
|
81
|
+
2. Map escape sequences to actual characters
|
|
82
|
+
3. Handle octal sequences (1-3 digits)
|
|
83
|
+
4. Otherwise, copy character as-is
|
|
84
|
+
|
|
85
|
+
**Why it works:** This is standard escape sequence handling, identical to how many programming languages handle string literals.
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
### `decode_pdf_string(token)`
|
|
90
|
+
|
|
91
|
+
**Purpose:** Decode a PDF string token into a Ruby string.
|
|
92
|
+
|
|
93
|
+
**PDF string types:**
|
|
94
|
+
1. **Literal:** `(Hello World)` or `(Hello\nWorld)`
|
|
95
|
+
2. **Hex:** `<48656C6C6F>` or `<FEFF00480065006C006C006F>`
|
|
96
|
+
|
|
97
|
+
**Algorithm:**
|
|
98
|
+
1. Check if token starts with `(` → literal string
|
|
99
|
+
- Extract content between parentheses
|
|
100
|
+
- Unescape using `unescape_literal`
|
|
101
|
+
- Check for UTF-16BE BOM (`FE FF`)
|
|
102
|
+
- Decode accordingly
|
|
103
|
+
2. Check if token starts with `<` → hex string
|
|
104
|
+
- Remove spaces, pad if odd length
|
|
105
|
+
- Convert hex to bytes
|
|
106
|
+
- Check for UTF-16BE BOM
|
|
107
|
+
- Decode accordingly
|
|
108
|
+
3. Otherwise, return as-is (name, number, reference, etc.)
|
|
109
|
+
|
|
110
|
+
**Why it works:** PDF strings have well-defined formats. We just pattern-match on the delimiters and decode accordingly.
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
### `encode_pdf_string(val)`
|
|
115
|
+
|
|
116
|
+
**Purpose:** Encode a Ruby value into a PDF string token.
|
|
117
|
+
|
|
118
|
+
**Handles:**
|
|
119
|
+
- `true` → `"true"`
|
|
120
|
+
- `false` → `"false"`
|
|
121
|
+
- `Symbol` → `"/symbol_name"`
|
|
122
|
+
- `String`:
|
|
123
|
+
- ASCII-only → literal string `(value)`
|
|
124
|
+
- Non-ASCII → hex string with UTF-16BE encoding
|
|
125
|
+
|
|
126
|
+
**Why it works:** Reverse of `decode_pdf_string`—we know the target format and encode accordingly.
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
### `value_token_after(key, dict_src)`
|
|
131
|
+
|
|
132
|
+
**Purpose:** Extract the value token that follows a key in a dictionary.
|
|
133
|
+
|
|
134
|
+
**This is the heart of text traversal.** Here's how it works:
|
|
135
|
+
|
|
136
|
+
1. **Find the key:**
|
|
137
|
+
```ruby
|
|
138
|
+
match = dict_src.match(%r{#{Regexp.escape(key)}(?=[\s(<\[/])})
|
|
139
|
+
```
|
|
140
|
+
Use regex to ensure the key is followed by a delimiter (whitespace, `(`, `<`, `[`, or `/`). This prevents partial matches.
|
|
141
|
+
|
|
142
|
+
2. **Skip whitespace:**
|
|
143
|
+
```ruby
|
|
144
|
+
i += 1 while i < dict_src.length && dict_src[i] =~ /\s/
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
3. **Switch on the next character:**
|
|
148
|
+
- **`(` → Literal string:**
|
|
149
|
+
- Track depth of parentheses
|
|
150
|
+
- Handle escaped characters (skip `\` and next char)
|
|
151
|
+
- Match closing `)` when depth returns to 0
|
|
152
|
+
- **`<` → Hex string or dictionary:**
|
|
153
|
+
- If `<<` → return `"<<"` (nested dictionary marker)
|
|
154
|
+
- Otherwise, find matching `>`
|
|
155
|
+
- **`[` → Array:**
|
|
156
|
+
- Track depth of brackets
|
|
157
|
+
- Match closing `]` when depth returns to 0
|
|
158
|
+
- **`/` → PDF name:**
|
|
159
|
+
- Extract until whitespace or delimiter
|
|
160
|
+
- **Otherwise → Atom:**
|
|
161
|
+
- Extract until whitespace or delimiter (number, reference, boolean, etc.)
|
|
162
|
+
|
|
163
|
+
**Why it works:** PDF has well-defined token syntax. Each value type has distinct delimiters, so we can pattern-match on the first character and extract accordingly.
|
|
164
|
+
|
|
165
|
+
**Example:**
|
|
166
|
+
```
|
|
167
|
+
Dict: "<< /V (Hello) /R [1 2 3] >>"
|
|
168
|
+
value_token_after("/V", dict):
|
|
169
|
+
→ Finds "/V" at position 3
|
|
170
|
+
→ Skips space
|
|
171
|
+
→ Next char is "("
|
|
172
|
+
→ Extracts "(Hello)" using paren matching
|
|
173
|
+
→ Returns "(Hello)"
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
---
|
|
177
|
+
|
|
178
|
+
### `replace_key_value(dict_src, key, new_token)`
|
|
179
|
+
|
|
180
|
+
**Purpose:** Replace a key's value in a dictionary string.
|
|
181
|
+
|
|
182
|
+
**Algorithm:**
|
|
183
|
+
1. Find the key using pattern matching
|
|
184
|
+
2. Extract the existing value token using `value_token_after`
|
|
185
|
+
3. Find exact byte positions:
|
|
186
|
+
- Key start/end
|
|
187
|
+
- Value start/end
|
|
188
|
+
4. Replace using string slicing: `before + new_token + after`
|
|
189
|
+
5. Verify dictionary is still valid (contains `<<` and `>>`)
|
|
190
|
+
|
|
191
|
+
**Why it works:** We use **precise byte positions** rather than regex replacement. This:
|
|
192
|
+
- Preserves exact formatting (whitespace, etc.)
|
|
193
|
+
- Avoids regex edge cases
|
|
194
|
+
- Is deterministic and safe
|
|
195
|
+
|
|
196
|
+
**Example:**
|
|
197
|
+
```
|
|
198
|
+
Input: "<< /V (Old) /X 1 >>"
|
|
199
|
+
before: "<< /V "
|
|
200
|
+
value: "(Old)"
|
|
201
|
+
after: " /X 1 >>"
|
|
202
|
+
Output: "<< /V (New) /X 1 >>"
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
---
|
|
206
|
+
|
|
207
|
+
### `upsert_key_value(dict_src, key, token)`
|
|
208
|
+
|
|
209
|
+
**Purpose:** Insert a key-value pair if the key doesn't exist.
|
|
210
|
+
|
|
211
|
+
**Algorithm:**
|
|
212
|
+
- If key not found, insert after the opening `<<`
|
|
213
|
+
- Uses simple string substitution: `"<<#{key} #{token}"`
|
|
214
|
+
|
|
215
|
+
**Why it works:** Simple string manipulation when we know the key doesn't exist.
|
|
216
|
+
|
|
217
|
+
---
|
|
218
|
+
|
|
219
|
+
### `remove_ref_from_array(array_body, ref)` and `add_ref_to_array(array_body, ref)`
|
|
220
|
+
|
|
221
|
+
**Purpose:** Manipulate object references in PDF arrays.
|
|
222
|
+
|
|
223
|
+
**Algorithm:**
|
|
224
|
+
- Use regex/gsub to find and replace reference patterns: `"5 0 R"`
|
|
225
|
+
- Handle edge cases: empty arrays, spacing
|
|
226
|
+
|
|
227
|
+
**Why it works:** Object references have a fixed format (`num gen R`), so we can pattern-match and replace.
|
|
228
|
+
|
|
229
|
+
---
|
|
230
|
+
|
|
231
|
+
## Why This Approach Works
|
|
232
|
+
|
|
233
|
+
### 1. PDF Structure is Text-Based
|
|
234
|
+
|
|
235
|
+
PDF dictionaries, arrays, strings, and references are all defined using text syntax. No binary parsing needed for structure.
|
|
236
|
+
|
|
237
|
+
### 2. Delimiters Are Unique
|
|
238
|
+
|
|
239
|
+
Each PDF value type has distinct delimiters:
|
|
240
|
+
- Dictionaries: `<<` `>>`
|
|
241
|
+
- Arrays: `[` `]`
|
|
242
|
+
- Literal strings: `(` `)`
|
|
243
|
+
- Hex strings: `<` `>`
|
|
244
|
+
- Names: `/`
|
|
245
|
+
- References: `R`
|
|
246
|
+
|
|
247
|
+
We can pattern-match on these to extract values.
|
|
248
|
+
|
|
249
|
+
### 3. Depth Tracking is Simple
|
|
250
|
+
|
|
251
|
+
Nested structures (dictionaries, arrays, strings) can be parsed by tracking depth—increment on open, decrement on close. Standard algorithm from compiler theory.
|
|
252
|
+
|
|
253
|
+
### 4. Position-Based Replacement is Safe
|
|
254
|
+
|
|
255
|
+
When modifying dictionaries, we use exact byte positions rather than regex replacement. This:
|
|
256
|
+
- Preserves formatting
|
|
257
|
+
- Avoids edge cases
|
|
258
|
+
- Is predictable
|
|
259
|
+
|
|
260
|
+
### 5. No Full Parser Needed
|
|
261
|
+
|
|
262
|
+
We don't need to:
|
|
263
|
+
- Build an AST
|
|
264
|
+
- Validate the entire PDF structure
|
|
265
|
+
- Handle all PDF features
|
|
266
|
+
|
|
267
|
+
We only need to:
|
|
268
|
+
- Find dictionaries
|
|
269
|
+
- Extract values
|
|
270
|
+
- Replace values
|
|
271
|
+
- Preserve structure
|
|
272
|
+
|
|
273
|
+
This is a **minimal parser** that does exactly what we need.
|
|
274
|
+
|
|
275
|
+
## Common Patterns
|
|
276
|
+
|
|
277
|
+
### Pattern 1: Find and Extract
|
|
278
|
+
|
|
279
|
+
```ruby
|
|
280
|
+
# Find all dictionaries
|
|
281
|
+
each_dictionary(pdf_text) do |dict|
|
|
282
|
+
# Extract a value
|
|
283
|
+
value_token = value_token_after("/V", dict)
|
|
284
|
+
value = decode_pdf_string(value_token)
|
|
285
|
+
puts value
|
|
286
|
+
end
|
|
287
|
+
```
|
|
288
|
+
|
|
289
|
+
### Pattern 2: Find and Replace
|
|
290
|
+
|
|
291
|
+
```ruby
|
|
292
|
+
# Get dictionary
|
|
293
|
+
dict = "<< /V (Old) >>"
|
|
294
|
+
|
|
295
|
+
# Replace value
|
|
296
|
+
new_dict = replace_key_value(dict, "/V", "(New)")
|
|
297
|
+
|
|
298
|
+
# Result: "<< /V (New) >>"
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
### Pattern 3: Encode and Insert
|
|
302
|
+
|
|
303
|
+
```ruby
|
|
304
|
+
# Prepare new value
|
|
305
|
+
token = encode_pdf_string("Hello")
|
|
306
|
+
|
|
307
|
+
# Insert into dictionary
|
|
308
|
+
dict = upsert_key_value(dict, "/V", token)
|
|
309
|
+
```
|
|
310
|
+
|
|
311
|
+
## Performance Considerations
|
|
312
|
+
|
|
313
|
+
**Why this is fast:**
|
|
314
|
+
- No AST construction
|
|
315
|
+
- No full PDF parsing
|
|
316
|
+
- Direct string manipulation
|
|
317
|
+
- Minimal memory allocation
|
|
318
|
+
|
|
319
|
+
**Trade-offs:**
|
|
320
|
+
- Doesn't validate entire PDF structure
|
|
321
|
+
- Assumes dictionaries are well-formed
|
|
322
|
+
- Stream stripping is regex-based (could be optimized)
|
|
323
|
+
|
|
324
|
+
## Conclusion
|
|
325
|
+
|
|
326
|
+
`DictScan` appears complicated because it handles many edge cases and value types, but the **core approach is elegantly simple**:
|
|
327
|
+
|
|
328
|
+
1. PDF dictionaries are text patterns
|
|
329
|
+
2. Parse them with character-by-character traversal
|
|
330
|
+
3. Track depth for nested structures
|
|
331
|
+
4. Use precise positions for replacement
|
|
332
|
+
|
|
333
|
+
No magic, no complex parsers—just careful text traversal with attention to PDF syntax rules.
|
|
334
|
+
|
|
335
|
+
The complexity you see is:
|
|
336
|
+
- **Edge case handling** (escaping, nesting, encoding)
|
|
337
|
+
- **Safety checks** (verification, error handling)
|
|
338
|
+
- **Support for multiple value types** (strings, arrays, dictionaries, references)
|
|
339
|
+
|
|
340
|
+
But the **fundamental algorithm** is straightforward: find delimiters, track depth, extract substrings, replace substrings.
|
|
341
|
+
|
|
@@ -0,0 +1,311 @@
|
|
|
1
|
+
# PDF Object Streams Explained
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
PDF Object Streams (also called "ObjStm") are a compression feature in PDFs that allows multiple PDF objects to be stored together in a single compressed stream. This reduces file size and improves performance. `AcroThat` handles object streams transparently, so you don't need to worry about them when working with PDF objects—but understanding how they work helps explain how the library parses PDFs.
|
|
6
|
+
|
|
7
|
+
## What Are Object Streams?
|
|
8
|
+
|
|
9
|
+
Instead of storing objects individually in the PDF body:
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
5 0 obj
|
|
13
|
+
<< /Type /Annot /Subtype /Widget >>
|
|
14
|
+
endobj
|
|
15
|
+
|
|
16
|
+
6 0 obj
|
|
17
|
+
<< /Type /Annot /Subtype /Widget >>
|
|
18
|
+
endobj
|
|
19
|
+
|
|
20
|
+
7 0 obj
|
|
21
|
+
<< /Type /Annot /Subtype /Widget >>
|
|
22
|
+
endobj
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Object streams allow multiple objects to be **packed together** in a single compressed stream:
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
10 0 obj
|
|
29
|
+
<< /Type /ObjStm /N 3 /First 20 >>
|
|
30
|
+
stream
|
|
31
|
+
[compressed header + object bodies]
|
|
32
|
+
endstream
|
|
33
|
+
endobj
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
Where:
|
|
37
|
+
- `/N 3` means there are 3 objects in this stream
|
|
38
|
+
- `/First 20` means the object data starts at byte offset 20 (the first 20 bytes are the header)
|
|
39
|
+
- The stream contains: header (object numbers + offsets) + object bodies
|
|
40
|
+
|
|
41
|
+
## Object Stream Structure
|
|
42
|
+
|
|
43
|
+
An object stream consists of two parts:
|
|
44
|
+
|
|
45
|
+
### 1. Header Section (First N bytes)
|
|
46
|
+
|
|
47
|
+
The header is a list of space-separated integers:
|
|
48
|
+
```
|
|
49
|
+
5 0 6 10 7 25
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
This means:
|
|
53
|
+
- Object 5 starts at offset 0 (relative to the start of object data)
|
|
54
|
+
- Object 6 starts at offset 10 (relative to the start of object data)
|
|
55
|
+
- Object 7 starts at offset 25 (relative to the start of object data)
|
|
56
|
+
|
|
57
|
+
Format: `obj_num offset obj_num offset ...` (pairs of object number and offset)
|
|
58
|
+
|
|
59
|
+
### 2. Object Data Section (Starting at /First)
|
|
60
|
+
|
|
61
|
+
The object data section contains the actual object bodies, concatenated together:
|
|
62
|
+
```
|
|
63
|
+
<< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
The offsets in the header tell us where each object body starts within this data section.
|
|
67
|
+
|
|
68
|
+
### Complete Example
|
|
69
|
+
|
|
70
|
+
```
|
|
71
|
+
10 0 obj
|
|
72
|
+
<< /Type /ObjStm /N 3 /First 20 /Filter /FlateDecode >>
|
|
73
|
+
stream
|
|
74
|
+
[Compressed bytes containing:]
|
|
75
|
+
Header: "5 0 6 10 7 25 "
|
|
76
|
+
Data: "<< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>"
|
|
77
|
+
endstream
|
|
78
|
+
endobj
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
After decompression:
|
|
82
|
+
- Bytes 0-19: Header (`"5 0 6 10 7 25 "`)
|
|
83
|
+
- Bytes 20+: Object data (`"<< /Type /Annot /Subtype /Widget >>..."`)
|
|
84
|
+
- Object 5 body: bytes 20-29 (offset 0 + 20 = 20, next object at offset 10)
|
|
85
|
+
- Object 6 body: bytes 30-59 (offset 10 + 20 = 30, next object at offset 25)
|
|
86
|
+
- Object 7 body: bytes 45+ (offset 25 + 20 = 45)
|
|
87
|
+
|
|
88
|
+
## How AcroThat Parses Object Streams
|
|
89
|
+
|
|
90
|
+
### Step 1: Cross-Reference Table Parsing
|
|
91
|
+
|
|
92
|
+
When `ObjectResolver` parses the cross-reference table or xref stream, it identifies objects stored in object streams:
|
|
93
|
+
|
|
94
|
+
```ruby
|
|
95
|
+
# In parse_xref_stream_records
|
|
96
|
+
when 2 then @entries[ref] ||= Entry.new(type: :in_objstm, objstm_num: f1, objstm_index: f2)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Where:
|
|
100
|
+
- `f1` = object stream number (the container object)
|
|
101
|
+
- `f2` = index within the object stream (0 = first object, 1 = second object, etc.)
|
|
102
|
+
|
|
103
|
+
Example: If object 5 is at index 0 in object stream 10, then `Entry` stores:
|
|
104
|
+
- `type: :in_objstm`
|
|
105
|
+
- `objstm_num: 10` (the object stream container)
|
|
106
|
+
- `objstm_index: 0` (first object in the stream)
|
|
107
|
+
|
|
108
|
+
### Step 2: Lazy Loading
|
|
109
|
+
|
|
110
|
+
When you request an object body that's in an object stream, `ObjectResolver` calls `load_objstm`:
|
|
111
|
+
|
|
112
|
+
```ruby
|
|
113
|
+
def load_objstm(container_ref)
|
|
114
|
+
return if @objstm_cache.key?(container_ref) # Already cached
|
|
115
|
+
|
|
116
|
+
# Get the object stream container's body
|
|
117
|
+
body = object_body(container_ref)
|
|
118
|
+
|
|
119
|
+
# Extract dictionary to get /N and /First
|
|
120
|
+
dict_src = extract_dictionary(body)
|
|
121
|
+
n = DictScan.value_token_after("/N", dict_src).to_i
|
|
122
|
+
first = DictScan.value_token_after("/First", dict_src).to_i
|
|
123
|
+
|
|
124
|
+
# Extract and decode stream data
|
|
125
|
+
raw = decode_stream_data(dict_src, extract_stream_body(body))
|
|
126
|
+
|
|
127
|
+
# Parse the object stream
|
|
128
|
+
parsed = AcroThat::ObjStm.parse(raw, n: n, first: first)
|
|
129
|
+
|
|
130
|
+
# Cache the result
|
|
131
|
+
@objstm_cache[container_ref] = parsed
|
|
132
|
+
end
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Step 3: Stream Decoding
|
|
136
|
+
|
|
137
|
+
The `decode_stream_data` method handles:
|
|
138
|
+
1. **Extracting stream body**: Removes `stream` and `endstream` keywords
|
|
139
|
+
2. **Decompression**: If `/Filter /FlateDecode` is present, decompress using zlib
|
|
140
|
+
3. **PNG Predictor**: If `/Predictor` is present (10-15), apply PNG predictor decoding
|
|
141
|
+
|
|
142
|
+
Example:
|
|
143
|
+
```ruby
|
|
144
|
+
def decode_stream_data(dict_src, stream_chunk)
|
|
145
|
+
# Extract body between stream...endstream
|
|
146
|
+
body = extract_stream_body(stream_chunk)
|
|
147
|
+
|
|
148
|
+
# Decompress if FlateDecode
|
|
149
|
+
data = if dict_src =~ %r{/Filter\s*/FlateDecode}
|
|
150
|
+
Zlib::Inflate.inflate(body)
|
|
151
|
+
else
|
|
152
|
+
body
|
|
153
|
+
end
|
|
154
|
+
|
|
155
|
+
# Apply PNG predictor if present
|
|
156
|
+
if dict_src =~ %r{/Predictor\s+(\d+)}
|
|
157
|
+
# Decode PNG predictor (Sub, Up, Average, Paeth)
|
|
158
|
+
data = apply_png_predictor(data, columns)
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
data
|
|
162
|
+
end
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
### Step 4: Object Stream Parsing (`ObjStm.parse`)
|
|
166
|
+
|
|
167
|
+
The `ObjStm.parse` method is the heart of object stream parsing:
|
|
168
|
+
|
|
169
|
+
```ruby
|
|
170
|
+
def self.parse(bytes, n:, first:)
|
|
171
|
+
# Extract header (first N bytes)
|
|
172
|
+
head = bytes[0...first]
|
|
173
|
+
|
|
174
|
+
# Parse space-separated integers: obj_num offset obj_num offset ...
|
|
175
|
+
entries = head.strip.split(/\s+/).map!(&:to_i)
|
|
176
|
+
|
|
177
|
+
# Extract each object body
|
|
178
|
+
refs = []
|
|
179
|
+
n.times do |i|
|
|
180
|
+
obj = entries[2 * i] # Object number
|
|
181
|
+
off = entries[(2 * i) + 1] # Offset in data section
|
|
182
|
+
|
|
183
|
+
# Calculate next offset (or end of data)
|
|
184
|
+
next_off = i + 1 < n ? entries[(2 * (i + 1)) + 1] : (bytes.bytesize - first)
|
|
185
|
+
|
|
186
|
+
# Extract object body: start at (first + off), length is (next_off - off)
|
|
187
|
+
body = bytes[first + off, next_off - off]
|
|
188
|
+
|
|
189
|
+
refs << { ref: [obj, 0], body: body }
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
refs
|
|
193
|
+
end
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
**Step-by-step example:**
|
|
197
|
+
|
|
198
|
+
Given:
|
|
199
|
+
- `bytes` = decompressed stream data
|
|
200
|
+
- `n = 3` (3 objects)
|
|
201
|
+
- `first = 20` (data starts at byte 20)
|
|
202
|
+
- Header: `"5 0 6 10 7 25 "` (12 bytes, padded to 20)
|
|
203
|
+
|
|
204
|
+
Processing:
|
|
205
|
+
1. Extract header: `bytes[0...20]` → `"5 0 6 10 7 25 "`
|
|
206
|
+
2. Parse entries: `[5, 0, 6, 10, 7, 25]`
|
|
207
|
+
3. For `i=0` (first object):
|
|
208
|
+
- `obj = entries[0]` = 5
|
|
209
|
+
- `off = entries[1]` = 0
|
|
210
|
+
- `next_off = entries[3]` = 10 (offset of next object)
|
|
211
|
+
- `body = bytes[20 + 0, 10 - 0]` = bytes[20...30]
|
|
212
|
+
4. For `i=1` (second object):
|
|
213
|
+
- `obj = entries[2]` = 6
|
|
214
|
+
- `off = entries[3]` = 10
|
|
215
|
+
- `next_off = entries[5]` = 25
|
|
216
|
+
- `body = bytes[20 + 10, 25 - 10]` = bytes[30...45]
|
|
217
|
+
5. For `i=2` (third object):
|
|
218
|
+
- `obj = entries[4]` = 7
|
|
219
|
+
- `off = entries[5]` = 25
|
|
220
|
+
- `next_off = bytes.bytesize - first` (end of data)
|
|
221
|
+
- `body = bytes[20 + 25, ...]` = bytes[45...end]
|
|
222
|
+
|
|
223
|
+
Result:
|
|
224
|
+
```ruby
|
|
225
|
+
[
|
|
226
|
+
{ ref: [5, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
|
|
227
|
+
{ ref: [6, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
|
|
228
|
+
{ ref: [7, 0], body: "<< /Type /Annot /Subtype /Widget >>" }
|
|
229
|
+
]
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
### Step 5: Object Retrieval
|
|
233
|
+
|
|
234
|
+
When `object_body(ref)` is called for an object in an object stream:
|
|
235
|
+
|
|
236
|
+
```ruby
|
|
237
|
+
def object_body(ref)
|
|
238
|
+
case (e = @entries[ref])&.type
|
|
239
|
+
when :in_file
|
|
240
|
+
# Regular object: extract from file
|
|
241
|
+
extract_from_file(e.offset)
|
|
242
|
+
when :in_objstm
|
|
243
|
+
# Object stream: load stream (if not cached), then get object by index
|
|
244
|
+
load_objstm([e.objstm_num, 0])
|
|
245
|
+
@objstm_cache[[e.objstm_num, 0]][e.objstm_index][:body]
|
|
246
|
+
end
|
|
247
|
+
end
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
The index (`objstm_index`) tells us which object in the parsed array to return.
|
|
251
|
+
|
|
252
|
+
## Why Object Streams Matter
|
|
253
|
+
|
|
254
|
+
### Benefits
|
|
255
|
+
|
|
256
|
+
1. **File Size**: Compressing multiple objects together is more efficient than compressing each individually
|
|
257
|
+
2. **Performance**: Fewer objects to parse when opening the PDF
|
|
258
|
+
3. **Common in Modern PDFs**: Most PDFs created by modern tools use object streams
|
|
259
|
+
|
|
260
|
+
### Transparency in AcroThat
|
|
261
|
+
|
|
262
|
+
`AcroThat` handles object streams automatically:
|
|
263
|
+
- You don't need to know if an object is in a stream or not
|
|
264
|
+
- `object_body(ref)` returns the object body the same way regardless
|
|
265
|
+
- Object streams are cached after first load (no repeated parsing)
|
|
266
|
+
- The same `DictScan` methods work on extracted object bodies
|
|
267
|
+
|
|
268
|
+
## Cross-Reference Streams vs Object Streams
|
|
269
|
+
|
|
270
|
+
**Important distinction:**
|
|
271
|
+
|
|
272
|
+
1. **XRef Streams** (`/Type /XRef`): Used to find where objects are located in the PDF
|
|
273
|
+
- Contains byte offsets or references to object streams
|
|
274
|
+
- Replaces classic xref tables
|
|
275
|
+
|
|
276
|
+
2. **Object Streams** (`/Type /ObjStm`): Used to store actual object bodies
|
|
277
|
+
- Contains compressed object dictionaries
|
|
278
|
+
- Referenced by xref streams or classic xref tables
|
|
279
|
+
|
|
280
|
+
Both use the same stream format (compressed, potentially with PNG predictor), but serve different purposes.
|
|
281
|
+
|
|
282
|
+
## PNG Predictor
|
|
283
|
+
|
|
284
|
+
PNG Predictor is a compression technique that predicts values based on previous values to improve compression. `AcroThat` supports all 5 PNG predictor types:
|
|
285
|
+
|
|
286
|
+
1. **Type 0 (None)**: No prediction
|
|
287
|
+
2. **Type 1 (Sub)**: Predict from left
|
|
288
|
+
3. **Type 2 (Up)**: Predict from above
|
|
289
|
+
4. **Type 3 (Average)**: Predict from average of left and above
|
|
290
|
+
5. **Type 4 (Paeth)**: Predict using Paeth algorithm
|
|
291
|
+
|
|
292
|
+
The `apply_png_predictor` method decodes predictor-encoded data row by row, using the `/Columns` parameter to determine row width.
|
|
293
|
+
|
|
294
|
+
## Summary
|
|
295
|
+
|
|
296
|
+
Object streams allow PDFs to store multiple objects in compressed streams. `AcroThat` handles them by:
|
|
297
|
+
|
|
298
|
+
1. **Identifying** objects in streams via xref parsing
|
|
299
|
+
2. **Lazy loading** stream containers when needed
|
|
300
|
+
3. **Decoding** compressed stream data (zlib + PNG predictor)
|
|
301
|
+
4. **Parsing** the header to extract object offsets
|
|
302
|
+
5. **Extracting** individual object bodies by offset
|
|
303
|
+
6. **Caching** parsed streams for performance
|
|
304
|
+
|
|
305
|
+
The parsing itself is straightforward:
|
|
306
|
+
- Header is space-separated integers (object numbers and offsets)
|
|
307
|
+
- Object data follows the header
|
|
308
|
+
- Extract each object body using its offset
|
|
309
|
+
|
|
310
|
+
Just like `DictScan`, object stream parsing is **text traversal**—once the stream is decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
|
|
311
|
+
|