corp_pdf 1.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,311 @@
1
+ # PDF Object Streams Explained
2
+
3
+ ## Overview
4
+
5
+ PDF Object Streams (also called "ObjStm") are a compression feature in PDFs that allows multiple PDF objects to be stored together in a single compressed stream. This reduces file size and improves performance. `CorpPdf` handles object streams transparently, so you don't need to worry about them when working with PDF objects—but understanding how they work helps explain how the library parses PDFs.
6
+
7
+ ## What Are Object Streams?
8
+
9
+ Instead of storing objects individually in the PDF body:
10
+
11
+ ```
12
+ 5 0 obj
13
+ << /Type /Annot /Subtype /Widget >>
14
+ endobj
15
+
16
+ 6 0 obj
17
+ << /Type /Annot /Subtype /Widget >>
18
+ endobj
19
+
20
+ 7 0 obj
21
+ << /Type /Annot /Subtype /Widget >>
22
+ endobj
23
+ ```
24
+
25
+ Object streams allow multiple objects to be **packed together** in a single compressed stream:
26
+
27
+ ```
28
+ 10 0 obj
29
+ << /Type /ObjStm /N 3 /First 20 >>
30
+ stream
31
+ [compressed header + object bodies]
32
+ endstream
33
+ endobj
34
+ ```
35
+
36
+ Where:
37
+ - `/N 3` means there are 3 objects in this stream
38
+ - `/First 20` means the object data starts at byte offset 20 (the first 20 bytes are the header)
39
+ - The stream contains: header (object numbers + offsets) + object bodies
40
+
41
+ ## Object Stream Structure
42
+
43
+ An object stream consists of two parts:
44
+
45
+ ### 1. Header Section (First N bytes)
46
+
47
+ The header is a list of space-separated integers:
48
+ ```
49
+ 5 0 6 10 7 25
50
+ ```
51
+
52
+ This means:
53
+ - Object 5 starts at offset 0 (relative to the start of object data)
54
+ - Object 6 starts at offset 10 (relative to the start of object data)
55
+ - Object 7 starts at offset 25 (relative to the start of object data)
56
+
57
+ Format: `obj_num offset obj_num offset ...` (pairs of object number and offset)
58
+
59
+ ### 2. Object Data Section (Starting at /First)
60
+
61
+ The object data section contains the actual object bodies, concatenated together:
62
+ ```
63
+ << /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>
64
+ ```
65
+
66
+ The offsets in the header tell us where each object body starts within this data section.
67
+
68
+ ### Complete Example
69
+
70
+ ```
71
+ 10 0 obj
72
+ << /Type /ObjStm /N 3 /First 20 /Filter /FlateDecode >>
73
+ stream
74
+ [Compressed bytes containing:]
75
+ Header: "5 0 6 10 7 25 "
76
+ Data: "<< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >><< /Type /Annot /Subtype /Widget >>"
77
+ endstream
78
+ endobj
79
+ ```
80
+
81
+ After decompression:
82
+ - Bytes 0-19: Header (`"5 0 6 10 7 25 "`)
83
+ - Bytes 20+: Object data (`"<< /Type /Annot /Subtype /Widget >>..."`)
84
+ - Object 5 body: bytes 20-29 (offset 0 + 20 = 20, next object at offset 10)
85
+ - Object 6 body: bytes 30-59 (offset 10 + 20 = 30, next object at offset 25)
86
+ - Object 7 body: bytes 45+ (offset 25 + 20 = 45)
87
+
88
+ ## How CorpPdf Parses Object Streams
89
+
90
+ ### Step 1: Cross-Reference Table Parsing
91
+
92
+ When `ObjectResolver` parses the cross-reference table or xref stream, it identifies objects stored in object streams:
93
+
94
+ ```ruby
95
+ # In parse_xref_stream_records
96
+ when 2 then @entries[ref] ||= Entry.new(type: :in_objstm, objstm_num: f1, objstm_index: f2)
97
+ ```
98
+
99
+ Where:
100
+ - `f1` = object stream number (the container object)
101
+ - `f2` = index within the object stream (0 = first object, 1 = second object, etc.)
102
+
103
+ Example: If object 5 is at index 0 in object stream 10, then `Entry` stores:
104
+ - `type: :in_objstm`
105
+ - `objstm_num: 10` (the object stream container)
106
+ - `objstm_index: 0` (first object in the stream)
107
+
108
+ ### Step 2: Lazy Loading
109
+
110
+ When you request an object body that's in an object stream, `ObjectResolver` calls `load_objstm`:
111
+
112
+ ```ruby
113
+ def load_objstm(container_ref)
114
+ return if @objstm_cache.key?(container_ref) # Already cached
115
+
116
+ # Get the object stream container's body
117
+ body = object_body(container_ref)
118
+
119
+ # Extract dictionary to get /N and /First
120
+ dict_src = extract_dictionary(body)
121
+ n = DictScan.value_token_after("/N", dict_src).to_i
122
+ first = DictScan.value_token_after("/First", dict_src).to_i
123
+
124
+ # Extract and decode stream data
125
+ raw = decode_stream_data(dict_src, extract_stream_body(body))
126
+
127
+ # Parse the object stream
128
+ parsed = CorpPdf::ObjStm.parse(raw, n: n, first: first)
129
+
130
+ # Cache the result
131
+ @objstm_cache[container_ref] = parsed
132
+ end
133
+ ```
134
+
135
+ ### Step 3: Stream Decoding
136
+
137
+ The `decode_stream_data` method handles:
138
+ 1. **Extracting stream body**: Removes `stream` and `endstream` keywords
139
+ 2. **Decompression**: If `/Filter /FlateDecode` is present, decompress using zlib
140
+ 3. **PNG Predictor**: If `/Predictor` is present (10-15), apply PNG predictor decoding
141
+
142
+ Example:
143
+ ```ruby
144
+ def decode_stream_data(dict_src, stream_chunk)
145
+ # Extract body between stream...endstream
146
+ body = extract_stream_body(stream_chunk)
147
+
148
+ # Decompress if FlateDecode
149
+ data = if dict_src =~ %r{/Filter\s*/FlateDecode}
150
+ Zlib::Inflate.inflate(body)
151
+ else
152
+ body
153
+ end
154
+
155
+ # Apply PNG predictor if present
156
+ if dict_src =~ %r{/Predictor\s+(\d+)}
157
+ # Decode PNG predictor (Sub, Up, Average, Paeth)
158
+ data = apply_png_predictor(data, columns)
159
+ end
160
+
161
+ data
162
+ end
163
+ ```
164
+
165
+ ### Step 4: Object Stream Parsing (`ObjStm.parse`)
166
+
167
+ The `ObjStm.parse` method is the heart of object stream parsing:
168
+
169
+ ```ruby
170
+ def self.parse(bytes, n:, first:)
171
+ # Extract header (first N bytes)
172
+ head = bytes[0...first]
173
+
174
+ # Parse space-separated integers: obj_num offset obj_num offset ...
175
+ entries = head.strip.split(/\s+/).map!(&:to_i)
176
+
177
+ # Extract each object body
178
+ refs = []
179
+ n.times do |i|
180
+ obj = entries[2 * i] # Object number
181
+ off = entries[(2 * i) + 1] # Offset in data section
182
+
183
+ # Calculate next offset (or end of data)
184
+ next_off = i + 1 < n ? entries[(2 * (i + 1)) + 1] : (bytes.bytesize - first)
185
+
186
+ # Extract object body: start at (first + off), length is (next_off - off)
187
+ body = bytes[first + off, next_off - off]
188
+
189
+ refs << { ref: [obj, 0], body: body }
190
+ end
191
+
192
+ refs
193
+ end
194
+ ```
195
+
196
+ **Step-by-step example:**
197
+
198
+ Given:
199
+ - `bytes` = decompressed stream data
200
+ - `n = 3` (3 objects)
201
+ - `first = 20` (data starts at byte 20)
202
+ - Header: `"5 0 6 10 7 25 "` (12 bytes, padded to 20)
203
+
204
+ Processing:
205
+ 1. Extract header: `bytes[0...20]` → `"5 0 6 10 7 25 "`
206
+ 2. Parse entries: `[5, 0, 6, 10, 7, 25]`
207
+ 3. For `i=0` (first object):
208
+ - `obj = entries[0]` = 5
209
+ - `off = entries[1]` = 0
210
+ - `next_off = entries[3]` = 10 (offset of next object)
211
+ - `body = bytes[20 + 0, 10 - 0]` = bytes[20...30]
212
+ 4. For `i=1` (second object):
213
+ - `obj = entries[2]` = 6
214
+ - `off = entries[3]` = 10
215
+ - `next_off = entries[5]` = 25
216
+ - `body = bytes[20 + 10, 25 - 10]` = bytes[30...45]
217
+ 5. For `i=2` (third object):
218
+ - `obj = entries[4]` = 7
219
+ - `off = entries[5]` = 25
220
+ - `next_off = bytes.bytesize - first` (end of data)
221
+ - `body = bytes[20 + 25, ...]` = bytes[45...end]
222
+
223
+ Result:
224
+ ```ruby
225
+ [
226
+ { ref: [5, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
227
+ { ref: [6, 0], body: "<< /Type /Annot /Subtype /Widget >>" },
228
+ { ref: [7, 0], body: "<< /Type /Annot /Subtype /Widget >>" }
229
+ ]
230
+ ```
231
+
232
+ ### Step 5: Object Retrieval
233
+
234
+ When `object_body(ref)` is called for an object in an object stream:
235
+
236
+ ```ruby
237
+ def object_body(ref)
238
+ case (e = @entries[ref])&.type
239
+ when :in_file
240
+ # Regular object: extract from file
241
+ extract_from_file(e.offset)
242
+ when :in_objstm
243
+ # Object stream: load stream (if not cached), then get object by index
244
+ load_objstm([e.objstm_num, 0])
245
+ @objstm_cache[[e.objstm_num, 0]][e.objstm_index][:body]
246
+ end
247
+ end
248
+ ```
249
+
250
+ The index (`objstm_index`) tells us which object in the parsed array to return.
251
+
252
+ ## Why Object Streams Matter
253
+
254
+ ### Benefits
255
+
256
+ 1. **File Size**: Compressing multiple objects together is more efficient than compressing each individually
257
+ 2. **Performance**: Fewer objects to parse when opening the PDF
258
+ 3. **Common in Modern PDFs**: Most PDFs created by modern tools use object streams
259
+
260
+ ### Transparency in CorpPdf
261
+
262
+ `CorpPdf` handles object streams automatically:
263
+ - You don't need to know if an object is in a stream or not
264
+ - `object_body(ref)` returns the object body the same way regardless
265
+ - Object streams are cached after first load (no repeated parsing)
266
+ - The same `DictScan` methods work on extracted object bodies
267
+
268
+ ## Cross-Reference Streams vs Object Streams
269
+
270
+ **Important distinction:**
271
+
272
+ 1. **XRef Streams** (`/Type /XRef`): Used to find where objects are located in the PDF
273
+ - Contains byte offsets or references to object streams
274
+ - Replaces classic xref tables
275
+
276
+ 2. **Object Streams** (`/Type /ObjStm`): Used to store actual object bodies
277
+ - Contains compressed object dictionaries
278
+ - Referenced by xref streams or classic xref tables
279
+
280
+ Both use the same stream format (compressed, potentially with PNG predictor), but serve different purposes.
281
+
282
+ ## PNG Predictor
283
+
284
+ PNG Predictor is a compression technique that predicts values based on previous values to improve compression. `CorpPdf` supports all 5 PNG predictor types:
285
+
286
+ 1. **Type 0 (None)**: No prediction
287
+ 2. **Type 1 (Sub)**: Predict from left
288
+ 3. **Type 2 (Up)**: Predict from above
289
+ 4. **Type 3 (Average)**: Predict from average of left and above
290
+ 5. **Type 4 (Paeth)**: Predict using Paeth algorithm
291
+
292
+ The `apply_png_predictor` method decodes predictor-encoded data row by row, using the `/Columns` parameter to determine row width.
293
+
294
+ ## Summary
295
+
296
+ Object streams allow PDFs to store multiple objects in compressed streams. `CorpPdf` handles them by:
297
+
298
+ 1. **Identifying** objects in streams via xref parsing
299
+ 2. **Lazy loading** stream containers when needed
300
+ 3. **Decoding** compressed stream data (zlib + PNG predictor)
301
+ 4. **Parsing** the header to extract object offsets
302
+ 5. **Extracting** individual object bodies by offset
303
+ 6. **Caching** parsed streams for performance
304
+
305
+ The parsing itself is straightforward:
306
+ - Header is space-separated integers (object numbers and offsets)
307
+ - Object data follows the header
308
+ - Extract each object body using its offset
309
+
310
+ Just like `DictScan`, object stream parsing is **text traversal**—once the stream is decompressed, it's just parsing space-separated numbers and extracting substrings by offset.
311
+
@@ -0,0 +1,251 @@
1
+ # PDF File Structure
2
+
3
+ ## Overview
4
+
5
+ PDF (Portable Document Format) files have a reputation for being complex binary formats, but at their core, they are **text-based files with a structured syntax**. Understanding this fundamental fact is key to understanding how PDF works.
6
+
7
+ While PDFs can contain binary data (like compressed streams, images, and fonts), the **structure** of a PDF—its objects, dictionaries, arrays, and references—is defined using plain text syntax.
8
+
9
+ ## PDF File Anatomy
10
+
11
+ A PDF file consists of several main parts:
12
+
13
+ 1. **Header**: `%PDF-1.4` (or similar version)
14
+ 2. **Body**: A collection of PDF objects (the actual content)
15
+ 3. **Cross-Reference Table (xref)**: Points to byte offsets of objects
16
+ 4. **Trailer**: Contains the root object reference and metadata
17
+ 5. **EOF Marker**: `%%EOF`
18
+
19
+ ### PDF Objects
20
+
21
+ The body contains PDF objects. Each object has:
22
+ - An object number and generation number (e.g., `5 0 obj`)
23
+ - Content (dictionary, array, stream, etc.)
24
+ - An `endobj` marker
25
+
26
+ Example:
27
+ ```
28
+ 5 0 obj
29
+ << /Type /Page /Parent 3 0 R /MediaBox [0 0 612 792] >>
30
+ endobj
31
+ ```
32
+
33
+ ## PDF Dictionaries
34
+
35
+ **PDF dictionaries are the heart of PDF structure.** They're defined using angle brackets:
36
+
37
+ ```
38
+ << /Key1 value1 /Key2 value2 /Key3 value3 >>
39
+ ```
40
+
41
+ Think of them like JSON objects or Ruby hashes, but with PDF-specific syntax:
42
+ - Keys are PDF names (always start with `/`)
43
+ - Values can be: strings, numbers, booleans, arrays, dictionaries, or object references
44
+ - Whitespace is generally ignored (but required between tokens)
45
+
46
+ ### Dictionary Examples
47
+
48
+ **Simple dictionary:**
49
+ ```
50
+ << /Type /Page /Width 612 /Height 792 >>
51
+ ```
52
+
53
+ **Nested dictionary:**
54
+ ```
55
+ <<
56
+ /Type /Annot
57
+ /Subtype /Widget
58
+ /Rect [100 500 200 520]
59
+ /AP <<
60
+ /N <<
61
+ /Yes 10 0 R
62
+ /Off 11 0 R
63
+ >>
64
+ >>
65
+ >>
66
+ ```
67
+
68
+ **Dictionary with array:**
69
+ ```
70
+ << /Kids [5 0 R 6 0 R 7 0 R] >>
71
+ ```
72
+
73
+ **Dictionary with string values:**
74
+ ```
75
+ << /Title (My Document) /Author (John Doe) >>
76
+ ```
77
+
78
+ The parentheses `()` denote literal strings in PDF syntax. Hex strings use angle brackets: `<>`.
79
+
80
+ ## PDF Text-Based Syntax
81
+
82
+ Despite being "binary" files, PDFs use text-based syntax for their structure. This means:
83
+
84
+ 1. **Dictionaries are text**: `<< ... >>` are just character sequences
85
+ 2. **Arrays are text**: `[ ... ]` are just character sequences
86
+ 3. **References are text**: `5 0 R` means "object 5, generation 0"
87
+ 4. **Strings can be text or hex**: `(Hello)` or `<48656C6C6F>`
88
+
89
+ ### Why This Matters
90
+
91
+ Because PDF dictionaries are just text with delimiters (`<<`, `>>`), we can parse them using **simple text traversal algorithms**—no complex parser generator, no AST construction, just:
92
+
93
+ 1. Find opening `<<`
94
+ 2. Track nesting depth by counting `<<` and `>>`
95
+ 3. When depth reaches zero, we've found a complete dictionary
96
+ 4. Repeat
97
+
98
+ ## PDF Object References
99
+
100
+ PDFs use references to link objects together:
101
+ ```
102
+ 5 0 R
103
+ ```
104
+
105
+ This means:
106
+ - Object number: `5`
107
+ - Generation number: `0` (usually 0 for non-incremental PDFs)
108
+ - `R` means "reference"
109
+
110
+ When you see `/Parent 5 0 R`, it means the `Parent` key references object 5.
111
+
112
+ ## PDF Arrays
113
+
114
+ Arrays are space-separated lists in square brackets:
115
+ ```
116
+ [0 0 612 792]
117
+ ```
118
+
119
+ Can contain any PDF value type:
120
+ ```
121
+ [5 0 R 6 0 R]
122
+ [/Yes /Off]
123
+ [(Hello) (World)]
124
+ ```
125
+
126
+ ## PDF Strings
127
+
128
+ PDF strings come in two flavors:
129
+
130
+ ### Literal Strings (parentheses)
131
+ ```
132
+ (Hello World)
133
+ (Line 1\nLine 2)
134
+ ```
135
+
136
+ Can contain escape sequences: `\n`, `\r`, `\t`, `\\(`, `\\)`, octal `\123`.
137
+
138
+ ### Hex Strings (angle brackets)
139
+ ```
140
+ <48656C6C6F>
141
+ <FEFF00480065006C006C006F>
142
+ ```
143
+
144
+ The hex string `<FEFF...>` with BOM indicates UTF-16BE encoding.
145
+
146
+ ## PDF Names
147
+
148
+ PDF names start with `/`:
149
+ ```
150
+ /Type
151
+ /Subtype
152
+ /Widget
153
+ ```
154
+
155
+ Names can contain most characters except special delimiters.
156
+
157
+ ## Stream Objects
158
+
159
+ Some PDF objects contain **streams** (binary or text data):
160
+ ```
161
+ 10 0 obj
162
+ << /Length 100 /Filter /FlateDecode >>
163
+ stream
164
+ [compressed binary data here]
165
+ endstream
166
+ endobj
167
+ ```
168
+
169
+ For parsing structure (dictionaries), we typically strip or ignore stream bodies because they can contain arbitrary binary data that would confuse text-based parsing.
170
+
171
+ ## Why CorpPdf Works
172
+
173
+ `CorpPdf` works because **PDF dictionaries are just text patterns**. Despite looking complicated, the algorithms are straightforward:
174
+
175
+ ### Finding Dictionaries
176
+
177
+ The `each_dictionary` method:
178
+ 1. Searches for `<<` (start of dictionary)
179
+ 2. Tracks nesting depth: `<<` increments, `>>` decrements
180
+ 3. When depth returns to 0, we've found a complete dictionary
181
+ 4. Yield it and continue searching
182
+
183
+ This is **pure text traversal**—no PDF-specific knowledge beyond "dictionaries use `<<` and `>>`".
184
+
185
+ ### Extracting Values
186
+
187
+ The `value_token_after` method:
188
+ 1. Finds a key (like `/V`)
189
+ 2. Skips whitespace
190
+ 3. Based on the next character, extracts the value:
191
+ - `(` → Extract literal string (handle escaping)
192
+ - `<` → Extract hex string or dictionary
193
+ - `[` → Extract array (match brackets)
194
+ - `/` → Extract name
195
+ - Otherwise → Extract atom (number, reference, etc.)
196
+
197
+ Again, this is just **text pattern matching** with some bracket/depth tracking.
198
+
199
+ ### Why It Seems Complicated
200
+
201
+ The complexity comes from:
202
+ 1. **Handling edge cases**: Escaped characters, nested structures, various value types
203
+ 2. **Preserving exact formatting**: When replacing values, we must maintain valid PDF syntax
204
+ 3. **Encoding/decoding**: PDF strings have special encoding rules (UTF-16BE BOM, escapes)
205
+ 4. **Safety checks**: Verifying dictionaries are still valid after modification
206
+
207
+ But the **core concept** is simple: PDF dictionaries are text, so we can parse them with text traversal.
208
+
209
+ ## Example: Walking Through a PDF Dictionary
210
+
211
+ Given this PDF dictionary text:
212
+ ```
213
+ << /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>
214
+ ```
215
+
216
+ How `CorpPdf` would parse it:
217
+
218
+ 1. **`each_dictionary` finds it:**
219
+ - Finds `<<` at position 0
220
+ - Depth: 0 → 1 (after `<<`)
221
+ - Scans forward...
222
+ - Finds `>>` at position 64
223
+ - Depth: 1 → 0
224
+ - Yields: `"<< /Type /Annot /Subtype /Widget /V (Hello World) /Rect [100 500 200 520] >>"`
225
+
226
+ 2. **`value_token_after("/V", dict)` extracts value:**
227
+ - Finds `/V` (followed by space)
228
+ - Skips whitespace
229
+ - Next char is `(`, so extract literal string
230
+ - Scan forward, handle escaping, match closing `)`
231
+ - Returns: `"(Hello World)"`
232
+
233
+ 3. **`decode_pdf_string("(Hello World)")` decodes:**
234
+ - Starts with `(`, ends with `)`
235
+ - Extract inner: `"Hello World"`
236
+ - Unescape (no escapes here)
237
+ - Check for UTF-16BE BOM (none)
238
+ - Return: `"Hello World"`
239
+
240
+ ## Conclusion
241
+
242
+ PDF files are **structured text files** with binary data embedded in streams. The structure itself—dictionaries, arrays, strings, references—is all text-based syntax. This is why `CorpPdf` can use simple text traversal to parse and modify PDF dictionaries without needing a full PDF parser.
243
+
244
+ The apparent complexity in `CorpPdf` comes from:
245
+ - Handling PDF's various value types
246
+ - Proper encoding/decoding of strings
247
+ - Careful preservation of structure during edits
248
+ - Edge case handling (escaping, nesting, etc.)
249
+
250
+ But the **fundamental approach** is elegantly simple: treat PDF dictionaries as text patterns and parse them with character-by-character traversal.
251
+
data/issues/README.md ADDED
@@ -0,0 +1,59 @@
1
+ # Code Review Issues
2
+
3
+ This folder contains documentation of code cleanup, refactoring opportunities, and improvement tasks found in the codebase.
4
+
5
+ ## Files
6
+
7
+ - **[refactoring-opportunities.md](./refactoring-opportunities.md)** - Detailed list of code duplication and refactoring opportunities
8
+ - **[memory-improvements.md](./memory-improvements.md)** - Memory usage issues and optimization opportunities for handling larger PDF documents
9
+
10
+ ## Summary
11
+
12
+ ### High Priority Issues
13
+ 1. **Widget Matching Logic** - Duplicated across 6+ locations
14
+ 2. **/Annots Array Manipulation** - Complex logic duplicated in 3 locations
15
+
16
+ ### Medium Priority Issues
17
+ 3. **Box Parsing Logic** - Repeated code blocks for 5 box types
18
+ 4. **Checkbox Appearance Creation** - Significant duplication in new code
19
+ 5. **PDF Metadata Formatting** - Could benefit from being shared utilities
20
+
21
+ ### Low Priority Issues
22
+ 6. Duplicated `next_fresh_object_number` implementation (may be intentional)
23
+ 7. Object reference extraction pattern duplication
24
+ 8. Unused method: `get_widget_rect_dimensions`
25
+ 9. Base64 decoding logic duplication
26
+
27
+ ### Completed ✅
28
+ - **Page-Finding Logic** - Successfully refactored into `DictScan.is_page?` and unified page-finding methods
29
+
30
+ ## Quick Stats
31
+
32
+ - **10 refactoring opportunities** identified (1 completed, 9 remaining)
33
+ - **6+ locations** with widget matching duplication
34
+ - **3 locations** with /Annots array manipulation duplication
35
+ - **1 unused method** found
36
+ - **2 new issues** identified in recent code additions
37
+
38
+ ## Memory & Performance
39
+
40
+ ### Memory Improvement Opportunities
41
+
42
+ See **[memory-improvements.md](./memory-improvements.md)** for detailed analysis of memory usage and optimization strategies.
43
+
44
+ **Key Issues:**
45
+ - Duplicate PDF loading (2x memory usage)
46
+ - Stream decompression cache retention
47
+ - All-objects-in-memory operations
48
+ - Multiple full PDF copies during write operations
49
+
50
+ **Estimated Impact:** 50-90MB typical usage for 10MB PDF, can exceed 100-200MB+ for larger/complex PDFs (39+ pages).
51
+
52
+ ## Next Steps
53
+
54
+ 1. Review [refactoring-opportunities.md](./refactoring-opportunities.md) for detailed information
55
+ 2. Review [memory-improvements.md](./memory-improvements.md) for memory optimization strategies
56
+ 3. Prioritize improvements based on maintenance and performance needs
57
+ 4. Create test coverage before refactoring
58
+ 5. Implement improvements incrementally, starting with high-priority items
59
+