acro_that 0.1.5 → 0.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,388 @@
1
+ # Memory Improvement Opportunities
2
+
3
+ This document identifies memory usage issues and opportunities to optimize memory consumption for handling larger PDF documents.
4
+
5
+ ## Overview
6
+
7
+ Currently, `acro_that` loads entire PDF files into memory and creates multiple copies during processing. For small to medium PDFs (<20MB), this is acceptable, but for larger documents (39+ pages, especially with images/compressed streams), memory usage can become problematic.
8
+
9
+ ### Current Memory Footprint
10
+
11
+ For a typical 10MB PDF:
12
+ - **Initial load**: ~10MB (Document `@raw`)
13
+ - **ObjectResolver**: ~10MB (`@bytes` - separate copy)
14
+ - **Decompressed streams**: ~20-50MB (cached in `@objstm_cache`)
15
+ - **Operations (flatten/clear)**: ~10-20MB (new PDF buffer)
16
+ - **Total peak**: ~50-90MB
17
+
18
+ For larger PDFs (39+ pages with images), peak memory can easily exceed **100-200MB**.
19
+
20
+ ---
21
+
22
+ ## 1. Duplicate Full PDF in Memory
23
+
24
+ ### Issue
25
+ The PDF file is loaded twice: once in `Document#@raw` and again in `ObjectResolver#@bytes`.
26
+
27
+ ### Current Implementation
28
+ ```ruby
29
+ # document.rb line 21-26
30
+ @raw = File.binread(path_or_io) # First copy: ~10MB
31
+ @resolver = AcroThat::ObjectResolver.new(@raw) # Second copy: ~10MB
32
+ ```
33
+
34
+ ### Suggested Improvement
35
+ **Option A: Shared String Buffer**
36
+ - Use frozen strings to allow Ruby to share memory
37
+ - Or: Pass a reference instead of copying
38
+
39
+ **Option B: Lazy Loading with File IO**
40
+ - Keep file handle open
41
+ - Read chunks on demand instead of loading entire file
42
+ - Use `IO#seek` and `IO#read` for object access
43
+
44
+ **Option C: Memory-Mapped Files** (Advanced)
45
+ - Use `mmap` to map file to memory without loading
46
+ - Read-only access via memory mapping
47
+
48
+ ### Benefits
49
+ - **Immediate**: ~50% reduction in base memory (eliminates duplicate)
50
+ - **Impact**: High - affects every operation
51
+
52
+ ### Priority
53
+ **HIGH** - This is the easiest win with immediate impact.
54
+
55
+ ---
56
+
57
+ ## 2. Stream Decompression Cache Retention
58
+
59
+ ### Issue
60
+ Decompressed object streams are cached in `@objstm_cache` and never cleared, even after they're no longer needed.
61
+
62
+ ### Current Implementation
63
+ ```ruby
64
+ # object_resolver.rb line 357-374
65
+ def load_objstm(container_ref)
66
+ return if @objstm_cache.key?(container_ref) # Cached forever
67
+ # ... decompress stream ...
68
+ @objstm_cache[container_ref] = parsed # Never cleared
69
+ end
70
+ ```
71
+
72
+ ### Suggested Improvement
73
+ **Option A: Cache Size Limits**
74
+ - Implement LRU (Least Recently Used) cache with max size
75
+ - Clear old entries when cache exceeds threshold
76
+
77
+ **Option B: Lazy Caching**
78
+ - Only cache streams that are accessed multiple times
79
+ - Clear cache after operations complete
80
+
81
+ **Option C: Cache Clearing API**
82
+ - Add `Document#clear_cache` method
83
+ - Allow manual cache management
84
+ - Auto-clear after `flatten`, `clear`, or `write` operations
85
+
86
+ ### Benefits
87
+ - **Immediate**: Can free 20-50MB+ for large PDFs with many streams
88
+ - **Impact**: Medium-High - Especially important for PDFs with object streams
89
+
90
+ ### Priority
91
+ **MEDIUM-HIGH** - Significant memory savings, relatively easy to implement.
92
+
93
+ ---
94
+
95
+ ## 3. All-Objects-in-Memory Operations
96
+
97
+ ### Issue
98
+ Operations like `flatten` and `clear` load ALL objects into memory arrays before processing.
99
+
100
+ ### Current Implementation
101
+ ```ruby
102
+ # document.rb line 35-38 (flatten)
103
+ objects = []
104
+ @resolver.each_object do |ref, body|
105
+ objects << { ref: ref, body: body } # All objects loaded!
106
+ end
107
+ ```
108
+
109
+ ### Suggested Improvement
110
+ **Option A: Streaming Write**
111
+ - Write objects directly to output buffer as they're processed
112
+ - Don't collect all objects first
113
+ - Process and write in single pass
114
+
115
+ **Option B: Chunked Processing**
116
+ - Process objects in batches (e.g., 100 at a time)
117
+ - Write batches incrementally
118
+ - Reduce peak memory
119
+
120
+ **Option C: Two-Pass Approach**
121
+ - First pass: collect object references and metadata only
122
+ - Second pass: read and write object bodies on demand
123
+ - Keep object bodies in original file, only read when writing
124
+
125
+ ### Benefits
126
+ - **Immediate**: Eliminates need for full object array
127
+ - **Impact**: High - Especially for PDFs with many objects (1000+)
128
+
129
+ ### Priority
130
+ **HIGH** - Core operations (`flatten`, `clear`) are memory-intensive.
131
+
132
+ ---
133
+
134
+ ## 4. Multiple Full PDF Copies During Write
135
+
136
+ ### Issue
137
+ `write` and `flatten` operations create complete new PDFs in memory, doubling memory usage.
138
+
139
+ ### Current Implementation
140
+ ```ruby
141
+ # document.rb line 66-67 (flatten!)
142
+ flattened_content = flatten # New PDF in memory: ~10-20MB
143
+ @raw = flattened_content # Replace original
144
+ @resolver = AcroThat::ObjectResolver.new(flattened_content) # Another copy!
145
+ ```
146
+
147
+ ### Suggested Improvement
148
+ **Option A: Write Directly to File**
149
+ - Stream output directly to file instead of building in memory
150
+ - Only buffer small chunks at a time
151
+
152
+ **Option B: Incremental Flattening**
153
+ - Rebuild PDF by reading from original and writing to output file
154
+ - Never have both in memory simultaneously
155
+
156
+ **Option C: Temp File for Large Operations**
157
+ - For documents >10MB, use temp file
158
+ - Stream to temp, then replace original
159
+ - Fallback to in-memory for small files
160
+
161
+ ### Benefits
162
+ - **Immediate**: 50% reduction during write operations
163
+ - **Impact**: Medium - Affects write-heavy workflows
164
+
165
+ ### Priority
166
+ **MEDIUM** - Important for write operations, but less critical than load-time memory.
167
+
168
+ ---
169
+
170
+ ## 5. IncrementalWriter Duplicate Original
171
+
172
+ ### Issue
173
+ `IncrementalWriter#render` duplicates the entire original PDF before appending patches.
174
+
175
+ ### Current Implementation
176
+ ```ruby
177
+ # incremental_writer.rb line 19
178
+ original_with_newline = @orig.dup # Full copy: ~10-20MB
179
+ ```
180
+
181
+ ### Suggested Improvement
182
+ **Option A: Append Mode**
183
+ - Write patches directly to original file (if writable)
184
+ - Don't duplicate in memory
185
+ - Use file append operations
186
+
187
+ **Option B: Streaming Append**
188
+ - Read original file in chunks
189
+ - Write chunks + patches directly to output
190
+ - Never have full original in memory
191
+
192
+ **Option C: Reference Original**
193
+ - Only duplicate if original is frozen/immutable
194
+ - Use `+""` instead of `dup` for better memory sharing
195
+
196
+ ### Benefits
197
+ - **Immediate**: Eliminates ~10-20MB during incremental updates
198
+ - **Impact**: Medium - Affects `write` operations
199
+
200
+ ### Priority
201
+ **MEDIUM** - Good optimization, but incremental updates are typically small operations.
202
+
203
+ ---
204
+
205
+ ## 6. Object Body String Slicing
206
+
207
+ ### Issue
208
+ Every `object_body` call creates new string slices from the original buffer, potentially preventing garbage collection of unused portions.
209
+
210
+ ### Current Implementation
211
+ ```ruby
212
+ # object_resolver.rb line 57-62
213
+ hdr = /\bobj\b/m.match(@bytes, i)
214
+ after = hdr.end(0)
215
+ j = @bytes.index(/\bendobj\b/m, after)
216
+ @bytes[after...j] # New string slice
217
+ ```
218
+
219
+ ### Suggested Improvement
220
+ **Option A: Weak References**
221
+ - Use weak references for object bodies
222
+ - Allow GC to reclaim original buffer if all references gone
223
+
224
+ **Option B: Substring Views** (if available)
225
+ - Use substring views instead of copying
226
+ - Only create copy when string is modified
227
+
228
+ **Option C: Minimal Caching**
229
+ - Don't cache object bodies unless accessed multiple times
230
+ - Re-read from file when needed (if streaming)
231
+
232
+ ### Benefits
233
+ - **Immediate**: Helps GC reclaim memory faster
234
+ - **Impact**: Low-Medium - Affects GC efficiency more than peak memory
235
+
236
+ ### Priority
237
+ **LOW-MEDIUM** - Optimization that helps over time, but less critical.
238
+
239
+ ---
240
+
241
+ ## 7. No Memory Limits or Warnings
242
+
243
+ ### Issue
244
+ The gem has no way to detect or warn about excessive memory usage before operations fail.
245
+
246
+ ### Current Implementation
247
+ No memory monitoring or limits exist.
248
+
249
+ ### Suggested Improvement
250
+ **Option A: Memory Estimation**
251
+ - Estimate memory usage before operations
252
+ - Warn if estimated memory > available
253
+ - Suggest alternatives (temp files, etc.)
254
+
255
+ **Option B: File Size Limits**
256
+ - Add configurable file size limits
257
+ - Raise error if file exceeds limit
258
+ - Prevent loading files that will definitely OOM
259
+
260
+ **Option C: Memory Monitoring**
261
+ - Track peak memory usage during operations
262
+ - Log warnings for large memory spikes
263
+ - Provide metrics for monitoring
264
+
265
+ ### Benefits
266
+ - **Immediate**: Better user experience, fail-fast before OOM
267
+ - **Impact**: Medium - Prevents crashes, but doesn't reduce memory
268
+
269
+ ### Priority
270
+ **LOW-MEDIUM** - Nice to have, but doesn't fix the root issue.
271
+
272
+ ---
273
+
274
+ ## 8. Field Listing Memory Usage
275
+
276
+ ### Issue
277
+ `list_fields` iterates through ALL objects and builds arrays of widget information before returning fields.
278
+
279
+ ### Current Implementation
280
+ ```ruby
281
+ # document.rb line 163-208
282
+ @resolver.each_object do |ref, body| # Iterates ALL objects
283
+ # ... collect widget info in hashes ...
284
+ field_widgets[parent_ref] ||= []
285
+ field_widgets[parent_ref] << widget_info
286
+ # ... more arrays and hashes ...
287
+ end
288
+ ```
289
+
290
+ ### Suggested Improvement
291
+ **Option A: Lazy Field Enumeration**
292
+ - Return enumerable instead of array
293
+ - Calculate field info on-demand
294
+ - Only build full array if needed (e.g., `.to_a`)
295
+
296
+ **Option B: Stream Field Objects**
297
+ - Yield fields one at a time instead of collecting
298
+ - Process fields as they're discovered
299
+ - Use `each_field` method instead of `list_fields`
300
+
301
+ **Option C: Field Index**
302
+ - Build lightweight index (refs only) on first call
303
+ - Fetch full field data on-demand
304
+ - Cache only frequently accessed fields
305
+
306
+ ### Benefits
307
+ - **Immediate**: Reduces memory for documents with many objects
308
+ - **Impact**: Medium - Helps when scanning large PDFs
309
+
310
+ ### Priority
311
+ **MEDIUM** - Good optimization, but `list_fields` may need to return array for compatibility.
312
+
313
+ ---
314
+
315
+ ## Priority Recommendations
316
+
317
+ ### Critical (Do First)
318
+ 1. **Duplicate Full PDF (#1)** - Easiest win, immediate 50% reduction
319
+ 2. **All-Objects-in-Memory Operations (#3)** - Core operations, highest impact
320
+
321
+ ### High Priority
322
+ 3. **Stream Decompression Cache (#2)** - Significant savings for PDFs with object streams
323
+ 4. **Multiple Full PDF Copies (#4)** - Affects write operations
324
+
325
+ ### Medium Priority
326
+ 5. **IncrementalWriter Duplicate (#5)** - Affects incremental updates
327
+ 6. **Field Listing Memory (#8)** - Optimize field scanning
328
+
329
+ ### Low Priority
330
+ 7. **Object Body String Slicing (#6)** - GC optimization, less critical
331
+ 8. **Memory Limits/Warnings (#7)** - Nice to have, doesn't reduce memory
332
+
333
+ ---
334
+
335
+ ## Implementation Strategy
336
+
337
+ ### Phase 1: Quick Wins (Low Risk, High Impact)
338
+ 1. Eliminate duplicate PDF loading (#1)
339
+ 2. Clear cache after operations (#2)
340
+ 3. Add memory estimation/warnings (#7)
341
+
342
+ ### Phase 2: Core Operations (Medium Risk, High Impact)
343
+ 4. Streaming write for `flatten` (#3)
344
+ 5. Streaming write for `clear` (#3)
345
+ 6. Eliminate duplicate during `flatten!` (#4)
346
+
347
+ ### Phase 3: Advanced Optimizations (Higher Risk, Medium Impact)
348
+ 7. Streaming `IncrementalWriter` (#5)
349
+ 8. Lazy field enumeration (#8)
350
+ 9. Memory-mapped files for large documents (#1, Option C)
351
+
352
+ ---
353
+
354
+ ## Testing Considerations
355
+
356
+ ### Memory Profiling
357
+ - Use `ObjectSpace.memsize_of` and `GC.stat` to measure improvements
358
+ - Profile before/after with real-world PDFs (10MB, 50MB, 100MB+)
359
+ - Test with various PDF types (text-only, images, object streams)
360
+
361
+ ### Compatibility
362
+ - Ensure all optimizations maintain existing API
363
+ - No breaking changes to public methods
364
+ - Maintain backward compatibility
365
+
366
+ ### Performance
367
+ - Measure impact on processing speed
368
+ - Some optimizations (streaming) may slightly reduce speed
369
+ - Balance memory vs. performance trade-offs
370
+
371
+ ---
372
+
373
+ ## Notes
374
+
375
+ - **Ruby String Memory**: Ruby strings have overhead (~24 bytes per string object)
376
+ - **GC Pressure**: Multiple large string copies increase GC pressure
377
+ - **File Size vs. Memory**: Decompressed streams can be 5-20x larger than compressed size
378
+ - **Real-World Limits**: Consider typical server environments (512MB-2GB available)
379
+ - **Backward Compatibility**: Must maintain API, but can optimize internals
380
+
381
+ ---
382
+
383
+ ## References
384
+
385
+ - [Ruby Memory Profiling](https://github.com/SamSaffron/memory_profiler)
386
+ - [ObjectSpace Documentation](https://ruby-doc.org/core-3.2.2/ObjectSpace.html)
387
+ - [PDF Specification - Object Streams](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf)
388
+
@@ -0,0 +1,204 @@
1
+ # Memory Optimization Summary
2
+
3
+ ## Overview
4
+
5
+ This document summarizes the memory optimizations implemented for `acro_that` based on the analysis in `memory-improvements.md`.
6
+
7
+ ## Optimizations Implemented
8
+
9
+ ### 1. Freeze @raw to Guarantee Memory Sharing ✅
10
+
11
+ **Implementation:**
12
+ - Freeze `@raw` after initial load in `Document#initialize`
13
+ - Freeze `@raw` on reassignment in `flatten!`, `clear!`, and `write`
14
+
15
+ **Files Modified:**
16
+ - `lib/acro_that/document.rb`
17
+
18
+ **Benefits:**
19
+ - Guarantees memory sharing between `Document#@raw` and `ObjectResolver#@bytes`
20
+ - Prevents accidental modification of the PDF buffer
21
+ - Ruby can optimize memory usage for frozen strings
22
+
23
+ **Code Changes:**
24
+ ```ruby
25
+ # Before
26
+ @raw = File.binread(path_or_io)
27
+
28
+ # After
29
+ @raw = File.binread(path_or_io).freeze
30
+ ```
31
+
32
+ ---
33
+
34
+ ### 2. Clear Object Stream Cache After Operations ✅
35
+
36
+ **Implementation:**
37
+ - Added `clear_cache` method to `ObjectResolver`
38
+ - Call `clear_cache` before creating new resolver instances in `flatten!`, `clear!`, and `write`
39
+
40
+ **Files Modified:**
41
+ - `lib/acro_that/object_resolver.rb` - Added `clear_cache` method
42
+ - `lib/acro_that/document.rb` - Call `clear_cache` before creating new resolvers
43
+
44
+ **Benefits:**
45
+ - Prevents memory retention from object stream cache
46
+ - Frees decompressed stream data after operations complete
47
+ - Reduces memory footprint for documents with many object streams
48
+
49
+ **Code Changes:**
50
+ ```ruby
51
+ # In ObjectResolver
52
+ def clear_cache
53
+ @objstm_cache.clear
54
+ end
55
+
56
+ # In Document
57
+ def flatten!
58
+ flattened_content = flatten.freeze
59
+ @raw = flattened_content
60
+ @resolver.clear_cache # Clear cache before new resolver
61
+ @resolver = AcroThat::ObjectResolver.new(flattened_content)
62
+ # ...
63
+ end
64
+ ```
65
+
66
+ ---
67
+
68
+ ### 3. Optimize IncrementalWriter to Avoid dup ✅
69
+
70
+ **Implementation:**
71
+ - Replace `@orig.dup` and in-place modification with string concatenation
72
+ - Avoids creating an unnecessary duplicate of the original PDF
73
+
74
+ **Files Modified:**
75
+ - `lib/acro_that/incremental_writer.rb`
76
+
77
+ **Benefits:**
78
+ - Eliminates duplication of original PDF during incremental updates
79
+ - Reduces memory usage during `write` operations
80
+ - More efficient string operations
81
+
82
+ **Code Changes:**
83
+ ```ruby
84
+ # Before
85
+ original_with_newline = @orig.dup
86
+ original_with_newline << "\n" unless @orig.end_with?("\n")
87
+
88
+ # After
89
+ newline_if_needed = @orig.end_with?("\n") ? "".b : "\n".b
90
+ original_with_newline = @orig + newline_if_needed
91
+ ```
92
+
93
+ ---
94
+
95
+ ## Benchmark Results
96
+
97
+ ### Key Improvements
98
+
99
+ | Operation | Before | After | Improvement |
100
+ |-----------|--------|-------|-------------|
101
+ | **write** | 1.25 MB | 0.89 MB | **-29%** ✅ |
102
+ | **flatten!** | 0.19 MB | 0.13 MB | **-32%** ✅ |
103
+ | **clear** | 0.44 MB | 0.33 MB | **-25%** ✅ |
104
+ | **Peak (flatten)** | 0.39 MB | 0.03 MB | **-92%** ✅✅ |
105
+
106
+ ### Overall Impact
107
+
108
+ - **Total memory savings**: ~0.52 MB per typical workflow (write + flatten!)
109
+ - **Peak memory reduction**: 92% reduction during flatten operation
110
+ - **Cache management**: Proper cleanup after operations prevents memory retention
111
+ - **Memory sharing**: Guaranteed via frozen strings
112
+
113
+ See `memory-benchmark-results.md` for detailed before/after comparison.
114
+
115
+ ---
116
+
117
+ ## Testing
118
+
119
+ All existing tests pass:
120
+ - ✅ 61 examples, 0 failures
121
+ - ✅ All functionality preserved
122
+ - ✅ No breaking changes to public API
123
+
124
+ ### Running Memory Benchmarks
125
+
126
+ ```bash
127
+ # Run all memory benchmarks
128
+ BENCHMARK=true bundle exec rspec spec/memory_benchmark_spec.rb
129
+
130
+ # Run specific benchmark
131
+ BENCHMARK=true bundle exec rspec spec/memory_benchmark_spec.rb:12
132
+ ```
133
+
134
+ ---
135
+
136
+ ## Future Optimization Opportunities
137
+
138
+ Based on `memory-improvements.md`, additional optimizations could include:
139
+
140
+ 1. **Streaming writes for `flatten` and `clear`** (Issue #3)
141
+ - Stream objects directly to PDFWriter instead of collecting in array
142
+ - High impact for PDFs with many objects (1000+)
143
+
144
+ 2. **Reuse resolver in `flatten!`** (Issue #4)
145
+ - Avoid creating new resolver when possible
146
+ - Medium impact for write-heavy workflows
147
+
148
+ 3. **Lazy field enumeration** (Issue #8)
149
+ - Return enumerable instead of array
150
+ - Medium impact for large PDFs
151
+
152
+ ---
153
+
154
+ ## Files Changed
155
+
156
+ 1. `lib/acro_that/document.rb`
157
+ - Freeze `@raw` after loading and on reassignment
158
+ - Call `clear_cache` before creating new resolvers
159
+
160
+ 2. `lib/acro_that/object_resolver.rb`
161
+ - Add `clear_cache` method
162
+
163
+ 3. `lib/acro_that/incremental_writer.rb`
164
+ - Optimize to avoid `dup` by using string concatenation
165
+
166
+ 4. `spec/memory_benchmark_helper.rb` (new)
167
+ - Memory benchmarking utilities
168
+
169
+ 5. `spec/memory_benchmark_spec.rb` (new)
170
+ - Memory benchmark tests
171
+
172
+ 6. `issues/memory-benchmark-results.md` (new)
173
+ - Before/after benchmark results
174
+
175
+ 7. `issues/memory-optimization-summary.md` (this file)
176
+ - Summary of optimizations
177
+
178
+ ---
179
+
180
+ ## Backward Compatibility
181
+
182
+ ✅ **All changes are backward compatible**
183
+ - No changes to public API
184
+ - No breaking changes
185
+ - All existing functionality preserved
186
+ - Internal optimizations only
187
+
188
+ ---
189
+
190
+ ## Notes
191
+
192
+ - Freezing strings has minimal overhead but provides memory sharing guarantees
193
+ - Cache clearing happens automatically after operations - no manual intervention needed
194
+ - Peak memory reduction (92%) is the most impressive improvement
195
+ - Some operations show slight variance in measurements (normal for memory profiling)
196
+
197
+ ---
198
+
199
+ ## References
200
+
201
+ - [Memory Improvements Analysis](./memory-improvements.md)
202
+ - [Memory Benchmark Results](./memory-benchmark-results.md)
203
+ - [Ruby Memory Profiling](https://github.com/SamSaffron/memory_profiler)
204
+