pdf-reader 1.0.0.rc1 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +4 -0
- data/README.rdoc +12 -17
- data/TODO +6 -17
- data/lib/pdf/reader.rb +7 -7
- data/lib/pdf/reader/filter.rb +1 -1
- data/lib/pdf/reader/form_xobject.rb +1 -1
- data/lib/pdf/reader/page.rb +1 -1
- data/lib/pdf/reader/page_text_receiver.rb +2 -14
- data/lib/pdf/reader/standard_security_handler.rb +5 -3
- metadata +23 -24
data/CHANGELOG
CHANGED
@@ -1,3 +1,7 @@
|
|
1
|
+
v1.0.0 (16th January 2012)
|
2
|
+
- support a new encryption variation
|
3
|
+
- bugfix in PageTextRender (thanks Paul Gallagher)
|
4
|
+
|
1
5
|
v1.0.0.rc1 (19th December 2011)
|
2
6
|
- performance optimisations (all by Bernerd Schaefer)
|
3
7
|
- some improvements to text extraction from form xobjects
|
data/README.rdoc
CHANGED
@@ -1,18 +1,3 @@
|
|
1
|
-
= !PLEASE NOTE!
|
2
|
-
|
3
|
-
All the examples below are for the latest (pre-release) version of the gem (0.11)
|
4
|
-
|
5
|
-
If you have installed the gem via the rubygems with the command:
|
6
|
-
|
7
|
-
$ gem install pdf-reader
|
8
|
-
|
9
|
-
Then the examples below *will not work* for you. Please check the examples that
|
10
|
-
come with previous version of the gem (0.10).
|
11
|
-
|
12
|
-
If you want to install the latest version of this gem use the command:
|
13
|
-
|
14
|
-
$ gem install pdf-reader --prerelease
|
15
|
-
|
16
1
|
= Release Notes
|
17
2
|
|
18
3
|
The PDF::Reader library implements a PDF parser conforming as much as possible
|
@@ -59,7 +44,8 @@ an IO stream:
|
|
59
44
|
puts reader.info
|
60
45
|
|
61
46
|
If you open a PDF with File#open or IO#open, I strongly recommend using "rb"
|
62
|
-
mode to ensure the file isn't mangled by ruby being 'helpful'.
|
47
|
+
mode to ensure the file isn't mangled by ruby being 'helpful'. This is
|
48
|
+
particularly important on windows and MRI >= 1.9.2.
|
63
49
|
|
64
50
|
File.open("somefile.pdf", "rb") do |io|
|
65
51
|
reader = PDF::Reader.new(io)
|
@@ -111,6 +97,15 @@ to UTF-8 before it is passed back from PDF::Reader.
|
|
111
97
|
Strings that contain binary data (like font blobs) will be marked as such on
|
112
98
|
M17N aware VMs.
|
113
99
|
|
100
|
+
= Former API
|
101
|
+
|
102
|
+
Version 1.0.0 of PDF::Reader introduced a new page-based API that provides
|
103
|
+
efficient and easy access to any page.
|
104
|
+
|
105
|
+
The previous API is marked as deprecated but will continue to work for the
|
106
|
+
time being. Eventually calls to the old API will begin triggering deprecation
|
107
|
+
warnings before it is completely removed in version 2.0.0.
|
108
|
+
|
114
109
|
= Exceptions
|
115
110
|
|
116
111
|
There are two key exceptions that you will need to watch out for when processing a
|
@@ -119,7 +114,7 @@ PDF file:
|
|
119
114
|
MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
|
120
115
|
file should be valid, or that a corrupt file didn't raise an exception, please
|
121
116
|
forward a copy of the file to the maintainers (preferably via the google group)
|
122
|
-
and we
|
117
|
+
and we will attempt to improve the code.
|
123
118
|
|
124
119
|
UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
|
125
120
|
support. Again, we welcome submissions of PDF files that exhibit these features to help
|
data/TODO
CHANGED
@@ -1,27 +1,19 @@
|
|
1
|
-
|
2
|
-
-
|
3
|
-
- list implemented features
|
4
|
-
- encrypted? tagged? bookmarks? annotated? optimised?
|
5
|
-
- Allow more than just page content and metadata to be parsed (see spec section 3.6.1)
|
1
|
+
This stuff would be great
|
2
|
+
- improved access to document level objects and data
|
6
3
|
- bookmarks?
|
7
4
|
- outline?
|
8
5
|
- articles?
|
9
6
|
- viewer prefs?
|
10
|
-
-
|
7
|
+
- Improve the speed of Encoding#to_utf8
|
11
8
|
- Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
|
12
9
|
poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
|
13
10
|
from the Original encoding to Unicode.
|
14
11
|
- detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
|
15
12
|
- Improve interpretation of non content stream data (ie metadata). recognise dates, etc
|
16
|
-
- Fix inheritance of page attributes. Resources has been done, but plenty of other attributes
|
17
|
-
are inheritable. See table 3.2.7 in the spec
|
18
13
|
|
19
|
-
v0.9
|
20
|
-
- Add a way to extract raster images
|
21
|
-
- see XObjects section of spec (section 4.7)
|
22
|
-
- Add a way to extract font data?
|
23
14
|
|
24
|
-
|
15
|
+
|
16
|
+
This might be useful, more research required
|
25
17
|
- Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
|
26
18
|
- Will require significantly improved handling of CMaps, including creating a bunch of predefined ones
|
27
19
|
|
@@ -30,10 +22,7 @@ Sometime
|
|
30
22
|
- Ship some extra receivers in the standard package, particuarly ones that are useful for running
|
31
23
|
rspec over generated PDF files
|
32
24
|
|
33
|
-
-
|
34
|
-
sensible way to convert them to unicode
|
35
|
-
|
36
|
-
- Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
|
25
|
+
- Add support for additional filters: CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode
|
37
26
|
|
38
27
|
- Add support for additional encodings:
|
39
28
|
- Identity-V(I *think* this relates to vertical text. Not sure how we'd support it sensibly)
|
data/lib/pdf/reader.rb
CHANGED
@@ -159,7 +159,7 @@ module PDF
|
|
159
159
|
yield PDF::Reader.new(input, opts)
|
160
160
|
end
|
161
161
|
|
162
|
-
# DEPRECATED: this method was deprecated in version 0.
|
162
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
163
163
|
# eventually be removed
|
164
164
|
#
|
165
165
|
#
|
@@ -171,7 +171,7 @@ module PDF
|
|
171
171
|
end
|
172
172
|
end
|
173
173
|
|
174
|
-
# DEPRECATED: this method was deprecated in version 0.
|
174
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
175
175
|
# eventually be removed
|
176
176
|
#
|
177
177
|
# Parse the given string, sending events to the given receiver.
|
@@ -182,7 +182,7 @@ module PDF
|
|
182
182
|
end
|
183
183
|
end
|
184
184
|
|
185
|
-
# DEPRECATED: this method was deprecated in version 0.
|
185
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
186
186
|
# eventually be removed
|
187
187
|
#
|
188
188
|
# Parse the file with the given name, returning an unmarshalled ruby version of
|
@@ -194,7 +194,7 @@ module PDF
|
|
194
194
|
}
|
195
195
|
end
|
196
196
|
|
197
|
-
# DEPRECATED: this method was deprecated in version 0.
|
197
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
198
198
|
# eventually be removed
|
199
199
|
#
|
200
200
|
# Parse the given string, returning an unmarshalled ruby version of represents
|
@@ -245,7 +245,7 @@ module PDF
|
|
245
245
|
end
|
246
246
|
|
247
247
|
|
248
|
-
# DEPRECATED: this method was deprecated in version 0.
|
248
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
249
249
|
# eventually be removed
|
250
250
|
#
|
251
251
|
# Given an IO object that contains PDF data, parse it.
|
@@ -263,7 +263,7 @@ module PDF
|
|
263
263
|
self
|
264
264
|
end
|
265
265
|
|
266
|
-
# DEPRECATED: this method was deprecated in version 0.
|
266
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
267
267
|
# eventually be removed
|
268
268
|
#
|
269
269
|
# Given an IO object that contains PDF data, return the contents of a single object
|
@@ -276,7 +276,7 @@ module PDF
|
|
276
276
|
|
277
277
|
private
|
278
278
|
|
279
|
-
# recursively convert strings from outside a content stream
|
279
|
+
# recursively convert strings from outside a content stream into UTF-8
|
280
280
|
#
|
281
281
|
def doc_strings_to_utf8(obj)
|
282
282
|
case obj
|
data/lib/pdf/reader/filter.rb
CHANGED
data/lib/pdf/reader/page.rb
CHANGED
@@ -1,12 +1,6 @@
|
|
1
1
|
# coding: utf-8
|
2
2
|
|
3
3
|
require 'matrix'
|
4
|
-
require 'yaml'
|
5
|
-
|
6
|
-
begin
|
7
|
-
require 'psych'
|
8
|
-
rescue LoadError
|
9
|
-
end
|
10
4
|
|
11
5
|
module PDF
|
12
6
|
class Reader
|
@@ -32,7 +26,7 @@ module PDF
|
|
32
26
|
@font_stack = [build_fonts(page.fonts)]
|
33
27
|
@xobject_stack = [page.xobjects]
|
34
28
|
@content = {}
|
35
|
-
@stack = [DEFAULT_GRAPHICS_STATE]
|
29
|
+
@stack = [DEFAULT_GRAPHICS_STATE.dup]
|
36
30
|
end
|
37
31
|
|
38
32
|
def content
|
@@ -235,8 +229,6 @@ module PDF
|
|
235
229
|
# underlying device space.
|
236
230
|
#
|
237
231
|
def transform(point, z = 1)
|
238
|
-
trm = text_rendering_matrix
|
239
|
-
|
240
232
|
point.transform(text_rendering_matrix, z)
|
241
233
|
end
|
242
234
|
|
@@ -286,7 +278,7 @@ module PDF
|
|
286
278
|
end
|
287
279
|
|
288
280
|
# private class for representing points on a cartesian plain. Used
|
289
|
-
# to simplify maths
|
281
|
+
# to simplify maths.
|
290
282
|
#
|
291
283
|
class Point < Struct.new(:x, :y)
|
292
284
|
def transform(trm, z)
|
@@ -295,10 +287,6 @@ module PDF
|
|
295
287
|
(trm[0,1] * x) + (trm[1,1] * y) + (trm[2,1] * z)
|
296
288
|
)
|
297
289
|
end
|
298
|
-
|
299
|
-
def distance(point)
|
300
|
-
Math.hypot(point.x - @x, point.y - @y)
|
301
|
-
end
|
302
290
|
end
|
303
291
|
end
|
304
292
|
end
|
@@ -79,7 +79,8 @@ class PDF::Reader
|
|
79
79
|
objKey = @encrypt_key.dup
|
80
80
|
(0..2).each { |e| objKey << (ref.id >> e*8 & 0xFF ) }
|
81
81
|
(0..1).each { |e| objKey << (ref.gen >> e*8 & 0xFF ) }
|
82
|
-
|
82
|
+
length = objKey.length < 16 ? objKey.length : 16
|
83
|
+
rc4 = RC4.new( Digest::MD5.digest(objKey)[(0...length)] )
|
83
84
|
rc4.decrypt(buf)
|
84
85
|
end
|
85
86
|
|
@@ -144,10 +145,11 @@ class PDF::Reader
|
|
144
145
|
out = Digest::MD5.digest(PassPadBytes.pack("C*") + @file_id)
|
145
146
|
#zero doesn't matter -> so from 0-19
|
146
147
|
20.times{ |i| out=RC4.new(xor_each_byte(keyBegins, i)).decrypt(out) }
|
148
|
+
pass = @user_key[(0...16)] == out
|
147
149
|
else
|
148
|
-
|
150
|
+
pass = RC4.new(keyBegins).encrypt(PassPadBytes.pack("C*")) == @user_key
|
149
151
|
end
|
150
|
-
|
152
|
+
pass ? keyBegins : nil
|
151
153
|
end
|
152
154
|
|
153
155
|
def make_file_key( user_pass )
|
metadata
CHANGED
@@ -1,19 +1,19 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.0
|
5
|
-
prerelease:
|
4
|
+
version: 1.0.0
|
5
|
+
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
8
8
|
- James Healy
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2012-01-16 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rake
|
16
|
-
requirement: &
|
16
|
+
requirement: &24844240 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :development
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *24844240
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: roodi
|
27
|
-
requirement: &
|
27
|
+
requirement: &24843780 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: '0'
|
33
33
|
type: :development
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *24843780
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: rspec
|
38
|
-
requirement: &
|
38
|
+
requirement: &24843280 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: '2.3'
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *24843280
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: ZenTest
|
49
|
-
requirement: &
|
49
|
+
requirement: &24842780 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 4.4.2
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *24842780
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: Ascii85
|
60
|
-
requirement: &
|
60
|
+
requirement: &24842320 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 1.0.0
|
66
66
|
type: :runtime
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *24842320
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: ruby-rc4
|
71
|
-
requirement: &
|
71
|
+
requirement: &24841940 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ! '>='
|
@@ -76,7 +76,7 @@ dependencies:
|
|
76
76
|
version: '0'
|
77
77
|
type: :runtime
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *24841940
|
80
80
|
description: The PDF::Reader library implements a PDF parser conforming as much as
|
81
81
|
possible to the PDF specification from Adobe
|
82
82
|
email:
|
@@ -152,13 +152,12 @@ files:
|
|
152
152
|
- bin/pdf_callbacks
|
153
153
|
homepage: http://github.com/yob/pdf-reader
|
154
154
|
licenses: []
|
155
|
-
post_install_message: ! "\n ********************************************\n\n
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
new API for a spin, please send any feedback my way.\n\n ********************************************\n\n"
|
155
|
+
post_install_message: ! "\n ********************************************\n\n v1.0.0
|
156
|
+
of PDF::Reader introduced a new page-based API. There are extensive\n examples
|
157
|
+
showing how to use it in the README and examples directory.\n\n For detailed documentation,
|
158
|
+
check the rdocs for the PDF::Reader,\n PDF::Reader::Page and PDF::Reader::ObjectHash
|
159
|
+
classes.\n\n The old API is marked as deprecated but will continue to work with
|
160
|
+
no\n visible warnings for now.\n\n ********************************************\n\n"
|
162
161
|
rdoc_options:
|
163
162
|
- --title
|
164
163
|
- PDF::Reader Documentation
|
@@ -176,9 +175,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
176
175
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
177
176
|
none: false
|
178
177
|
requirements:
|
179
|
-
- - ! '
|
178
|
+
- - ! '>='
|
180
179
|
- !ruby/object:Gem::Version
|
181
|
-
version:
|
180
|
+
version: '0'
|
182
181
|
requirements: []
|
183
182
|
rubyforge_project:
|
184
183
|
rubygems_version: 1.8.11
|