pdf-reader 1.0.0.rc1 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +4 -0
- data/README.rdoc +12 -17
- data/TODO +6 -17
- data/lib/pdf/reader.rb +7 -7
- data/lib/pdf/reader/filter.rb +1 -1
- data/lib/pdf/reader/form_xobject.rb +1 -1
- data/lib/pdf/reader/page.rb +1 -1
- data/lib/pdf/reader/page_text_receiver.rb +2 -14
- data/lib/pdf/reader/standard_security_handler.rb +5 -3
- metadata +23 -24
data/CHANGELOG
CHANGED
@@ -1,3 +1,7 @@
|
|
1
|
+
v1.0.0 (16th January 2012)
|
2
|
+
- support a new encryption variation
|
3
|
+
- bugfix in PageTextRender (thanks Paul Gallagher)
|
4
|
+
|
1
5
|
v1.0.0.rc1 (19th December 2011)
|
2
6
|
- performance optimisations (all by Bernerd Schaefer)
|
3
7
|
- some improvements to text extraction from form xobjects
|
data/README.rdoc
CHANGED
@@ -1,18 +1,3 @@
|
|
1
|
-
= !PLEASE NOTE!
|
2
|
-
|
3
|
-
All the examples below are for the latest (pre-release) version of the gem (0.11)
|
4
|
-
|
5
|
-
If you have installed the gem via the rubygems with the command:
|
6
|
-
|
7
|
-
$ gem install pdf-reader
|
8
|
-
|
9
|
-
Then the examples below *will not work* for you. Please check the examples that
|
10
|
-
come with previous version of the gem (0.10).
|
11
|
-
|
12
|
-
If you want to install the latest version of this gem use the command:
|
13
|
-
|
14
|
-
$ gem install pdf-reader --prerelease
|
15
|
-
|
16
1
|
= Release Notes
|
17
2
|
|
18
3
|
The PDF::Reader library implements a PDF parser conforming as much as possible
|
@@ -59,7 +44,8 @@ an IO stream:
|
|
59
44
|
puts reader.info
|
60
45
|
|
61
46
|
If you open a PDF with File#open or IO#open, I strongly recommend using "rb"
|
62
|
-
mode to ensure the file isn't mangled by ruby being 'helpful'.
|
47
|
+
mode to ensure the file isn't mangled by ruby being 'helpful'. This is
|
48
|
+
particularly important on windows and MRI >= 1.9.2.
|
63
49
|
|
64
50
|
File.open("somefile.pdf", "rb") do |io|
|
65
51
|
reader = PDF::Reader.new(io)
|
@@ -111,6 +97,15 @@ to UTF-8 before it is passed back from PDF::Reader.
|
|
111
97
|
Strings that contain binary data (like font blobs) will be marked as such on
|
112
98
|
M17N aware VMs.
|
113
99
|
|
100
|
+
= Former API
|
101
|
+
|
102
|
+
Version 1.0.0 of PDF::Reader introduced a new page-based API that provides
|
103
|
+
efficient and easy access to any page.
|
104
|
+
|
105
|
+
The previous API is marked as deprecated but will continue to work for the
|
106
|
+
time being. Eventually calls to the old API will begin triggering deprecation
|
107
|
+
warnings before it is completely removed in version 2.0.0.
|
108
|
+
|
114
109
|
= Exceptions
|
115
110
|
|
116
111
|
There are two key exceptions that you will need to watch out for when processing a
|
@@ -119,7 +114,7 @@ PDF file:
|
|
119
114
|
MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
|
120
115
|
file should be valid, or that a corrupt file didn't raise an exception, please
|
121
116
|
forward a copy of the file to the maintainers (preferably via the google group)
|
122
|
-
and we
|
117
|
+
and we will attempt to improve the code.
|
123
118
|
|
124
119
|
UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
|
125
120
|
support. Again, we welcome submissions of PDF files that exhibit these features to help
|
data/TODO
CHANGED
@@ -1,27 +1,19 @@
|
|
1
|
-
|
2
|
-
-
|
3
|
-
- list implemented features
|
4
|
-
- encrypted? tagged? bookmarks? annotated? optimised?
|
5
|
-
- Allow more than just page content and metadata to be parsed (see spec section 3.6.1)
|
1
|
+
This stuff would be great
|
2
|
+
- improved access to document level objects and data
|
6
3
|
- bookmarks?
|
7
4
|
- outline?
|
8
5
|
- articles?
|
9
6
|
- viewer prefs?
|
10
|
-
-
|
7
|
+
- Improve the speed of Encoding#to_utf8
|
11
8
|
- Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
|
12
9
|
poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
|
13
10
|
from the Original encoding to Unicode.
|
14
11
|
- detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
|
15
12
|
- Improve interpretation of non content stream data (ie metadata). recognise dates, etc
|
16
|
-
- Fix inheritance of page attributes. Resources has been done, but plenty of other attributes
|
17
|
-
are inheritable. See table 3.2.7 in the spec
|
18
13
|
|
19
|
-
v0.9
|
20
|
-
- Add a way to extract raster images
|
21
|
-
- see XObjects section of spec (section 4.7)
|
22
|
-
- Add a way to extract font data?
|
23
14
|
|
24
|
-
|
15
|
+
|
16
|
+
This might be useful, more research required
|
25
17
|
- Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
|
26
18
|
- Will require significantly improved handling of CMaps, including creating a bunch of predefined ones
|
27
19
|
|
@@ -30,10 +22,7 @@ Sometime
|
|
30
22
|
- Ship some extra receivers in the standard package, particuarly ones that are useful for running
|
31
23
|
rspec over generated PDF files
|
32
24
|
|
33
|
-
-
|
34
|
-
sensible way to convert them to unicode
|
35
|
-
|
36
|
-
- Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
|
25
|
+
- Add support for additional filters: CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode
|
37
26
|
|
38
27
|
- Add support for additional encodings:
|
39
28
|
- Identity-V(I *think* this relates to vertical text. Not sure how we'd support it sensibly)
|
data/lib/pdf/reader.rb
CHANGED
@@ -159,7 +159,7 @@ module PDF
|
|
159
159
|
yield PDF::Reader.new(input, opts)
|
160
160
|
end
|
161
161
|
|
162
|
-
# DEPRECATED: this method was deprecated in version 0.
|
162
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
163
163
|
# eventually be removed
|
164
164
|
#
|
165
165
|
#
|
@@ -171,7 +171,7 @@ module PDF
|
|
171
171
|
end
|
172
172
|
end
|
173
173
|
|
174
|
-
# DEPRECATED: this method was deprecated in version 0.
|
174
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
175
175
|
# eventually be removed
|
176
176
|
#
|
177
177
|
# Parse the given string, sending events to the given receiver.
|
@@ -182,7 +182,7 @@ module PDF
|
|
182
182
|
end
|
183
183
|
end
|
184
184
|
|
185
|
-
# DEPRECATED: this method was deprecated in version 0.
|
185
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
186
186
|
# eventually be removed
|
187
187
|
#
|
188
188
|
# Parse the file with the given name, returning an unmarshalled ruby version of
|
@@ -194,7 +194,7 @@ module PDF
|
|
194
194
|
}
|
195
195
|
end
|
196
196
|
|
197
|
-
# DEPRECATED: this method was deprecated in version 0.
|
197
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
198
198
|
# eventually be removed
|
199
199
|
#
|
200
200
|
# Parse the given string, returning an unmarshalled ruby version of represents
|
@@ -245,7 +245,7 @@ module PDF
|
|
245
245
|
end
|
246
246
|
|
247
247
|
|
248
|
-
# DEPRECATED: this method was deprecated in version 0.
|
248
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
249
249
|
# eventually be removed
|
250
250
|
#
|
251
251
|
# Given an IO object that contains PDF data, parse it.
|
@@ -263,7 +263,7 @@ module PDF
|
|
263
263
|
self
|
264
264
|
end
|
265
265
|
|
266
|
-
# DEPRECATED: this method was deprecated in version 0.
|
266
|
+
# DEPRECATED: this method was deprecated in version 1.0.0 and will
|
267
267
|
# eventually be removed
|
268
268
|
#
|
269
269
|
# Given an IO object that contains PDF data, return the contents of a single object
|
@@ -276,7 +276,7 @@ module PDF
|
|
276
276
|
|
277
277
|
private
|
278
278
|
|
279
|
-
# recursively convert strings from outside a content stream
|
279
|
+
# recursively convert strings from outside a content stream into UTF-8
|
280
280
|
#
|
281
281
|
def doc_strings_to_utf8(obj)
|
282
282
|
case obj
|
data/lib/pdf/reader/filter.rb
CHANGED
data/lib/pdf/reader/page.rb
CHANGED
@@ -1,12 +1,6 @@
|
|
1
1
|
# coding: utf-8
|
2
2
|
|
3
3
|
require 'matrix'
|
4
|
-
require 'yaml'
|
5
|
-
|
6
|
-
begin
|
7
|
-
require 'psych'
|
8
|
-
rescue LoadError
|
9
|
-
end
|
10
4
|
|
11
5
|
module PDF
|
12
6
|
class Reader
|
@@ -32,7 +26,7 @@ module PDF
|
|
32
26
|
@font_stack = [build_fonts(page.fonts)]
|
33
27
|
@xobject_stack = [page.xobjects]
|
34
28
|
@content = {}
|
35
|
-
@stack = [DEFAULT_GRAPHICS_STATE]
|
29
|
+
@stack = [DEFAULT_GRAPHICS_STATE.dup]
|
36
30
|
end
|
37
31
|
|
38
32
|
def content
|
@@ -235,8 +229,6 @@ module PDF
|
|
235
229
|
# underlying device space.
|
236
230
|
#
|
237
231
|
def transform(point, z = 1)
|
238
|
-
trm = text_rendering_matrix
|
239
|
-
|
240
232
|
point.transform(text_rendering_matrix, z)
|
241
233
|
end
|
242
234
|
|
@@ -286,7 +278,7 @@ module PDF
|
|
286
278
|
end
|
287
279
|
|
288
280
|
# private class for representing points on a cartesian plain. Used
|
289
|
-
# to simplify maths
|
281
|
+
# to simplify maths.
|
290
282
|
#
|
291
283
|
class Point < Struct.new(:x, :y)
|
292
284
|
def transform(trm, z)
|
@@ -295,10 +287,6 @@ module PDF
|
|
295
287
|
(trm[0,1] * x) + (trm[1,1] * y) + (trm[2,1] * z)
|
296
288
|
)
|
297
289
|
end
|
298
|
-
|
299
|
-
def distance(point)
|
300
|
-
Math.hypot(point.x - @x, point.y - @y)
|
301
|
-
end
|
302
290
|
end
|
303
291
|
end
|
304
292
|
end
|
@@ -79,7 +79,8 @@ class PDF::Reader
|
|
79
79
|
objKey = @encrypt_key.dup
|
80
80
|
(0..2).each { |e| objKey << (ref.id >> e*8 & 0xFF ) }
|
81
81
|
(0..1).each { |e| objKey << (ref.gen >> e*8 & 0xFF ) }
|
82
|
-
|
82
|
+
length = objKey.length < 16 ? objKey.length : 16
|
83
|
+
rc4 = RC4.new( Digest::MD5.digest(objKey)[(0...length)] )
|
83
84
|
rc4.decrypt(buf)
|
84
85
|
end
|
85
86
|
|
@@ -144,10 +145,11 @@ class PDF::Reader
|
|
144
145
|
out = Digest::MD5.digest(PassPadBytes.pack("C*") + @file_id)
|
145
146
|
#zero doesn't matter -> so from 0-19
|
146
147
|
20.times{ |i| out=RC4.new(xor_each_byte(keyBegins, i)).decrypt(out) }
|
148
|
+
pass = @user_key[(0...16)] == out
|
147
149
|
else
|
148
|
-
|
150
|
+
pass = RC4.new(keyBegins).encrypt(PassPadBytes.pack("C*")) == @user_key
|
149
151
|
end
|
150
|
-
|
152
|
+
pass ? keyBegins : nil
|
151
153
|
end
|
152
154
|
|
153
155
|
def make_file_key( user_pass )
|
metadata
CHANGED
@@ -1,19 +1,19 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.0
|
5
|
-
prerelease:
|
4
|
+
version: 1.0.0
|
5
|
+
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
8
8
|
- James Healy
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date:
|
12
|
+
date: 2012-01-16 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rake
|
16
|
-
requirement: &
|
16
|
+
requirement: &24844240 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :development
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *24844240
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: roodi
|
27
|
-
requirement: &
|
27
|
+
requirement: &24843780 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: '0'
|
33
33
|
type: :development
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *24843780
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: rspec
|
38
|
-
requirement: &
|
38
|
+
requirement: &24843280 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: '2.3'
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *24843280
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: ZenTest
|
49
|
-
requirement: &
|
49
|
+
requirement: &24842780 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 4.4.2
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *24842780
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: Ascii85
|
60
|
-
requirement: &
|
60
|
+
requirement: &24842320 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 1.0.0
|
66
66
|
type: :runtime
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *24842320
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: ruby-rc4
|
71
|
-
requirement: &
|
71
|
+
requirement: &24841940 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ! '>='
|
@@ -76,7 +76,7 @@ dependencies:
|
|
76
76
|
version: '0'
|
77
77
|
type: :runtime
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *24841940
|
80
80
|
description: The PDF::Reader library implements a PDF parser conforming as much as
|
81
81
|
possible to the PDF specification from Adobe
|
82
82
|
email:
|
@@ -152,13 +152,12 @@ files:
|
|
152
152
|
- bin/pdf_callbacks
|
153
153
|
homepage: http://github.com/yob/pdf-reader
|
154
154
|
licenses: []
|
155
|
-
post_install_message: ! "\n ********************************************\n\n
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
|
160
|
-
|
161
|
-
new API for a spin, please send any feedback my way.\n\n ********************************************\n\n"
|
155
|
+
post_install_message: ! "\n ********************************************\n\n v1.0.0
|
156
|
+
of PDF::Reader introduced a new page-based API. There are extensive\n examples
|
157
|
+
showing how to use it in the README and examples directory.\n\n For detailed documentation,
|
158
|
+
check the rdocs for the PDF::Reader,\n PDF::Reader::Page and PDF::Reader::ObjectHash
|
159
|
+
classes.\n\n The old API is marked as deprecated but will continue to work with
|
160
|
+
no\n visible warnings for now.\n\n ********************************************\n\n"
|
162
161
|
rdoc_options:
|
163
162
|
- --title
|
164
163
|
- PDF::Reader Documentation
|
@@ -176,9 +175,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
176
175
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
177
176
|
none: false
|
178
177
|
requirements:
|
179
|
-
- - ! '
|
178
|
+
- - ! '>='
|
180
179
|
- !ruby/object:Gem::Version
|
181
|
-
version:
|
180
|
+
version: '0'
|
182
181
|
requirements: []
|
183
182
|
rubyforge_project:
|
184
183
|
rubygems_version: 1.8.11
|