pdf-reader 1.0.0.rc1 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,3 +1,7 @@
1
+ v1.0.0 (16th January 2012)
2
+ - support a new encryption variation
3
+ - bugfix in PageTextRender (thanks Paul Gallagher)
4
+
1
5
  v1.0.0.rc1 (19th December 2011)
2
6
  - performance optimisations (all by Bernerd Schaefer)
3
7
  - some improvements to text extraction from form xobjects
@@ -1,18 +1,3 @@
1
- = !PLEASE NOTE!
2
-
3
- All the examples below are for the latest (pre-release) version of the gem (0.11)
4
-
5
- If you have installed the gem via the rubygems with the command:
6
-
7
- $ gem install pdf-reader
8
-
9
- Then the examples below *will not work* for you. Please check the examples that
10
- come with previous version of the gem (0.10).
11
-
12
- If you want to install the latest version of this gem use the command:
13
-
14
- $ gem install pdf-reader --prerelease
15
-
16
1
  = Release Notes
17
2
 
18
3
  The PDF::Reader library implements a PDF parser conforming as much as possible
@@ -59,7 +44,8 @@ an IO stream:
59
44
  puts reader.info
60
45
 
61
46
  If you open a PDF with File#open or IO#open, I strongly recommend using "rb"
62
- mode to ensure the file isn't mangled by ruby being 'helpful'.
47
+ mode to ensure the file isn't mangled by ruby being 'helpful'. This is
48
+ particularly important on windows and MRI >= 1.9.2.
63
49
 
64
50
  File.open("somefile.pdf", "rb") do |io|
65
51
  reader = PDF::Reader.new(io)
@@ -111,6 +97,15 @@ to UTF-8 before it is passed back from PDF::Reader.
111
97
  Strings that contain binary data (like font blobs) will be marked as such on
112
98
  M17N aware VMs.
113
99
 
100
+ = Former API
101
+
102
+ Version 1.0.0 of PDF::Reader introduced a new page-based API that provides
103
+ efficient and easy access to any page.
104
+
105
+ The previous API is marked as deprecated but will continue to work for the
106
+ time being. Eventually calls to the old API will begin triggering deprecation
107
+ warnings before it is completely removed in version 2.0.0.
108
+
114
109
  = Exceptions
115
110
 
116
111
  There are two key exceptions that you will need to watch out for when processing a
@@ -119,7 +114,7 @@ PDF file:
119
114
  MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
120
115
  file should be valid, or that a corrupt file didn't raise an exception, please
121
116
  forward a copy of the file to the maintainers (preferably via the google group)
122
- and we can attempt to improve the code.
117
+ and we will attempt to improve the code.
123
118
 
124
119
  UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
125
120
  support. Again, we welcome submissions of PDF files that exhibit these features to help
data/TODO CHANGED
@@ -1,27 +1,19 @@
1
- v0.8
2
- - add extra callbacks
3
- - list implemented features
4
- - encrypted? tagged? bookmarks? annotated? optimised?
5
- - Allow more than just page content and metadata to be parsed (see spec section 3.6.1)
1
+ This stuff would be great
2
+ - improved access to document level objects and data
6
3
  - bookmarks?
7
4
  - outline?
8
5
  - articles?
9
6
  - viewer prefs?
10
- - Don't remove comment when tokenising in the middle of a string
7
+ - Improve the speed of Encoding#to_utf8
11
8
  - Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
12
9
  poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
13
10
  from the Original encoding to Unicode.
14
11
  - detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
15
12
  - Improve interpretation of non content stream data (ie metadata). recognise dates, etc
16
- - Fix inheritance of page attributes. Resources has been done, but plenty of other attributes
17
- are inheritable. See table 3.2.7 in the spec
18
13
 
19
- v0.9
20
- - Add a way to extract raster images
21
- - see XObjects section of spec (section 4.7)
22
- - Add a way to extract font data?
23
14
 
24
- Sometime
15
+
16
+ This might be useful, more research required
25
17
  - Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
26
18
  - Will require significantly improved handling of CMaps, including creating a bunch of predefined ones
27
19
 
@@ -30,10 +22,7 @@ Sometime
30
22
  - Ship some extra receivers in the standard package, particuarly ones that are useful for running
31
23
  rspec over generated PDF files
32
24
 
33
- - When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
34
- sensible way to convert them to unicode
35
-
36
- - Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
25
+ - Add support for additional filters: CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode
37
26
 
38
27
  - Add support for additional encodings:
39
28
  - Identity-V(I *think* this relates to vertical text. Not sure how we'd support it sensibly)
@@ -159,7 +159,7 @@ module PDF
159
159
  yield PDF::Reader.new(input, opts)
160
160
  end
161
161
 
162
- # DEPRECATED: this method was deprecated in version 0.11.0 and will
162
+ # DEPRECATED: this method was deprecated in version 1.0.0 and will
163
163
  # eventually be removed
164
164
  #
165
165
  #
@@ -171,7 +171,7 @@ module PDF
171
171
  end
172
172
  end
173
173
 
174
- # DEPRECATED: this method was deprecated in version 0.11.0 and will
174
+ # DEPRECATED: this method was deprecated in version 1.0.0 and will
175
175
  # eventually be removed
176
176
  #
177
177
  # Parse the given string, sending events to the given receiver.
@@ -182,7 +182,7 @@ module PDF
182
182
  end
183
183
  end
184
184
 
185
- # DEPRECATED: this method was deprecated in version 0.11.0 and will
185
+ # DEPRECATED: this method was deprecated in version 1.0.0 and will
186
186
  # eventually be removed
187
187
  #
188
188
  # Parse the file with the given name, returning an unmarshalled ruby version of
@@ -194,7 +194,7 @@ module PDF
194
194
  }
195
195
  end
196
196
 
197
- # DEPRECATED: this method was deprecated in version 0.11.0 and will
197
+ # DEPRECATED: this method was deprecated in version 1.0.0 and will
198
198
  # eventually be removed
199
199
  #
200
200
  # Parse the given string, returning an unmarshalled ruby version of represents
@@ -245,7 +245,7 @@ module PDF
245
245
  end
246
246
 
247
247
 
248
- # DEPRECATED: this method was deprecated in version 0.11.0 and will
248
+ # DEPRECATED: this method was deprecated in version 1.0.0 and will
249
249
  # eventually be removed
250
250
  #
251
251
  # Given an IO object that contains PDF data, parse it.
@@ -263,7 +263,7 @@ module PDF
263
263
  self
264
264
  end
265
265
 
266
- # DEPRECATED: this method was deprecated in version 0.11.0 and will
266
+ # DEPRECATED: this method was deprecated in version 1.0.0 and will
267
267
  # eventually be removed
268
268
  #
269
269
  # Given an IO object that contains PDF data, return the contents of a single object
@@ -276,7 +276,7 @@ module PDF
276
276
 
277
277
  private
278
278
 
279
- # recursively convert strings from outside a content stream intop UTF-8
279
+ # recursively convert strings from outside a content stream into UTF-8
280
280
  #
281
281
  def doc_strings_to_utf8(obj)
282
282
  case obj
@@ -272,7 +272,7 @@ class PDF::Reader
272
272
  row += 1
273
273
  end
274
274
 
275
- pixels.map { |row| row.flatten.pack("C*") }.join("")
275
+ pixels.map { |bytes| bytes.flatten.pack("C*") }.join("")
276
276
  end
277
277
  end
278
278
  end
@@ -76,7 +76,7 @@ module PDF
76
76
  params << token
77
77
  end
78
78
  end
79
- rescue EOFError => e
79
+ rescue EOFError
80
80
  raise MalformedPDFError, "End Of File while processing a content stream"
81
81
  end
82
82
  end
@@ -133,7 +133,7 @@ module PDF
133
133
  params << token
134
134
  end
135
135
  end
136
- rescue EOFError => e
136
+ rescue EOFError
137
137
  raise MalformedPDFError, "End Of File while processing a content stream"
138
138
  end
139
139
 
@@ -1,12 +1,6 @@
1
1
  # coding: utf-8
2
2
 
3
3
  require 'matrix'
4
- require 'yaml'
5
-
6
- begin
7
- require 'psych'
8
- rescue LoadError
9
- end
10
4
 
11
5
  module PDF
12
6
  class Reader
@@ -32,7 +26,7 @@ module PDF
32
26
  @font_stack = [build_fonts(page.fonts)]
33
27
  @xobject_stack = [page.xobjects]
34
28
  @content = {}
35
- @stack = [DEFAULT_GRAPHICS_STATE]
29
+ @stack = [DEFAULT_GRAPHICS_STATE.dup]
36
30
  end
37
31
 
38
32
  def content
@@ -235,8 +229,6 @@ module PDF
235
229
  # underlying device space.
236
230
  #
237
231
  def transform(point, z = 1)
238
- trm = text_rendering_matrix
239
-
240
232
  point.transform(text_rendering_matrix, z)
241
233
  end
242
234
 
@@ -286,7 +278,7 @@ module PDF
286
278
  end
287
279
 
288
280
  # private class for representing points on a cartesian plain. Used
289
- # to simplify maths in the MinPpi class.
281
+ # to simplify maths.
290
282
  #
291
283
  class Point < Struct.new(:x, :y)
292
284
  def transform(trm, z)
@@ -295,10 +287,6 @@ module PDF
295
287
  (trm[0,1] * x) + (trm[1,1] * y) + (trm[2,1] * z)
296
288
  )
297
289
  end
298
-
299
- def distance(point)
300
- Math.hypot(point.x - @x, point.y - @y)
301
- end
302
290
  end
303
291
  end
304
292
  end
@@ -79,7 +79,8 @@ class PDF::Reader
79
79
  objKey = @encrypt_key.dup
80
80
  (0..2).each { |e| objKey << (ref.id >> e*8 & 0xFF ) }
81
81
  (0..1).each { |e| objKey << (ref.gen >> e*8 & 0xFF ) }
82
- rc4 = RC4.new( Digest::MD5.digest(objKey) )
82
+ length = objKey.length < 16 ? objKey.length : 16
83
+ rc4 = RC4.new( Digest::MD5.digest(objKey)[(0...length)] )
83
84
  rc4.decrypt(buf)
84
85
  end
85
86
 
@@ -144,10 +145,11 @@ class PDF::Reader
144
145
  out = Digest::MD5.digest(PassPadBytes.pack("C*") + @file_id)
145
146
  #zero doesn't matter -> so from 0-19
146
147
  20.times{ |i| out=RC4.new(xor_each_byte(keyBegins, i)).decrypt(out) }
148
+ pass = @user_key[(0...16)] == out
147
149
  else
148
- out = RC4.new(keyBegins).encrypt(PassPadBytes.pack("C*"))
150
+ pass = RC4.new(keyBegins).encrypt(PassPadBytes.pack("C*")) == @user_key
149
151
  end
150
- @user_key[(0...16)] == out ? keyBegins : nil
152
+ pass ? keyBegins : nil
151
153
  end
152
154
 
153
155
  def make_file_key( user_pass )
metadata CHANGED
@@ -1,19 +1,19 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0.rc1
5
- prerelease: 6
4
+ version: 1.0.0
5
+ prerelease:
6
6
  platform: ruby
7
7
  authors:
8
8
  - James Healy
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2011-12-19 00:00:00.000000000 Z
12
+ date: 2012-01-16 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rake
16
- requirement: &19650680 !ruby/object:Gem::Requirement
16
+ requirement: &24844240 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: '0'
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *19650680
24
+ version_requirements: *24844240
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: roodi
27
- requirement: &19650220 !ruby/object:Gem::Requirement
27
+ requirement: &24843780 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: '0'
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *19650220
35
+ version_requirements: *24843780
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: rspec
38
- requirement: &19649720 !ruby/object:Gem::Requirement
38
+ requirement: &24843280 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: '2.3'
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *19649720
46
+ version_requirements: *24843280
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: ZenTest
49
- requirement: &19649220 !ruby/object:Gem::Requirement
49
+ requirement: &24842780 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 4.4.2
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *19649220
57
+ version_requirements: *24842780
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: Ascii85
60
- requirement: &19648740 !ruby/object:Gem::Requirement
60
+ requirement: &24842320 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 1.0.0
66
66
  type: :runtime
67
67
  prerelease: false
68
- version_requirements: *19648740
68
+ version_requirements: *24842320
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: ruby-rc4
71
- requirement: &19648280 !ruby/object:Gem::Requirement
71
+ requirement: &24841940 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ! '>='
@@ -76,7 +76,7 @@ dependencies:
76
76
  version: '0'
77
77
  type: :runtime
78
78
  prerelease: false
79
- version_requirements: *19648280
79
+ version_requirements: *24841940
80
80
  description: The PDF::Reader library implements a PDF parser conforming as much as
81
81
  possible to the PDF specification from Adobe
82
82
  email:
@@ -152,13 +152,12 @@ files:
152
152
  - bin/pdf_callbacks
153
153
  homepage: http://github.com/yob/pdf-reader
154
154
  licenses: []
155
- post_install_message: ! "\n ********************************************\n\n This
156
- is a beta release of PDF::Reader to gather feedback on the proposed\n API changes.\n\n
157
- \ The old API is marked as deprecated but will continue to work with no\n visible
158
- warnings for now.\n\n The new API is documented in the README and in rdoc for the
159
- PDF::Reader,\n PDF::Reader::Page and PDF::Reader::ObjectHash classes.\n\n Do not
160
- use this in production, stick to stable releases for that. If you do\n take the
161
- new API for a spin, please send any feedback my way.\n\n ********************************************\n\n"
155
+ post_install_message: ! "\n ********************************************\n\n v1.0.0
156
+ of PDF::Reader introduced a new page-based API. There are extensive\n examples
157
+ showing how to use it in the README and examples directory.\n\n For detailed documentation,
158
+ check the rdocs for the PDF::Reader,\n PDF::Reader::Page and PDF::Reader::ObjectHash
159
+ classes.\n\n The old API is marked as deprecated but will continue to work with
160
+ no\n visible warnings for now.\n\n ********************************************\n\n"
162
161
  rdoc_options:
163
162
  - --title
164
163
  - PDF::Reader Documentation
@@ -176,9 +175,9 @@ required_ruby_version: !ruby/object:Gem::Requirement
176
175
  required_rubygems_version: !ruby/object:Gem::Requirement
177
176
  none: false
178
177
  requirements:
179
- - - ! '>'
178
+ - - ! '>='
180
179
  - !ruby/object:Gem::Version
181
- version: 1.3.1
180
+ version: '0'
182
181
  requirements: []
183
182
  rubyforge_project:
184
183
  rubygems_version: 1.8.11