pdf-reader 0.7.7 → 0.8.0

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,3 +1,7 @@
1
+ v0.8.0 (20th November 2009)
2
+ - Added PDF::Hash. It provides direct access to objects from a PDF file
3
+ with an API that emulates the standard Ruby hash
4
+
1
5
  v0.7.7 (11th September 2009)
2
6
  - Trigger callbacks contained in Form XObjects when we encounter them in a
3
7
  content stream
@@ -17,9 +17,7 @@ spare time to dedicate to adding new features.
17
17
  The code as it is works fairly well, and I offer it "as is". All patches, bug
18
18
  reports and sample PDFs are welcome - I will work on them when I can. If anyone
19
19
  is interested in adding features to PDF::Reader in their own effort to learn
20
- the PDF file format, I'll happy offer help qand support.
21
-
22
- I STRONGLY RECOMMEND NOT USING PDF::READER FOR YOUR PRODUCTION CODE.
20
+ the PDF file format, I'll happy offer help and support.
23
21
 
24
22
  = Installation
25
23
 
@@ -42,6 +40,10 @@ For a full list of the supported callback methods and a description of when they
42
40
  will be called, refer to PDF::Reader::Content. See the code examples below for a
43
41
  way to print a list of all the callbacks generated by a file to STDOUT.
44
42
 
43
+ There is also a class called PDF::Hash. This provides direct access to the objects
44
+ in a PDF file using a ruby hash-like API. Checkout the documentation for the class
45
+ for further information.
46
+
45
47
  = Text Encoding
46
48
 
47
49
  Internally, text can be stored inside a PDF in various encodings, including
data/Rakefile CHANGED
@@ -6,7 +6,7 @@ require 'rake/testtask'
6
6
  require "rake/gempackagetask"
7
7
  require 'spec/rake/spectask'
8
8
 
9
- PKG_VERSION = "0.7.7"
9
+ PKG_VERSION = "0.8.0"
10
10
  PKG_NAME = "pdf-reader"
11
11
  PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
12
12
 
@@ -16,10 +16,11 @@ task :default => [ :spec ]
16
16
  # run all rspecs
17
17
  desc "Run all rspec files"
18
18
  Spec::Rake::SpecTask.new("spec") do |t|
19
- t.spec_files = FileList['specs/**/*.rb']
20
- t.rcov = true
21
- t.rcov_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
22
- # t.rcov_opts = ["--exclude","spec.*\.rb"]
19
+ t.spec_files = FileList['specs/**/*.rb']
20
+ t.rcov = true
21
+ t.rcov_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
22
+ t.ruby_opts << "-w"
23
+ # t.rcov_opts = ["--exclude","spec.*\.rb"]
23
24
  end
24
25
 
25
26
  # generate specdocs
@@ -0,0 +1,12 @@
1
+ #!/usr/bin/env ruby
2
+ # coding: utf-8
3
+
4
+ # get direct access to PDF objects
5
+ #
6
+ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
7
+
8
+ require 'pdf/reader'
9
+
10
+ filename = File.dirname(__FILE__) + "/../specs/data/cairo-unicode.pdf"
11
+ hash = PDF::Hash.new(filename)
12
+ puts hash[3]
@@ -0,0 +1,202 @@
1
+ module PDF
2
+ # Provides low level access to the objects in a PDF file via a hash-like
3
+ # object.
4
+ #
5
+ # A PDF file can be viewed as a large hash map. It is a series of objects
6
+ # stored at an exact byte offsets, and a table that maps object IDs to byte
7
+ # offsets. Given an object ID, looking up an object is an O(1) operation.
8
+ #
9
+ # Each PDF object can be mapped to a ruby object, so by passing an object
10
+ # ID to the [] method, a ruby representation of that object will be
11
+ # retrieved.
12
+ #
13
+ # The class behaves much like a standard Ruby hash, including the use of
14
+ # the Enumerable mixin. The key difference is no []= method - the hash
15
+ # is read only.
16
+ #
17
+ # == Basic Usage
18
+ #
19
+ # h = PDF::Hash.new("somefile.pdf")
20
+ # h[1]
21
+ # => 3469
22
+ #
23
+ # h[PDF::Reader::Reference.new(1,0)]
24
+ # => 3469
25
+ #
26
+ class Hash
27
+ include Enumerable
28
+
29
+ attr_accessor :default
30
+ attr_reader :trailer
31
+
32
+ # Creates a new PDF:Hash object. input can be a string with a valid filename,
33
+ # a string containing a PDF file, or an IO object.
34
+ #
35
+ def initialize(input)
36
+ if input.kind_of?(IO) || input.kind_of?(StringIO)
37
+ io = input
38
+ elsif File.file?(input.to_s)
39
+ if File.respond_to?(:binread)
40
+ input = File.binread(input.to_s)
41
+ else
42
+ input = File.read(input.to_s)
43
+ end
44
+ io = StringIO.new(input)
45
+ else
46
+ raise ArgumentError, "input must be an IO-like object or a filename"
47
+ end
48
+ buffer = PDF::Reader::Buffer.new(io)
49
+ @xref = PDF::Reader::XRef.new(buffer)
50
+ @trailer = @xref.load
51
+ end
52
+
53
+ # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
54
+ # object.
55
+ #
56
+ # If an int is used, the object with that ID and a generation number of 0 will
57
+ # be returned.
58
+ #
59
+ # If a PDF::Reader::Reference object is used the exact ID and generation number
60
+ # can be specified.
61
+ #
62
+ def [](key)
63
+ return default if key.to_i <= 0
64
+
65
+ begin
66
+ unless key.kind_of?(PDF::Reader::Reference)
67
+ key = PDF::Reader::Reference.new(key.to_i, 0)
68
+ end
69
+ @xref.object(key)
70
+ rescue
71
+ return default
72
+ end
73
+ end
74
+
75
+ # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
76
+ # object.
77
+ #
78
+ # If an int is used, the object with that ID and a generation number of 0 will
79
+ # be returned.
80
+ #
81
+ # If a PDF::Reader::Reference object is used the exact ID and generation number
82
+ # can be specified.
83
+ #
84
+ # local_deault is the object that will be returned if the requested key doesn't
85
+ # exist.
86
+ #
87
+ def fetch(key, local_default = nil)
88
+ obj = self[key]
89
+ if obj
90
+ return obj
91
+ elsif local_default
92
+ return local_default
93
+ else
94
+ raise IndexError, "#{key} is invalid" if key.to_i <= 0
95
+ end
96
+ end
97
+
98
+ # iterate over each key, value. Just like a ruby hash.
99
+ #
100
+ def each(&block)
101
+ @xref.each do |ref, obj|
102
+ yield ref, obj
103
+ end
104
+ end
105
+ alias :each_pair :each
106
+
107
+ # iterate over each key. Just like a ruby hash.
108
+ #
109
+ def each_key(&block)
110
+ each do |id, obj|
111
+ yield id
112
+ end
113
+ end
114
+
115
+ # iterate over each value. Just like a ruby hash.
116
+ #
117
+ def each_value(&block)
118
+ each do |id, obj|
119
+ yield obj
120
+ end
121
+ end
122
+
123
+ # return the number of objects in the file. An object with multiple generations
124
+ # is counted once.
125
+ def size
126
+ @xref.size
127
+ end
128
+ alias :length :size
129
+
130
+ # return true if there are no objects in this file
131
+ #
132
+ def empty?
133
+ size == 0 ? true : false
134
+ end
135
+
136
+ # return true if the specified key exists in the file. key
137
+ # can be an int or a PDF::Reader::Reference
138
+ #
139
+ def has_key?(check_key)
140
+ # TODO update from O(n) to O(1)
141
+ each_key do |key|
142
+ if check_key.kind_of?(PDF::Reader::Reference)
143
+ return true if check_key == key
144
+ else
145
+ return true if check_key.to_i == key.id
146
+ end
147
+ end
148
+ return false
149
+ end
150
+ alias :include? :has_key?
151
+ alias :key? :has_key?
152
+ alias :member? :has_key?
153
+
154
+ # return true if the specifiedvalue exists in the file
155
+ #
156
+ def has_value?(value)
157
+ # TODO update from O(n) to O(1)
158
+ each_value do |obj|
159
+ return true if obj == value
160
+ end
161
+ return false
162
+ end
163
+ alias :value? :has_key?
164
+
165
+ def to_s
166
+ "<PDF::Hash size: #{self.size}>"
167
+ end
168
+
169
+ # return an array of all keys in the file
170
+ #
171
+ def keys
172
+ ret = []
173
+ each_key { |k| ret << k }
174
+ ret
175
+ end
176
+
177
+ # return an array of all values in the file
178
+ #
179
+ def values
180
+ ret = []
181
+ each_value { |v| ret << v }
182
+ ret
183
+ end
184
+
185
+ # return an array of all values from the specified keys
186
+ #
187
+ def values_at(*ids)
188
+ ids.map { |id| self[id] }
189
+ end
190
+
191
+ # return an array of arrays. Each sub array contains a key/value pair.
192
+ #
193
+ def to_a
194
+ ret = []
195
+ each do |id, obj|
196
+ ret << [id, obj]
197
+ end
198
+ ret
199
+ end
200
+
201
+ end
202
+ end
@@ -116,6 +116,7 @@ require 'pdf/reader/stream'
116
116
  require 'pdf/reader/text_receiver'
117
117
  require 'pdf/reader/token'
118
118
  require 'pdf/reader/xref'
119
+ require 'pdf/hash'
119
120
 
120
121
  class PDF::Reader
121
122
  ################################################################################
@@ -265,7 +265,10 @@ class PDF::Reader
265
265
  callback(:metadata, [info]) if info
266
266
 
267
267
  # new style xml metadata
268
- callback(:xml_metadata,@xref.object(root[:Metadata])) if root[:Metadata]
268
+ if root[:Metadata]
269
+ stream = @xref.object(root[:Metadata])
270
+ callback(:xml_metadata,stream.unfiltered_data)
271
+ end
269
272
 
270
273
  # page count
271
274
  if (pages = @xref.object(root[:Pages]))
@@ -327,7 +330,7 @@ class PDF::Reader
327
330
  callback(:begin_form_xobject)
328
331
  resources = @xref.object(xobject.hash[:Resources])
329
332
  walk_resources(resources) if resources
330
- content_stream(xobject.to_s)
333
+ content_stream(xobject)
331
334
  callback(:end_form_xobject)
332
335
  end
333
336
  end
@@ -346,9 +349,10 @@ class PDF::Reader
346
349
  # Reads a PDF content stream and calls all the appropriate callback methods for the operators
347
350
  # it contains
348
351
  def content_stream (instructions)
349
- @buffer = Buffer.new(StringIO.new(instructions))
350
- @parser = Parser.new(@buffer, @xref)
351
- @params = [] if @params.nil?
352
+ instructions = instructions.unfiltered_data if instructions.kind_of?(PDF::Reader::Stream)
353
+ @buffer = Buffer.new(StringIO.new(instructions))
354
+ @parser = Parser.new(@buffer, @xref)
355
+ @params ||= []
352
356
 
353
357
  while (token = @parser.parse_token(OPERATORS))
354
358
  if token.kind_of?(Token) and OPERATORS.has_key?(token)
@@ -437,7 +441,8 @@ class PDF::Reader
437
441
  if desc[:ToUnicode]
438
442
  # this stream is a cmap
439
443
  begin
440
- @fonts[label].tounicode = PDF::Reader::CMap.new(desc[:ToUnicode])
444
+ stream = desc[:ToUnicode]
445
+ @fonts[label].tounicode = PDF::Reader::CMap.new(stream.unfiltered_data)
441
446
  rescue
442
447
  # if the CMap fails to parse, don't worry too much. Means we can't translate the text properly
443
448
  end
@@ -113,9 +113,10 @@ class PDF::Reader
113
113
  array_orig.each do |num|
114
114
  if tounicode && (code = tounicode.decode(num))
115
115
  array_enc << code
116
- elsif tounicode || (tounicode.nil? && @to_unicode_required)
116
+ elsif tounicode || ( tounicode.nil? && defined?(@to_unicode_required) &&
117
+ @to_unicode_required )
117
118
  array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
118
- elsif @mapping && @mapping[num]
119
+ elsif defined?(@mapping) && @mapping && @mapping[num]
119
120
  array_enc << @mapping[num]
120
121
  else
121
122
  array_enc << num
@@ -207,18 +207,6 @@ class PDF::Reader
207
207
  Error.str_assert(parse_token, "endstream")
208
208
  Error.str_assert(parse_token, "endobj")
209
209
 
210
- if dict.has_key?(:Filter)
211
- options = []
212
-
213
- if dict.has_key?(:DecodeParms)
214
- options = Array(dict[:DecodeParms])
215
- end
216
-
217
- Array(dict[:Filter]).each_with_index do |filter, index|
218
- data = Filter.new(filter, options[index]).filter(data)
219
- end
220
- end
221
-
222
210
  PDF::Reader::Stream.new(dict, data)
223
211
  end
224
212
  ################################################################################
@@ -49,6 +49,21 @@ class PDF::Reader
49
49
  [self]
50
50
  end
51
51
  ################################################################################
52
+ # returns the ID of this reference. Use with caution, ignores the generation id
53
+ def to_i
54
+ self.id
55
+ end
56
+ def ==(obj)
57
+ return false unless obj.kind_of?(PDF::Reader::Reference)
58
+
59
+ if obj.id == self.id && obj.gen == self.gen
60
+ true
61
+ else
62
+ false
63
+ end
64
+ end
65
+ alias :eql? :==
66
+ ################################################################################
52
67
  end
53
68
  ################################################################################
54
69
  end
@@ -25,18 +25,40 @@
25
25
 
26
26
  class PDF::Reader
27
27
  ################################################################################
28
- # An internal PDF::Reader class that represents a single token from a PDF file.
28
+ # An internal PDF::Reader class that represents a stream object from a PDF. Stream
29
+ # objects have 2 components, a dictionary that describes the content (size,
30
+ # compression, etc) and a stream of bytes.
29
31
  #
30
- # Behaves exactly like a Ruby String - it basically exists for convenience.
31
- class Stream < String
32
+ class Stream
32
33
  attr_accessor :hash
34
+ attr_reader :data
33
35
  ################################################################################
34
- # Creates a new token with the specified value
35
- def initialize (hash, val)
36
+ # Creates a new stream with the specified dictionary and data. The dictionary
37
+ # should be a standard ruby hash, the data should be a standard ruby string.
38
+ def initialize (hash, data)
36
39
  @hash = hash
37
- super val
40
+ @data = data
41
+ @udata = nil
38
42
  end
39
43
  ################################################################################
44
+ # apply this streams filters to its data and return the result.
45
+ def unfiltered_data
46
+ return @udata if @udata
47
+ @udata = data.dup
48
+
49
+ if hash.has_key?(:Filter)
50
+ options = []
51
+
52
+ if hash.has_key?(:DecodeParms)
53
+ options = Array(hash[:DecodeParms])
54
+ end
55
+
56
+ Array(hash[:Filter]).each_with_index do |filter, index|
57
+ @udata = Filter.new(filter, options[index]).filter(@udata)
58
+ end
59
+ end
60
+ @udata
61
+ end
40
62
  end
41
63
  ################################################################################
42
64
  end
@@ -36,6 +36,9 @@ class PDF::Reader
36
36
  @buffer = buffer
37
37
  @xref = {}
38
38
  end
39
+ def size
40
+ @xref.size
41
+ end
39
42
  ################################################################################
40
43
  # returns the PDF version of the current document. Technically this isn't part of the XRef
41
44
  # table, but it is one of the lowest level data items in the file, so we've lumped it in
@@ -136,6 +139,16 @@ class PDF::Reader
136
139
  raise InvalidObjectError, "Object #{ref.id}, Generation #{ref.gen} is invalid"
137
140
  end
138
141
  ################################################################################
142
+ # iterate over each object in the xref table
143
+ def each(&block)
144
+ ids = @xref.keys.sort
145
+ ids.each do |id|
146
+ gen = @xref[id].keys.sort[-1]
147
+ ref = PDF::Reader::Reference.new(id, gen)
148
+ yield ref, object(ref)
149
+ end
150
+ end
151
+ ################################################################################
139
152
  # Stores an offset value for a particular PDF object ID and revision number
140
153
  def store (id, gen, offset)
141
154
  (@xref[id] ||= {})[gen] ||= offset
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.7
4
+ version: 0.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Jones
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-09-11 00:00:00 +10:00
12
+ date: 2009-11-20 00:00:00 +11:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
@@ -36,38 +36,40 @@ extra_rdoc_files:
36
36
  - CHANGELOG
37
37
  - MIT-LICENSE
38
38
  files:
39
- - examples/extract_bates.rb
40
- - examples/text.rb
41
39
  - examples/page_counter_naive.rb
42
- - examples/callbacks.rb
40
+ - examples/rspec.rb
43
41
  - examples/metadata.rb
42
+ - examples/extract_bates.rb
43
+ - examples/hash.rb
44
+ - examples/callbacks.rb
45
+ - examples/text.rb
44
46
  - examples/page_counter_improved.rb
45
- - examples/rspec.rb
46
- - lib/pdf/reader.rb
47
- - lib/pdf/reader/buffer.rb
48
- - lib/pdf/reader/cmap.rb
47
+ - lib/pdf/reader/glyphlist.txt
49
48
  - lib/pdf/reader/content.rb
50
- - lib/pdf/reader/encoding.rb
51
49
  - lib/pdf/reader/error.rb
52
- - lib/pdf/reader/explore.rb
53
- - lib/pdf/reader/filter.rb
54
50
  - lib/pdf/reader/font.rb
55
- - lib/pdf/reader/glyphlist.txt
56
- - lib/pdf/reader/parser.rb
57
- - lib/pdf/reader/xref.rb
51
+ - lib/pdf/reader/print_receiver.rb
58
52
  - lib/pdf/reader/reference.rb
59
- - lib/pdf/reader/register_receiver.rb
53
+ - lib/pdf/reader/filter.rb
60
54
  - lib/pdf/reader/text_receiver.rb
55
+ - lib/pdf/reader/encoding.rb
56
+ - lib/pdf/reader/stream.rb
57
+ - lib/pdf/reader/register_receiver.rb
61
58
  - lib/pdf/reader/token.rb
62
- - lib/pdf/reader/encodings/mac_expert.txt
63
- - lib/pdf/reader/encodings/mac_roman.txt
64
- - lib/pdf/reader/encodings/pdf_doc.txt
59
+ - lib/pdf/reader/xref.rb
60
+ - lib/pdf/reader/cmap.rb
61
+ - lib/pdf/reader/buffer.rb
62
+ - lib/pdf/reader/explore.rb
63
+ - lib/pdf/reader/encodings/zapf_dingbats.txt
65
64
  - lib/pdf/reader/encodings/standard.txt
66
- - lib/pdf/reader/encodings/symbol.txt
65
+ - lib/pdf/reader/encodings/mac_roman.txt
66
+ - lib/pdf/reader/encodings/mac_expert.txt
67
67
  - lib/pdf/reader/encodings/win_ansi.txt
68
- - lib/pdf/reader/encodings/zapf_dingbats.txt
69
- - lib/pdf/reader/stream.rb
70
- - lib/pdf/reader/print_receiver.rb
68
+ - lib/pdf/reader/encodings/symbol.txt
69
+ - lib/pdf/reader/encodings/pdf_doc.txt
70
+ - lib/pdf/reader/parser.rb
71
+ - lib/pdf/hash.rb
72
+ - lib/pdf/reader.rb
71
73
  - Rakefile
72
74
  - README.rdoc
73
75
  - TODO