pdf-reader 0.7.7 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,3 +1,7 @@
1
+ v0.8.0 (20th November 2009)
2
+ - Added PDF::Hash. It provides direct access to objects from a PDF file
3
+ with an API that emulates the standard Ruby hash
4
+
1
5
  v0.7.7 (11th September 2009)
2
6
  - Trigger callbacks contained in Form XObjects when we encounter them in a
3
7
  content stream
@@ -17,9 +17,7 @@ spare time to dedicate to adding new features.
17
17
  The code as it is works fairly well, and I offer it "as is". All patches, bug
18
18
  reports and sample PDFs are welcome - I will work on them when I can. If anyone
19
19
  is interested in adding features to PDF::Reader in their own effort to learn
20
- the PDF file format, I'll happy offer help qand support.
21
-
22
- I STRONGLY RECOMMEND NOT USING PDF::READER FOR YOUR PRODUCTION CODE.
20
+ the PDF file format, I'll happy offer help and support.
23
21
 
24
22
  = Installation
25
23
 
@@ -42,6 +40,10 @@ For a full list of the supported callback methods and a description of when they
42
40
  will be called, refer to PDF::Reader::Content. See the code examples below for a
43
41
  way to print a list of all the callbacks generated by a file to STDOUT.
44
42
 
43
+ There is also a class called PDF::Hash. This provides direct access to the objects
44
+ in a PDF file using a ruby hash-like API. Checkout the documentation for the class
45
+ for further information.
46
+
45
47
  = Text Encoding
46
48
 
47
49
  Internally, text can be stored inside a PDF in various encodings, including
data/Rakefile CHANGED
@@ -6,7 +6,7 @@ require 'rake/testtask'
6
6
  require "rake/gempackagetask"
7
7
  require 'spec/rake/spectask'
8
8
 
9
- PKG_VERSION = "0.7.7"
9
+ PKG_VERSION = "0.8.0"
10
10
  PKG_NAME = "pdf-reader"
11
11
  PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
12
12
 
@@ -16,10 +16,11 @@ task :default => [ :spec ]
16
16
  # run all rspecs
17
17
  desc "Run all rspec files"
18
18
  Spec::Rake::SpecTask.new("spec") do |t|
19
- t.spec_files = FileList['specs/**/*.rb']
20
- t.rcov = true
21
- t.rcov_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
22
- # t.rcov_opts = ["--exclude","spec.*\.rb"]
19
+ t.spec_files = FileList['specs/**/*.rb']
20
+ t.rcov = true
21
+ t.rcov_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
22
+ t.ruby_opts << "-w"
23
+ # t.rcov_opts = ["--exclude","spec.*\.rb"]
23
24
  end
24
25
 
25
26
  # generate specdocs
@@ -0,0 +1,12 @@
1
+ #!/usr/bin/env ruby
2
+ # coding: utf-8
3
+
4
+ # get direct access to PDF objects
5
+ #
6
+ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
7
+
8
+ require 'pdf/reader'
9
+
10
+ filename = File.dirname(__FILE__) + "/../specs/data/cairo-unicode.pdf"
11
+ hash = PDF::Hash.new(filename)
12
+ puts hash[3]
@@ -0,0 +1,202 @@
1
+ module PDF
2
+ # Provides low level access to the objects in a PDF file via a hash-like
3
+ # object.
4
+ #
5
+ # A PDF file can be viewed as a large hash map. It is a series of objects
6
+ # stored at an exact byte offsets, and a table that maps object IDs to byte
7
+ # offsets. Given an object ID, looking up an object is an O(1) operation.
8
+ #
9
+ # Each PDF object can be mapped to a ruby object, so by passing an object
10
+ # ID to the [] method, a ruby representation of that object will be
11
+ # retrieved.
12
+ #
13
+ # The class behaves much like a standard Ruby hash, including the use of
14
+ # the Enumerable mixin. The key difference is no []= method - the hash
15
+ # is read only.
16
+ #
17
+ # == Basic Usage
18
+ #
19
+ # h = PDF::Hash.new("somefile.pdf")
20
+ # h[1]
21
+ # => 3469
22
+ #
23
+ # h[PDF::Reader::Reference.new(1,0)]
24
+ # => 3469
25
+ #
26
+ class Hash
27
+ include Enumerable
28
+
29
+ attr_accessor :default
30
+ attr_reader :trailer
31
+
32
+ # Creates a new PDF:Hash object. input can be a string with a valid filename,
33
+ # a string containing a PDF file, or an IO object.
34
+ #
35
+ def initialize(input)
36
+ if input.kind_of?(IO) || input.kind_of?(StringIO)
37
+ io = input
38
+ elsif File.file?(input.to_s)
39
+ if File.respond_to?(:binread)
40
+ input = File.binread(input.to_s)
41
+ else
42
+ input = File.read(input.to_s)
43
+ end
44
+ io = StringIO.new(input)
45
+ else
46
+ raise ArgumentError, "input must be an IO-like object or a filename"
47
+ end
48
+ buffer = PDF::Reader::Buffer.new(io)
49
+ @xref = PDF::Reader::XRef.new(buffer)
50
+ @trailer = @xref.load
51
+ end
52
+
53
+ # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
54
+ # object.
55
+ #
56
+ # If an int is used, the object with that ID and a generation number of 0 will
57
+ # be returned.
58
+ #
59
+ # If a PDF::Reader::Reference object is used the exact ID and generation number
60
+ # can be specified.
61
+ #
62
+ def [](key)
63
+ return default if key.to_i <= 0
64
+
65
+ begin
66
+ unless key.kind_of?(PDF::Reader::Reference)
67
+ key = PDF::Reader::Reference.new(key.to_i, 0)
68
+ end
69
+ @xref.object(key)
70
+ rescue
71
+ return default
72
+ end
73
+ end
74
+
75
+ # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
76
+ # object.
77
+ #
78
+ # If an int is used, the object with that ID and a generation number of 0 will
79
+ # be returned.
80
+ #
81
+ # If a PDF::Reader::Reference object is used the exact ID and generation number
82
+ # can be specified.
83
+ #
84
+ # local_deault is the object that will be returned if the requested key doesn't
85
+ # exist.
86
+ #
87
+ def fetch(key, local_default = nil)
88
+ obj = self[key]
89
+ if obj
90
+ return obj
91
+ elsif local_default
92
+ return local_default
93
+ else
94
+ raise IndexError, "#{key} is invalid" if key.to_i <= 0
95
+ end
96
+ end
97
+
98
+ # iterate over each key, value. Just like a ruby hash.
99
+ #
100
+ def each(&block)
101
+ @xref.each do |ref, obj|
102
+ yield ref, obj
103
+ end
104
+ end
105
+ alias :each_pair :each
106
+
107
+ # iterate over each key. Just like a ruby hash.
108
+ #
109
+ def each_key(&block)
110
+ each do |id, obj|
111
+ yield id
112
+ end
113
+ end
114
+
115
+ # iterate over each value. Just like a ruby hash.
116
+ #
117
+ def each_value(&block)
118
+ each do |id, obj|
119
+ yield obj
120
+ end
121
+ end
122
+
123
+ # return the number of objects in the file. An object with multiple generations
124
+ # is counted once.
125
+ def size
126
+ @xref.size
127
+ end
128
+ alias :length :size
129
+
130
+ # return true if there are no objects in this file
131
+ #
132
+ def empty?
133
+ size == 0 ? true : false
134
+ end
135
+
136
+ # return true if the specified key exists in the file. key
137
+ # can be an int or a PDF::Reader::Reference
138
+ #
139
+ def has_key?(check_key)
140
+ # TODO update from O(n) to O(1)
141
+ each_key do |key|
142
+ if check_key.kind_of?(PDF::Reader::Reference)
143
+ return true if check_key == key
144
+ else
145
+ return true if check_key.to_i == key.id
146
+ end
147
+ end
148
+ return false
149
+ end
150
+ alias :include? :has_key?
151
+ alias :key? :has_key?
152
+ alias :member? :has_key?
153
+
154
+ # return true if the specifiedvalue exists in the file
155
+ #
156
+ def has_value?(value)
157
+ # TODO update from O(n) to O(1)
158
+ each_value do |obj|
159
+ return true if obj == value
160
+ end
161
+ return false
162
+ end
163
+ alias :value? :has_key?
164
+
165
+ def to_s
166
+ "<PDF::Hash size: #{self.size}>"
167
+ end
168
+
169
+ # return an array of all keys in the file
170
+ #
171
+ def keys
172
+ ret = []
173
+ each_key { |k| ret << k }
174
+ ret
175
+ end
176
+
177
+ # return an array of all values in the file
178
+ #
179
+ def values
180
+ ret = []
181
+ each_value { |v| ret << v }
182
+ ret
183
+ end
184
+
185
+ # return an array of all values from the specified keys
186
+ #
187
+ def values_at(*ids)
188
+ ids.map { |id| self[id] }
189
+ end
190
+
191
+ # return an array of arrays. Each sub array contains a key/value pair.
192
+ #
193
+ def to_a
194
+ ret = []
195
+ each do |id, obj|
196
+ ret << [id, obj]
197
+ end
198
+ ret
199
+ end
200
+
201
+ end
202
+ end
@@ -116,6 +116,7 @@ require 'pdf/reader/stream'
116
116
  require 'pdf/reader/text_receiver'
117
117
  require 'pdf/reader/token'
118
118
  require 'pdf/reader/xref'
119
+ require 'pdf/hash'
119
120
 
120
121
  class PDF::Reader
121
122
  ################################################################################
@@ -265,7 +265,10 @@ class PDF::Reader
265
265
  callback(:metadata, [info]) if info
266
266
 
267
267
  # new style xml metadata
268
- callback(:xml_metadata,@xref.object(root[:Metadata])) if root[:Metadata]
268
+ if root[:Metadata]
269
+ stream = @xref.object(root[:Metadata])
270
+ callback(:xml_metadata,stream.unfiltered_data)
271
+ end
269
272
 
270
273
  # page count
271
274
  if (pages = @xref.object(root[:Pages]))
@@ -327,7 +330,7 @@ class PDF::Reader
327
330
  callback(:begin_form_xobject)
328
331
  resources = @xref.object(xobject.hash[:Resources])
329
332
  walk_resources(resources) if resources
330
- content_stream(xobject.to_s)
333
+ content_stream(xobject)
331
334
  callback(:end_form_xobject)
332
335
  end
333
336
  end
@@ -346,9 +349,10 @@ class PDF::Reader
346
349
  # Reads a PDF content stream and calls all the appropriate callback methods for the operators
347
350
  # it contains
348
351
  def content_stream (instructions)
349
- @buffer = Buffer.new(StringIO.new(instructions))
350
- @parser = Parser.new(@buffer, @xref)
351
- @params = [] if @params.nil?
352
+ instructions = instructions.unfiltered_data if instructions.kind_of?(PDF::Reader::Stream)
353
+ @buffer = Buffer.new(StringIO.new(instructions))
354
+ @parser = Parser.new(@buffer, @xref)
355
+ @params ||= []
352
356
 
353
357
  while (token = @parser.parse_token(OPERATORS))
354
358
  if token.kind_of?(Token) and OPERATORS.has_key?(token)
@@ -437,7 +441,8 @@ class PDF::Reader
437
441
  if desc[:ToUnicode]
438
442
  # this stream is a cmap
439
443
  begin
440
- @fonts[label].tounicode = PDF::Reader::CMap.new(desc[:ToUnicode])
444
+ stream = desc[:ToUnicode]
445
+ @fonts[label].tounicode = PDF::Reader::CMap.new(stream.unfiltered_data)
441
446
  rescue
442
447
  # if the CMap fails to parse, don't worry too much. Means we can't translate the text properly
443
448
  end
@@ -113,9 +113,10 @@ class PDF::Reader
113
113
  array_orig.each do |num|
114
114
  if tounicode && (code = tounicode.decode(num))
115
115
  array_enc << code
116
- elsif tounicode || (tounicode.nil? && @to_unicode_required)
116
+ elsif tounicode || ( tounicode.nil? && defined?(@to_unicode_required) &&
117
+ @to_unicode_required )
117
118
  array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
118
- elsif @mapping && @mapping[num]
119
+ elsif defined?(@mapping) && @mapping && @mapping[num]
119
120
  array_enc << @mapping[num]
120
121
  else
121
122
  array_enc << num
@@ -207,18 +207,6 @@ class PDF::Reader
207
207
  Error.str_assert(parse_token, "endstream")
208
208
  Error.str_assert(parse_token, "endobj")
209
209
 
210
- if dict.has_key?(:Filter)
211
- options = []
212
-
213
- if dict.has_key?(:DecodeParms)
214
- options = Array(dict[:DecodeParms])
215
- end
216
-
217
- Array(dict[:Filter]).each_with_index do |filter, index|
218
- data = Filter.new(filter, options[index]).filter(data)
219
- end
220
- end
221
-
222
210
  PDF::Reader::Stream.new(dict, data)
223
211
  end
224
212
  ################################################################################
@@ -49,6 +49,21 @@ class PDF::Reader
49
49
  [self]
50
50
  end
51
51
  ################################################################################
52
+ # returns the ID of this reference. Use with caution, ignores the generation id
53
+ def to_i
54
+ self.id
55
+ end
56
+ def ==(obj)
57
+ return false unless obj.kind_of?(PDF::Reader::Reference)
58
+
59
+ if obj.id == self.id && obj.gen == self.gen
60
+ true
61
+ else
62
+ false
63
+ end
64
+ end
65
+ alias :eql? :==
66
+ ################################################################################
52
67
  end
53
68
  ################################################################################
54
69
  end
@@ -25,18 +25,40 @@
25
25
 
26
26
  class PDF::Reader
27
27
  ################################################################################
28
- # An internal PDF::Reader class that represents a single token from a PDF file.
28
+ # An internal PDF::Reader class that represents a stream object from a PDF. Stream
29
+ # objects have 2 components, a dictionary that describes the content (size,
30
+ # compression, etc) and a stream of bytes.
29
31
  #
30
- # Behaves exactly like a Ruby String - it basically exists for convenience.
31
- class Stream < String
32
+ class Stream
32
33
  attr_accessor :hash
34
+ attr_reader :data
33
35
  ################################################################################
34
- # Creates a new token with the specified value
35
- def initialize (hash, val)
36
+ # Creates a new stream with the specified dictionary and data. The dictionary
37
+ # should be a standard ruby hash, the data should be a standard ruby string.
38
+ def initialize (hash, data)
36
39
  @hash = hash
37
- super val
40
+ @data = data
41
+ @udata = nil
38
42
  end
39
43
  ################################################################################
44
+ # apply this streams filters to its data and return the result.
45
+ def unfiltered_data
46
+ return @udata if @udata
47
+ @udata = data.dup
48
+
49
+ if hash.has_key?(:Filter)
50
+ options = []
51
+
52
+ if hash.has_key?(:DecodeParms)
53
+ options = Array(hash[:DecodeParms])
54
+ end
55
+
56
+ Array(hash[:Filter]).each_with_index do |filter, index|
57
+ @udata = Filter.new(filter, options[index]).filter(@udata)
58
+ end
59
+ end
60
+ @udata
61
+ end
40
62
  end
41
63
  ################################################################################
42
64
  end
@@ -36,6 +36,9 @@ class PDF::Reader
36
36
  @buffer = buffer
37
37
  @xref = {}
38
38
  end
39
+ def size
40
+ @xref.size
41
+ end
39
42
  ################################################################################
40
43
  # returns the PDF version of the current document. Technically this isn't part of the XRef
41
44
  # table, but it is one of the lowest level data items in the file, so we've lumped it in
@@ -136,6 +139,16 @@ class PDF::Reader
136
139
  raise InvalidObjectError, "Object #{ref.id}, Generation #{ref.gen} is invalid"
137
140
  end
138
141
  ################################################################################
142
+ # iterate over each object in the xref table
143
+ def each(&block)
144
+ ids = @xref.keys.sort
145
+ ids.each do |id|
146
+ gen = @xref[id].keys.sort[-1]
147
+ ref = PDF::Reader::Reference.new(id, gen)
148
+ yield ref, object(ref)
149
+ end
150
+ end
151
+ ################################################################################
139
152
  # Stores an offset value for a particular PDF object ID and revision number
140
153
  def store (id, gen, offset)
141
154
  (@xref[id] ||= {})[gen] ||= offset
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.7
4
+ version: 0.8.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Jones
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-09-11 00:00:00 +10:00
12
+ date: 2009-11-20 00:00:00 +11:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
@@ -36,38 +36,40 @@ extra_rdoc_files:
36
36
  - CHANGELOG
37
37
  - MIT-LICENSE
38
38
  files:
39
- - examples/extract_bates.rb
40
- - examples/text.rb
41
39
  - examples/page_counter_naive.rb
42
- - examples/callbacks.rb
40
+ - examples/rspec.rb
43
41
  - examples/metadata.rb
42
+ - examples/extract_bates.rb
43
+ - examples/hash.rb
44
+ - examples/callbacks.rb
45
+ - examples/text.rb
44
46
  - examples/page_counter_improved.rb
45
- - examples/rspec.rb
46
- - lib/pdf/reader.rb
47
- - lib/pdf/reader/buffer.rb
48
- - lib/pdf/reader/cmap.rb
47
+ - lib/pdf/reader/glyphlist.txt
49
48
  - lib/pdf/reader/content.rb
50
- - lib/pdf/reader/encoding.rb
51
49
  - lib/pdf/reader/error.rb
52
- - lib/pdf/reader/explore.rb
53
- - lib/pdf/reader/filter.rb
54
50
  - lib/pdf/reader/font.rb
55
- - lib/pdf/reader/glyphlist.txt
56
- - lib/pdf/reader/parser.rb
57
- - lib/pdf/reader/xref.rb
51
+ - lib/pdf/reader/print_receiver.rb
58
52
  - lib/pdf/reader/reference.rb
59
- - lib/pdf/reader/register_receiver.rb
53
+ - lib/pdf/reader/filter.rb
60
54
  - lib/pdf/reader/text_receiver.rb
55
+ - lib/pdf/reader/encoding.rb
56
+ - lib/pdf/reader/stream.rb
57
+ - lib/pdf/reader/register_receiver.rb
61
58
  - lib/pdf/reader/token.rb
62
- - lib/pdf/reader/encodings/mac_expert.txt
63
- - lib/pdf/reader/encodings/mac_roman.txt
64
- - lib/pdf/reader/encodings/pdf_doc.txt
59
+ - lib/pdf/reader/xref.rb
60
+ - lib/pdf/reader/cmap.rb
61
+ - lib/pdf/reader/buffer.rb
62
+ - lib/pdf/reader/explore.rb
63
+ - lib/pdf/reader/encodings/zapf_dingbats.txt
65
64
  - lib/pdf/reader/encodings/standard.txt
66
- - lib/pdf/reader/encodings/symbol.txt
65
+ - lib/pdf/reader/encodings/mac_roman.txt
66
+ - lib/pdf/reader/encodings/mac_expert.txt
67
67
  - lib/pdf/reader/encodings/win_ansi.txt
68
- - lib/pdf/reader/encodings/zapf_dingbats.txt
69
- - lib/pdf/reader/stream.rb
70
- - lib/pdf/reader/print_receiver.rb
68
+ - lib/pdf/reader/encodings/symbol.txt
69
+ - lib/pdf/reader/encodings/pdf_doc.txt
70
+ - lib/pdf/reader/parser.rb
71
+ - lib/pdf/hash.rb
72
+ - lib/pdf/reader.rb
71
73
  - Rakefile
72
74
  - README.rdoc
73
75
  - TODO