pdf-reader 0.8.6 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,3 +1,20 @@
1
+ v0.9.0 (19th November 2010)
2
+ - support for pdf 1.5+ files that use object and xref streams
3
+ - support streams that use a flate filter with the predictor option
4
+ - ensure all content instructions are parsed when split over multiple stream
5
+ - thanks to Jack Rusher for reporting
6
+ - Various string parsing bug
7
+ - some character conversions to utf-8 were failing (thanks Andrea Barisani)
8
+ - hashes with nested hex strings were tokenising wronly (thanks Evan Arnold)
9
+ - escaping bug in tokenising of literal strings (thanks David Westerink)
10
+ - Fix a bug that prevented PDFs with white space after the EOF marker from loading
11
+ - thanks to Solomon White for reporting the issue
12
+ - Add support for de-filtering some LZW compressed streams
13
+ - thanks to Jose Ignacio Rubio Iradi for the patch
14
+ - some small speed improvements
15
+ - API CHANGE: PDF::Hash renamed to PDF::Reader::ObjectHash
16
+ - having a class named Hash was confusing for users
17
+
1
18
  v0.8.6 (27th August 2010)
2
19
  - new method: hash#page_references
3
20
  - returns references to all page objects, gives rapid access to objects
@@ -8,17 +8,6 @@ The PDF 1.7 specification is a weighty document and not all aspects are
8
8
  currently supported. I welcome submission of PDF files that exhibit
9
9
  unsupported aspects of the spec to assist with improving out support.
10
10
 
11
- = Development Status
12
-
13
- I adopted this library in 2007 when I was learning the fundamentals of the PDF
14
- spec. I do not currently use it in my day to day work and I just don't have the
15
- spare time to dedicate to adding new features.
16
-
17
- The code as it is works fairly well, and I offer it "as is". All patches, bug
18
- reports and sample PDFs are welcome - I will work on them when I can. If anyone
19
- is interested in adding features to PDF::Reader in their own effort to learn
20
- the PDF file format, I'll happy offer help and support.
21
-
22
11
  = Installation
23
12
 
24
13
  The recommended installation method is via Rubygems.
@@ -37,12 +26,12 @@ methods do is entirely up to you - save the text, extract images, count pages,
37
26
  read metadata, whatever.
38
27
 
39
28
  For a full list of the supported callback methods and a description of when they
40
- will be called, refer to PDF::Reader::Content. See the code examples below for a
29
+ will be called, refer to PDF::Reader::PagesStrategy. See the examples directory for a
41
30
  way to print a list of all the callbacks generated by a file to STDOUT.
42
31
 
43
- There is also a class called PDF::Hash. This provides direct access to the objects
44
- in a PDF file using a ruby hash-like API. Checkout the documentation for the class
45
- for further information.
32
+ There is also a class called PDF::Reader::ObjectHash. This provides direct
33
+ access to the objects in a PDF file using a ruby hash-like API. Checkout the
34
+ documentation for the class for further information.
46
35
 
47
36
  = Text Encoding
48
37
 
@@ -50,6 +39,9 @@ Internally, text can be stored inside a PDF in various encodings, including
50
39
  zingbats, win-1252, mac roman and a form of Unicode. To avoid confusion, all
51
40
  text will be converted to UTF-8 before it is passed back from PDF::Reader.
52
41
 
42
+ Strings that contain binary data (like font blobs) will be marked as such on
43
+ M17N aware VMs.
44
+
53
45
  = Exceptions
54
46
 
55
47
  There are two key exceptions that you will need to watch out for when processing a
data/Rakefile CHANGED
@@ -1,43 +1,21 @@
1
1
  require "rubygems"
2
+ require "bundler"
3
+ Bundler.setup
4
+
2
5
  require 'rake'
3
- require 'rake/clean'
4
6
  require 'rake/rdoctask'
5
- require 'rake/testtask'
6
- require "rake/gempackagetask"
7
- require 'spec/rake/spectask'
8
-
9
- PKG_VERSION = "0.8.6"
10
- PKG_NAME = "pdf-reader"
11
- PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
7
+ require 'rspec/core/rake_task'
8
+ require 'roodi'
9
+ require 'roodi_task'
12
10
 
13
11
  desc "Default Task"
14
12
  task :default => [ :spec ]
15
13
 
16
14
  # run all rspecs
17
15
  desc "Run all rspec files"
18
- Spec::Rake::SpecTask.new("spec") do |t|
19
- t.spec_files = FileList['specs/**/*.rb']
20
- t.rcov = true
21
- t.rcov_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
22
- t.ruby_opts << "-w"
23
- # t.rcov_opts = ["--exclude","spec.*\.rb"]
24
- end
25
-
26
- # generate specdocs
27
- desc "Generate Specdocs"
28
- Spec::Rake::SpecTask.new("specdocs") do |t|
29
- t.spec_files = FileList['specs/**/*.rb']
30
- t.spec_opts = ["--format", "rdoc"]
31
- t.out = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + '/specdoc.rd'
32
- end
33
-
34
- # generate failing spec report
35
- desc "Generate failing spec report"
36
- Spec::Rake::SpecTask.new("spec_report") do |t|
37
- t.spec_files = FileList['specs/**/*.rb']
38
- t.spec_opts = ["--format", "html", "--diff"]
39
- t.out = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + '/spec_report.html'
40
- t.fail_on_error = false
16
+ RSpec::Core::RakeTask.new("spec") do |t|
17
+ t.rspec_opts = ["--color", "--format progress"]
18
+ t.ruby_opts = "-w"
41
19
  end
42
20
 
43
21
  # Genereate the RDoc documentation
@@ -53,35 +31,4 @@ Rake::RDocTask.new("doc") do |rdoc|
53
31
  rdoc.options << "--inline-source"
54
32
  end
55
33
 
56
- # a gemspec for packaging this library
57
- # RSpec files aren't included, as they depend on the PDF files,
58
- # which will make the gem filesize irritatingly large
59
- spec = Gem::Specification.new do |spec|
60
- spec.name = PKG_NAME
61
- spec.version = PKG_VERSION
62
- spec.platform = Gem::Platform::RUBY
63
- spec.summary = "A library for accessing the content of PDF files"
64
- spec.files = Dir.glob("{examples,lib}/**/**/*") + ["Rakefile"]
65
- spec.require_path = "lib"
66
- spec.bindir = "bin"
67
- spec.executables << "pdf_object"
68
- spec.executables << "pdf_text"
69
- spec.executables << "pdf_list_callbacks"
70
- spec.has_rdoc = true
71
- spec.extra_rdoc_files = %w{README.rdoc TODO CHANGELOG MIT-LICENSE }
72
- spec.rdoc_options << '--title' << 'PDF::Reader Documentation' <<
73
- '--main' << 'README.rdoc' << '-q'
74
- spec.author = "James Healy"
75
- spec.email = "jimmy@deefa.com"
76
- spec.rubyforge_project = "pdf-reader"
77
- spec.homepage = "http://github.com/yob/pdf-reader"
78
- spec.description = "The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe"
79
- spec.add_dependency('Ascii85', '>=0.9')
80
- end
81
-
82
- # package the library into a gem
83
- desc "Generate a gem for pdf-reader"
84
- Rake::GemPackageTask.new(spec) do |pkg|
85
- pkg.need_zip = true
86
- pkg.need_tar = true
87
- end
34
+ RoodiTask.new 'roodi', ['lib/**/*.rb']
data/TODO CHANGED
@@ -1,8 +1,6 @@
1
1
  v0.8
2
- - optimise PDF::Reader::Reference#from_buffer
3
- - ruby-prof shows the match() call in this function is a real killer
4
2
  - add extra callbacks
5
- - list implemented features
3
+ - list implemented features
6
4
  - encrypted? tagged? bookmarks? annotated? optimised?
7
5
  - Allow more than just page content and metadata to be parsed (see spec section 3.6.1)
8
6
  - bookmarks?
@@ -15,7 +13,6 @@ v0.8
15
13
  from the Original encoding to Unicode.
16
14
  - detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
17
15
  - Improve interpretation of non content stream data (ie metadata). recognise dates, etc
18
- - Support Cross Reference Streams (spec 3.4.7)
19
16
  - Fix inheritance of page attributes. Resources has been done, but plenty of other attributes
20
17
  are inheritable. See table 3.2.7 in the spec
21
18
 
@@ -33,15 +30,16 @@ Sometime
33
30
  - Ship some extra receivers in the standard package, particuarly ones that are useful for running
34
31
  rspec over generated PDF files
35
32
 
36
- - When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
33
+ - When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
37
34
  sensible way to convert them to unicode
38
35
 
39
36
  - Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
40
37
 
41
- - Add support for additional encodings:
42
- - PDFDocEncoding
38
+ - Add support for additional encodings:
43
39
  - Identity-V(I *think* this relates to vertical text. Not sure how we'd support it sensibly)
44
40
 
45
41
  - Investigate how R->L text is handled
46
42
 
47
- - Add support for object streams (spec section 3.4.6)
43
+ - fix all callbacks to only ever return basic ruby objects (strings, ints,
44
+ attays, symbols, hashes, etc). No PDF::Reader::Reference or
45
+ PDF::Reader::Font, etc.
@@ -30,6 +30,9 @@ begin
30
30
  case obj
31
31
  when Hash, Array
32
32
  puts obj.inspect
33
+ when PDF::Reader::Stream
34
+ puts obj.hash.inspect
35
+ puts obj.unfiltered_data
33
36
  else
34
37
  puts obj
35
38
  end
@@ -17,9 +17,11 @@ class PageTextReceiver
17
17
  end
18
18
  end
19
19
 
20
- def show_text(string, *params)
20
+ def show_text(*params)
21
21
  @content = "" if @content.nil?
22
- @content << string
22
+ params.each do |str|
23
+ @content << str.to_s
24
+ end
23
25
  end
24
26
 
25
27
  # there's a few text callbacks, so make sure we process them all
@@ -0,0 +1,108 @@
1
+ # coding: utf-8
2
+
3
+ # This demonstrates a way to extract some images (those based on the JPG or
4
+ # TIFF formats) from a PDF. There are other ways to store images, so
5
+ # it may need to be expanded for real world usage, but it should serve
6
+ # as a good guide.
7
+ #
8
+ # Thanks to Jack Rusher for the initial version of this example.
9
+ #
10
+ # USAGE:
11
+ #
12
+ # ruby extract_images.rb somefile.pdf
13
+
14
+ require 'pdf/reader'
15
+
16
+ module ExtractImages
17
+
18
+ class Receiver
19
+ attr_reader :count
20
+
21
+ def initialize
22
+ @count = 0
23
+ end
24
+
25
+ def resource_xobject(name, stream)
26
+ return unless stream.hash[:Subtype] == :Image
27
+ increment_count
28
+
29
+ case stream.hash[:Filter]
30
+ when :CCITTFaxDecode
31
+ ExtractImages::Tiff.new(stream).save("#{count}-#{name}.tif")
32
+ when :DCTDecode
33
+ ExtractImages::Jpg.new(stream).save("#{count}-#{name}.jpg")
34
+ else
35
+ $stderr.puts "unrecognized image filter '#{stream.hash[:Filter]}'!"
36
+ end
37
+ end
38
+
39
+ def increment_count
40
+ @count += 1
41
+ end
42
+ private :increment_count
43
+
44
+ end
45
+
46
+ class Jpg
47
+ attr_reader :stream
48
+
49
+ def initialize(stream)
50
+ @stream = stream
51
+ end
52
+
53
+ def save(filename)
54
+ w = stream.hash[:Width]
55
+ h = stream.hash[:Height]
56
+ puts "#{filename}: h=#{h}, w=#{w}"
57
+ File.open(filename, "wb") { |file| file.write stream.data }
58
+ end
59
+ end
60
+
61
+ class Tiff
62
+ attr_reader :stream
63
+
64
+ def initialize(stream)
65
+ @stream = stream
66
+ end
67
+
68
+ def save(filename)
69
+ if stream.hash[:DecodeParms][:K] <= 0
70
+ save_group_four(filename)
71
+ else
72
+ $stderr.puts "#{filename}: CCITT non-group 4/2D image."
73
+ end
74
+ end
75
+
76
+ private
77
+
78
+ # Group 4, 2D
79
+ def save_group_four(filename)
80
+ k = stream.hash[:DecodeParms][:K]
81
+ h = stream.hash[:Height]
82
+ w = stream.hash[:Width]
83
+ bpc = stream.hash[:BitsPerComponent]
84
+ mask = stream.hash[:ImageMask]
85
+ len = stream.hash[:Length]
86
+ cols = stream.hash[:DecodeParms][:Columns]
87
+ puts "#{filename}: h=#{h}, w=#{w}, bpc=#{bpc}, mask=#{mask}, len=#{len}, cols=#{cols}, k=#{k}"
88
+
89
+ # Synthesize a TIFF header
90
+ long_tag = lambda {|tag, value| [ tag, 4, 1, value ].pack( "ssII" ) }
91
+ short_tag = lambda {|tag, value| [ tag, 3, 1, value ].pack( "ssII" ) }
92
+ # header = byte order, version magic, offset of directory, directory count,
93
+ # followed by a series of tags containing metadata: 259 is a magic number for
94
+ # the compression type; 273 is the offset of the image data.
95
+ tiff = [ 73, 73, 42, 8, 5 ].pack("ccsIs") \
96
+ + short_tag.call( 256, cols ) \
97
+ + short_tag.call( 257, h ) \
98
+ + short_tag.call( 259, 4 ) \
99
+ + long_tag.call( 273, (10 + (5*12)) ) \
100
+ + long_tag.call( 279, len) \
101
+ + stream.data
102
+ File.open(filename, "wb") { |file| file.write tiff }
103
+ end
104
+ end
105
+ end
106
+
107
+ receiver = ExtractImages::Receiver.new
108
+ PDF::Reader.file(ARGV[0], receiver)
@@ -8,5 +8,5 @@ $LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
8
8
  require 'pdf/reader'
9
9
 
10
10
  filename = File.dirname(__FILE__) + "/../specs/data/cairo-unicode.pdf"
11
- hash = PDF::Hash.new(filename)
11
+ hash = PDF::Reader::ObjectHash.new(filename)
12
12
  puts hash[3]
@@ -3,6 +3,9 @@
3
3
 
4
4
  # Extract all text from a single PDF
5
5
 
6
+ require 'rubygems'
7
+ require 'pdf/reader'
8
+
6
9
  class PageTextReceiver
7
10
  attr_accessor :content
8
11
 
@@ -1,232 +1,15 @@
1
- module PDF
2
- # Provides low level access to the objects in a PDF file via a hash-like
3
- # object.
4
- #
5
- # A PDF file can be viewed as a large hash map. It is a series of objects
6
- # stored at an exact byte offsets, and a table that maps object IDs to byte
7
- # offsets. Given an object ID, looking up an object is an O(1) operation.
8
- #
9
- # Each PDF object can be mapped to a ruby object, so by passing an object
10
- # ID to the [] method, a ruby representation of that object will be
11
- # retrieved.
12
- #
13
- # The class behaves much like a standard Ruby hash, including the use of
14
- # the Enumerable mixin. The key difference is no []= method - the hash
15
- # is read only.
16
- #
17
- # == Basic Usage
18
- #
19
- # h = PDF::Hash.new("somefile.pdf")
20
- # h[1]
21
- # => 3469
22
- #
23
- # h[PDF::Reader::Reference.new(1,0)]
24
- # => 3469
25
- #
26
- class Hash
27
- include Enumerable
28
-
29
- attr_accessor :default
30
- attr_reader :trailer, :version
1
+ # coding: utf-8
31
2
 
32
- # Creates a new PDF:Hash object. input can be a string with a valid filename,
33
- # a string containing a PDF file, or an IO object.
34
- #
3
+ module PDF
4
+ class Hash < PDF::Reader::ObjectHash # :nodoc:
35
5
  def initialize(input)
36
- if input.kind_of?(IO) || input.kind_of?(StringIO)
37
- io = input
38
- elsif File.file?(input.to_s)
39
- if File.respond_to?(:binread)
40
- input = File.binread(input.to_s)
41
- else
42
- input = File.read(input.to_s)
43
- end
44
- io = StringIO.new(input)
45
- else
46
- raise ArgumentError, "input must be an IO-like object or a filename"
47
- end
48
- @version = read_version(io)
49
- @xref = PDF::Reader::XRef.new(io)
50
- @trailer = @xref.load
51
- end
52
-
53
- # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
54
- # object.
55
- #
56
- # If an int is used, the object with that ID and a generation number of 0 will
57
- # be returned.
58
- #
59
- # If a PDF::Reader::Reference object is used the exact ID and generation number
60
- # can be specified.
61
- #
62
- def [](key)
63
- return default if key.to_i <= 0
64
-
65
- begin
66
- unless key.kind_of?(PDF::Reader::Reference)
67
- key = PDF::Reader::Reference.new(key.to_i, 0)
68
- end
69
- @xref.object(key)
70
- rescue
71
- return default
72
- end
73
- end
74
-
75
- # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
76
- # object.
77
- #
78
- # If an int is used, the object with that ID and a generation number of 0 will
79
- # be returned.
80
- #
81
- # If a PDF::Reader::Reference object is used the exact ID and generation number
82
- # can be specified.
83
- #
84
- # local_deault is the object that will be returned if the requested key doesn't
85
- # exist.
86
- #
87
- def fetch(key, local_default = nil)
88
- obj = self[key]
89
- if obj
90
- return obj
91
- elsif local_default
92
- return local_default
93
- else
94
- raise IndexError, "#{key} is invalid" if key.to_i <= 0
95
- end
96
- end
97
-
98
- # iterate over each key, value. Just like a ruby hash.
99
- #
100
- def each(&block)
101
- @xref.each do |ref, obj|
102
- yield ref, obj
103
- end
104
- end
105
- alias :each_pair :each
106
-
107
- # iterate over each key. Just like a ruby hash.
108
- #
109
- def each_key(&block)
110
- each do |id, obj|
111
- yield id
112
- end
113
- end
114
-
115
- # iterate over each value. Just like a ruby hash.
116
- #
117
- def each_value(&block)
118
- each do |id, obj|
119
- yield obj
120
- end
121
- end
122
-
123
- # return the number of objects in the file. An object with multiple generations
124
- # is counted once.
125
- def size
126
- @xref.size
127
- end
128
- alias :length :size
129
-
130
- # return true if there are no objects in this file
131
- #
132
- def empty?
133
- size == 0 ? true : false
134
- end
135
-
136
- # return true if the specified key exists in the file. key
137
- # can be an int or a PDF::Reader::Reference
138
- #
139
- def has_key?(check_key)
140
- # TODO update from O(n) to O(1)
141
- each_key do |key|
142
- if check_key.kind_of?(PDF::Reader::Reference)
143
- return true if check_key == key
144
- else
145
- return true if check_key.to_i == key.id
146
- end
147
- end
148
- return false
6
+ warn "DEPRECATION NOTICE: PDF::Hash has been deprecated, use PDF::Reader::ObjectHash instead"
7
+ super
149
8
  end
150
- alias :include? :has_key?
151
- alias :key? :has_key?
152
- alias :member? :has_key?
153
9
 
154
- # return true if the specifiedvalue exists in the file
155
- #
156
- def has_value?(value)
157
- # TODO update from O(n) to O(1)
158
- each_value do |obj|
159
- return true if obj == value
160
- end
161
- return false
10
+ def version
11
+ warn "DEPRECATION NOTICE: PDF::Hash#version has been deprecated, use PDF::Reader::ObjectHash#pdf_version instead"
12
+ pdf_version
162
13
  end
163
- alias :value? :has_key?
164
-
165
- def to_s
166
- "<PDF::Hash size: #{self.size}>"
167
- end
168
-
169
- # return an array of all keys in the file
170
- #
171
- def keys
172
- ret = []
173
- each_key { |k| ret << k }
174
- ret
175
- end
176
-
177
- # return an array of all values in the file
178
- #
179
- def values
180
- ret = []
181
- each_value { |v| ret << v }
182
- ret
183
- end
184
-
185
- # return an array of all values from the specified keys
186
- #
187
- def values_at(*ids)
188
- ids.map { |id| self[id] }
189
- end
190
-
191
- # return an array of arrays. Each sub array contains a key/value pair.
192
- #
193
- def to_a
194
- ret = []
195
- each do |id, obj|
196
- ret << [id, obj]
197
- end
198
- ret
199
- end
200
-
201
- # returns an array of PDF::Reader::References. Each reference in the
202
- # array points a Page object, one for each page in the PDF. The first
203
- # reference is page 1, second reference is page 2, etc.
204
- #
205
- def page_references
206
- root = fetch(trailer[:Root])
207
- @page_references ||= get_page_objects(root[:Pages]).flatten
208
- end
209
-
210
- private
211
-
212
- # returns a nested array of object references for all pages in this object store.
213
- #
214
- def get_page_objects(ref)
215
- obj = fetch(ref)
216
-
217
- if obj[:Type] == :Page
218
- ref
219
- elsif obj[:Type] == :Pages
220
- obj[:Kids].map { |kid| get_page_objects(kid) }
221
- end
222
- end
223
-
224
- def read_version(io)
225
- io.seek(0)
226
- m, version = *io.read(10).match(/PDF-(\d.\d)/)
227
- io.seek(0)
228
- version
229
- end
230
-
231
14
  end
232
15
  end