RubyGems - pdf-reader - Versions diffs - 1.0.0.beta1 → 1.0.0.rc1 - Mend

pdf-reader 1.0.0.beta1 → 1.0.0.rc1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

data/CHANGELOG +7 -0
data/README.rdoc +38 -4
data/Rakefile +45 -1
data/examples/extract_fonts.rb +1 -0
data/examples/extract_images.rb +9 -14
data/lib/pdf/reader.rb +50 -2
data/lib/pdf/reader/buffer.rb +20 -20
data/lib/pdf/reader/cmap.rb +2 -0
data/lib/pdf/reader/encoding.rb +16 -17
data/lib/pdf/reader/filter.rb +1 -1
data/lib/pdf/reader/font.rb +3 -4
data/lib/pdf/reader/form_xobject.rb +8 -7
data/lib/pdf/reader/glyph_hash.rb +1 -0
data/lib/pdf/reader/glyphlist.txt +122 -0
data/lib/pdf/reader/lzw.rb +2 -2
data/lib/pdf/reader/object_hash.rb +30 -4
data/lib/pdf/reader/page.rb +10 -58
data/lib/pdf/reader/page_text_receiver.rb +26 -17
data/lib/pdf/reader/pages_strategy.rb +1 -1
data/lib/pdf/reader/parser.rb +40 -21
data/lib/pdf/reader/resource_methods.rb +60 -0
data/lib/pdf/reader/xref.rb +1 -1
metadata +75 -104

data/CHANGELOG CHANGED

@@ -1,3 +1,10 @@
+v1.0.0.rc1 (19th December 2011)
+- performance optimisations (all by Bernerd Schaefer)
+- some improvements to text extraction from form xobjects
+- assume invalid font encodings are StandardEncoding
+- use binary mode when opening PDFs to stop ruby being helpful and transcoding
+    bytes for us
 v1.0.0.beta1 (6th October 2011)
 - ensure inline images that contain "EI" are correctly parsed
   (thanks Bernard Schaefer)

data/README.rdoc CHANGED

@@ -1,3 +1,20 @@
+= !PLEASE NOTE!
+All the examples below are for the latest (pre-release) version of the gem (0.11)
+If you have installed the gem via the rubygems with the command:
+    $ gem install pdf-reader
+Then the examples below *will not work* for you. Please check the examples that
+come with previous version of the gem (0.10).
+If you want to install the latest version of this gem use the command:
+    $ gem install pdf-reader --prerelease
+= Release Notes
 The PDF::Reader library implements a PDF parser conforming as much as possible
 to the PDF specification from Adobe.
@@ -32,7 +49,7 @@ this object.
     puts reader.metadata
     puts reader.page_count
-PDF::Reader.new can accept an IO stream or a filename. Here's an example with
+PDF::Reader.new accepts an IO stream or a filename. Here's an example with
 an IO stream:
     require 'open-uri'
@@ -41,6 +58,14 @@ an IO stream:
     reader = PDF::Reader.new(io)
     puts reader.info
+If you open a PDF with File#open or IO#open, I strongly recommend using "rb"
+mode to ensure the file isn't mangled by ruby being 'helpful'.
+    File.open("somefile.pdf", "rb") do |io|
+      reader = PDF::Reader.new(io)
+      puts reader.info
+    end
 PDF is a page based file format, so most visible information is available via
 page-based iteration
@@ -80,9 +105,8 @@ The second method is preferred to increase the effectiveness of internal caching
 = Text Encoding
-Internally, text can be stored inside a PDF in various encodings, including
-zingbats, win-1252, mac roman and a form of Unicode. To avoid confusion, all
-text will be converted to UTF-8 before it is passed back from PDF::Reader.
+Regardless of the internal encoding used in the PDF all text will be converted
+to UTF-8 before it is passed back from PDF::Reader.
 Strings that contain binary data (like font blobs) will be marked as such on
 M17N aware VMs.
@@ -107,6 +131,16 @@ don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
 Any other exceptions should be considered bugs in either PDF::Reader (please
 report it!).
+= PDF Integrity
+Windows developers may run into problems when running specs due to MalformedPDFError's
+This is usually because CRLF characters are automatically added to some of the PDF's in
+the spec folder when you checkout a branch from Git.
+To remove any invalid CRLF characters added while checking out a branch from Git, run:
+    rake fix_integrity
 = Maintainers
 - James Healy <mailto:jimmy@deefa.com>

data/Rakefile CHANGED

@@ -18,7 +18,7 @@ RSpec::Core::RakeTask.new("spec") do |t|
   t.ruby_opts = "-w"
 end
-# Genereate the RDoc documentation
+# Generate the RDoc documentation
 desc "Create documentation"
 Rake::RDocTask.new("doc") do |rdoc|
   rdoc.title = "pdf-reader"
@@ -32,3 +32,47 @@ Rake::RDocTask.new("doc") do |rdoc|
 end
 RoodiTask.new 'roodi', ['lib/**/*.rb']
+desc "Create a YAML file of integrity info for PDFs in the spec suite"
+task :integrity_yaml do
+  data = {}
+  Dir.glob("spec/data/**/*.*").each do |path|
+    path_without_spec = path.gsub("spec/","")
+    data[path_without_spec] = {
+      :bytes => File.size(path),
+      :md5   => `md5sum "#{path}"`.split.first
+    } if File.file?(path)
+  end
+  File.open("spec/integrity.yml","wb") { |f| f.write YAML.dump(data)}
+end
+desc "Remove any CRLF characters added by Git"
+task :fix_integrity do
+  yaml_path = File.expand_path("spec/integrity.yml",File.dirname(__FILE__))
+  integrity = YAML.load_file(yaml_path)
+  Dir.glob("spec/data/**/*.pdf").each do |path|
+    path_relative_to_spec_folder = path[/.+(data\/.+)/,1]
+    item = integrity[path_relative_to_spec_folder]
+    if File.file?(path)
+      file_contents = File.open(path, "rb") { |f| f.read }
+      md5 = Digest::MD5.hexdigest(file_contents)
+      unless md5 == item[:md5]
+        #file md5 does not match what was checked into Git
+        if Digest::MD5.hexdigest(file_contents.gsub(/\r\n/, "\n")) == item[:md5]
+          #pdf file is fixable by swapping CRLF characters
+          File.open(path, "wb") do |f|
+            f.write(file_contents.gsub(/\r\n/, "\n"))
+          end
+          puts "Replaced CRLF characters in: #{path}"
+        else
+          puts "Failed to fix: #{path}"
+        end
+      end
+    end
+  end
+end

data/examples/extract_fonts.rb CHANGED

@@ -1,3 +1,4 @@
+#!/usr/bin/env ruby
 # coding: utf-8
 # This demonstrates a way to extract TTF fonts from a PDF. It could be expanded

data/examples/extract_images.rb CHANGED

@@ -1,3 +1,4 @@
+#!/usr/bin/env ruby
 # coding: utf-8
 # This demonstrates a way to extract some images (those based on the JPG or
@@ -14,9 +15,7 @@ module ExtractImages
   class Extractor
     def page(page)
-      count = 0
-      process_resources(page, page.resources, count)
+      process_page(page, 0)
     end
     private
@@ -25,17 +24,13 @@ module ExtractImages
       @complete_refs ||= {}
     end
-    def process_resources(page, resources, count)
-      xobjects = resources[:XObject]
-      return count if xobjects.nil?
+    def process_page(page, count)
+      xobjects = page.xobjects
+      return count if xobjects.empty?
       xobjects.each do |name, stream|
-        next if complete_refs[stream]
-        complete_refs[stream] = true
-        stream = page.objects.deref(stream)
-        if stream.hash[:Subtype] == :Image
+        case stream.hash[:Subtype]
+        when :Image then
           count += 1
           case stream.hash[:Filter]
@@ -46,8 +41,8 @@ module ExtractImages
           else
             ExtractImages::Raw.new(stream).save("#{page.number}-#{count}-#{name}.tif")
           end
-        elsif stream.hash[:Subtype] == :Form
-          count = process_resources(page, PDF::Reader::FormXObject.new(page, stream).resources, count)
+        when :Form then
+          count = process_page(PDF::Reader::FormXObject.new(page, stream), count)
         end
       end
       count

data/lib/pdf/reader.rb CHANGED

@@ -118,12 +118,19 @@ module PDF
     end
     def info
-      @objects.deref(@objects.trailer[:Info])
+      dict = @objects.deref(@objects.trailer[:Info])
+      doc_strings_to_utf8(dict)
     end
     def metadata
       stream = @objects.deref(root[:Metadata])
-      stream ? stream.unfiltered_data : nil
+      if stream.nil?
+        nil
+      else
+        xml = stream.unfiltered_data
+        xml.force_encoding("utf-8") if xml.respond_to?(:force_encoding)
+        xml
+      end
     end
     def page_count
@@ -269,6 +276,46 @@ module PDF
     private
+    # recursively convert strings from outside a content stream intop UTF-8
+    #
+    def doc_strings_to_utf8(obj)
+      case obj
+      when ::Hash then
+        {}.tap { |new_hash|
+          obj.each do |key, value|
+            new_hash[key] = doc_strings_to_utf8(value)
+          end
+        }
+      when Array then
+        obj.map { |item| doc_strings_to_utf8(item) }
+      when String then
+        if obj[0,2].unpack("C*") == [254, 255]
+          utf16_to_utf8(obj)
+        else
+          pdfdoc_to_utf8(obj)
+        end
+      else
+        obj
+      end
+    end
+    # TODO find a PDF I can use to spec this behaviour
+    #
+    def pdfdoc_to_utf8(obj)
+      obj.force_encoding("utf-8") if obj.respond_to?(:force_encoding)
+      obj
+    end
+    # one day we'll all run on a 1.9 compatible VM and I can just do this with
+    # String#encode
+    #
+    def utf16_to_utf8(obj)
+      str = obj[2, obj.size]
+      str = str.unpack("n*").pack("U*")
+      str.force_encoding("utf-8") if str.respond_to?(:force_encoding)
+      str
+    end
     def strategies
       @strategies ||= [
         ::PDF::Reader::MetadataStrategy,
@@ -284,6 +331,7 @@ module PDF
 end
 ################################################################################
+require 'pdf/reader/resource_methods'
 require 'pdf/reader/abstract_strategy'
 require 'pdf/reader/buffer'
 require 'pdf/reader/cmap'

data/lib/pdf/reader/buffer.rb CHANGED

@@ -151,14 +151,11 @@ class PDF::Reader
     #
     def prepare_tokens
       10.times do
-        if state == :literal_string
-          prepare_literal_token
-        elsif state == :hex_string
-          prepare_hex_token
-        elsif state == :regular
-          prepare_regular_token
-        elsif state == :inline
-          prepare_inline_token
+        case state
+        when :literal_string then prepare_literal_token
+        when :hex_string     then prepare_hex_token
+        when :regular        then prepare_regular_token
+        when :inline         then prepare_inline_token
         end
       end
@@ -169,14 +166,12 @@ class PDF::Reader
     # Determine the current context/state by examining the last token we found
     #
     def state
-      if @tokens[-1] == "("
-        :literal_string
-      elsif @tokens[-1] == "<"
-        :hex_string
-      elsif @tokens[-1] == "stream"
-        :stream
-      elsif in_content_stream? && @tokens[-1] == "ID"
-        :inline
+      case @tokens.last
+      when "(" then :literal_string
+      when "<" then :hex_string
+      when "stream" then :stream
+      when "ID"
+        in_content_stream? ? :inline : :regular
       else
         :regular
       end
@@ -209,13 +204,18 @@ class PDF::Reader
     def prepare_inline_token
       str = ""
-      while str !~ /\sEI$/
+      buffer = []
+      until buffer[0] =~ /\s/ && buffer[1, 2] == ["E", "I"]
         chr = @io.read(1)
-        break if chr.nil?
-        str << chr
+        buffer << chr
+        if buffer.length > 3
+          str << buffer.shift
+        end
       end
-      @tokens << string_token(str[0..-3].strip)
+      @tokens << string_token(str.strip)
       @io.seek(-3, IO::SEEK_CUR) unless chr.nil?
     end

data/lib/pdf/reader/cmap.rb CHANGED

@@ -26,6 +26,8 @@
 class PDF::Reader
   class CMap # :nodoc:
+    attr_reader :map
     def initialize(data)
       @map = {}
       process_data(data)

data/lib/pdf/reader/encoding.rb CHANGED

@@ -137,24 +137,23 @@ class PDF::Reader
     end
     def get_mapping_file(enc)
-      return File.dirname(__FILE__) + "/encodings/standard.txt" if enc.nil?
-      files = {
-        :"Identity-H"      => nil,
-        :"Identity-V"      => nil,
-        :MacRomanEncoding  => File.dirname(__FILE__) + "/encodings/mac_roman.txt",
-        :MacExpertEncoding => File.dirname(__FILE__) + "/encodings/mac_expert.txt",
-        :PDFDocEncoding    => File.dirname(__FILE__) + "/encodings/pdf_doc.txt",
-        :StandardEncoding  => File.dirname(__FILE__) + "/encodings/standard.txt",
-        :SymbolEncoding    => File.dirname(__FILE__) + "/encodings/symbol.txt",
-        :UTF16Encoding     => nil,
-        :WinAnsiEncoding   => File.dirname(__FILE__) + "/encodings/win_ansi.txt",
-        :ZapfDingbatsEncoding => File.dirname(__FILE__) + "/encodings/zapf_dingbats.txt"
-      }
-      if files.has_key?(enc)
-        files[enc]
+      case enc
+      when :"Identity-H", :"Identity-V", :UTF16Encoding then
+        nil
+      when :MacRomanEncoding then
+        File.dirname(__FILE__) + "/encodings/mac_roman.txt"
+      when :MacExpertEncoding then
+        File.dirname(__FILE__) + "/encodings/mac_expert.txt"
+      when :PDFDocEncoding then
+        File.dirname(__FILE__) + "/encodings/pdf_doc.txt"
+      when :SymbolEncoding then
+        File.dirname(__FILE__) + "/encodings/symbol.txt"
+      when :WinAnsiEncoding then
+        File.dirname(__FILE__) + "/encodings/win_ansi.txt"
+      when :ZapfDingbatsEncoding then
+        File.dirname(__FILE__) + "/encodings/zapf_dingbats.txt"
       else
-        raise UnsupportedFeatureError, "#{enc} is not currently a supported encoding"
+        File.dirname(__FILE__) + "/encodings/standard.txt"
       end
     end

data/lib/pdf/reader/filter.rb CHANGED

@@ -201,7 +201,7 @@ class PDF::Reader
       data = data.unpack("C*")
-      pixel_bytes     = 1 #pixel_bitlength / 8
+      pixel_bytes     = opts[:Colors] || 1
       scanline_length = (pixel_bytes * opts[:Columns]) + 1
       row = 0
       pixels = []

data/lib/pdf/reader/font.rb CHANGED

@@ -39,6 +39,8 @@ class PDF::Reader
       extract_base_info(obj)
       extract_descriptor(obj)
       extract_descendants(obj)
+      @encoding ||= PDF::Reader::Encoding.new(:StandardEncoding)
     end
     def basefont=(font)
@@ -59,10 +61,7 @@ class PDF::Reader
       raise UnsupportedFeatureError, "font encoding '#{encoding}' currently unsupported" if encoding.kind_of?(String)
       if params.class == String
-        # translate the bytestram into a UTF-8 string.
-        # If an encoding hasn't been specified, assume the text using this
-        # font is in Adobe Standard Encoding.
-        (encoding || PDF::Reader::Encoding.new(:StandardEncoding)).to_utf8(params, tounicode)
+        encoding.to_utf8(params, tounicode)
       elsif params.class == Array
         params.collect { |param| self.to_utf8(param) }
       else

data/lib/pdf/reader/form_xobject.rb CHANGED

@@ -11,6 +11,7 @@ module PDF
     # This behaves and looks much like a limited PDF::Reader::Page class.
     #
     class FormXObject
+      include ResourceMethods
       def initialize(page, xobject)
         @page    = page
@@ -18,12 +19,6 @@ module PDF
         @xobject = @objects.deref(xobject)
       end
-      # Returns the resources that accompany this form.
-      #
-      def resources
-        @resources ||= @objects.deref(@xobject.hash[:Resources]) || {}
-      end
       # return a hash of fonts used on this form.
       #
       # The keys are the font labels used within the form content stream.
@@ -31,7 +26,7 @@ module PDF
       # The values are a PDF::Reader::Font instances that provide access
       # to most available metrics for each font.
       #
-      def fonts
+      def font_objects
         raw_fonts = @objects.deref(resources[:Font] || {})
         ::Hash[raw_fonts.map { |label, font|
           [label, PDF::Reader::Font.new(@objects, @objects.deref(font))]
@@ -56,6 +51,12 @@ module PDF
       private
+      # Returns the resources that accompany this form.
+      #
+      def resources
+        @resources ||= @objects.deref(@xobject.hash[:Resources]) || {}
+      end
       def callback(receivers, name, params=[])
         receivers.each do |receiver|
           receiver.send(name, *params) if receiver.respond_to?(name)

data/lib/pdf/reader/glyph_hash.rb CHANGED

@@ -48,6 +48,7 @@ class PDF::Reader
     def [](name)
       return nil unless name.is_a?(Symbol)
+      name = name.to_s.gsub('_', '').intern
       str = name.to_s
       if @adobe.has_key?(name)

data/lib/pdf/reader/glyphlist.txt CHANGED

@@ -4281,3 +4281,125 @@ zstroke;01B6
 zuhiragana;305A
 zukatakana;30BA
 #--end
+#--start wingdings
+scissors;2702
+scissorscutting;2701
+telephonesolid;260E
+telhandsetcirc;2706
+envelopeback;2709
+hourglass;231B
+keyboard;2328
+tapereel;2707
+handwrite;270D
+handv;270C
+handptleft;261C
+handptright;261E
+handptup;261D
+handptdown;261F
+handhalt;270B
+frownface;2639
+skullcrossbones;2620
+flag;2690
+airplane;2708
+sunshine;263C
+snowflake;2744
+crossshadow;271E
+crossmaltese;2720
+starofdavid;2721
+crescentstar;262A
+om;0950
+wheel;2638
+aries;2648
+taurus;2649
+gemini;264A
+cancer;264B
+leo;264C
+virgo;264D
+libra;264E
+scorpio;264F
+saggitarius;2650
+capricorn;2651
+aquarius;2652
+pisces;2653
+ampersanditlc;0026
+ampersandit;0026
+circle6;25CF
+circleshadowdwn;274D
+square6;25A0
+box3;25A1
+boxshadowdwn;2751
+boxshadowup;2752
+lozenge4;2B27
+lozenge6;29EB
+rhombus6;25C6
+xrhombus;2756
+rhombus4;2B25
+escape;2353
+command;2318
+rosette;2740
+rosettesolid;273F
+quotedbllftbld;275D
+quotedblrtbld;275E
+.notdef;25AF
+zerosans;24EA
+onesans;2460
+twosans;2461
+threesans;2462
+foursans;2463
+fivesans;2464
+sixsans;2465
+sevensans;2466
+eightsans;2467
+ninesans;2468
+tensans;2469
+zerosansinv;24FF
+onesansinv;2776
+twosansinv;2777
+threesansinv;2778
+foursansinv;2779
+fivesansinv;277A
+sixsansinv;277B
+sevensansinv;277C
+eightsansinv;277D
+ninesansinv;277E
+tensansinv;277F
+circle2;00B7
+circle4;2022
+square2;25AA
+ring2;25CB
+ring4;2B55
+ringbutton2;25C9
+target;25CE
+square4;25AA
+box2;25FB
+crosstar2;2726
+pentastar2;2605
+hexstar2;2736
+octastar2;2734
+dodecastar3;2739
+octastar4;2735
+registercircle;2316
+cuspopen;27E1
+cuspopen1;2311
+circlestar;272A
+starshadow;2730
+head2right;27A2
+circleright;27B2
+barb4right;2794
+bleft;21E6
+bright;21E8
+bup;21E7
+bdown;21E9
+bleftright;2B04
+bupdown;21F3
+bnw;2B00
+bne;2B01
+bsw;2B03
+bse;2B02
+bdash1;25AD
+bdash2;25AB
+xmarkbld;2717
+checkbld;2713
+boxxmarkbld;2612
+boxcheckbld;2611
+#--end wingdings

data/lib/pdf/reader/lzw.rb CHANGED

@@ -37,7 +37,7 @@ module PDF
           while bits_left_in_chunk > 0 and @current_pos < @data.size
             chunk = 0 if chunk.nil?
             codepoint = @data[@current_pos, 1].unpack("C*")[0]
-            current_byte = codepoint & (2**@bits_left_in_byte -1) #clear consumed bits
+            current_byte = codepoint & (2**@bits_left_in_byte - 1) #clear consumed bits
             dif = bits_left_in_chunk - @bits_left_in_byte
             if dif > 0 then  current_byte <<= dif
             elsif dif < 0 then  current_byte >>= dif.abs
@@ -82,7 +82,7 @@ module PDF
       def self.decode(data)
         stream = BitStream.new data.to_s, 9 # size of codes between 9 and 12 bits
         result = ''
-        while not (code = stream.read) == CODE_EOD
+        until (code = stream.read) == CODE_EOD
           if code == CODE_CLEAR_TABLE
             string_table = StringTable.new
             code = stream.read

data/lib/pdf/reader/object_hash.rb CHANGED

@@ -30,6 +30,7 @@ class PDF::Reader
     attr_accessor :default
     attr_reader :trailer, :pdf_version
+    attr_reader :sec_handler
     # Creates a new ObjectHash object. Input can be a string with a valid filename
     # or an IO-like object.
@@ -97,6 +98,27 @@ class PDF::Reader
     end
     alias :deref :object
+    # Recursively dereferences the object refered to be +key+. If +key+ is not
+    # a PDF::Reader::Reference, the key is returned unchanged.
+    #
+    def deref!(key)
+      case object = deref(key)
+      when Hash
+        {}.tap { |hash|
+          object.each do |k, value|
+            hash[k] = deref!(value)
+          end
+        }
+      when PDF::Reader::Stream
+        object.hash = deref!(object.hash)
+        object
+      when Array
+        object.map { |value| deref!(value) }
+      else
+        object
+      end
+    end
     # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
     # object.
     #
@@ -238,6 +260,10 @@ class PDF::Reader
       trailer.has_key?(:Encrypt)
     end
+    def sec_handler?
+      !!sec_handler
+    end
     private
     def build_security_handler(opts = {})
@@ -253,11 +279,11 @@ class PDF::Reader
     end
     def decrypt(ref, obj)
-      return obj if @sec_handler.nil?
+      return obj unless sec_handler?
       case obj
       when PDF::Reader::Stream then
-        obj.data = @sec_handler.decrypt(obj.data, ref)
+        obj.data = sec_handler.decrypt(obj.data, ref)
         obj
       when Hash                then
         arr = obj.map { |key,val| [key, decrypt(ref, val)] }.flatten(1)
@@ -265,7 +291,7 @@ class PDF::Reader
       when Array               then
         obj.collect { |item| decrypt(ref, item) }
       when String
-        @sec_handler.decrypt(obj, ref)
+        sec_handler.decrypt(obj, ref)
       else
         obj
       end
@@ -316,7 +342,7 @@ class PDF::Reader
       if File.respond_to?(:binread)
         File.binread(input.to_s)
       else
-        File.read(input.to_s)
+        File.open(input.to_s,"rb") { |f| f.read }
       end
     end

data/lib/pdf/reader/page.rb CHANGED

@@ -12,6 +12,7 @@ module PDF
     # objects accessor to help walk the page dictionary in any useful way.
     #
     class Page
+      include ResourceMethods
       # lowlevel hash-like access to all objects in the underlying PDF
       attr_reader :objects
@@ -45,73 +46,17 @@ module PDF
         "<PDF::Reader::Page page: #{@pagenum}>"
       end
-      # Returns the attributes that accompany this page. Includes
+      # Returns the attributes that accompany this page, including
       # attributes inherited from parents.
       #
       def attributes
-        {}.tap { |hash|
+        @attributes ||= {}.tap { |hash|
           page_with_ancestors.reverse.each do |obj|
             hash.merge!(@objects.deref(obj))
           end
         }
       end
-      # Returns the resources that accompany this page. Includes
-      # resources inherited from parents.
-      #
-      def resources
-        @resources ||= @objects.deref(attributes[:Resources]) || {}
-      end
-      # Returns a Hash of color spaces that are available to this page
-      #
-      def color_spaces
-        @objects.deref(resources[:ColorSpace]) || {}
-      end
-      # Returns a Hash of fonts that are available to this page
-      #
-      def fonts
-        @objects.deref(resources[:Font]) || {}
-      end
-      # Returns a Hash of external graphic states that are available to this
-      # page
-      #
-      def graphic_states
-        @objects.deref(resources[:ExtGState]) || {}
-      end
-      # Returns a Hash of patterns that are available to this page
-      #
-      def patterns
-        @objects.deref(resources[:Pattern]) || {}
-      end
-      # Returns an Array of procedure sets that are available to this page
-      #
-      def procedure_sets
-        @objects.deref(resources[:ProcSet]) || []
-      end
-      # Returns a Hash of properties sets that are available to this page
-      #
-      def properties
-        @objects.deref(resources[:Properties]) || {}
-      end
-      # Returns a Hash of shadings that are available to this page
-      #
-      def shadings
-        @objects.deref(resources[:Shading]) || {}
-      end
-      # Returns a Hash of XObjects that are available to this page
-      #
-      def xobjects
-        @objects.deref(resources[:XObject]) || {}
-      end
       # returns the plain text content of this page encoded as UTF-8. Any
       # characters that can't be translated will be returned as a ▯
       #
@@ -168,6 +113,13 @@ module PDF
         root ||= objects.deref(@objects.trailer[:Root])
       end
+      # Returns the resources that accompany this page. Includes
+      # resources inherited from parents.
+      #
+      def resources
+        @resources ||= @objects.deref(attributes[:Resources]) || {}
+      end
       def content_stream(receivers, instructions)
         buffer       = Buffer.new(StringIO.new(instructions), :content_stream => true)
         parser       = Parser.new(buffer, @objects)

data/lib/pdf/reader/page_text_receiver.rb CHANGED

@@ -29,8 +29,8 @@ module PDF
       def page=(page)
         @page    = page
         @objects = page.objects
-        @fonts   = build_fonts(page.fonts)
-        @form_fonts = {}
+        @font_stack    = [build_fonts(page.fonts)]
+        @xobject_stack = [page.xobjects]
         @content = {}
         @stack   = [DEFAULT_GRAPHICS_STATE]
       end
@@ -109,6 +109,10 @@ module PDF
         state[:text_font_size] = size
       end
+      def font_size
+        state[:text_font_size] * @text_matrix[0,0]
+      end
       def set_text_leading(leading)
         state[:text_leading] = leading
       end
@@ -194,17 +198,23 @@ module PDF
       #####################################################
       def invoke_xobject(label)
         save_graphics_state
-        xobject = @objects.deref(@page.xobjects[label])
+        dict = @xobject_stack.detect { |xobjects|
+          xobjects.has_key?(label)
+        }
+        xobject = dict ? dict[label] : nil
+        raise MalformedPDFError, "XObject #{label} not found" if xobject.nil?
         matrix = xobject.hash[:Matrix]
         concatenate_matrix(*matrix) if matrix
         if xobject.hash[:Subtype] == :Form
           form = PDF::Reader::FormXObject.new(@page, xobject)
-          @form_fonts = form.fonts
+          @font_stack.unshift(form.font_objects)
+          @xobject_stack.unshift(form.xobjects)
           form.walk(self)
+          @font_stack.shift
+          @xobject_stack.shift
         end
-        @form_fonts = {}
         restore_graphics_state
       end
@@ -232,10 +242,10 @@ module PDF
       def text_rendering_matrix
         state_matrix = Matrix[
-                         [state[:text_font_size] * state[:h_scaling], 0, 0],
-                         [0, state[:text_font_size], 0],
-                         [0, state[:text_rise], 1]
-                       ]
+          [font_size * state[:h_scaling], 0, 0],
+          [0, font_size, 0],
+          [0, state[:text_rise], 1]
+        ]
         state_matrix * @text_matrix * ctm
       end
@@ -251,21 +261,17 @@ module PDF
       # This returns a deep clone of the current state, ensuring changes are
       # keep separate from earlier states.
       #
-      # YAML is used to round-trip the state through a string to easily perform
-      # the deep clone. Kinda hacky, but effective.
+      # Marshal is used to round-trip the state through a string to easily
+      # perform the deep clone. Kinda hacky, but effective.
       #
       def clone_state
         if @stack.empty?
           {}
         else
-          yaml_lib.load yaml_lib.dump(@stack.last)
+          Marshal.load Marshal.dump(@stack.last)
         end
       end
-      def yaml_lib
-        Kernel.const_defined?("Psych") ? Psych : YAML
-      end
       # return the current transformation matrix
       #
       def ctm
@@ -273,7 +279,10 @@ module PDF
       end
       def current_font
-        @form_fonts[state[:text_font]] || @fonts[state[:text_font]]
+        dict = @font_stack.detect { |fonts|
+          fonts.has_key?(state[:text_font])
+        }
+        dict ? dict[state[:text_font]] : nil
       end
       # private class for representing points on a cartesian plain. Used

data/lib/pdf/reader/pages_strategy.rb CHANGED

@@ -350,7 +350,7 @@ class PDF::Reader
       while (token = parser.parse_token(OPERATORS))
         if token.kind_of?(Token) and OPERATORS.has_key?(token)
-           if OPERATORS[token] == :set_text_font_and_size
+          if OPERATORS[token] == :set_text_font_and_size
             current_font = params.first
             if fonts[current_font].nil?
               raise MalformedPDFError, "Unknown font #{current_font}"

data/lib/pdf/reader/parser.rb CHANGED

@@ -28,6 +28,31 @@ class PDF::Reader
   # An internal PDF::Reader class that reads objects from the PDF file and converts
   # them into useable ruby objects (hash's, arrays, true, false, etc)
   class Parser
+    TOKEN_STRATEGY = proc { |parser, token| Token.new(token) }
+    STRATEGIES = {
+      "/"  => proc { |parser, token| parser.send(:pdf_name) },
+      "<<" => proc { |parser, token| parser.send(:dictionary) },
+      "["  => proc { |parser, token| parser.send(:array) },
+      "("  => proc { |parser, token| parser.send(:string) },
+      "<"  => proc { |parser, token| parser.send(:hex_string) },
+      nil     => proc { nil },
+      "true"  => proc { true },
+      "false" => proc { false },
+      "null"  => proc { nil },
+      "obj"       => TOKEN_STRATEGY,
+      "endobj"    => TOKEN_STRATEGY,
+      "stream"    => TOKEN_STRATEGY,
+      "endstream" => TOKEN_STRATEGY,
+      ">>"        => TOKEN_STRATEGY,
+      "]"         => TOKEN_STRATEGY,
+      ">"         => TOKEN_STRATEGY,
+      ")"         => TOKEN_STRATEGY
+    }
     ################################################################################
     # Create a new parser around a PDF::Reader::Buffer object
     #
@@ -45,25 +70,20 @@ class PDF::Reader
     def parse_token (operators={})
       token = @buffer.token
-      case token
-      when PDF::Reader::Reference, nil then return token
-      when "/"                         then return pdf_name()
-      when "<<"                        then return dictionary()
-      when "["                         then return array()
-      when "("                         then return string()
-      when "<"                         then return hex_string()
-      when "true"                      then return true
-      when "false"                     then return false
-      when "null"                      then return nil
-      when "obj", "endobj", "stream", "endstream" then return Token.new(token)
-      when "stream", "endstream"       then return Token.new(token)
-      when ">>", "]", ">", ")"         then return Token.new(token)
+      if STRATEGIES.has_key? token
+        STRATEGIES[token].call(self, token)
+      elsif token.is_a? PDF::Reader::Reference
+        token
+      elsif token.is_a? Token
+        token
+      elsif operators.has_key? token
+        Token.new(token)
+      elsif token.respond_to?(:to_token)
+        token.to_token
+      elsif token =~ /\d*\.\d/
+        token.to_f
       else
-        if token.respond_to?(:to_token) then return token.to_token
-        elsif operators.has_key?(token)   then return Token.new(token)
-        elsif token =~ /\d*\.\d/       then return token.to_f
-        else                           return token.to_i
-        end
+        token.to_i
       end
     end
     ################################################################################
@@ -110,9 +130,8 @@ class PDF::Reader
     # reads a PDF name from the buffer and converts it to a Ruby Symbol
     def pdf_name
       tok = @buffer.token
-      tok.scan(/#([A-Fa-f0-9]{2})/).each do |find|
-        replace = find[0].hex.chr
-        tok.gsub!("#"+find[0], replace)
+      tok.gsub!(/#([A-Fa-f0-9]{2})/) do |match|
+        match[1, 2].hex.chr
       end
       tok.to_sym
     end

data/lib/pdf/reader/resource_methods.rb ADDED

@@ -0,0 +1,60 @@
+# coding: utf-8
+module PDF
+  class Reader
+    # mixin for common methods in Page and FormXobjects
+    #
+    module ResourceMethods
+      # Returns a Hash of color spaces that are available to this page
+      #
+      def color_spaces
+        @objects.deref!(resources[:ColorSpace]) || {}
+      end
+      # Returns a Hash of fonts that are available to this page
+      #
+      def fonts
+        @objects.deref!(resources[:Font]) || {}
+      end
+      # Returns a Hash of external graphic states that are available to this
+      # page
+      #
+      def graphic_states
+        @objects.deref!(resources[:ExtGState]) || {}
+      end
+      # Returns a Hash of patterns that are available to this page
+      #
+      def patterns
+        @objects.deref!(resources[:Pattern]) || {}
+      end
+      # Returns an Array of procedure sets that are available to this page
+      #
+      def procedure_sets
+        @objects.deref!(resources[:ProcSet]) || []
+      end
+      # Returns a Hash of properties sets that are available to this page
+      #
+      def properties
+        @objects.deref!(resources[:Properties]) || {}
+      end
+      # Returns a Hash of shadings that are available to this page
+      #
+      def shadings
+        @objects.deref!(resources[:Shading]) || {}
+      end
+      # Returns a Hash of XObjects that are available to this page
+      #
+      def xobjects
+        @objects.deref!(resources[:XObject]) || {}
+      end
+    end
+  end
+end

data/lib/pdf/reader/xref.rb CHANGED

@@ -146,7 +146,7 @@ class PDF::Reader
     # Read a XReaf stream from the underlying buffer instead of a traditional xref table.
     #
     def load_xref_stream(stream)
-      unless stream.hash[:Type] == :XRef
+      unless stream.is_a?(PDF::Reader::Stream) && stream.hash[:Type] == :XRef
         raise PDF::Reader::MalformedPDFError, "xref stream not found when expected"
       end
       trailer = Hash[stream.hash.select { |key, value|

metadata CHANGED

@@ -1,122 +1,98 @@
---- !ruby/object:Gem::Specification
+--- !ruby/object:Gem::Specification
 name: pdf-reader
-version: !ruby/object:Gem::Version
-  prerelease: true
-  segments:
-  - 1
-  - 0
-  - 0
-  - beta1
-  version: 1.0.0.beta1
+version: !ruby/object:Gem::Version
+  version: 1.0.0.rc1
+  prerelease: 6
 platform: ruby
-authors:
+authors:
 - James Healy
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2011-10-06 00:00:00 +11:00
-default_executable:
-dependencies:
-- !ruby/object:Gem::Dependency
+date: 2011-12-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
   name: rake
-  prerelease: false
-  requirement: &id001 !ruby/object:Gem::Requirement
+  requirement: &19650680 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        segments:
-        - 0
-        version: "0"
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
   type: :development
-  version_requirements: *id001
-- !ruby/object:Gem::Dependency
-  name: roodi
   prerelease: false
-  requirement: &id002 !ruby/object:Gem::Requirement
+  version_requirements: *19650680
+- !ruby/object:Gem::Dependency
+  name: roodi
+  requirement: &19650220 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        segments:
-        - 0
-        version: "0"
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
   type: :development
-  version_requirements: *id002
-- !ruby/object:Gem::Dependency
-  name: rspec
   prerelease: false
-  requirement: &id003 !ruby/object:Gem::Requirement
+  version_requirements: *19650220
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: &19649720 !ruby/object:Gem::Requirement
     none: false
-    requirements:
+    requirements:
     - - ~>
-      - !ruby/object:Gem::Version
-        segments:
-        - 2
-        - 3
-        version: "2.3"
+      - !ruby/object:Gem::Version
+        version: '2.3'
   type: :development
-  version_requirements: *id003
-- !ruby/object:Gem::Dependency
-  name: ZenTest
   prerelease: false
-  requirement: &id004 !ruby/object:Gem::Requirement
+  version_requirements: *19649720
+- !ruby/object:Gem::Dependency
+  name: ZenTest
+  requirement: &19649220 !ruby/object:Gem::Requirement
     none: false
-    requirements:
+    requirements:
     - - ~>
-      - !ruby/object:Gem::Version
-        segments:
-        - 4
-        - 4
-        - 2
+      - !ruby/object:Gem::Version
         version: 4.4.2
   type: :development
-  version_requirements: *id004
-- !ruby/object:Gem::Dependency
-  name: Ascii85
   prerelease: false
-  requirement: &id005 !ruby/object:Gem::Requirement
+  version_requirements: *19649220
+- !ruby/object:Gem::Dependency
+  name: Ascii85
+  requirement: &19648740 !ruby/object:Gem::Requirement
     none: false
-    requirements:
+    requirements:
     - - ~>
-      - !ruby/object:Gem::Version
-        segments:
-        - 1
-        - 0
-        - 0
+      - !ruby/object:Gem::Version
         version: 1.0.0
   type: :runtime
-  version_requirements: *id005
-- !ruby/object:Gem::Dependency
-  name: ruby-rc4
   prerelease: false
-  requirement: &id006 !ruby/object:Gem::Requirement
+  version_requirements: *19648740
+- !ruby/object:Gem::Dependency
+  name: ruby-rc4
+  requirement: &19648280 !ruby/object:Gem::Requirement
     none: false
-    requirements:
-    - - ">="
-      - !ruby/object:Gem::Version
-        segments:
-        - 0
-        version: "0"
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
   type: :runtime
-  version_requirements: *id006
-description: The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe
-email:
+  prerelease: false
+  version_requirements: *19648280
+description: The PDF::Reader library implements a PDF parser conforming as much as
+  possible to the PDF specification from Adobe
+email:
 - jimmy@deefa.com
-executables:
+executables:
 - pdf_object
 - pdf_text
 - pdf_list_callbacks
 - pdf_callbacks
 extensions: []
-extra_rdoc_files:
+extra_rdoc_files:
 - README.rdoc
 - TODO
 - CHANGELOG
 - MIT-LICENSE
-files:
+files:
 - examples/metadata.rb
 - examples/extract_images.rb
 - examples/extract_bates.rb
@@ -161,6 +137,7 @@ files:
 - lib/pdf/reader/encodings/zapf_dingbats.txt
 - lib/pdf/reader/encodings/pdf_doc.txt
 - lib/pdf/reader/encodings/mac_expert.txt
+- lib/pdf/reader/resource_methods.rb
 - lib/pdf/reader/metadata_strategy.rb
 - lib/pdf/reader/token.rb
 - lib/pdf-reader.rb
@@ -173,45 +150,39 @@ files:
 - bin/pdf_text
 - bin/pdf_list_callbacks
 - bin/pdf_callbacks
-has_rdoc: true
 homepage: http://github.com/yob/pdf-reader
 licenses: []
-post_install_message: "\n  ********************************************\n\n  This is a beta release of PDF::Reader to gather feedback on the proposed\n  API changes.\n\n  The old API is marked as deprecated but will continue to work with no\n  visible warnings for now.\n\n  The new API is documented in the README and in rdoc for the PDF::Reader,\n  PDF::Reader::Page and PDF::Reader::ObjectHash classes.\n\n  Do not use this in production, stick to stable releases for that. If you do\n  take the new API for a spin, please send any feedback my way.\n\n  ********************************************\n\n"
-rdoc_options:
+post_install_message: ! "\n  ********************************************\n\n  This
+  is a beta release of PDF::Reader to gather feedback on the proposed\n  API changes.\n\n
+  \ The old API is marked as deprecated but will continue to work with no\n  visible
+  warnings for now.\n\n  The new API is documented in the README and in rdoc for the
+  PDF::Reader,\n  PDF::Reader::Page and PDF::Reader::ObjectHash classes.\n\n  Do not
+  use this in production, stick to stable releases for that. If you do\n  take the
+  new API for a spin, please send any feedback my way.\n\n  ********************************************\n\n"
+rdoc_options:
 - --title
 - PDF::Reader Documentation
 - --main
 - README.rdoc
 - -q
-require_paths:
+require_paths:
 - lib
-required_ruby_version: !ruby/object:Gem::Requirement
+required_ruby_version: !ruby/object:Gem::Requirement
   none: false
-  requirements:
-  - - ">="
-    - !ruby/object:Gem::Version
-      segments:
-      - 1
-      - 8
-      - 7
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
       version: 1.8.7
-required_rubygems_version: !ruby/object:Gem::Requirement
+required_rubygems_version: !ruby/object:Gem::Requirement
   none: false
-  requirements:
-  - - ">"
-    - !ruby/object:Gem::Version
-      segments:
-      - 1
-      - 3
-      - 1
+  requirements:
+  - - ! '>'
+    - !ruby/object:Gem::Version
       version: 1.3.1
 requirements: []
 rubyforge_project:
-rubygems_version: 1.3.7
+rubygems_version: 1.8.11
 signing_key:
 specification_version: 3
 summary: A library for accessing the content of PDF files
 test_files: []