RubyGems - pdf-reader - Versions diffs - 0.7.7 → 0.8.0 - Mend

pdf-reader 0.7.7 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

data/CHANGELOG CHANGED

@@ -1,3 +1,7 @@
+v0.8.0 (20th November 2009)
+- Added PDF::Hash. It provides direct access to objects from a PDF file
+  with an API that emulates the standard Ruby hash
 v0.7.7 (11th September 2009)
 - Trigger callbacks contained in Form XObjects when we encounter them in a
   content stream

data/README.rdoc CHANGED

@@ -17,9 +17,7 @@ spare time to dedicate to adding new features.
 The code as it is works fairly well, and I offer it "as is". All patches, bug
 reports and sample PDFs are welcome - I will work on them when I can. If anyone
 is interested in adding features to PDF::Reader in their own effort to learn
-the PDF file format, I'll happy offer help qand support.
-I STRONGLY RECOMMEND NOT USING PDF::READER FOR YOUR PRODUCTION CODE.
+the PDF file format, I'll happy offer help and support.
 = Installation
@@ -42,6 +40,10 @@ For a full list of the supported callback methods and a description of when they
 will be called, refer to PDF::Reader::Content. See the code examples below for a
 way to print a list of all the callbacks generated by a file to STDOUT.
+There is also a class called PDF::Hash. This provides direct access to the objects
+in a PDF file using a ruby hash-like API. Checkout the documentation for the class
+for further information.
 = Text Encoding
 Internally, text can be stored inside a PDF in various encodings, including

data/Rakefile CHANGED

@@ -6,7 +6,7 @@ require 'rake/testtask'
 require "rake/gempackagetask"
 require 'spec/rake/spectask'
-PKG_VERSION = "0.7.7"
+PKG_VERSION = "0.8.0"
 PKG_NAME = "pdf-reader"
 PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
@@ -16,10 +16,11 @@ task :default => [ :spec ]
 # run all rspecs
 desc "Run all rspec files"
 Spec::Rake::SpecTask.new("spec") do |t|
-  t.spec_files = FileList['specs/**/*.rb']
-  t.rcov = true
-  t.rcov_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
- # t.rcov_opts = ["--exclude","spec.*\.rb"]
+  t.spec_files =  FileList['specs/**/*.rb']
+  t.rcov       =  true
+  t.rcov_dir   =  (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
+  t.ruby_opts  << "-w"
+  # t.rcov_opts = ["--exclude","spec.*\.rb"]
 end
 # generate specdocs

data/examples/hash.rb ADDED

@@ -0,0 +1,12 @@
+#!/usr/bin/env ruby
+# coding: utf-8
+# get direct access to PDF objects
+#
+$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
+require 'pdf/reader'
+filename = File.dirname(__FILE__) + "/../specs/data/cairo-unicode.pdf"
+hash = PDF::Hash.new(filename)
+puts hash[3]

data/lib/pdf/hash.rb ADDED

@@ -0,0 +1,202 @@
+module PDF
+  # Provides low level access to the objects in a PDF file via a hash-like
+  # object.
+  #
+  # A PDF file can be viewed as a large hash map. It is a series of objects
+  # stored at an exact byte offsets, and a table that maps object IDs to byte
+  # offsets. Given an object ID, looking up an object is an O(1) operation.
+  #
+  # Each PDF object can be mapped to a ruby object, so by passing an object
+  # ID to the [] method, a ruby representation of that object will be
+  # retrieved.
+  #
+  # The class behaves much like a standard Ruby hash, including the use of
+  # the Enumerable mixin. The key difference is no []= method - the hash
+  # is read only.
+  #
+  # == Basic Usage
+  #
+  #     h = PDF::Hash.new("somefile.pdf")
+  #     h[1]
+  #     => 3469
+  #
+  #     h[PDF::Reader::Reference.new(1,0)]
+  #     => 3469
+  #
+  class Hash
+    include Enumerable
+    attr_accessor :default
+    attr_reader :trailer
+    # Creates a new PDF:Hash object. input can be a string with a valid filename,
+    # a string containing a PDF file, or an IO object.
+    #
+    def initialize(input)
+      if input.kind_of?(IO) || input.kind_of?(StringIO)
+        io = input
+      elsif File.file?(input.to_s)
+        if File.respond_to?(:binread)
+          input = File.binread(input.to_s)
+        else
+          input = File.read(input.to_s)
+        end
+        io = StringIO.new(input)
+      else
+        raise ArgumentError, "input must be an IO-like object or a filename"
+      end
+      buffer = PDF::Reader::Buffer.new(io)
+      @xref  = PDF::Reader::XRef.new(buffer)
+      @trailer = @xref.load
+    end
+    # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
+    # object.
+    #
+    # If an int is used, the object with that ID and a generation number of 0 will
+    # be returned.
+    #
+    # If a PDF::Reader::Reference object is used the exact ID and generation number
+    # can be specified.
+    #
+    def [](key)
+      return default if key.to_i <= 0
+      begin
+        unless key.kind_of?(PDF::Reader::Reference)
+          key = PDF::Reader::Reference.new(key.to_i, 0)
+        end
+        @xref.object(key)
+      rescue
+        return default
+      end
+    end
+    # Access an object from the PDF. key can be an int or a PDF::Reader::Reference
+    # object.
+    #
+    # If an int is used, the object with that ID and a generation number of 0 will
+    # be returned.
+    #
+    # If a PDF::Reader::Reference object is used the exact ID and generation number
+    # can be specified.
+    #
+    # local_deault is the object that will be returned if the requested key doesn't
+    # exist.
+    #
+    def fetch(key, local_default = nil)
+      obj = self[key]
+      if obj
+        return obj
+      elsif local_default
+        return local_default
+      else
+        raise IndexError, "#{key} is invalid" if key.to_i <= 0
+      end
+    end
+    # iterate over each key, value. Just like a ruby hash.
+    #
+    def each(&block)
+      @xref.each do |ref, obj|
+        yield ref, obj
+      end
+    end
+    alias :each_pair :each
+    # iterate over each key. Just like a ruby hash.
+    #
+    def each_key(&block)
+      each do |id, obj|
+        yield id
+      end
+    end
+    # iterate over each value. Just like a ruby hash.
+    #
+    def each_value(&block)
+      each do |id, obj|
+        yield obj
+      end
+    end
+    # return the number of objects in the file. An object with multiple generations
+    # is counted once.
+    def size
+      @xref.size
+    end
+    alias :length :size
+    # return true if there are no objects in this file
+    #
+    def empty?
+      size == 0 ? true : false
+    end
+    # return true if the specified key exists in the file. key
+    # can be an int or a PDF::Reader::Reference
+    #
+    def has_key?(check_key)
+      # TODO update from O(n) to O(1)
+      each_key do |key|
+        if check_key.kind_of?(PDF::Reader::Reference)
+          return true if check_key == key
+        else
+          return true if check_key.to_i == key.id
+        end
+      end
+      return false
+    end
+    alias :include? :has_key?
+    alias :key? :has_key?
+    alias :member? :has_key?
+    # return true if the specifiedvalue exists in the file
+    #
+    def has_value?(value)
+      # TODO update from O(n) to O(1)
+      each_value do |obj|
+        return true if obj == value
+      end
+      return false
+    end
+    alias :value? :has_key?
+    def to_s
+      "<PDF::Hash size: #{self.size}>"
+    end
+    # return an array of all keys in the file
+    #
+    def keys
+      ret = []
+      each_key { |k| ret << k }
+      ret
+    end
+    # return an array of all values in the file
+    #
+    def values
+      ret = []
+      each_value { |v| ret << v }
+      ret
+    end
+    # return an array of all values from the specified keys
+    #
+    def values_at(*ids)
+      ids.map { |id| self[id] }
+    end
+    # return an array of arrays. Each sub array contains a key/value pair.
+    #
+    def to_a
+      ret = []
+      each do |id, obj|
+        ret << [id, obj]
+      end
+      ret
+    end
+  end
+end

data/lib/pdf/reader.rb CHANGED

@@ -116,6 +116,7 @@ require 'pdf/reader/stream'
 require 'pdf/reader/text_receiver'
 require 'pdf/reader/token'
 require 'pdf/reader/xref'
+require 'pdf/hash'
 class PDF::Reader
   ################################################################################

data/lib/pdf/reader/content.rb CHANGED

@@ -265,7 +265,10 @@ class PDF::Reader
       callback(:metadata, [info]) if info
       # new style xml metadata
-      callback(:xml_metadata,@xref.object(root[:Metadata])) if root[:Metadata]
+      if root[:Metadata]
+        stream = @xref.object(root[:Metadata])
+        callback(:xml_metadata,stream.unfiltered_data)
+      end
       # page count
       if (pages = @xref.object(root[:Pages]))
@@ -327,7 +330,7 @@ class PDF::Reader
         callback(:begin_form_xobject)
         resources = @xref.object(xobject.hash[:Resources])
         walk_resources(resources) if resources
-        content_stream(xobject.to_s)
+        content_stream(xobject)
         callback(:end_form_xobject)
       end
     end
@@ -346,9 +349,10 @@ class PDF::Reader
     # Reads a PDF content stream and calls all the appropriate callback methods for the operators
     # it contains
     def content_stream (instructions)
-      @buffer = Buffer.new(StringIO.new(instructions))
-      @parser = Parser.new(@buffer, @xref)
-      @params = [] if @params.nil?
+      instructions = instructions.unfiltered_data if instructions.kind_of?(PDF::Reader::Stream)
+      @buffer =   Buffer.new(StringIO.new(instructions))
+      @parser =   Parser.new(@buffer, @xref)
+      @params ||= []
       while (token = @parser.parse_token(OPERATORS))
         if token.kind_of?(Token) and OPERATORS.has_key?(token)
@@ -437,7 +441,8 @@ class PDF::Reader
           if desc[:ToUnicode]
             # this stream is a cmap
             begin
-              @fonts[label].tounicode = PDF::Reader::CMap.new(desc[:ToUnicode])
+              stream = desc[:ToUnicode]
+              @fonts[label].tounicode = PDF::Reader::CMap.new(stream.unfiltered_data)
             rescue
               # if the CMap fails to parse, don't worry too much. Means we can't translate the text properly
             end

data/lib/pdf/reader/encoding.rb CHANGED

@@ -113,9 +113,10 @@ class PDF::Reader
       array_orig.each do |num|
         if tounicode && (code = tounicode.decode(num))
           array_enc << code
-        elsif tounicode || (tounicode.nil? && @to_unicode_required)
+        elsif tounicode || ( tounicode.nil? && defined?(@to_unicode_required) &&
+                                               @to_unicode_required )
           array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
-        elsif @mapping && @mapping[num]
+        elsif defined?(@mapping) && @mapping && @mapping[num]
           array_enc << @mapping[num]
         else
           array_enc << num

data/lib/pdf/reader/parser.rb CHANGED

@@ -207,18 +207,6 @@ class PDF::Reader
       Error.str_assert(parse_token, "endstream")
       Error.str_assert(parse_token, "endobj")
-      if dict.has_key?(:Filter)
-        options = []
-        if dict.has_key?(:DecodeParms)
-          options = Array(dict[:DecodeParms])
-        end
-        Array(dict[:Filter]).each_with_index do |filter, index|
-          data = Filter.new(filter, options[index]).filter(data)
-        end
-      end
       PDF::Reader::Stream.new(dict, data)
     end
     ################################################################################

data/lib/pdf/reader/reference.rb CHANGED

@@ -49,6 +49,21 @@ class PDF::Reader
       [self]
     end
     ################################################################################
+    # returns the ID of this reference. Use with caution, ignores the generation id
+    def to_i
+      self.id
+    end
+    def ==(obj)
+      return false unless obj.kind_of?(PDF::Reader::Reference)
+      if obj.id == self.id && obj.gen == self.gen
+        true
+      else
+        false
+      end
+    end
+    alias :eql? :==
+    ################################################################################
   end
   ################################################################################
 end

data/lib/pdf/reader/stream.rb CHANGED

@@ -25,18 +25,40 @@
 class PDF::Reader
   ################################################################################
-  # An internal PDF::Reader class that represents a single token from a PDF file.
+  # An internal PDF::Reader class that represents a stream object from a PDF. Stream
+  # objects have 2 components, a dictionary that describes the content (size,
+  # compression, etc) and a stream of bytes.
   #
-  # Behaves exactly like a Ruby String - it basically exists for convenience.
-  class Stream < String
+  class Stream
     attr_accessor :hash
+    attr_reader :data
     ################################################################################
-    # Creates a new token with the specified value
-    def initialize (hash, val)
+    # Creates a new stream with the specified dictionary and data. The dictionary
+    # should be a standard ruby hash, the data should be a standard ruby string.
+    def initialize (hash, data)
       @hash = hash
-      super val
+      @data = data
+      @udata = nil
     end
     ################################################################################
+    # apply this streams filters to its data and return the result.
+    def unfiltered_data
+      return @udata if @udata
+      @udata = data.dup
+      if hash.has_key?(:Filter)
+        options = []
+        if hash.has_key?(:DecodeParms)
+          options = Array(hash[:DecodeParms])
+        end
+        Array(hash[:Filter]).each_with_index do |filter, index|
+          @udata = Filter.new(filter, options[index]).filter(@udata)
+        end
+      end
+      @udata
+    end
   end
   ################################################################################
 end

data/lib/pdf/reader/xref.rb CHANGED

@@ -36,6 +36,9 @@ class PDF::Reader
       @buffer = buffer
       @xref = {}
     end
+    def size
+      @xref.size
+    end
     ################################################################################
     # returns the PDF version of the current document. Technically this isn't part of the XRef
     # table, but it is one of the lowest level data items in the file, so we've lumped it in
@@ -136,6 +139,16 @@ class PDF::Reader
       raise InvalidObjectError, "Object #{ref.id}, Generation #{ref.gen} is invalid"
     end
     ################################################################################
+    # iterate over each object in the xref table
+    def each(&block)
+      ids = @xref.keys.sort
+      ids.each do |id|
+        gen = @xref[id].keys.sort[-1]
+        ref = PDF::Reader::Reference.new(id, gen)
+        yield ref, object(ref)
+      end
+    end
+    ################################################################################
     # Stores an offset value for a particular PDF object ID and revision number
     def store (id, gen, offset)
       (@xref[id] ||= {})[gen] ||= offset

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pdf-reader
 version: !ruby/object:Gem::Version
-  version: 0.7.7
+  version: 0.8.0
 platform: ruby
 authors:
 - Peter Jones
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2009-09-11 00:00:00 +10:00
+date: 2009-11-20 00:00:00 +11:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -36,38 +36,40 @@ extra_rdoc_files:
 - CHANGELOG
 - MIT-LICENSE
 files:
-- examples/extract_bates.rb
-- examples/text.rb
 - examples/page_counter_naive.rb
-- examples/callbacks.rb
+- examples/rspec.rb
 - examples/metadata.rb
+- examples/extract_bates.rb
+- examples/hash.rb
+- examples/callbacks.rb
+- examples/text.rb
 - examples/page_counter_improved.rb
-- examples/rspec.rb
-- lib/pdf/reader.rb
-- lib/pdf/reader/buffer.rb
-- lib/pdf/reader/cmap.rb
+- lib/pdf/reader/glyphlist.txt
 - lib/pdf/reader/content.rb
-- lib/pdf/reader/encoding.rb
 - lib/pdf/reader/error.rb
-- lib/pdf/reader/explore.rb
-- lib/pdf/reader/filter.rb
 - lib/pdf/reader/font.rb
-- lib/pdf/reader/glyphlist.txt
-- lib/pdf/reader/parser.rb
-- lib/pdf/reader/xref.rb
+- lib/pdf/reader/print_receiver.rb
 - lib/pdf/reader/reference.rb
-- lib/pdf/reader/register_receiver.rb
+- lib/pdf/reader/filter.rb
 - lib/pdf/reader/text_receiver.rb
+- lib/pdf/reader/encoding.rb
+- lib/pdf/reader/stream.rb
+- lib/pdf/reader/register_receiver.rb
 - lib/pdf/reader/token.rb
-- lib/pdf/reader/encodings/mac_expert.txt
-- lib/pdf/reader/encodings/mac_roman.txt
-- lib/pdf/reader/encodings/pdf_doc.txt
+- lib/pdf/reader/xref.rb
+- lib/pdf/reader/cmap.rb
+- lib/pdf/reader/buffer.rb
+- lib/pdf/reader/explore.rb
+- lib/pdf/reader/encodings/zapf_dingbats.txt
 - lib/pdf/reader/encodings/standard.txt
-- lib/pdf/reader/encodings/symbol.txt
+- lib/pdf/reader/encodings/mac_roman.txt
+- lib/pdf/reader/encodings/mac_expert.txt
 - lib/pdf/reader/encodings/win_ansi.txt
-- lib/pdf/reader/encodings/zapf_dingbats.txt
-- lib/pdf/reader/stream.rb
-- lib/pdf/reader/print_receiver.rb
+- lib/pdf/reader/encodings/symbol.txt
+- lib/pdf/reader/encodings/pdf_doc.txt
+- lib/pdf/reader/parser.rb
+- lib/pdf/hash.rb
+- lib/pdf/reader.rb
 - Rakefile
 - README.rdoc
 - TODO