RubyGems - pdf-reader - Versions diffs - 0.7.2 → 0.7.3 - Mend

pdf-reader 0.7.2 → 0.7.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

data/CHANGELOG +12 -2
data/{README → README.rdoc} +27 -47
data/Rakefile +5 -4
data/TODO +3 -1
data/bin/pdf_list_callbacks +1 -5
data/bin/pdf_object +43 -0
data/bin/pdf_text +1 -0
data/lib/pdf/reader.rb +25 -7
data/lib/pdf/reader/buffer.rb +3 -1
data/lib/pdf/reader/content.rb +56 -48
data/lib/pdf/reader/encoding.rb +82 -1088
data/lib/pdf/reader/encodings/mac_expert.txt +159 -0
data/lib/pdf/reader/encodings/mac_roman.txt +128 -0
data/lib/pdf/reader/encodings/pdf_doc.txt +40 -0
data/lib/pdf/reader/encodings/standard.txt +47 -0
data/lib/pdf/reader/encodings/symbol.txt +154 -0
data/lib/pdf/reader/encodings/win_ansi.txt +29 -0
data/lib/pdf/reader/encodings/zapf_dingbats.txt +201 -0
data/lib/pdf/reader/error.rb +1 -0
data/lib/pdf/reader/font.rb +4 -3
data/lib/pdf/reader/parser.rb +1 -0
data/lib/pdf/reader/print_receiver.rb +19 -0
data/lib/pdf/reader/xref.rb +12 -0
metadata +26 -17
data/lib/pdf/reader/parser.rb.rej +0 -29

data/CHANGELOG CHANGED

@@ -1,3 +1,13 @@
+v0.7.3 (UNRELESED)
+- Add a high level way to get direct access to a PDF object, including a new executable: pdf_object
+- Fix a hard loop bug caused by a content stream that is missing a final operator
+- Significantly simplified the internal code for encoding conversions
+  - Fixes YACC parsing bug that occurs on Fedora 8's ruby VM
+- New callbacks
+  - page_count
+  - pdf_version
+- Fix a bug that prevented a font's BaseFont from being recorded correctly
 v0.7.2 (20th May 2008)
 - Throw an UnsupportedFeatureError if we try to open an encrypted/secure PDF
 - Correctly handle page content instruction sets with trailing whitespace
@@ -16,7 +26,7 @@ v0.7 (6th May 2008)
 - Behave as expected if the Contents key in a Page Dict is a reference
 - Include some basic metadata callbacks
 - Don't interpret a comment token (%) inside a string as a comment
-- Small fixes to improve 1.9 compatability
+- Small fixes to improve 1.9 compatibility
 - Improved our Zlib deflating to make it slightly more robust - still some more issues to work out though
 - Throw an UnsupportedFeatureError if a pdf that uses XRef streams is opened
 - Added an option to PDF::Reader#file and PDF::Reader#string to enable parsing of only parts of a PDF file(ie. only metadata, etc)
@@ -36,7 +46,7 @@ v0.6.1 (12th March 2008)
   just replace each character with a little box.
 - Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
   NoMethodError.
-- Added a method to RegisterReceiver that returns all occurances of a callback
+- Added a method to RegisterReceiver that returns all occurrences of a callback
 v0.6.0 (27th February 2008)
 - all text is now transparently converted to UTF-8 before being passed to the callbacks.

data/{README → README.rdoc} RENAMED

@@ -48,8 +48,11 @@ UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't curren
 support. Again, we welcome submissions of PDF files that exhibit these features to help
 us with future code improvements.
-Any other exceptions should be considered bugs and should be reported (unless they originate
-inside your receiver, in which case you're on your own)
+MalformedPDFError has some subclasses if you want to detect finer grained issues. If you
+don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
+Any other exceptions should be considered bugs in either PDF::Reader (please
+report it!) your receiver (please don't report it!).
 = Maintainers
@@ -66,7 +69,7 @@ http://groups.google.com/group/pdf-reader
 The easiest way to explain how this works in practice is to show some examples.
-== Page Counter
+== Naïve Page Counter
 A simple app to count the number of pages in a PDF File.
@@ -127,6 +130,26 @@ it through less or to a text file.
   puts receiver.regular.inspect
   puts receiver.xml.inspect
+== Improved Page Counter
+A simple app to display the number of pages in a PDF File.
+  require 'rubygems'
+  require 'pdf/reader'
+  class PageReceiver
+    attr_accessor :pages
+    # Called when page parsing ends
+    def page_count(arg)
+      @pages = arg
+    end
+  end
+  receiver = PageReceiver.new
+  pdf = PDF::Reader.file("somefile.pdf", receiver, :pages => false)
+  puts "#{receiver.pages} pages"
 == Basic RSpec of a generated PDF
   require 'rubygems'
@@ -182,49 +205,6 @@ it through less or to a text file.
     end
   end
-== Extract ISBNs
-Parse all text in the requested PDF file and print out any valid book ISBNs.
-Requires the rbook-isbn gem.
-  require 'rubygems'
-  require 'pdf/reader'
-  require 'rbook/isbn'
-  class ISBNReceiver
-    # there's a few text callbacks, so make sure we process them all
-    def show_text(string, *params)
-      process_words(string.split(/\W+/))
-    end
-    def super_show_text(string, *params)
-      process_words(string.split(/\W+/))
-    end
-    def move_to_next_line_and_show_text (string)
-      process_words(string.split(/\W+/))
-    end
-    def set_spacing_next_line_show_text (aw, ac, string)
-      process_words(string.split(/\W+/))
-    end
-    private
-    # check if any items in the supplied array are a valid ISBN, and print any
-    # that are to console
-    def process_words(words)
-      words.each do |word|
-        word.strip!
-        puts "#{RBook::ISBN.convert_to_isbn13(word)}" if RBook::ISBN.valid_isbn?(word)
-      end
-    end
-  end
-  receiver = ISBNReceiver.new
-  PDF::Reader.file("somefile.pdf", receiver)
 = Known Limitations
 The order of the callbacks is unpredicable, and is dependent on the internal
@@ -238,7 +218,7 @@ little UTF-8 friendly box to indicate an unrecognisable character.
 = Resources
-- PDF::Reader Homepage: http://software.pmade.com/pdfreader
+- PDF::Reader Code Repository: http://github.com/yob/pdf-reader
 - PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/
 - PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
 - PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html

data/Rakefile CHANGED

@@ -6,7 +6,7 @@ require 'rake/testtask'
 require "rake/gempackagetask"
 require 'spec/rake/spectask'
-PKG_VERSION = "0.7.2"
+PKG_VERSION = "0.7.3"
 PKG_NAME = "pdf-reader"
 PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
@@ -44,7 +44,7 @@ desc "Create documentation"
 Rake::RDocTask.new("doc") do |rdoc|
   rdoc.title = "pdf-reader"
   rdoc.rdoc_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + '/rdoc'
-  rdoc.rdoc_files.include('README')
+  rdoc.rdoc_files.include('README.rdoc')
   rdoc.rdoc_files.include('TODO')
   rdoc.rdoc_files.include('CHANGELOG')
   #rdoc.rdoc_files.include('COPYING')
@@ -66,12 +66,13 @@ spec = Gem::Specification.new do |spec|
   spec.require_path = "lib"
   spec.bindir = "bin"
+  spec.executables << "pdf_object"
   spec.executables << "pdf_text"
   spec.executables << "pdf_list_callbacks"
 	spec.has_rdoc = true
-	spec.extra_rdoc_files = %w{README TODO CHANGELOG}
+	spec.extra_rdoc_files = %w{README.rdoc TODO CHANGELOG}
 	spec.rdoc_options << '--title' << 'PDF::Reader Documentation' <<
-	                     '--main'  << 'README' << '-q'
+	                     '--main'  << 'README.rdoc' << '-q'
   spec.author = "Peter Jones"
 	spec.email = "pjones@pmade.com"
 	spec.rubyforge_project = "pdf-reader"

data/TODO CHANGED

@@ -1,4 +1,7 @@
 v0.8
+- add extra callbacks
+  - list implemented features
+    - encrypted? tagged? bookmarks? annotated? optimised?
 - Allow more than just page content and metadata to be parsed (see spec section 3.6.1)
   - bookmarks?
   - outline?
@@ -9,7 +12,6 @@ v0.8
   poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
   from the Original encoding to Unicode.
 - detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
-- Provide a way to get raw access to a particular object. Good for testing purposes
 - Improve interpretation of non content stream data (ie metadata). recognise dates, etc
 - Support Cross Reference Streams (spec 3.4.7)

data/bin/pdf_list_callbacks CHANGED

@@ -4,14 +4,10 @@ $LOAD_PATH.unshift(File.dirname(__FILE__) + "/../lib")
 require 'pdf/reader'
-receiver = PDF::Reader::RegisterReceiver.new
+receiver = PDF::Reader::PrintReceiver.new
 if ARGV.empty?
   PDF::Reader.new.parse($stdin, receiver)
 else
   PDF::Reader.file(ARGV[0], receiver)
 end
-receiver.callbacks.each do |callback|
-  puts "#{callback[:name]} - #{callback[:args].inspect}"
-end

data/bin/pdf_object ADDED

@@ -0,0 +1,43 @@
+#!/usr/bin/env ruby
+$LOAD_PATH.unshift(File.dirname(__FILE__) + "/../lib")
+USAGE = "USAGE: " + File.basename(__FILE__) + " <file> <object id> [generation]"
+require 'pdf/reader'
+filename, id, gen = *ARGV
+if filename.nil? || id.nil?
+  puts USAGE
+  exit 1
+elsif !File.file?(filename)
+  $stderr.puts "#{filename} does not exist"
+  exit 1
+end
+# tweak the users options
+id  =  id.to_i
+gen ||= 0
+gen = gen.to_i
+# make magic happen
+begin
+  obj = PDF::Reader.object_file(filename, id, gen)
+  case obj
+  when Hash, Array
+    puts obj.inspect
+  else
+    puts obj
+  end
+rescue PDF::Reader::InvalidObjectError
+  $stderr.puts "Error retreiving object #{id}, gen #{gen}. Does it exist?"
+  exit 1
+rescue PDF::Reader::MalformedPDFError => e
+  $stderr.puts "Malformed PDF file: #{e.message}"
+  exit 1
+rescue PDF::Reader::UnsupportedFeatureError => e
+  $stderr.puts "PDF file implements a feature unsupported by PDF::Reader: #{e.message}"
+  exit 1
+end

data/bin/pdf_text CHANGED

@@ -11,6 +11,7 @@ class PageTextReceiver
   def end_page(arg = nil)
     if @content
       puts @content
+      @content = nil
       puts
     end
   end

data/lib/pdf/reader.rb CHANGED

@@ -70,19 +70,31 @@ module PDF
   class Reader
     ################################################################################
     # Parse the file with the given name, sending events to the given receiver.
-    def self.file (name, receiver, opts = {})
+    def self.file(name, receiver, opts = {})
       File.open(name,"rb") do |f|
         new.parse(f, receiver, opts)
       end
     end
     ################################################################################
     # Parse the given string, sending events to the given receiver.
-    def self.string (str, receiver, opts = {})
+    def self.string(str, receiver, opts = {})
       StringIO.open(str) do |s|
         new.parse(s, receiver, opts)
       end
     end
     ################################################################################
+    def self.object_file(name, id, gen)
+      File.open(name,"rb") do |f|
+        new.object(f, id, gen)
+      end
+    end
+    ################################################################################
+    def self.object_string(name, id, gen)
+      StringIO.open(str) do |s|
+        new.object(s, id, gen)
+      end
+    end
+    ################################################################################
   end
   ################################################################################
 end
@@ -96,6 +108,7 @@ require 'pdf/reader/error'
 require 'pdf/reader/filter'
 require 'pdf/reader/font'
 require 'pdf/reader/parser'
+require 'pdf/reader/print_receiver'
 require 'pdf/reader/reference'
 require 'pdf/reader/register_receiver'
 require 'pdf/reader/stream'
@@ -104,10 +117,6 @@ require 'pdf/reader/token'
 require 'pdf/reader/xref'
 class PDF::Reader
-  ################################################################################
-  # Initialize a new PDF::Reader
-  def initialize
-  end
   ################################################################################
   # Given an IO object that contains PDF data, parse it.
   def parse (io, receiver, opts = {})
@@ -121,10 +130,19 @@ class PDF::Reader
     trailer = @xref.load
     raise PDF::Reader::UnsupportedFeatureError, 'PDF::Reader cannot read encrypted PDF files' if trailer[:Encrypt]
-    @content.metadata(@xref.object(trailer[:Info])) if options[:metadata]
+    @content.metadata(@xref.object(trailer[:Root]), @xref.object(trailer[:Info])) if options[:metadata]
     @content.document(@xref.object(trailer[:Root])) if options[:pages]
     self
   end
   ################################################################################
+  # Given an IO object that contains PDF data, return the contents of a single object
+  def object (io, id, gen)
+    @buffer   = Buffer.new(io)
+    @xref     = XRef.new(@buffer)
+    @xref.load
+    @xref.object(Reference.new(id, gen))
+  end
+  ################################################################################
 end
 ################################################################################

data/lib/pdf/reader/buffer.rb CHANGED

@@ -118,7 +118,9 @@ class PDF::Reader
       strip_space = !(i == 0 and @buffer[0,1] == '(')
       tok = head(token_chars, strip_space)
-      if tok[0,1] == "%"
+      if tok == ""
+        nil
+      elsif tok[0,1] == "%"
         @buffer = ""
         token
       else

data/lib/pdf/reader/content.rb CHANGED

@@ -9,10 +9,10 @@
 # distribute, sublicense, and/or sell copies of the Software, and to
 # permit persons to whom the Software is furnished to do so, subject to
 # the following conditions:
-#
+#
 # The above copyright notice and this permission notice shall be
 # included in all copies or substantial portions of the Software.
-#
+#
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 # MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
@@ -27,20 +27,20 @@ require 'stringio'
 class PDF::Reader
   ################################################################################
   # Walks the PDF file and calls the appropriate callback methods when something of interest is
-  # found.
+  # found.
   #
   # The callback methods should exist on the receiver object passed into the constructor. Whenever
-  # some content is found that will trigger a callback, the receiver is checked to see if the callback
+  # some content is found that will trigger a callback, the receiver is checked to see if the callback
   # is defined.
   #
   # If it is defined it will be called. If not, processing will continue.
   #
   # = Available Callbacks
-  # The following callbacks are available and should be methods defined on your receiver class. Only
+  # The following callbacks are available and should be methods defined on your receiver class. Only
   # implement the ones you need - the rest will be ignored.
   #
   # Some callbacks will include parameters which will be passed in as an array. For callbacks that supply no
-  # paramters, or where you don't need them, the *params argument can be left off. Some example callback
+  # paramters, or where you don't need them, the *params argument can be left off. Some example callback
   # method definitions are:
   #
   #   def begin_document
@@ -49,14 +49,14 @@ class PDF::Reader
   #   def fill_stroke(*params)
   #
   # You should be able to infer the basic command the callback is reporting based on the name. For
-  # further experimentation, define the callback with just a *params parameter, then print out the
+  # further experimentation, define the callback with just a *params parameter, then print out the
   # contents of the array using something like:
   #
   #   puts params.inspect
   #
   # == Text Callbacks
   #
-  # All text passed into these callbacks will be encoded as UTF-8. Depending on where (and when) the
+  # All text passed into these callbacks will be encoded as UTF-8. Depending on where (and when) the
   # PDF was generated, there's a good chance the text is NOT stored as UTF-8 internally so be careful
   # when doing a comparison on strings returned from PDF::Reader (when doing unit tests for example). The
   # string may not be byte-by-byte identical with the string that was originally written to the PDF.
@@ -146,6 +146,7 @@ class PDF::Reader
   # - end_page
   # - metadata
   # - xml_metadata
+  # - page_count
   #
   # == Resource Callbacks
   #
@@ -155,8 +156,8 @@ class PDF::Reader
   # on a page:
   #
   # In most cases, these callbacks associate a name with each resource, allowing it
-  # to be referred to by name in the page content. For example, an XObject can hold an image.
-  # If it gets mapped to the name "IM1", then it can be placed on the page using
+  # to be referred to by name in the page content. For example, an XObject can hold an image.
+  # If it gets mapped to the name "IM1", then it can be placed on the page using
   # invoke_xobject "IM1".
   #
   # - resource_procset
@@ -252,25 +253,37 @@ class PDF::Reader
     end
     ################################################################################
     # Begin processing the document metadata
-    def metadata (info)
+    def metadata (root, info)
       info = decode_strings(info)
+      # may be useful to some people
+      callback(:pdf_version, @xref.pdf_version)
+      # ye olde metadata
       callback(:metadata, [info]) if info
+      # new style xml metadata
+      callback(:xml_metadata,@xref.object(root[:Metadata])) if root[:Metadata]
+      # page count
+      if (pages = @xref.object(root[:Pages]))
+        if (count = @xref.object(pages[:Count]))
+          callback(:page_count, count.to_i)
+        end
+      end
     end
     ################################################################################
     # Begin processing the document
     def document (root)
-      if root[:Metadata]
-        callback(:xml_metadata,@xref.object(root[:Metadata]))
-      end
       callback(:begin_document, [root])
       walk_pages(@xref.object(root[:Pages]))
       callback(:end_document)
     end
     ################################################################################
-    # Walk over all pages in the PDF file, calling the appropriate callbacks for each page and all
+    # Walk over all pages in the PDF file, calling the appropriate callbacks for each page and all
     # its content
     def walk_pages (page)
       if page[:Resources]
         res = page[:Resources]
         page.delete(:Resources)
@@ -293,7 +306,7 @@ class PDF::Reader
         else
           contents = [page[:Contents]]
         end
         contents.each do |content|
           obj = @xref.object(content)
           content_stream(obj)
@@ -310,32 +323,27 @@ class PDF::Reader
       @parser = Parser.new(@buffer, @xref)
       @params = [] if @params.nil?
-      until @buffer.eof?
-        loop do
-          token = @parser.parse_token(OPERATORS)
-          if token.kind_of?(Token) and OPERATORS.has_key?(token)
-            @current_font = @params.first if OPERATORS[token] == :set_text_font_and_size
+      while (token = @parser.parse_token(OPERATORS))
+        if token.kind_of?(Token) and OPERATORS.has_key?(token)
+          @current_font = @params.first if OPERATORS[token] == :set_text_font_and_size
-            # handle special cases in response to certain operators
-            if OPERATORS[token].to_s.include?("show_text") && @fonts[@current_font]
-              # convert any text to utf-8
-              @params = @fonts[@current_font].to_utf8(@params)
-            elsif token == "ID"
-              # inline image data, first convert the current params into a more familiar hash
-              map = {}
-              @params.each_slice(2) do |a|
-                map[a.first] = a.last
-              end
-              @params = [map]
-              # read the raw image data from the buffer without tokenising
-              @params << @buffer.read_until("EI")
+          # handle special cases in response to certain operators
+          if OPERATORS[token].to_s.include?("show_text") && @fonts[@current_font]
+            # convert any text to utf-8
+            @params = @fonts[@current_font].to_utf8(@params)
+          elsif token == "ID"
+            # inline image data, first convert the current params into a more familiar hash
+            map = {}
+            @params.each_slice(2) do |a|
+              map[a.first] = a.last
             end
-            callback(OPERATORS[token], @params)
-            @params.clear
-            break
+            @params = [map]
+            # read the raw image data from the buffer without tokenising
+            @params << @buffer.read_until("EI")
           end
+          callback(OPERATORS[token], @params)
+          @params.clear
+        else
           @params << token
         end
       end
@@ -345,7 +353,7 @@ class PDF::Reader
     ################################################################################
     def walk_resources(resources)
       resources = resolve_references(resources)
       # extract any procset information
       if resources[:ProcSet]
         callback(:resource_procset, resources[:ProcSet])
@@ -387,7 +395,7 @@ class PDF::Reader
           @fonts[label].label = label
           @fonts[label].subtype = desc[:Subtype] if desc[:Subtype]
           @fonts[label].basefont = desc[:BaseFont] if desc[:BaseFont]
-          @fonts[label].encoding = PDF::Reader::Encoding.factory(@xref.object(desc[:Encoding]))
+          @fonts[label].encoding = PDF::Reader::Encoding.new(@xref.object(desc[:Encoding]))
           @fonts[label].descendantfonts = desc[:DescendantFonts] if desc[:DescendantFonts]
           if desc[:ToUnicode]
             # this stream is a cmap
@@ -402,13 +410,13 @@ class PDF::Reader
       end
     end
     ################################################################################
-    # Convert any PDF::Reader::Resource objects into a real object
+    # Convert any PDF::Reader::Resource objects into a real object
     def resolve_references(obj)
       case obj
-      when PDF::Reader::Stream then
+      when PDF::Reader::Stream then
         obj.hash = resolve_references(obj.hash)
         obj
-      when PDF::Reader::Reference then
+      when PDF::Reader::Reference then
         resolve_references(@xref.object(obj))
       when Hash                   then obj.each { |key,val| obj[key] = resolve_references(val) }
       when Array                  then obj.collect { |item| resolve_references(item) }
@@ -426,11 +434,11 @@ class PDF::Reader
     # strings outside of page content should be in either PDFDocEncoding or UTF-16.
     def decode_strings(obj)
       case obj
-      when String then
+      when String then
         if obj[0,2] == "\376\377"
-          PDF::Reader::Encoding::UTF16Encoding.new.to_utf8(obj)
+          PDF::Reader::Encoding.new(:UTF16Encoding).to_utf8(obj[2, obj.size])
         else
-          PDF::Reader::Encoding::PDFDocEncoding.new.to_utf8(obj)
+          PDF::Reader::Encoding.new(:PDFDocEncoding).to_utf8(obj)
         end
       when Hash   then obj.each { |key,val| obj[key] = decode_strings(val) }
       when Array  then obj.collect { |item| decode_strings(item) }