RubyGems - pdf-reader - Versions diffs - 0.10.1 → 0.11.0.alpha - Mend

pdf-reader 0.10.1 → 0.11.0.alpha

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

data/CHANGELOG +1 -4
data/README.rdoc +30 -21
data/bin/pdf_text +5 -35
data/examples/callbacks.rb +9 -4
data/examples/extract_bates.rb +15 -29
data/lib/pdf/reader.rb +150 -37
data/lib/pdf/reader/abstract_strategy.rb +2 -0
data/lib/pdf/reader/buffer.rb +12 -13
data/lib/pdf/reader/font.rb +56 -0
data/lib/pdf/reader/glyphlist.txt +40 -1
data/lib/pdf/reader/metadata_strategy.rb +3 -0
data/lib/pdf/reader/object_cache.rb +85 -0
data/lib/pdf/reader/object_hash.rb +19 -5
data/lib/pdf/reader/page.rb +172 -0
data/lib/pdf/reader/page_text_receiver.rb +253 -0
data/lib/pdf/reader/pages_strategy.rb +3 -11
data/lib/pdf/reader/text_receiver.rb +3 -0
data/lib/pdf/reader/xref.rb +3 -4
metadata +41 -35

data/CHANGELOG CHANGED

@@ -1,7 +1,4 @@
-v0.10.1 (20th October 2011)
-- simple license change to glyph data file, no code changes
-v0.10.0 (6th July 2011)
+v0.9.4 (XXX)
 - support multiple receivers within a single pass over a source file
   - massive time saving when dealing with multiple receivers

data/README.rdoc CHANGED

@@ -6,7 +6,7 @@ degree of flexibility.
 The PDF 1.7 specification is a weighty document and not all aspects are
 currently supported. I welcome submission of PDF files that exhibit
-unsupported aspects of the spec to assist with improving out support.
+unsupported aspects of the spec to assist with improving our support.
 = Installation
@@ -16,22 +16,37 @@ The recommended installation method is via Rubygems.
 = Usage
-PDF::Reader is designed with a callback-style architecture. The basic concept
-is to build a receiver class and pass that into PDF::Reader along with the PDF
-to process.
+Begin by creating a PDF::Reader instance that points to a PDF file. Document
+level information (metadata, page count, bookmarks, etc) is available via
+this object.
-As PDF::Reader walks the file and encounters various objects (pages, text,
-images, shapes, etc) it will call methods on the receiver class.  What those
-methods do is entirely up to you - save the text, extract images, count pages,
-read metadata, whatever.
+    reader = PDF::Reader.new("somefile.pdf")
-For a full list of the supported callback methods and a description of when they
-will be called, refer to PDF::Reader::PagesStrategy. See the examples directory for a
-way to print a list of all the callbacks generated by a file to STDOUT.
+    puts reader.pdf_version
+    puts reader.info
+    puts reader.metadata
+    puts reader.page_count
-There is also a class called PDF::Reader::ObjectHash. This provides direct
-access to the objects in a PDF file using a ruby hash-like API. Checkout the
-documentation for the class for further information.
+PDF is a page based file format, so most visible information is available via
+page-based iteration
+    reader = PDF::Reader.new("somefile.pdf")
+    reader.pages.each do |page|
+      puts page.fonts
+      puts page.images
+      puts page.text
+    end
+For low level access to the objects in a PDF file, use the ObjectHash class. You can
+build an ObjectHash instance directly:
+    puts PDF::Reader::ObjectHash.new("somefile.pdf")
+or via a PDF::Reader instance:
+    reader  = PDF::Reader.new("somefile.pdf")
+    puts reader.objects
 = Text Encoding
@@ -60,7 +75,7 @@ MalformedPDFError has some subclasses if you want to detect finer grained issues
 don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
 Any other exceptions should be considered bugs in either PDF::Reader (please
-report it!) or your receiver (please don't report it!).
+report it!).
 = Maintainers
@@ -86,11 +101,6 @@ Check out the examples/ directory for a few files.
 = Known Limitations
-The order of the callbacks is unpredictable, and is dependent on the internal
-layout of the file, not the order objects are displayed to the user. As a
-consequence of this it is highly unlikely that text will be completely in
-order.
 Occasionally some text cannot be extracted properly due to the way it has been
 stored, or the use of invalid bytes. In these cases PDF::Reader will output a
 little UTF-8 friendly box to indicate an unrecognisable character.
@@ -98,6 +108,5 @@ little UTF-8 friendly box to indicate an unrecognisable character.
 = Resources
 - PDF::Reader Code Repository: http://github.com/yob/pdf-reader
-- PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/
 - PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
 - PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html

data/bin/pdf_text CHANGED

@@ -5,41 +5,11 @@ $LOAD_PATH.unshift(File.dirname(__FILE__) + "/../lib")
 require 'pdf/reader'
-class PageTextReceiver
-  attr_accessor :content
-  # Called when page parsing starts
-  def end_page(arg = nil)
-    if @content
-      puts @content
-      @content = nil
-      puts
-    end
-  end
-  def show_text(*params)
-    @content = "" if @content.nil?
-    params.each do |str|
-      @content << str.to_s
-    end
-  end
-  # there's a few text callbacks, so make sure we process them all
-  alias :super_show_text :show_text
-  alias :move_to_next_line_and_show_text :show_text
-  alias :set_spacing_next_line_show_text :show_text
-  def show_text_with_positioning(*params)
-    params = params.first
-    params ||= []
-    params.each { |str| show_text(str) if str.kind_of?(String)}
-  end
-end
-receiver = PageTextReceiver.new
 if ARGV.empty?
-  PDF::Reader.new.parse($stdin, receiver)
+  browser = PDF::Reader.new($stdin)
 else
-  PDF::Reader.file(ARGV[0], receiver)
+  browser = PDF::Reader.new(ARGV[0])
+end
+browser.pages.each do |page|
+  puts page.text
 end

data/examples/callbacks.rb CHANGED

@@ -1,7 +1,7 @@
 #!/usr/bin/env ruby
 # coding: utf-8
-# List all callbacks generated by a single PDF
+# List all callbacks generated by each page
 #
 # WARNING: this will generate a *lot* of output, so you probably want to pipe
 #          it through less or to a text file.
@@ -10,7 +10,12 @@ require 'rubygems'
 require 'pdf/reader'
 receiver = PDF::Reader::RegisterReceiver.new
-pdf = PDF::Reader.file("somefile.pdf", receiver)
-receiver.callbacks.each do |cb|
-  puts cb
+PDF::Reader.open("somefile.pdf") do |reader|
+  reader.pages.each do |page|
+    page.walk(receiver)
+    receiver.callbacks.each do |cb|
+      puts cb
+    end
+  end
 end

data/examples/extract_bates.rb CHANGED

@@ -10,26 +10,20 @@
 # the number is to look for words that match a pattern.
 #
 # This example attempts to extract numbers using the Acrobat 9 syntax.
-# As a fall back, you can provide a regular expression that will be
-# used to look for words that look like the numbers you expect in the
-# page content.
+# As a fall back, you can use a regular expression to look for words
+# that match the numbers you expect in the page content.
 require 'rubygems'
 require 'pdf/reader'
 class BatesReceiver
-  def initialize(regexp = nil)
-    @numbers = []
-    @backup  = []
-    @regexp  = regexp
-  end
+  attr_reader :numbers
-  def numbers
-    @numbers.size > 0 ? @numbers : @backup
+  def initialize
+    @numbers = []
   end
-  # Called when page parsing starts
   def begin_marked_content(*args)
     return unless args.size >= 2
     return unless args.first == :Artifact
@@ -39,25 +33,17 @@ class BatesReceiver
   end
   alias :begin_marked_content_with_pl :begin_marked_content
-  # record text that is drawn on the page
-  def show_text(string, *params)
-    return if @regexp.nil?
-    string.scan(@regexp).each { |m| @backup << m }
-  end
+end
-  # there's a few text callbacks, so make sure we process them all
-  alias :super_show_text :show_text
-  alias :move_to_next_line_and_show_text :show_text
-  alias :set_spacing_next_line_show_text :show_text
-  # this final text callback takes slightly different arguments
-  def show_text_with_positioning(*params)
-    params = params.first
-    params.each { |str| show_text(str) if str.kind_of?(String)}
+PDF::Reader.open("bates.pdf") do |reader|
+  reader.pages.each do |page|
+    receiver = BatesReceiver.new
+    page.walk(receiver)
+    if receiver.numbers.empty?
+      puts page.scan(/CC.+/)
+    else
+      puts receiver.numbers.inspect
+    end
   end
 end
-receiver = BatesReceiver.new(/CC.+/)
-PDF::Reader.file("bates.pdf", receiver)
-puts receiver.numbers.inspect

data/lib/pdf/reader.rb CHANGED

@@ -1,6 +1,7 @@
 ################################################################################
 #
 # Copyright (C) 2006 Peter J Jones (pjones@pmade.com)
+# Copyright (C) 2011 James Healy
 #
 # Permission is hereby granted, free of charge, to any person obtaining
 # a copy of this software and associated documentation files (the
@@ -30,62 +31,102 @@ require 'ascii85'
 module PDF
   ################################################################################
-  # The Reader class serves as an entry point for parsing a PDF file. There are three
-  # ways to kick off processing - which one you pick will be based on personal preference
-  # and the situation.
+  # The Reader class serves as an entry point for parsing a PDF file.
   #
-  # For all examples, assume the receiver variable contains an object that will respond
-  # to various callbacks. Refer to the README and PDF::Reader::Content for more information
-  # on receivers.
+  # PDF is a page based file format. There is some data associated with the
+  # document (metadata, bookmarks, etc) but all visible content is stored
+  # under a Page object.
   #
-  # = Parsing a file
+  # In most use cases for extracting and examining the contents of a PDF it
+  # makes sense to traverse the information using page based iteration.
   #
-  #   PDF::Reader.file("somefile.pdf", receiver)
+  # In addition to the documentation here, check out the
+  # PDF::Reader::Page class.
   #
-  # = Parsing a String
+  # == File Metadata
   #
-  # This is useful for processing a PDF that is already in memory
+  #   reader = PDF::Reader.new("somefile.pdf")
   #
-  #   PDF::Reader.string(pdf_string, receiver)
+  #   puts reader.pdf_version
+  #   puts reader.info
+  #   puts reader.metadata
+  #   puts reader.page_count
   #
-  # = Parsing an IO object
+  # == Iterating over page content
   #
-  # This can be a useful alternative to the first 2 options in some situations
+  #   reader = PDF::Reader.new("somefile.pdf")
   #
-  #   pdf = PDF::Reader.new
-  #   pdf.parse(File.new("somefile.pdf"), receiver)
+  #   reader.pages.each do |page|
+  #     puts page.fonts
+  #     puts page.images
+  #     puts page.text
+  #   end
   #
-  # = Parsing parts of a file
+  # == Extracting all text
   #
-  # Both PDF::Reader#file and PDF::Reader#string accept a third argument that
-  # specifies which parts of the file to process. By default, all options are
-  # enabled, so this can be useful to cut down processing time if you're only
-  # interested in say, metadata.
+  #   reader = PDF::Reader.new("somefile.pdf")
   #
-  # As an example, the following call will disable parsing the contents of
-  # pages in the file, but explicitly enables processing metadata.
+  #   reader.pages.map(&:text)
   #
-  #   PDF::Reader.new("somefile.pdf", receiver, {:metadata => true, :pages => false})
+  # == Extracting content from a single page
   #
-  # Available options are currently:
+  #   reader = PDF::Reader.new("somefile.pdf")
   #
-  #   :metadata
-  #   :pages
-  #   :raw_text
+  #   page = reader.page(1)
+  #   puts page.fonts
+  #   puts page.images
+  #   puts page.text
   #
-  # = Processing with multiple receivers
+  # == Low level callbacks (ala current version of PDF::Reader)
   #
-  # If you wish to parse a PDF file with multiple simultaneous receivers, just
-  # pass an array of receivers as the second argument:
+  #   reader = PDF::Reader.new("somefile.pdf")
   #
-  #   pdf = PDF::Reader.new
-  #   pdf.parse(File.new("somefile.pdf"), [receiver_one, receiever_two])
-  #
-  # This saves a significant amount of time by limiting the work to a single pass
-  # over the source file.
+  #   page = reader.page(1)
+  #   page.walk(receiver)
   #
   class Reader
+    # lowlevel hash-like access to all objects in the underlying PDF
+    attr_reader :objects
+    attr_reader :page_count, :pdf_version, :info, :metadata
+    # creates a new document reader for the provided PDF.
+    #
+    # input can be an IO-ish object (StringIO, File, etc) containing a PDF
+    # or a filename
+    #
+    #   reader = PDF::Reader.new("somefile.pdf")
+    #
+    #   File.open("somefile.pdf","rb") do |file|
+    #     reader = PDF::Reader.new(file)
+    #   end
+    #
+    def initialize(input = nil)
+      if input # support the deprecated Reader API
+        @objects = PDF::Reader::ObjectHash.new(input)
+        @page_count  = get_page_count
+        @pdf_version = @objects.pdf_version
+        @info        = @objects.deref(@objects.trailer[:Info])
+        @metadata    = get_metadata
+      end
+    end
+    # syntactic sugar for opening a PDF file. Accepts the same arguments
+    # as new().
+    #
+    #   PDF::Reader.open("somefile.pdf") do |reader|
+    #     puts reader.pdf_version
+    #   end
+    #
+    def self.open(input, &block)
+      yield PDF::Reader.new(input)
+    end
+    # DEPRECATED: this method was deprecated in version 0.11.0 and will
+    #             eventually be removed
+    #
+    #
     # Parse the file with the given name, sending events to the given receiver.
     #
     def self.file(name, receivers, opts = {})
@@ -94,6 +135,9 @@ module PDF
       end
     end
+    # DEPRECATED: this method was deprecated in version 0.11.0 and will
+    #             eventually be removed
+    #
     # Parse the given string, sending events to the given receiver.
     #
     def self.string(str, receivers, opts = {})
@@ -102,6 +146,9 @@ module PDF
       end
     end
+    # DEPRECATED: this method was deprecated in version 0.11.0 and will
+    #             eventually be removed
+    #
     # Parse the file with the given name, returning an unmarshalled ruby version of
     # represents the requested pdf object
     #
@@ -111,6 +158,9 @@ module PDF
       }
     end
+    # DEPRECATED: this method was deprecated in version 0.11.0 and will
+    #             eventually be removed
+    #
     # Parse the given string, returning an unmarshalled ruby version of represents
     # the requested pdf object
     #
@@ -120,6 +170,48 @@ module PDF
       }
     end
+    # returns an array of PDF::Reader::Page objects, one for each
+    # page in the source PDF.
+    #
+    #   reader = PDF::Reader.new("somefile.pdf")
+    #
+    #   reader.pages.each do |page|
+    #     puts page.fonts
+    #     puts page.images
+    #     puts page.text
+    #   end
+    #
+    # See the docs for PDF::Reader::Page to read more about the
+    # methods available on each page
+    #
+    def pages
+      (1..@page_count).map { |num|
+        PDF::Reader::Page.new(@objects, num)
+      }
+    end
+    # returns a single PDF::Reader::Page for the specified page.
+    # Use this instead of pages method when you need to access just a single
+    # page
+    #
+    #   reader = PDF::Reader.new("somefile.pdf")
+    #   page   = reader.page(10)
+    #
+    #   puts page.text
+    #
+    # See the docs for PDF::Reader::Page to read more about the
+    # methods available on each page
+    #
+    def page(num)
+      num = num.to_i
+      raise ArgumentError, "valid pages are 1 .. #{@page_count}" if num < 1 || num > @page_count
+      PDF::Reader::Page.new(@objects, num)
+    end
+    # DEPRECATED: this method was deprecated in version 0.11.0 and will
+    #             eventually be removed
+    #
     # Given an IO object that contains PDF data, parse it.
     #
     def parse(io, receivers, opts = {})
@@ -139,12 +231,15 @@ module PDF
       self
     end
+    # DEPRECATED: this method was deprecated in version 0.11.0 and will
+    #             eventually be removed
+    #
     # Given an IO object that contains PDF data, return the contents of a single object
     #
     def object (io, id, gen)
-      @ohash = ObjectHash.new(io)
+      @objects = ObjectHash.new(io)
-      @ohash.object(Reference.new(id, gen))
+      @objects.deref(Reference.new(id, gen))
     end
     private
@@ -155,6 +250,21 @@ module PDF
         ::PDF::Reader::PagesStrategy
       ]
     end
+    def root
+      root ||= @objects.deref(@objects.trailer[:Root])
+    end
+    def get_metadata
+      stream = @objects.deref(root[:Metadata])
+      stream ? stream.unfiltered_data : nil
+    end
+    def get_page_count
+      pages = @objects.deref(root[:Pages])
+      pages[:Count]
+    end
   end
 end
 ################################################################################
@@ -168,6 +278,7 @@ require 'pdf/reader/filter'
 require 'pdf/reader/font'
 require 'pdf/reader/lzw'
 require 'pdf/reader/metadata_strategy'
+require 'pdf/reader/object_cache'
 require 'pdf/reader/object_hash'
 require 'pdf/reader/object_stream'
 require 'pdf/reader/pages_strategy'
@@ -177,6 +288,8 @@ require 'pdf/reader/reference'
 require 'pdf/reader/register_receiver'
 require 'pdf/reader/stream'
 require 'pdf/reader/text_receiver'
+require 'pdf/reader/page_text_receiver'
 require 'pdf/reader/token'
 require 'pdf/reader/xref'
+require 'pdf/reader/page'
 require 'pdf/hash'