RubyGems - pdf-reader - Versions diffs - 0.5 - Mend

pdf-reader 0.5

Files changed (18) hide show

data/CHANGELOG +2 -0
data/README +177 -0
data/Rakefile +84 -0
data/TODO +9 -0
data/lib/pdf/reader.rb +106 -0
data/lib/pdf/reader/buffer.rb +144 -0
data/lib/pdf/reader/content.rb +289 -0
data/lib/pdf/reader/error.rb +53 -0
data/lib/pdf/reader/explore.rb +116 -0
data/lib/pdf/reader/filter.rb +62 -0
data/lib/pdf/reader/name.rb +37 -0
data/lib/pdf/reader/parser.rb +203 -0
data/lib/pdf/reader/reference.rb +55 -0
data/lib/pdf/reader/register_receiver.rb +18 -0
data/lib/pdf/reader/text_receiver.rb +259 -0
data/lib/pdf/reader/token.rb +41 -0
data/lib/pdf/reader/xref.rb +101 -0
metadata +70 -0

data/CHANGELOG ADDED

	@@ -0,0 +1,2 @@
1	+ v0.5 (14th December 2007)
2	+ - Initial Release

data/README ADDED

@@ -0,0 +1,177 @@
+The PDF::Reader library implements a PDF parser conforming as much as possible
+to the PDF specification from Adobe.
+It provides programmatic access to the contents of a PDF file with a high
+degree of flexibility.
+The PDF 1.7 specification is a weighty document and not all aspects are
+currently supported. We welcome submission of PDF files that exhibit
+unsupported aspects of the spec to assist with improving out support.
+= Installation
+The recommended installation method is via Rubygems.
+  gem install pdf-reader
+= Usage
+PDF::Reader is designed with a callback-style architecture. The basic concept
+is to build a receiver class and pass that into PDF::Reader along with the PDF
+to process.
+As PDF::Reader walks the file and encounters various objects (pages, text,
+images, shapes, etc) it will call methods on the receiver class.  What those
+methods do is entirely up to you - save the text, extract images, count pages,
+read metadata, whatever.
+For a full list of the supported callback methods and a description of when they
+will be called, refer to PDF::Reader::Content.
+= Exceptions
+There are two key exceptions that you will need to watch out for when processing a
+PDF file:
+MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
+file should be valid, or that a corrupt file didn't raise an exception, please
+forward a copy of the file to the maintainers and we can attempt improve the code.
+UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
+support. Again, we welcome submissions of PDF files that exhibit these features to help
+us with future code improvements.
+= Maintainers
+- Peter Jones <mailto:pjones@pmade.com>
+- James Healy <mailto:jimmy@deefa.com>
+= Examples
+The easiest way to explain how this works in practice is to show some examples.
+== Page Counter
+A simple app to count the number of pages in a PDF File.
+  require 'rubygems'
+  require 'pdf/reader'
+  class PageReceiver
+    attr_accessor :page_count
+    def initialize
+      @page_count = 0
+    end
+    # Called when page parsing ends
+    def end_page
+      @page_count += 1
+    end
+  end
+  receiver = PageReceiver.new
+  pdf = PDF::Reader.file("somefile.pdf", receiver)
+  puts "#{pdf.page_count} pages"
+== Basic RSpec of a generated PDF
+  require 'rubygems'
+  require 'pdf/reader'
+  require 'pdf/writer'
+  require 'spec'
+  class PageTextReceiver
+    attr_accessor :content
+    def initialize
+      @content = []
+    end
+    # Called when page parsing starts
+    def begin_page(arg = nil)
+      @content << ""
+    end
+    def show_text(string, *params)
+      @content.last << string.strip
+    end
+    # there's a few text callbacks, so make sure we process them all
+    alias :super_show_text :show_text
+    alias :move_to_next_line_and_show_text :show_text
+    alias :set_spacing_next_line_show_text :show_text
+  end
+  context "My generated PDF" do
+    specify "should have the correct text on 2 pages" do
+      # generate our PDF
+      pdf = PDF::Writer.new
+      pdf.text "Chunky", :font_size => 32, :justification => :center
+      pdf.start_new_page
+      pdf.text "Bacon", :font_size => 32, :justification => :center
+      pdf.save_as("chunkybacon.pdf")
+      # process the PDF
+      receiver = PageTextReceiver.new
+      PDF::Reader.file("chunkybacon.pdf", receiver)
+      # confirm the text appears on the correct pages
+      receiver.content.size.should eql(2)
+      receiver.content[0].should eql("Chunky")
+      receiver.content[1].should eql("Bacon")
+    end
+  end
+== Extract ISBNs
+Parse all text in the requested PDF file and print out any valid book ISBNs.
+Requires the rbook-isbn gem.
+  require 'rubygems'
+  require 'pdf/reader'
+  require 'rbook/isbn'
+  class ISBNReceiver
+    # there's a few text callbacks, so make sure we process them all
+    def show_text(string, *params)
+      process_words(string.split(/\W+/))
+    end
+    def super_show_text(string, *params)
+      process_words(string.split(/\W+/))
+    end
+    def move_to_next_line_and_show_text (string)
+      process_words(string.split(/\W+/))
+    end
+    def set_spacing_next_line_show_text (aw, ac, string)
+      process_words(string.split(/\W+/))
+    end
+    private
+    # check if any items in the supplied array are a valid ISBN, and print any
+    # that are to console
+    def process_words(words)
+      words.each do |word|
+        word.strip!
+        puts "#{RBook::ISBN.convert_to_isbn13(word)}" if RBook::ISBN.valid_isbn?(word)
+      end
+    end
+  end
+  receiver = ISBNReceiver.new
+  PDF::Reader.file("somefile.pdf", receiver)
+= Resources
+- PDF::Reader Homepage: http://software.pmade.com/pdfreader
+- PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/
+- PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
+- PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html

data/Rakefile ADDED

@@ -0,0 +1,84 @@
+require "rubygems"
+require 'rake'
+require 'rake/clean'
+require 'rake/rdoctask'
+require 'rake/testtask'
+require "rake/gempackagetask"
+require 'spec/rake/spectask'
+PKG_VERSION = "0.5"
+PKG_NAME = "pdf-reader"
+PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
+desc "Default Task"
+task :default => [ :spec ]
+# run all rspecs
+desc "Run all rspec files"
+Spec::Rake::SpecTask.new("spec") do |t|
+  t.spec_files = FileList['specs/**/*.rb']
+  t.rcov = true
+  t.rcov_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + "/rcov"
+ # t.rcov_opts = ["--exclude","spec.*\.rb"]
+end
+# generate specdocs
+desc "Generate Specdocs"
+Spec::Rake::SpecTask.new("specdocs") do |t|
+  t.spec_files = FileList['specs/**/*.rb']
+  t.spec_opts = ["--format", "rdoc"]
+  t.out = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + '/specdoc.rd'
+end
+# generate failing spec report
+desc "Generate failing spec report"
+Spec::Rake::SpecTask.new("spec_report") do |t|
+  t.spec_files = FileList['specs/**/*.rb']
+  t.spec_opts = ["--format", "html", "--diff"]
+  t.out = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + '/spec_report.html'
+  t.fail_on_error = false
+end
+# Genereate the RDoc documentation
+desc "Create documentation"
+Rake::RDocTask.new("doc") do |rdoc|
+  rdoc.title = "pdf-reader"
+  rdoc.rdoc_dir = (ENV['CC_BUILD_ARTIFACTS'] || 'doc') + '/rdoc'
+  rdoc.rdoc_files.include('README')
+  rdoc.rdoc_files.include('TODO')
+  rdoc.rdoc_files.include('CHANGELOG')
+  #rdoc.rdoc_files.include('COPYING')
+  #rdoc.rdoc_files.include('LICENSE')
+  rdoc.rdoc_files.include('lib/**/*.rb')
+  rdoc.options << "--inline-source"
+end
+# a gemspec for packaging this library
+# RSpec files aren't included, as they depend on the PDF files,
+# which will make the gem filesize irritatingly large
+spec = Gem::Specification.new do |spec|
+	spec.name = PKG_NAME
+	spec.version = PKG_VERSION
+	spec.platform = Gem::Platform::RUBY
+	spec.summary = "A library for accessing the content of PDF files"
+	spec.files =  Dir.glob("{examples,lib}/**/**/*") +
+                      ["Rakefile"]
+  spec.require_path = "lib"
+	spec.has_rdoc = true
+	spec.extra_rdoc_files = %w{README TODO CHANGELOG}
+	spec.rdoc_options << '--title' << 'PDF::Reader Documentation' <<
+	                     '--main'  << 'README' << '-q'
+  spec.author = "Peter Jones"
+	spec.email = "pjones@pmade.com"
+	spec.rubyforge_project = "pdf-reader"
+	spec.homepage = "http://software.pmade.com/pdfreader"
+	spec.description = "The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe"
+end
+# package the library into a gem
+desc "Generate a gem for pdf-reader"
+Rake::GemPackageTask.new(spec) do |pkg|
+	pkg.need_zip = true
+	pkg.need_tar = true
+end

data/TODO ADDED

@@ -0,0 +1,9 @@
+Some ideas for future work
+- Allows the user to only process certain aspects of the PDF file. For example, if they're only
+  interested in meta data, there's no point in walking the pages tree.
+- Ship some extra receivers in the standard package, particuarly ones that are useful for running
+  rspec over generated PDF files
+- Improve metadata support

data/lib/pdf/reader.rb ADDED

@@ -0,0 +1,106 @@
+################################################################################
+#
+# Copyright (C) 2006 Peter J Jones (pjones@pmade.com)
+#
+# Permission is hereby granted, free of charge, to any person obtaining
+# a copy of this software and associated documentation files (the
+# "Software"), to deal in the Software without restriction, including
+# without limitation the rights to use, copy, modify, merge, publish,
+# distribute, sublicense, and/or sell copies of the Software, and to
+# permit persons to whom the Software is furnished to do so, subject to
+# the following conditions:
+#
+# The above copyright notice and this permission notice shall be
+# included in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+#
+################################################################################
+require 'stringio'
+module PDF
+  ################################################################################
+  # The Reader class serves as an entry point for parsing a PDF file. There are three
+  # ways to kick off processing - which one you pick will be based on personal preference
+  # and the situation.
+  #
+  # For all examples, assume the receiver variable contains an object that will respond
+  # to various callbacks. Refer to the README and PDF::Reader::Content for more information
+  # on receivers.
+  #
+  # = Parsing a file
+  #
+  #   PDF::Reader.file("somefile.pdf", receiver)
+  #
+  # = Parsing a String
+  #
+  # This is useful for processing a PDF that is already in memory
+  #
+  #   PDF::Reader.string("somefile.pdf", receiver)
+  #
+  # = Parsing an IO object
+  #
+  # This can be a useful alternative to the first 2 options in some situations
+  #
+  #   pdf = PDF::Reader.new
+  #   pdf.parse(File.new("somefile.pdf"), receiver)
+  class Reader
+    ################################################################################
+    # Parse the file with the given name, sending events to the given receiver.
+    def self.file (name, receiver)
+      File.open(name) do |f|
+        new.parse(f, receiver)
+      end
+    end
+    ################################################################################
+    # Parse the given string, sending events to the given receiver.
+    def self.string (str, receiver)
+      StringIO.open(str) do |s|
+        new.parse(s, receiver)
+      end
+    end
+    ################################################################################
+  end
+  ################################################################################
+end
+################################################################################
+require 'pdf/reader/explore'
+require 'pdf/reader/buffer'
+require 'pdf/reader/content'
+require 'pdf/reader/error'
+require 'pdf/reader/filter'
+require 'pdf/reader/name'
+require 'pdf/reader/parser'
+require 'pdf/reader/reference'
+require 'pdf/reader/register_receiver'
+require 'pdf/reader/text_receiver'
+require 'pdf/reader/token'
+require 'pdf/reader/xref'
+class PDF::Reader
+  ################################################################################
+  # Initialize a new PDF::Reader
+  def initialize
+  end
+  ################################################################################
+  # Given an IO object that contains PDF data, parse it.
+  def parse (io, receiver)
+    @buffer   = Buffer.new(io)
+    @xref     = XRef.new(@buffer)
+    @parser   = Parser.new(@buffer, @xref)
+    @content  = (receiver == Explore ? Explore : Content).new(receiver, @xref)
+    trailer = @xref.load
+    @content.document(@xref.object(trailer['Root'])) || self
+  end
+  ################################################################################
+end
+################################################################################

data/lib/pdf/reader/buffer.rb ADDED

@@ -0,0 +1,144 @@
+################################################################################
+#
+# Copyright (C) 2006 Peter J Jones (pjones@pmade.com)
+#
+# Permission is hereby granted, free of charge, to any person obtaining
+# a copy of this software and associated documentation files (the
+# "Software"), to deal in the Software without restriction, including
+# without limitation the rights to use, copy, modify, merge, publish,
+# distribute, sublicense, and/or sell copies of the Software, and to
+# permit persons to whom the Software is furnished to do so, subject to
+# the following conditions:
+#
+# The above copyright notice and this permission notice shall be
+# included in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+#
+################################################################################
+class PDF::Reader
+  ################################################################################
+  # An internal PDF::Reader class that mediates access to the underlying PDF File or IO Stream
+  class Buffer
+    ################################################################################
+    # Creates a new buffer around the specified IO object
+    def initialize (io)
+      @io = io
+      @buffer = nil
+    end
+    ################################################################################
+    # Seek to the requested byte in the IO stream.
+    def seek (offset)
+      @io.seek(offset, IO::SEEK_SET)
+      @buffer = nil
+      self
+    end
+    ################################################################################
+    # reads the requested number of bytes from the underlying IO stream.
+    #
+    # length should be a positive integer.
+    def read (length)
+      out = ""
+      if @buffer and !@buffer.empty?
+        out << head(length)
+        length -= out.length
+      end
+      out << @io.read(length) if length > 0
+      out
+    end
+    ################################################################################
+    # returns true if the underlying IO object is at end and the internal buffer
+    # is empty
+    def eof?
+      if @buffer
+        @buffer.empty? && @io.eof?
+      else
+        @io.eof?
+      end
+    end
+    ################################################################################
+    def pos
+      @io.pos
+    end
+    ################################################################################
+    # PDF files are processed by tokenising the content into a series of objects and commands.
+    # This prepares the buffer for use by rerading the next line of tokens into memory.
+    def ready_token (with_strip=true, skip_blanks=true)
+      while @buffer.nil? or @buffer.empty?
+        @buffer = @io.readline
+        @buffer.sub!(/%.*$/, '')
+        @buffer.chomp!
+        @buffer.lstrip! if with_strip
+        break unless skip_blanks
+      end
+    end
+    ################################################################################
+    # return the next token from the underlying IO stream
+    def token
+      ready_token
+      i = @buffer.index(/[\[\]()<>{}\s\/]/) || @buffer.size
+      token_chars =
+        if i == 0 and @buffer[i,2] == "<<"    : 2
+        elsif i == 0 and @buffer[i,2] == ">>" : 2
+        elsif i == 0                          : 1
+        else                                    i
+        end
+      strip_space = !(i == 0 and @buffer[0,1] == '(')
+      head(token_chars, strip_space)
+    end
+    ################################################################################
+    def head (chars, with_strip=true)
+      val = @buffer[0, chars]
+      @buffer = @buffer[chars .. -1] || ""
+      @buffer.lstrip! if with_strip
+      val
+    end
+    ################################################################################
+    # return the internal buffer used by this class when reading from the IO stream.
+    def raw
+      @buffer
+    end
+    ################################################################################
+    # The Xref table in a PDF file acts as an aid for finding the location of various
+    # objects in the file. This method attempts to locate the byte offset of the xref
+    # table in the underlying IO stream.
+    def find_first_xref_offset
+      @io.seek(-1024, IO::SEEK_END) rescue seek(0)
+      data = @io.read(1024)
+      # the PDF 1.7 spec (section #3.4) says that EOL markers can be either \r, \n, or both.
+      # To ensure we find the xref offset correctly, change all possible options to a
+      # standard format
+      data = data.gsub("\r\n","\n").gsub("\n\r","\n").gsub("\r","\n")
+      lines = data.split(/\n/).reverse
+      eof_index = nil
+      lines.each_with_index do |line, index|
+        if line =~ /^%%EOF\r?$/
+          eof_index = index
+          break
+        end
+      end
+      raise MalformedPDFError, "PDF does not contain EOF marker" if eof_index.nil?
+      raise MalformedPDFError, "PDF EOF marker does not follow offset" if eof_index >= lines.size-1
+      lines[eof_index+1].to_i
+    end
+    ################################################################################
+  end
+  ################################################################################
+end
+################################################################################