RubyGems - pdf-reader - Versions diffs - 0.11.0.alpha → 0.12.0.alpha - Mend

pdf-reader 0.11.0.alpha → 0.12.0.alpha

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

data/CHANGELOG +17 -1
data/README.rdoc +31 -1
data/bin/pdf_list_callbacks +2 -0
data/examples/callbacks.rb +2 -1
data/examples/extract_bates.rb +3 -2
data/examples/extract_images.rb +146 -23
data/examples/hash.rb +5 -5
data/examples/metadata.rb +5 -16
data/examples/page_count.rb +13 -0
data/examples/rspec.rb +17 -41
data/examples/text.rb +4 -29
data/examples/version.rb +3 -15
data/lib/pdf/reader.rb +45 -27
data/lib/pdf/reader/encoding.rb +3 -3
data/lib/pdf/reader/error.rb +1 -0
data/lib/pdf/reader/filter.rb +64 -9
data/lib/pdf/reader/font.rb +0 -17
data/lib/pdf/reader/form_xobject.rb +83 -0
data/lib/pdf/reader/glyph_hash.rb +88 -0
data/lib/pdf/reader/glyphlist.txt +1 -1
data/lib/pdf/reader/object_hash.rb +42 -12
data/lib/pdf/reader/page.rb +63 -17
data/lib/pdf/reader/page_text_receiver.rb +38 -4
data/lib/pdf/reader/standard_security_handler.rb +186 -0
data/lib/pdf/reader/stream.rb +2 -2
metadata +39 -9
data/examples/page_counter_improved.rb +0 -23
data/examples/page_counter_naive.rb +0 -24

data/CHANGELOG CHANGED Viewed

@@ -1,4 +1,20 @@
-v0.9.4 (XXX)
+v0.12.0.alpha (28th August 2011)
+- small breaking changes to the page-based API - it's alpha for a reason
+  - resource related methods on Page object return raw PDF objects
+  - if the caller wants the resources wrapped in a more convenient
+    Ruby object (like PDF::Reader::Font or PDF::Reader::FormXObject) will
+    need to do so themselves
+- add support for RunLengthDecode filters (thanks Bernerd Schaefer)
+- add support for standard PDF encryption (thanks Evan Brunner)
+- add support for decoding stream wityh TIFF prediction
+- new PDF::Reader::FormXObject class to simplify working with form XObjects
+v0.11.0.alpha (19th July 2011)
+- introduce experimental new page-based API
+  - old API is deprecated but will continue to work with no warnings
+- add transparent caching of common objects to ObjectHash
+v0.10.0 (6th July 2011)
 - support multiple receivers within a single pass over a source file
   - massive time saving when dealing with multiple receivers

data/README.rdoc CHANGED Viewed

@@ -8,6 +8,11 @@ The PDF 1.7 specification is a weighty document and not all aspects are
 currently supported. I welcome submission of PDF files that exhibit
 unsupported aspects of the spec to assist with improving our support.
+This is primarily a low-level library that should be used as the foundation for
+higher level functionality - it's not going to render a PDF for you. There are
+a few exceptions to support very common use cases like extracting text from a
+page.
 = Installation
 The recommended installation method is via Rubygems.
@@ -27,6 +32,15 @@ this object.
     puts reader.metadata
     puts reader.page_count
+PDF::Reader.new can accept an IO stream or a filename. Here's an example with
+an IO stream:
+    require 'open-uri'
+    io     = open('http://example.com/somefile.pdf')
+    reader = PDF::Reader.new(io)
+    puts reader.info
 PDF is a page based file format, so most visible information is available via
 page-based iteration
@@ -34,10 +48,24 @@ page-based iteration
     reader.pages.each do |page|
       puts page.fonts
-      puts page.images
       puts page.text
+      puts page.raw_content
     end
+If you need to access the full program for rendering a page, use the walk() method
+of PDF::Reader::Page.
+    class RedGreenBlue
+      def set_rgb_color_for_nonstroking(r, g, b)
+        puts "R: #{r}, G: #{g}, B: #{b}"
+      end
+    end
+    reader   = PDF::Reader.new("somefile.pdf")
+    page     = reader.page(1)
+    receiver = RedGreenBlue.new
+    page.walk(receiver)
 For low level access to the objects in a PDF file, use the ObjectHash class. You can
 build an ObjectHash instance directly:
@@ -48,6 +76,8 @@ or via a PDF::Reader instance:
     reader  = PDF::Reader.new("somefile.pdf")
     puts reader.objects
+The second method is preferred to increase the effectiveness of internal caching.
 = Text Encoding
 Internally, text can be stored inside a PDF in various encodings, including

data/bin/pdf_list_callbacks CHANGED Viewed

@@ -1,5 +1,7 @@
 #!/usr/bin/env ruby
+# this executable is deprecated, use pdf_callbacks instead
 require 'rubygems'
 $LOAD_PATH.unshift(File.dirname(__FILE__) + "/../lib")

data/examples/callbacks.rb CHANGED Viewed

@@ -10,8 +10,9 @@ require 'rubygems'
 require 'pdf/reader'
 receiver = PDF::Reader::RegisterReceiver.new
+filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-basic.pdf"
-PDF::Reader.open("somefile.pdf") do |reader|
+PDF::Reader.open(filename) do |reader|
   reader.pages.each do |page|
     page.walk(receiver)
     receiver.callbacks.each do |cb|

data/examples/extract_bates.rb CHANGED Viewed

@@ -35,13 +35,14 @@ class BatesReceiver
 end
+filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-basic.pdf"
-PDF::Reader.open("bates.pdf") do |reader|
+PDF::Reader.open(filename) do |reader|
   reader.pages.each do |page|
     receiver = BatesReceiver.new
     page.walk(receiver)
     if receiver.numbers.empty?
-      puts page.scan(/CC.+/)
+      puts page.text.scan(/CC.+/)
     else
       puts receiver.numbers.inspect
     end

data/examples/extract_images.rb CHANGED Viewed

@@ -1,46 +1,164 @@
 # coding: utf-8
 # This demonstrates a way to extract some images (those based on the JPG or
-# TIFF formats) from a PDF. There are other ways to store images, so
+# TIFF formats) from a PDF. There are other ways to store images, so
 # it may need to be expanded for real world usage, but it should serve
 # as a good guide.
 #
 # Thanks to Jack Rusher for the initial version of this example.
-#
-# USAGE:
-#
-#   ruby extract_images.rb somefile.pdf
 require 'pdf/reader'
 module ExtractImages
-  class Receiver
-    attr_reader :count
+  class Extractor
+    def page(page)
+      count = 0
+      process_resources(page, page.resources, count)
+    end
+    private
-    def initialize
-      @count = 0
+    def complete_refs
+      @complete_refs ||= {}
     end
-    def resource_xobject(name, stream)
-      return unless stream.hash[:Subtype] == :Image
-      increment_count
+    def process_resources(page, resources, count)
+      xobjects = resources[:XObject]
+      return count if xobjects.nil?
+      xobjects.each do |name, stream|
+        next if complete_refs[stream]
+        complete_refs[stream] = true
+        stream = page.objects.deref(stream)
+        if stream.hash[:Subtype] == :Image
+          count += 1
+          case stream.hash[:Filter]
+          when :CCITTFaxDecode then
+            ExtractImages::Tiff.new(stream).save("#{page.number}-#{count}-#{name}.tif")
+          when :DCTDecode      then
+            ExtractImages::Jpg.new(stream).save("#{page.number}-#{count}-#{name}.jpg")
+          else
+            ExtractImages::Raw.new(stream).save("#{page.number}-#{count}-#{name}.tif")
+          end
+        elsif stream.hash[:Subtype] == :Form
+          count = process_resources(page, PDF::Reader::FormXObject.new(page, stream).resources, count)
+        end
+      end
+      count
+    end
+  end
+  class Raw
+    attr_reader :stream
-      case stream.hash[:Filter]
-      when :CCITTFaxDecode
-        ExtractImages::Tiff.new(stream).save("#{count}-#{name}.tif")
-      when :DCTDecode
-        ExtractImages::Jpg.new(stream).save("#{count}-#{name}.jpg")
+    def initialize(stream)
+      @stream = stream
+    end
+    def save(filename)
+      case @stream.hash[:ColorSpace]
+      when :DeviceCMYK then save_cmyk(filename)
+      when :DeviceGray then save_gray(filename)
+      when :DeviceRGB  then save_rgb(filename)
       else
-        $stderr.puts "unrecognized image filter '#{stream.hash[:Filter]}'!"
+        $stderr.puts "unsupport color depth #{@stream.hash[:ColorSpace]} #{filename}"
       end
     end
-    def increment_count
-      @count += 1
+    private
+    def save_cmyk(filename)
+      h    = stream.hash[:Height]
+      w    = stream.hash[:Width]
+      bpc  = stream.hash[:BitsPerComponent]
+      len  = stream.hash[:Length]
+      puts "#{filename}: h=#{h}, w=#{w}, bpc=#{bpc}, len=#{len}"
+      # Synthesize a TIFF header
+      long_tag  = lambda {|tag, count, value| [ tag, 4, count, value ].pack( "ssII" ) }
+      short_tag = lambda {|tag, count, value| [ tag, 3, count, value ].pack( "ssII" ) }
+      # header = byte order, version magic, offset of directory, directory count,
+      # followed by a series of tags containing metadata.
+      tag_count = 10
+      header = [ 73, 73, 42, 8, tag_count ].pack("ccsIs")
+      tiff = header.dup
+      tiff << short_tag.call( 256, 1, w ) # image width
+      tiff << short_tag.call( 257, 1, h ) # image height
+      tiff << long_tag.call( 258, 4, (header.size + (tag_count*12))) # bits per pixel
+      tiff << short_tag.call( 259, 1, 1 ) # compression
+      tiff << short_tag.call( 262, 1, 5 ) # colorspace - separation
+      tiff << long_tag.call( 273, 1, (10 + (tag_count*12) + 16) ) # data offset
+      tiff << short_tag.call( 277, 1, 4 ) # samples per pixel
+      tiff << long_tag.call( 279, 1, stream.unfiltered_data.size) # data byte size
+      tiff << short_tag.call( 284, 1, 1 ) # planer config
+      tiff << long_tag.call( 332, 1, 1)   # inkset - CMYK
+      tiff << [bpc, bpc, bpc, bpc].pack("IIII")
+      tiff << stream.unfiltered_data
+      File.open(filename, "wb") { |file| file.write tiff }
+    end
+    def save_gray(filename)
+      h    = stream.hash[:Height]
+      w    = stream.hash[:Width]
+      bpc  = stream.hash[:BitsPerComponent]
+      len  = stream.hash[:Length]
+      puts "#{filename}: h=#{h}, w=#{w}, bpc=#{bpc}, len=#{len}"
+      # Synthesize a TIFF header
+      long_tag  = lambda {|tag, count, value| [ tag, 4, count, value ].pack( "ssII" ) }
+      short_tag = lambda {|tag, count, value| [ tag, 3, count, value ].pack( "ssII" ) }
+      # header = byte order, version magic, offset of directory, directory count,
+      # followed by a series of tags containing metadata.
+      tag_count = 9
+      header = [ 73, 73, 42, 8, tag_count ].pack("ccsIs")
+      tiff = header.dup
+      tiff << short_tag.call( 256, 1, w ) # image width
+      tiff << short_tag.call( 257, 1, h ) # image height
+      tiff << short_tag.call( 258, 1, 8 ) # bits per pixel
+      tiff << short_tag.call( 259, 1, 1 ) # compression
+      tiff << short_tag.call( 262, 1, 1 ) # colorspace - grayscale
+      tiff << long_tag.call( 273, 1, (10 + (tag_count*12)) ) # data offset
+      tiff << short_tag.call( 277, 1, 1 ) # samples per pixel
+      tiff << long_tag.call( 279, 1, stream.unfiltered_data.size) # data byte size
+      tiff << short_tag.call( 284, 1, 1 ) # planer config
+      tiff << stream.unfiltered_data
+      File.open(filename, "wb") { |file| file.write tiff }
     end
-    private :increment_count
+    def save_rgb(filename)
+      h    = stream.hash[:Height]
+      w    = stream.hash[:Width]
+      bpc  = stream.hash[:BitsPerComponent]
+      len  = stream.hash[:Length]
+      puts "#{filename}: h=#{h}, w=#{w}, bpc=#{bpc}, len=#{len}"
+      # Synthesize a TIFF header
+      long_tag  = lambda {|tag, count, value| [ tag, 4, count, value ].pack( "ssII" ) }
+      short_tag = lambda {|tag, count, value| [ tag, 3, count, value ].pack( "ssII" ) }
+      # header = byte order, version magic, offset of directory, directory count,
+      # followed by a series of tags containing metadata.
+      tag_count = 8
+      header = [ 73, 73, 42, 8, tag_count ].pack("ccsIs")
+      tiff = header.dup
+      tiff << short_tag.call( 256, 1, w ) # image width
+      tiff << short_tag.call( 257, 1, h ) # image height
+      tiff << long_tag.call( 258, 3, (header.size + (tag_count*12))) # bits per pixel
+      tiff << short_tag.call( 259, 1, 1 ) # compression
+      tiff << short_tag.call( 262, 1, 2 ) # colorspace - RGB
+      tiff << long_tag.call( 273, 1, (header.size + (tag_count*12) + 12) ) # data offset
+      tiff << short_tag.call( 277, 1, 3 ) # samples per pixel
+      tiff << long_tag.call( 279, 1, stream.unfiltered_data.size) # data byte size
+      tiff << [bpc, bpc, bpc].pack("III")
+      tiff << stream.unfiltered_data
+      File.open(filename, "wb") { |file| file.write tiff }
+    end
   end
   class Jpg
@@ -104,5 +222,10 @@ module ExtractImages
   end
 end
-receiver = ExtractImages::Receiver.new
-PDF::Reader.file(ARGV[0], receiver)
+filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/adobe_sample.pdf"
+extractor = ExtractImages::Extractor.new
+PDF::Reader.open(filename) do |reader|
+  page = reader.page(1)
+  extractor.page(page)
+end

data/examples/hash.rb CHANGED Viewed

@@ -2,11 +2,11 @@
 # coding: utf-8
 # get direct access to PDF objects
-#
-$LOAD_PATH.unshift(File.dirname(__FILE__) + '/../lib')
 require 'pdf/reader'
-filename = File.dirname(__FILE__) + "/../specs/data/cairo-unicode.pdf"
-hash = PDF::Reader::ObjectHash.new(filename)
-puts hash[3]
+filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-unicode.pdf"
+reader = PDF::Reader.new(filename)
+puts reader.objects[3]
+puts reader.objects[4]

data/examples/metadata.rb CHANGED Viewed

@@ -1,25 +1,14 @@
 #!/usr/bin/env ruby
 # coding: utf-8
 # Extract metadata only
 require 'rubygems'
 require 'pdf/reader'
-class MetaDataReceiver
-  attr_accessor :regular
-  attr_accessor :xml
+filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cross_ref_stream.pdf"
-  def metadata(data)
-    @regular = data
-  end
-  def metadata_xml(data)
-    @xml = data
-  end
+PDF::Reader.open(filename) do |reader|
+  puts reader.info.inspect
+  puts reader.metadata.inspect
 end
-receiver = MetaDataReceiver.new
-pdf = PDF::Reader.file(ARGV.shift, receiver, :pages => false, :metadata => true)
-puts receiver.regular.inspect
-puts receiver.xml.inspect

data/examples/page_count.rb ADDED Viewed

@@ -0,0 +1,13 @@
+#!/usr/bin/env ruby
+# coding: utf-8
+# A simple app to count the number of pages in a PDF File.
+require 'rubygems'
+require 'pdf/reader'
+filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cross_ref_stream.pdf"
+PDF::Reader.open(filename) do |reader|
+  puts "#{reader.page_count} page(s)"
+end

data/examples/rspec.rb CHANGED Viewed

@@ -2,56 +2,32 @@
 # coding: utf-8
 #  Basic RSpec of a generated PDF
+#
+#  USAGE: rspec -c examples/rspec.rb
 require 'rubygems'
 require 'pdf/reader'
-require 'pdf/writer'
-require 'spec'
+require 'rspec'
+require 'prawn'
+require 'stringio'
-class PageTextReceiver
-  attr_accessor :content
-  def initialize
-    @content = []
-  end
-  # Called when page parsing starts
-  def begin_page(arg = nil)
-    @content << ""
-  end
-  def show_text(string, *params)
-    @content.last << string.strip
-  end
-  # there's a few text callbacks, so make sure we process them all
-  alias :super_show_text :show_text
-  alias :move_to_next_line_and_show_text :show_text
-  alias :set_spacing_next_line_show_text :show_text
-  def show_text_with_positioning(*params)
-    params = params.first
-    params.each { |str| show_text(str) if str.kind_of?(String)}
-  end
-end
-context "My generated PDF" do
-  specify "should have the correct text on 2 pages" do
+describe "My generated PDF" do
+  it "should have the correct text on 2 pages" do
     # generate our PDF
-    pdf = PDF::Writer.new
-    pdf.text "Chunky", :font_size => 32, :justification => :center
+    pdf = Prawn::Document.new
+    pdf.text "Chunky"
     pdf.start_new_page
-    pdf.text "Bacon", :font_size => 32, :justification => :center
-    pdf.save_as("chunkybacon.pdf")
+    pdf.text "Bacon"
+    io = StringIO.new(pdf.render)
     # process the PDF
-    receiver = PageTextReceiver.new
-    PDF::Reader.file("chunkybacon.pdf", receiver)
+    PDF::Reader.open(io) do |reader|
+      reader.page_count.should eql(2)          # correct page count
+      reader.page(1).text.should eql("Chunky") # correct content
+      reader.page(2).text.should eql("Bacon")  # correct content
+    end
-    # confirm the text appears on the correct pages
-    receiver.content.size.should eql(2)
-    receiver.content[0].should eql("Chunky")
-    receiver.content[1].should eql("Bacon")
   end
 end