RubyGems - pdf-reader - Versions diffs - 1.4.1 → 2.0.0.beta1 - Mend

pdf-reader 1.4.1 → 2.0.0.beta1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

checksums.yaml +4 -4
data/CHANGELOG +8 -3
data/{README.rdoc → README.md} +40 -23
data/Rakefile +2 -2
data/bin/pdf_object +4 -1
data/lib/pdf/reader.rb +7 -112
data/lib/pdf/reader/buffer.rb +2 -1
data/lib/pdf/reader/cmap.rb +26 -24
data/lib/pdf/reader/encoding.rb +4 -5
data/lib/pdf/reader/filter.rb +1 -0
data/lib/pdf/reader/filter/run_length.rb +1 -5
data/lib/pdf/reader/font.rb +1 -11
data/lib/pdf/reader/glyph_hash.rb +6 -2
data/lib/pdf/reader/lzw.rb +1 -1
data/lib/pdf/reader/object_hash.rb +35 -16
data/lib/pdf/reader/page_layout.rb +6 -17
data/lib/pdf/reader/pages_strategy.rb +1 -304
data/lib/pdf/reader/parser.rb +6 -4
data/lib/pdf/reader/standard_security_handler.rb +18 -14
data/lib/pdf/reader/text_run.rb +3 -9
metadata +14 -47
data/bin/pdf_list_callbacks +0 -17
data/lib/pdf/reader/abstract_strategy.rb +0 -81
data/lib/pdf/reader/metadata_strategy.rb +0 -56
data/lib/pdf/reader/text_receiver.rb +0 -265

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: fb8a5be7c95212f559bb4d26af5fbdb484d21e77
-  data.tar.gz: f8fe70bf868dfff03b47a0b81993d1e680593e84
+  metadata.gz: f4ea96ce79d9f4cc65a0a026ea7e50da7b33cd19
+  data.tar.gz: e2302c2d18cdc64cd81654f30658b5cf1e8ae3c3
 SHA512:
-  metadata.gz: b881cecddfa41e3ad15dcafd31d4109290c664d0cf06478f3af6769aa7ced108e3ba082db54c6759c117d7559cc118e0d3a971c17b59cb23bf4e50024089fa6b
-  data.tar.gz: 50d61b135d79840dce5e5ca712b5db5185deefeee5de13d2adc63c1a8e1eb4b383bb0e8bb491c03bea49d11c4edf130b0fdb3b2eafea63ee0b85ca0390e047a0
+  metadata.gz: 31da13f8b8e38dffbb19a33855beb1c85a14c2633137fd8e1957db14f3bac16a434c174e79bab6092c72954e6a6f87f0ab585562e79d39d5ede8f9398fd63f7b
+  data.tar.gz: e097e8aed8bbeb918676bacded748a538e109fdcf1e8740cfe1ce3d63105d891cbe2a90ad228bb87fb494cd796bd7e1c66138592ff21d62c97b87fa270eb0899

data/CHANGELOG CHANGED Viewed

@@ -1,5 +1,10 @@
+v2.0.0.beta1 (15th February 2017)
+- BREAKING CHANGE: remove all methods that were deprecated in 1.0.0
+- Bug: Support extra encrypted PDF variants (thanks to Gyuchang Jun)
+- various bug fixes
 v1.4.1 (2nd January 2017)
-- improve compatability with ruby 2.4 (thanks Akira Matsuda)
+- improve compatibility with ruby 2.4 (thanks Akira Matsuda)
 - various bug fixes
 v1.4.0 (22nd February 2016)
@@ -91,10 +96,10 @@ v0.9.2 (24th April 2011)
 v0.9.1 (21st December 2010)
 - force gem to only install on ruby 1.8.7 or higher
-  - maintaining supprot for earlier versions takes more time than I have
+  - maintaining support for earlier versions takes more time than I have
     available at the moment
 - bug: fix parsing of obscure pdf name format
-- bug: fix behaviour when loaded in confunction with htmldoc gem
+- bug: fix behaviour when loaded in conjunction with htmldoc gem
 v0.9.0 (19th November 2010)
 - support for pdf 1.5+ files that use object and xref streams

data/{README.rdoc → README.md} RENAMED Viewed

@@ -1,4 +1,4 @@
-= Release Notes
+# Release Notes
 The PDF::Reader library implements a PDF parser conforming as much as possible
 to the PDF specification from Adobe.
@@ -15,46 +15,55 @@ higher level functionality - it's not going to render a PDF for you. There are
 a few exceptions to support very common use cases like extracting text from a
 page.
-= Installation
+# Installation
 The recommended installation method is via Rubygems.
+```ruby
   gem install pdf-reader
+```
-= Usage
+# Usage
 Begin by creating a PDF::Reader instance that points to a PDF file. Document
 level information (metadata, page count, bookmarks, etc) is available via
 this object.
+```ruby
     reader = PDF::Reader.new("somefile.pdf")
     puts reader.pdf_version
     puts reader.info
     puts reader.metadata
     puts reader.page_count
+ ```
 PDF::Reader.new accepts an IO stream or a filename. Here's an example with
 an IO stream:
+```ruby
     require 'open-uri'
     io     = open('http://example.com/somefile.pdf')
     reader = PDF::Reader.new(io)
     puts reader.info
+ ```
 If you open a PDF with File#open or IO#open, I strongly recommend using "rb"
 mode to ensure the file isn't mangled by ruby being 'helpful'. This is
 particularly important on windows and MRI >= 1.9.2.
+```ruby
     File.open("somefile.pdf", "rb") do |io|
       reader = PDF::Reader.new(io)
       puts reader.info
     end
+ ```
 PDF is a page based file format, so most visible information is available via
 page-based iteration
+```ruby
     reader = PDF::Reader.new("somefile.pdf")
     reader.pages.each do |page|
@@ -62,10 +71,12 @@ page-based iteration
       puts page.text
       puts page.raw_content
     end
+```
 If you need to access the full program for rendering a page, use the walk() method
 of PDF::Reader::Page.
+```ruby
     class RedGreenBlue
       def set_rgb_color_for_nonstroking(r, g, b)
         puts "R: #{r}, G: #{g}, B: #{b}"
@@ -76,31 +87,32 @@ of PDF::Reader::Page.
     page     = reader.page(1)
     receiver = RedGreenBlue.new
     page.walk(receiver)
+```
 For low level access to the objects in a PDF file, use the ObjectHash class like
 so:
+```ruby
     reader  = PDF::Reader.new("somefile.pdf")
     puts reader.objects.inspect
+```
-= Text Encoding
+# Text Encoding
 Regardless of the internal encoding used in the PDF all text will be converted
 to UTF-8 before it is passed back from PDF::Reader.
-Strings that contain binary data (like font blobs) will be marked as such on
-M17N aware VMs.
+Strings that contain binary data (like font blobs) will be marked as such.
-= Former API
+# Former API
 Version 1.0.0 of PDF::Reader introduced a new page-based API that provides
 efficient and easy access to any page.
-The previous API is marked as deprecated but will continue to work for the
-time being. Eventually calls to the old API will begin triggering deprecation
-warnings before it is completely removed in version 2.0.0.
+The pre-1.0 API was deprecated during the 1.x release series, and has been
+removed from 2.0.0.
-= Exceptions
+# Exceptions
 There are two key exceptions that you will need to watch out for when processing a
 PDF file:
@@ -120,7 +132,7 @@ don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
 Any other exceptions should be considered bugs in either PDF::Reader (please
 report it!).
-= PDF Integrity
+# PDF Integrity
 Windows developers may run into problems when running specs due to MalformedPDFError's
 This is usually because CRLF characters are automatically added to some of the PDF's in
@@ -128,18 +140,20 @@ the spec folder when you checkout a branch from Git.
 To remove any invalid CRLF characters added while checking out a branch from Git, run:
+```ruby
     rake fix_integrity
+```
-= Maintainers
+# Maintainers
-- James Healy <mailto:jimmy@deefa.com>
+* James Healy <mailto:jimmy@deefa.com>
-= Licensing
+# Licensing
 This library is distributed under the terms of the MIT License. See the included file for
 more detail.
-= Mailing List
+# Mailing List
 Any questions or feedback should be sent to the PDF::Reader google group. It's
 better that any answers be available for others instead of hiding in someone's
@@ -147,20 +161,23 @@ inbox.
 http://groups.google.com/group/pdf-reader
-= Examples
+# Examples
 The easiest way to explain how this works in practice is to show some examples.
 Check out the examples/ directory for a few files.
-= Known Limitations
+# Known Limitations
 Occasionally some text cannot be extracted properly due to the way it has been
 stored, or the use of invalid bytes. In these cases PDF::Reader will output a
 little UTF-8 friendly box to indicate an unrecognisable character.
-= Resources
+# Resources
-- PDF::Reader Code Repository: http://github.com/yob/pdf-reader
-- PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
-- PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html
-- Developing with PDF (book): http://shop.oreilly.com/product/0636920025269.do
+* PDF::Reader Code Repository: http://github.com/yob/pdf-reader
+* PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
+* PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html
+* Developing with PDF (book): http://shop.oreilly.com/product/0636920025269.do

data/Rakefile CHANGED Viewed

@@ -14,7 +14,7 @@ desc "Run cane to check quality metrics"
 Cane::RakeTask.new(:quality) do |cane|
   cane.abc_max = 20
   cane.style_measure = 100
-  cane.max_violations = 93
+  cane.max_violations = 31
   cane.use Morecane::EncodingCheck, :encoding_glob => "{app,lib,spec}/**/*.rb"
 end
@@ -41,7 +41,7 @@ end
 desc "Create a YAML file of integrity info for PDFs in the spec suite"
 task :integrity_yaml do
   data = {}
-  Dir.glob("spec/data/**/*.*").each do |path|
+  Dir.glob("spec/data/**/*.*").sort.each do |path|
     path_without_spec = path.gsub("spec/","")
     data[path_without_spec] = {
       :bytes => File.size(path),

data/bin/pdf_object CHANGED Viewed

@@ -25,7 +25,10 @@ gen = gen.to_i
 # make magic happen
 begin
-  obj = PDF::Reader.object_file(filename, id, gen)
+  obj = nil
+  PDF::Reader.open(filename) do |pdf|
+    obj = pdf.objects[PDF::Reader::Reference.new(id, gen)]
+  end
   case obj
   when Hash, Array

data/lib/pdf/reader.rb CHANGED Viewed

@@ -110,16 +110,10 @@ module PDF
     #
     #   reader = PDF::Reader.new("somefile.pdf", :password => "apples")
     #
-    def initialize(input = nil, opts = {})
-      if input # support the deprecated Reader API
-        @cache   = PDF::Reader::ObjectCache.new
-        opts.merge!(:cache => @cache)
-        @objects = PDF::Reader::ObjectHash.new(input, opts)
-      else
-        msg  = "Calling PDF::Reader#new with no arguments is deprecated and will be removed "
-        msg += "in the 2.0 release"
-        $stderr.puts(msg)
-      end
+    def initialize(input, opts = {})
+      @cache   = PDF::Reader::ObjectCache.new
+      opts.merge!(:cache => @cache)
+      @objects = PDF::Reader::ObjectHash.new(input, opts)
     end
     def info
@@ -133,7 +127,7 @@ module PDF
         nil
       else
         xml = stream.unfiltered_data
-        xml.force_encoding("utf-8") if xml.respond_to?(:force_encoding)
+        xml.force_encoding("utf-8")
         xml
       end
     end
@@ -164,61 +158,6 @@ module PDF
       yield PDF::Reader.new(input, opts)
     end
-    # DEPRECATED: this method was deprecated in version 1.0.0 and will
-    #             eventually be removed
-    #
-    #
-    # Parse the file with the given name, sending events to the given receiver.
-    #
-    def self.file(name, receivers, opts = {})
-      msg  = "PDF::Reader#file is deprecated and will be removed in the 2.0 release"
-      $stderr.puts(msg)
-      File.open(name,"rb") do |f|
-        new.parse(f, receivers, opts)
-      end
-    end
-    # DEPRECATED: this method was deprecated in version 1.0.0 and will
-    #             eventually be removed
-    #
-    # Parse the given string, sending events to the given receiver.
-    #
-    def self.string(str, receivers, opts = {})
-      msg  = "PDF::Reader#string is deprecated and will be removed in the 2.0 release"
-      $stderr.puts(msg)
-      StringIO.open(str) do |s|
-        new.parse(s, receivers, opts)
-      end
-    end
-    # DEPRECATED: this method was deprecated in version 1.0.0 and will
-    #             eventually be removed
-    #
-    # Parse the file with the given name, returning an unmarshalled ruby version of
-    # represents the requested pdf object
-    #
-    def self.object_file(name, id, gen = 0)
-      msg  = "PDF::Reader#object_file is deprecated and will be removed in the 2.0 release"
-      $stderr.puts(msg)
-      File.open(name,"rb") { |f|
-        new.object(f, id.to_i, gen.to_i)
-      }
-    end
-    # DEPRECATED: this method was deprecated in version 1.0.0 and will
-    #             eventually be removed
-    #
-    # Parse the given string, returning an unmarshalled ruby version of represents
-    # the requested pdf object
-    #
-    def self.object_string(str, id, gen = 0)
-      msg  = "PDF::Reader#object_string is deprecated and will be removed in the 2.0 release"
-      $stderr.puts(msg)
-      StringIO.open(str) { |s|
-        new.object(s, id.to_i, gen.to_i)
-      }
-    end
     # returns an array of PDF::Reader::Page objects, one for each
     # page in the source PDF.
     #
@@ -259,40 +198,6 @@ module PDF
       PDF::Reader::Page.new(@objects, num, :cache => @cache)
     end
-    # DEPRECATED: this method was deprecated in version 1.0.0 and will
-    #             eventually be removed
-    #
-    # Given an IO object that contains PDF data, parse it.
-    #
-    def parse(io, receivers, opts = {})
-      msg  = "PDF::Reader#parse is deprecated and will be removed in the 2.0 release"
-      $stderr.puts(msg)
-      ohash    = ObjectHash.new(io)
-      options = {:pages => true, :raw_text => false, :metadata => true}
-      options.merge!(opts)
-      strategies.each do |s|
-        s.new(ohash, receivers, options).process
-      end
-      self
-    end
-    # DEPRECATED: this method was deprecated in version 1.0.0 and will
-    #             eventually be removed
-    #
-    # Given an IO object that contains PDF data, return the contents of a single object
-    #
-    def object(io, id, gen)
-      msg  = "PDF::Reader#object is deprecated and will be removed in the 2.0 release"
-      $stderr.puts(msg)
-      @objects = ObjectHash.new(io)
-      @objects.deref(Reference.new(id, gen))
-    end
     private
     # recursively convert strings from outside a content stream into UTF-8
@@ -321,7 +226,7 @@ module PDF
     # TODO find a PDF I can use to spec this behaviour
     #
     def pdfdoc_to_utf8(obj)
-      obj.force_encoding("utf-8") if obj.respond_to?(:force_encoding)
+      obj.force_encoding("utf-8")
       obj
     end
@@ -331,17 +236,10 @@ module PDF
     def utf16_to_utf8(obj)
       str = obj[2, obj.size]
       str = str.unpack("n*").pack("U*")
-      str.force_encoding("utf-8") if str.respond_to?(:force_encoding)
+      str.force_encoding("utf-8")
       str
     end
-    def strategies
-      @strategies ||= [
-        ::PDF::Reader::MetadataStrategy,
-        ::PDF::Reader::PagesStrategy
-      ]
-    end
     def root
       @root ||= @objects.deref(@objects.trailer[:Root])
     end
@@ -351,7 +249,6 @@ end
 ################################################################################
 require 'pdf/reader/resource_methods'
-require 'pdf/reader/abstract_strategy'
 require 'pdf/reader/buffer'
 require 'pdf/reader/cid_widths'
 require 'pdf/reader/cmap'
@@ -370,7 +267,6 @@ require 'pdf/reader/font_descriptor'
 require 'pdf/reader/form_xobject'
 require 'pdf/reader/glyph_hash'
 require 'pdf/reader/lzw'
-require 'pdf/reader/metadata_strategy'
 require 'pdf/reader/object_cache'
 require 'pdf/reader/object_hash'
 require 'pdf/reader/object_stream'
@@ -381,7 +277,6 @@ require 'pdf/reader/reference'
 require 'pdf/reader/register_receiver'
 require 'pdf/reader/standard_security_handler'
 require 'pdf/reader/stream'
-require 'pdf/reader/text_receiver'
 require 'pdf/reader/text_run'
 require 'pdf/reader/page_state'
 require 'pdf/reader/page_text_receiver'

data/lib/pdf/reader/buffer.rb CHANGED Viewed

@@ -37,6 +37,7 @@ class PDF::Reader
   #
   class Buffer
     TOKEN_WHITESPACE=[0x00, 0x09, 0x0A, 0x0C, 0x0D, 0x20]
+    TOKEN_DELIMITER=[0x25, 0x3C, 0x3E, 0x28, 0x5B, 0x7B, 0x29, 0x5D, 0x7D, 0x2F]
     # some strings for comparissons. Declaring them here avoids creating new
     # strings that need GC over and over
@@ -366,7 +367,7 @@ class PDF::Reader
           # PDF name, start of new token
           @tokens << tok if tok.size > 0
           @tokens << byte.chr
-          @tokens << "" if byte == 0x2F && [nil, 0x20, 0x0A].include?(peek_byte)
+          @tokens << "" if byte == 0x2F && ([nil, 0x20, 0x0A] + TOKEN_DELIMITER).include?(peek_byte)
           tok = ""
           break
         else

data/lib/pdf/reader/cmap.rb CHANGED Viewed

@@ -31,6 +31,17 @@ class PDF::Reader
   # extracting various useful information.
   #
   class CMap # :nodoc:
+    CMAP_KEYWORDS = {
+      "begincodespacerange" => 1,
+      "endcodespacerange" => 1,
+      "beginbfchar" => 1,
+      "endbfchar" => 1,
+      "beginbfrange" => 1,
+      "endbfrange" => 1,
+      "begin" => 1,
+      "begincmap" => 1,
+      "def" => 1
+    }
     attr_reader :map
@@ -40,24 +51,25 @@ class PDF::Reader
     end
     def process_data(data)
+      parser = build_parser(data)
       mode = nil
-      instructions = ""
+      instructions = []
-      data.each_line do |l|
-        if l.include?("beginbfchar")
+      while token = parser.parse_token(CMAP_KEYWORDS)
+        if token == "beginbfchar"
           mode = :char
-        elsif l.include?("endbfchar")
+        elsif token == "endbfchar"
           process_bfchar_instructions(instructions)
-          instructions = ""
+          instructions = []
           mode = nil
-        elsif l.include?("beginbfrange")
+        elsif token == "beginbfrange"
           mode = :range
-        elsif l.include?("endbfrange")
+        elsif token == "endbfrange"
           process_bfrange_instructions(instructions)
-          instructions = ""
+          instructions = []
           mode = nil
         elsif mode == :char || mode == :range
-          instructions << l
+          instructions << token
         end
       end
     end
@@ -105,22 +117,15 @@ class PDF::Reader
     end
     def process_bfchar_instructions(instructions)
-      parser  = build_parser(instructions)
-      find    = str_to_int(parser.parse_token)
-      replace = str_to_int(parser.parse_token)
-      while find && replace
-        @map[find[0]] = replace
-        find       = str_to_int(parser.parse_token)
-        replace    = str_to_int(parser.parse_token)
+      instructions.each_slice(2) do |one, two|
+        find    = str_to_int(one)
+        replace = str_to_int(two)
+        @map[find.first] = replace
       end
     end
     def process_bfrange_instructions(instructions)
-      parser  = build_parser(instructions)
-      start   = parser.parse_token
-      finish  = parser.parse_token
-      to      = parser.parse_token
-      while start && finish && to
+      instructions.each_slice(3) do |start, finish, to|
         if start.kind_of?(String) && finish.kind_of?(String) && to.kind_of?(String)
           bfrange_type_one(start, finish, to)
         elsif start.kind_of?(String) && finish.kind_of?(String) && to.kind_of?(Array)
@@ -128,9 +133,6 @@ class PDF::Reader
         else
           raise "invalid bfrange section"
         end
-        start   = parser.parse_token
-        finish  = parser.parse_token
-        to      = parser.parse_token
       end
     end