RubyGems - pdf-reader - Versions diffs - 2.5.0 → 2.6.0 - Mend

pdf-reader 2.5.0 → 2.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/CHANGELOG +17 -0
data/README.md +16 -1
data/Rakefile +1 -1
data/examples/extract_fonts.rb +12 -7
data/lib/pdf/reader/buffer.rb +62 -21
data/lib/pdf/reader/encoding.rb +1 -1
data/lib/pdf/reader/error.rb +3 -3
data/lib/pdf/reader/filter/ascii85.rb +5 -1
data/lib/pdf/reader/filter/depredict.rb +3 -3
data/lib/pdf/reader/glyph_hash.rb +15 -9
data/lib/pdf/reader/glyphlist-zapfdingbats.txt +245 -0
data/lib/pdf/reader/page_layout.rb +12 -9
data/lib/pdf/reader/page_text_receiver.rb +2 -2
data/lib/pdf/reader/parser.rb +8 -6
data/lib/pdf/reader/xref.rb +6 -1
data/lib/pdf/reader/zero_width_runs_filter.rb +11 -0
metadata +11 -9

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 652d05cf6a22fad5ecb4b92de1e27ba60cafc6525c5ca524e24c7f9796fe1b83
-  data.tar.gz: 2c7448e97890a9fcbd10ec2cd5bafb9025db2fb75dabaf71a4074c542b1065a1
+  metadata.gz: ccc4d14f5820ca798f6eafa1c0978207759ec1668c6f6307acb7cd43bcd0626e
+  data.tar.gz: 466bfe0a91f57463a56d9697ccd2529f981c6917e4ed578b4103f2bc87065522
 SHA512:
-  metadata.gz: ac82452924cf46af98ee15f2a20642b1d06d5b9c22104fe171b5b4612665e482f341e12473805016ccb9d921fc15324ba51675170b369adeace8b278cd1279fb
-  data.tar.gz: b1dc1c4422b0e6bf01092cf724630ba7424fdef1fdaf34f33aaa3a31397caf6ef5a73185a98e6e2828a9e082d87cbca311565397cb064cac20d86e72be27626f
+  metadata.gz: 45d6c16b3d9ed029e6eb5a45cc64aa95e7ada2950e052053cbe0b6f5aae632f824a86f0505a5cee660abd1cd896177a0637a2f2f5a3f3633e829e8d46fb59817
+  data.tar.gz: e3e566344bd5560387577597dea20b2f7da40aed2a7fa8b8d074c0742486db59d7e349f6c38c91c8dcd9b0a8cf2aa4c19a00d0ee097003449504b3f06f18ca3c

data/CHANGELOG CHANGED Viewed

@@ -1,3 +1,20 @@
+v2.6.0 (12th November 2021)
+- Text extraction improvements
+  - Improved text layout on pages with a variery of font sizes (http://github.com/yob/pdf-reader/pull/355)
+  - Fixed text positioning for some rotated pages (http://github.com/yob/pdf-reader/pull/356)
+  - Improved character width calculation for PDFs using built-in (non-embedded) ZapfDingbats (http://github.com/yob/pdf-reader/pull/373)
+  - Skip zero-width characters (http://github.com/yob/pdf-reader/pull/372)
+- Performance improvements
+  - Reduced memory pressure when decoding TIFF images (http://github.com/yob/pdf-reader/pull/360)
+  - Optional dependency on ascii81_native gem for faster processing of files using the ascii85 filter (http://github.com/yob/pdf-reader/pull/359)
+- Successfully parse more files
+  - Gracefully handle some non-spec compliant CR/LF issues (http://github.com/yob/pdf-reader/pull/364)
+  - Fix parsing of some escape sequences in content streams (http://github.com/yob/pdf-reader/pull/368)
+  - Increase the amount of junk bytes we detect and skip at the end of a file (382)
+  - Ignore "/Prev 0" in trailers (http://github.com/yob/pdf-reader/pull/383)
+  - Fix parsing of some inline images (BI ID EI tokens) (http://github.com/yob/pdf-reader/pull/389)
+  - Gracefully handle some xref tables that incorrectly start with 1 (http://github.com/yob/pdf-reader/pull/384)
 v2.5.0 (6th June 2021)
 - bump minimum ruby version to 2.0
 - Correctly handle trascoding to UTF-8 from some fonts that use a difference table [#344](https://github.com/yob/pdf-reader/pull/344/)

data/README.md CHANGED Viewed

@@ -166,6 +166,19 @@ http://groups.google.com/group/pdf-reader
 The easiest way to explain how this works in practice is to show some examples.
 Check out the examples/ directory for a few files.
+# Alternate Decoder
+For PDF files containing Ascii85 streams, the [ascii85_native](https://github.com/AnomalousBit/ascii85_native) gem can be used for increased performance. If the ascii85_native gem is detected, pdf-reader will automatically use the gem.
+First, run `gem install ascii85_native` and then require the gem alongside pdf-reader:
+```ruby
+require "pdf-reader"
+require "ascii85_native"
+```
+Another way of enabling native Ascii85 decoding is to place `gem 'ascii85_native'` in your project's `Gemfile`.
 # Known Limitations
 Occasionally some text cannot be extracted properly due to the way it has been
@@ -176,7 +189,9 @@ little UTF-8 friendly box to indicate an unrecognisable character.
 * PDF::Reader Code Repository: http://github.com/yob/pdf-reader
-* PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
+* PDF Specification: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
+* Adobe PDF Developer Resources: http://www.adobe.com/devnet/pdf/pdf_reference.html
 * PDF Tutorial Slide Presentations: https://web.archive.org/web/20150110042057/http://home.comcast.net/~jk05/presentations/PDFTutorials.html

data/Rakefile CHANGED Viewed

@@ -14,7 +14,7 @@ desc "Run cane to check quality metrics"
 Cane::RakeTask.new(:quality) do |cane|
   cane.abc_max = 20
   cane.style_measure = 100
-  cane.max_violations = 31
+  cane.max_violations = 32
   cane.use Morecane::EncodingCheck, :encoding_glob => "{app,lib,spec}/**/*.rb"
 end

data/examples/extract_fonts.rb CHANGED Viewed

@@ -17,8 +17,8 @@ module ExtractFonts
       return count if page.fonts.nil? || page.fonts.empty?
       page.fonts.each do |label, font|
-        next if complete_refs[font]
-        complete_refs[font] = true
+        next if complete_refs[label]
+        complete_refs[label] = true
         process_font(page, font)
@@ -39,7 +39,7 @@ module ExtractFonts
       when :TrueType, :CIDFontType2 then
         ExtractFonts::TTF.new(page.objects, font).save("#{font[:BaseFont]}.ttf")
       else
-        $stderr.puts "unsupported font type #{font[:Subtype]}"
+        $stderr.puts "unsupported font type #{font[:Subtype]} for #{font[:BaseFont]}"
       end
     end
@@ -68,10 +68,15 @@ module ExtractFonts
   end
 end
-filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-unicode.pdf"
+if ARGV.size == 0 # default file name
+  ARGV << File.expand_path(File.join(File.dirname(__dir__), "spec", "data", "cairo-unicode.pdf"))
+end
 extractor = ExtractFonts::Extractor.new
-PDF::Reader.open(filename) do |reader|
-  page = reader.page(1)
-  extractor.page(page)
+ARGV.each do |arg|
+  PDF::Reader.open(arg) do |reader|
+    page = reader.page(1)
+    extractor.page(page)
+  end
 end

data/lib/pdf/reader/buffer.rb CHANGED Viewed

@@ -48,6 +48,15 @@ class PDF::Reader
     ID = "ID"
     FWD_SLASH = "/"
     NULL_BYTE = "\x00"
+    CR = "\r"
+    LF = "\n"
+    CRLF = "\r\n"
+    WHITE_SPACE = [LF, CR, ' ']
+    # Quite a few PDFs have trailing junk.
+    # This can be several k of nuls in some cases
+    # Allow for this here
+    TRAILING_BYTECOUNT = 5000
     attr_reader :pos
@@ -86,9 +95,12 @@ class PDF::Reader
     #
     # options:
     #
-    #   :skip_eol - if true, the IO stream is advanced past a CRLF or LF that
-    #               is sitting under the io cursor.
-    #
+    #   :skip_eol - if true, the IO stream is advanced past a CRLF, CR or LF
+    #               that is sitting under the io cursor.
+    #   Note:
+    #   Skipping a bare CR is not spec-compliant.
+    #   This is because the data may start with LF.
+    #   However we check for CRLF first, so the ambiguity is avoided.
     def read(bytes, opts = {})
       reset_pos
@@ -97,9 +109,9 @@ class PDF::Reader
         str = @io.read(2)
         if str.nil?
           return nil
-        elsif str == "\r\n"
+        elsif str == CRLF # This MUST be done before checking for CR alone
           # do nothing
-        elsif str[0,1] == "\n"
+        elsif str[0, 1] == LF || str[0, 1] == CR # LF or CR alone
           @io.seek(-1, IO::SEEK_CUR)
         else
           @io.seek(-2, IO::SEEK_CUR)
@@ -127,8 +139,8 @@ class PDF::Reader
     #
     def find_first_xref_offset
       check_size_is_non_zero
-      @io.seek(-1024, IO::SEEK_END) rescue @io.seek(0)
-      data = @io.read(1024)
+      @io.seek(-TRAILING_BYTECOUNT, IO::SEEK_END) rescue @io.seek(0)
+      data = @io.read(TRAILING_BYTECOUNT)
       # the PDF 1.7 spec (section #3.4) says that EOL markers can be either \r, \n, or both.
       lines = data.split(/[\n\r]+/).reverse
@@ -217,7 +229,9 @@ class PDF::Reader
       return if @tokens.size < 3
       return if @tokens[2] != "R"
-      if @tokens[0].match(/\d+/) && @tokens[1].match(/\d+/)
+      # must match whole tokens
+      digits_only = %r{\A\d+\z}
+      if @tokens[0].match(digits_only) && @tokens[1].match(digits_only)
         @tokens[0] = PDF::Reader::Reference.new(@tokens[0].to_i, @tokens[1].to_i)
         @tokens[1] = nil
         @tokens[2] = nil
@@ -225,24 +239,51 @@ class PDF::Reader
       end
     end
+    # Extract data between ID and EI
+    # If the EI follows white-space the space is dropped from the data
+    # The EI must followed by white-space or end of buffer
+    # This is to reduce the chance of accidentally matching an embedded EI
     def prepare_inline_token
-      str = "".dup
-      buffer = []
-      until buffer[0] =~ /\s|\0/ && buffer[1, 2] == ["E", "I"]
+      idstart = @io.pos
+      chr = prevchr = nil
+      eisize = 0 # how many chars in the end marker
+      seeking = 'E' # what are we looking for now?
+      loop do
         chr = @io.read(1)
-        buffer << chr
-        if buffer.length > 3
-          str << buffer.shift
+        break if chr.nil?
+        case seeking
+        when 'E'
+          if chr == 'E'
+            seeking = 'I'
+            if WHITE_SPACE.include? prevchr
+              eisize = 3 # include whitespace in delimiter, i.e. drop from data
+            else # assume the EI immediately follows the data
+              eisize = 2 # leave prevchr in data
+            end
+          end
+        when 'I'
+          if chr == 'I'
+            seeking = :END
+          else
+            seeking = 'E'
+          end
+        when :END
+          if WHITE_SPACE.include? chr
+            eisize += 1 # Drop trailer
+            break
+          else
+            seeking = 'E'
+          end
         end
+        prevchr = chr
       end
-      str << NULL_BYTE if buffer.first == NULL_BYTE
+      unless seeking == :END
+        raise MalformedPDFError, "EI terminator not found"
+      end
+      eiend = @io.pos
+      @io.seek(idstart, IO::SEEK_SET)
+      str = @io.read(eiend - eisize - idstart) # get the ID content
       @tokens << string_token(str)
-      @io.seek(-3, IO::SEEK_CUR) unless chr.nil?
     end
     # if we're currently inside a hex string, read hex nibbles until

data/lib/pdf/reader/encoding.rb CHANGED Viewed

@@ -208,7 +208,7 @@ class PDF::Reader
     def load_mapping(file)
       File.open(file, "r:BINARY") do |f|
         f.each do |l|
-          _m, single_byte, unicode = *l.match(/([0-9A-Za-z]+);([0-9A-F]{4})/)
+          _m, single_byte, unicode = *l.match(/\A([0-9A-Za-z]+);([0-9A-F]{4})/)
           @mapping["0x#{single_byte}".hex] = "0x#{unicode}".hex if single_byte
         end
       end

data/lib/pdf/reader/error.rb CHANGED Viewed

@@ -33,17 +33,17 @@ class PDF::Reader
     def self.str_assert(lvalue, rvalue, chars=nil)
       raise MalformedPDFError, "PDF malformed, expected string but found #{lvalue.class} instead" if chars and !lvalue.kind_of?(String)
       lvalue = lvalue[0,chars] if chars
-      raise MalformedPDFError, "PDF malformed, expected '#{rvalue}' but found #{lvalue} instead"  if lvalue != rvalue
+      raise MalformedPDFError, "PDF malformed, expected '#{rvalue}' but found '#{lvalue}' instead"  if lvalue != rvalue
     end
     ################################################################################
     def self.str_assert_not(lvalue, rvalue, chars=nil)
       raise MalformedPDFError, "PDF malformed, expected string but found #{lvalue.class} instead" if chars and !lvalue.kind_of?(String)
       lvalue = lvalue[0,chars] if chars
-      raise MalformedPDFError, "PDF malformed, expected '#{rvalue}' but found #{lvalue} instead"  if lvalue == rvalue
+      raise MalformedPDFError, "PDF malformed, expected '#{rvalue}' but found '#{lvalue}' instead"  if lvalue == rvalue
     end
     ################################################################################
     def self.assert_equal(lvalue, rvalue)
-      raise MalformedPDFError, "PDF malformed, expected #{rvalue} but found #{lvalue} instead" if lvalue != rvalue
+      raise MalformedPDFError, "PDF malformed, expected '#{rvalue}' but found '#{lvalue}' instead" if lvalue != rvalue
     end
     ################################################################################
   end

data/lib/pdf/reader/filter/ascii85.rb CHANGED Viewed

@@ -17,7 +17,11 @@ class PDF::Reader
       #
       def filter(data)
         data = "<~#{data}" unless data.to_s[0,2] == "<~"
-        ::Ascii85::decode(data)
+        if defined?(::Ascii85Native)
+          ::Ascii85Native::decode(data)
+        else
+          ::Ascii85::decode(data)
+        end
       rescue Exception => e
         # Oops, there was a problem decoding the stream
         raise MalformedPDFError,

data/lib/pdf/reader/filter/depredict.rb CHANGED Viewed

@@ -34,7 +34,7 @@ class PDF::Reader
       ################################################################################
       def tiff_depredict(data)
         data        = data.unpack("C*")
-        unfiltered  = []
+        unfiltered  = ''
         bpc         = @options[:BitsPerComponent] || 8
         pixel_bits  = bpc * @options[:Colors]
         pixel_bytes = pixel_bits / 8
@@ -51,11 +51,11 @@ class PDF::Reader
             left = index < pixel_bytes ? 0 : row_data[index - pixel_bytes]
             row_data[index] = (byte + left) % 256
           end
-          unfiltered += row_data
+          unfiltered += row_data.pack("C*")
           pos += line_len
         end
-        unfiltered.pack("C*")
+        unfiltered
       end
       ################################################################################
       def png_depredict(data)

data/lib/pdf/reader/glyph_hash.rb CHANGED Viewed

@@ -103,19 +103,25 @@ class PDF::Reader
     # returns a hash that maps glyph names to unicode codepoints. The mapping is based on
     # a text file supplied by Adobe at:
-    # http://www.adobe.com/devnet/opentype/archives/glyphlist.txt
+    # https://github.com/adobe-type-tools/agl-aglfn
     def load_adobe_glyph_mapping
       keyed_by_name      = {}
       keyed_by_codepoint = {}
-      File.open(File.dirname(__FILE__) + "/glyphlist.txt", "r:BINARY") do |f|
-        f.each do |l|
-          _m, name, code = *l.match(/([0-9A-Za-z]+);([0-9A-F]{4})/)
-          if name && code
-            cp = "0x#{code}".hex
-            keyed_by_name[name.to_sym]   = cp
-            keyed_by_codepoint[cp]     ||= []
-            keyed_by_codepoint[cp]     << name.to_sym
+      paths = [
+        File.dirname(__FILE__) + "/glyphlist.txt",
+        File.dirname(__FILE__) + "/glyphlist-zapfdingbats.txt",
+      ]
+      paths.each do |path|
+        File.open(path, "r:BINARY") do |f|
+          f.each do |l|
+            _m, name, code = *l.match(/([0-9A-Za-z]+);([0-9A-F]{4})/)
+            if name && code
+              cp = "0x#{code}".hex
+              keyed_by_name[name.to_sym]   = cp
+              keyed_by_codepoint[cp]     ||= []
+              keyed_by_codepoint[cp]     << name.to_sym
+            end
           end
         end
       end

data/lib/pdf/reader/glyphlist-zapfdingbats.txt ADDED Viewed

@@ -0,0 +1,245 @@
+# -----------------------------------------------------------
+# Copyright 2002-2019 Adobe (http://www.adobe.com/).
+#
+# Redistribution and use in source and binary forms, with or
+# without modification, are permitted provided that the
+# following conditions are met:
+#
+# Redistributions of source code must retain the above
+# copyright notice, this list of conditions and the following
+# disclaimer.
+#
+# Redistributions in binary form must reproduce the above
+# copyright notice, this list of conditions and the following
+# disclaimer in the documentation and/or other materials
+# provided with the distribution.
+#
+# Neither the name of Adobe nor the names of its contributors
+# may be used to endorse or promote products derived from this
+# software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+# CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
+# INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+# MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+# NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
+# OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+# -----------------------------------------------------------
+# Name:          ITC Zapf Dingbats Glyph List
+# Table version: 2.0
+# Date:          September 20, 2002
+# URL:           https://github.com/adobe-type-tools/agl-aglfn
+#
+# Format: two semicolon-delimited fields:
+#   (1) glyph name--upper/lowercase letters and digits
+#   (2) Unicode scalar value--four uppercase hexadecimal digits
+#
+a100;275E
+a101;2761
+a102;2762
+a103;2763
+a104;2764
+a105;2710
+a106;2765
+a107;2766
+a108;2767
+a109;2660
+a10;2721
+a110;2665
+a111;2666
+a112;2663
+a117;2709
+a118;2708
+a119;2707
+a11;261B
+a120;2460
+a121;2461
+a122;2462
+a123;2463
+a124;2464
+a125;2465
+a126;2466
+a127;2467
+a128;2468
+a129;2469
+a12;261E
+a130;2776
+a131;2777
+a132;2778
+a133;2779
+a134;277A
+a135;277B
+a136;277C
+a137;277D
+a138;277E
+a139;277F
+a13;270C
+a140;2780
+a141;2781
+a142;2782
+a143;2783
+a144;2784
+a145;2785
+a146;2786
+a147;2787
+a148;2788
+a149;2789
+a14;270D
+a150;278A
+a151;278B
+a152;278C
+a153;278D
+a154;278E
+a155;278F
+a156;2790
+a157;2791
+a158;2792
+a159;2793
+a15;270E
+a160;2794
+a161;2192
+a162;27A3
+a163;2194
+a164;2195
+a165;2799
+a166;279B
+a167;279C
+a168;279D
+a169;279E
+a16;270F
+a170;279F
+a171;27A0
+a172;27A1
+a173;27A2
+a174;27A4
+a175;27A5
+a176;27A6
+a177;27A7
+a178;27A8
+a179;27A9
+a17;2711
+a180;27AB
+a181;27AD
+a182;27AF
+a183;27B2
+a184;27B3
+a185;27B5
+a186;27B8
+a187;27BA
+a188;27BB
+a189;27BC
+a18;2712
+a190;27BD
+a191;27BE
+a192;279A
+a193;27AA
+a194;27B6
+a195;27B9
+a196;2798
+a197;27B4
+a198;27B7
+a199;27AC
+a19;2713
+a1;2701
+a200;27AE
+a201;27B1
+a202;2703
+a203;2750
+a204;2752
+a205;276E
+a206;2770
+a20;2714
+a21;2715
+a22;2716
+a23;2717
+a24;2718
+a25;2719
+a26;271A
+a27;271B
+a28;271C
+a29;2722
+a2;2702
+a30;2723
+a31;2724
+a32;2725
+a33;2726
+a34;2727
+a35;2605
+a36;2729
+a37;272A
+a38;272B
+a39;272C
+a3;2704
+a40;272D
+a41;272E
+a42;272F
+a43;2730
+a44;2731
+a45;2732
+a46;2733
+a47;2734
+a48;2735
+a49;2736
+a4;260E
+a50;2737
+a51;2738
+a52;2739
+a53;273A
+a54;273B
+a55;273C
+a56;273D
+a57;273E
+a58;273F
+a59;2740
+a5;2706
+a60;2741
+a61;2742
+a62;2743
+a63;2744
+a64;2745
+a65;2746
+a66;2747
+a67;2748
+a68;2749
+a69;274A
+a6;271D
+a70;274B
+a71;25CF
+a72;274D
+a73;25A0
+a74;274F
+a75;2751
+a76;25B2
+a77;25BC
+a78;25C6
+a79;2756
+a7;271E
+a81;25D7
+a82;2758
+a83;2759
+a84;275A
+a85;276F
+a86;2771
+a87;2772
+a88;2773
+a89;2768
+a8;271F
+a90;2769
+a91;276C
+a92;276D
+a93;276A
+a94;276B
+a95;2774
+a96;2775
+a97;275B
+a98;275C
+a99;275D
+a9;2720
+# END

data/lib/pdf/reader/page_layout.rb CHANGED Viewed

@@ -2,6 +2,7 @@
 # frozen_string_literal: true
 require 'pdf/reader/overlapping_runs_filter'
+require 'pdf/reader/zero_width_runs_filter'
 class PDF::Reader
@@ -17,10 +18,12 @@ class PDF::Reader
     def initialize(runs, mediabox)
       raise ArgumentError, "a mediabox must be provided" if mediabox.nil?
-      @runs    = merge_runs(OverlappingRunsFilter.exclude_redundant_runs(runs))
+      runs = ZeroWidthRunsFilter.exclude_zero_width_runs(runs)
+      runs = OverlappingRunsFilter.exclude_redundant_runs(runs)
+      @runs = merge_runs(runs)
       @mean_font_size   = mean(@runs.map(&:font_size)) || DEFAULT_FONT_SIZE
       @mean_font_size = DEFAULT_FONT_SIZE if @mean_font_size == 0
-      @mean_glyph_width = mean(@runs.map(&:mean_character_width)) || 0
+      @median_glyph_width = median(@runs.map(&:mean_character_width)) || 0
       @page_width  = (mediabox[2] - mediabox[0]).abs
       @page_height = (mediabox[3] - mediabox[1]).abs
       @x_offset = @runs.map(&:x).sort.first || 0
@@ -67,7 +70,7 @@ class PDF::Reader
     end
     def col_count
-      @col_count ||= ((@page_width  / @mean_glyph_width) * 1.05).floor
+      @col_count ||= ((@page_width  / @median_glyph_width) * 1.05).floor
     end
     def row_multiplier
@@ -86,12 +89,12 @@ class PDF::Reader
       end
     end
-    def each_line(&block)
-      @runs.sort.group_by { |run|
-        run.y.to_i
-      }.map { |y, collection|
-        yield y, collection
-      }
+    def median(collection)
+      if collection.size == 0
+        0
+      else
+        collection.sort[(collection.size * 0.5).floor]
+      end
     end
     # take a collection of TextRun objects and merge any that are in close

data/lib/pdf/reader/page_text_receiver.rb CHANGED Viewed

@@ -45,8 +45,8 @@ module PDF
         @content = []
         @characters = []
         @mediabox = page.objects.deref(page.attributes[:MediaBox])
-        device_bl = @state.ctm_transform(@mediabox[0], @mediabox[1])
-        device_tr = @state.ctm_transform(@mediabox[2], @mediabox[3])
+        device_bl = apply_rotation(*@state.ctm_transform(@mediabox[0], @mediabox[1]))
+        device_tr = apply_rotation(*@state.ctm_transform(@mediabox[2], @mediabox[3]))
         @device_mediabox = [ device_bl.first, device_bl.last, device_tr.first, device_tr.last]
       end

data/lib/pdf/reader/parser.rb CHANGED Viewed

@@ -175,15 +175,18 @@ class PDF::Reader
       return "".dup.force_encoding("binary") if str == ")"
       Error.assert_equal(parse_token, ")")
-      str.gsub!(/\\([nrtbf()\\\n]|\d{1,3})?|\r\n?|\n\r/m) do |match|
-        MAPPING[match] || "".dup
+      str.gsub!(/\\(\r\n|[nrtbf()\\\n\r]|([0-7]{1,3}))?|\r\n?/m) do |match|
+        if $2.nil? # not octal digits
+          MAPPING[match] || "".dup
+        else # must be octal digits
+          ($2.oct & 0xff).chr # ignore high level overflow
+        end
       end
       str.force_encoding("binary")
     end
     MAPPING = {
       "\r"   => "\n",
-      "\n\r" => "\n",
       "\r\n" => "\n",
       "\\n"  => "\n",
       "\\r"  => "\r",
@@ -194,10 +197,9 @@ class PDF::Reader
       "\\)"  => ")",
       "\\\\" => "\\",
       "\\\n" => "",
+      "\\\r" => "",
+      "\\\r\n" => "",
     }
-    0.upto(9)   { |n| MAPPING["\\00"+n.to_s] = ("00"+n.to_s).oct.chr }
-    0.upto(99)  { |n| MAPPING["\\0"+n.to_s]  = ("0"+n.to_s).oct.chr }
-    0.upto(377) { |n| MAPPING["\\"+n.to_s]   = n.to_s.oct.chr }
     ################################################################################
     # Decodes the contents of a PDF Stream and returns it as a Ruby String.

data/lib/pdf/reader/xref.rb CHANGED Viewed

@@ -131,6 +131,9 @@ class PDF::Reader
             generation = buf.token.to_i
             state = buf.token
+            # Some PDF writers start numbering at 1 instead of 0. Fix up the number.
+            # TODO should this fix be logged?
+            objid = 0 if objid == 1 and offset == 0 and generation == 65535 and state == 'f'
             store(objid, generation, offset + @junk_offset) if state == "n" && offset > 0
             objid += 1
             params.clear
@@ -146,7 +149,9 @@ class PDF::Reader
       end
       load_offsets(trailer[:XRefStm])   if trailer.has_key?(:XRefStm)
-      load_offsets(trailer[:Prev].to_i) if trailer.has_key?(:Prev)
+      # Some PDF creators seem to use '/Prev 0' in trailer if there is no previous xref
+      # It's not possible for an xref to appear at offset 0, so can safely skip the ref
+      load_offsets(trailer[:Prev].to_i) if trailer.has_key?(:Prev) and trailer[:Prev].to_i != 0
       trailer
     end

data/lib/pdf/reader/zero_width_runs_filter.rb ADDED Viewed

@@ -0,0 +1,11 @@
+# coding: utf-8
+class PDF::Reader
+  # There's no point rendering zero-width characters
+  class ZeroWidthRunsFilter
+    def self.exclude_zero_width_runs(runs)
+      runs.reject { |run| run.width == 0 }
+    end
+  end
+end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: pdf-reader
 version: !ruby/object:Gem::Version
-  version: 2.5.0
+  version: 2.6.0
 platform: ruby
 authors:
 - James Healy
-autorequire:
+autorequire:
 bindir: bin
 cert_chain: []
-date: 2021-06-06 00:00:00.000000000 Z
+date: 2021-11-12 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: rake
@@ -239,6 +239,7 @@ files:
 - lib/pdf/reader/font_descriptor.rb
 - lib/pdf/reader/form_xobject.rb
 - lib/pdf/reader/glyph_hash.rb
+- lib/pdf/reader/glyphlist-zapfdingbats.txt
 - lib/pdf/reader/glyphlist.txt
 - lib/pdf/reader/lzw.rb
 - lib/pdf/reader/null_security_handler.rb
@@ -272,15 +273,16 @@ files:
 - lib/pdf/reader/width_calculator/type_one_or_three.rb
 - lib/pdf/reader/width_calculator/type_zero.rb
 - lib/pdf/reader/xref.rb
+- lib/pdf/reader/zero_width_runs_filter.rb
 homepage: https://github.com/yob/pdf-reader
 licenses:
 - MIT
 metadata:
   bug_tracker_uri: https://github.com/yob/pdf-reader/issues
-  changelog_uri: https://github.com/yob/pdf-reader/blob/v2.5.0/CHANGELOG
-  documentation_uri: https://www.rubydoc.info/gems/pdf-reader/2.5.0
-  source_code_uri: https://github.com/yob/pdf-reader/tree/v2.5.0
-post_install_message:
+  changelog_uri: https://github.com/yob/pdf-reader/blob/v2.6.0/CHANGELOG
+  documentation_uri: https://www.rubydoc.info/gems/pdf-reader/2.6.0
+  source_code_uri: https://github.com/yob/pdf-reader/tree/v2.6.0
+post_install_message:
 rdoc_options:
 - "--title"
 - PDF::Reader Documentation
@@ -300,8 +302,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.2.3
-signing_key:
+rubygems_version: 3.1.4
+signing_key:
 specification_version: 4
 summary: A library for accessing the content of PDF files
 test_files: []