RubyGems - pdf-reader - Versions diffs - 0.6 → 0.6.1 - Mend

pdf-reader 0.6 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

data/CHANGELOG +9 -1
data/README +4 -0
data/Rakefile +1 -1
data/TODO +8 -1
data/lib/pdf/reader/encoding.rb +34 -5
data/lib/pdf/reader/register_receiver.rb +9 -0
metadata +2 -2

data/CHANGELOG CHANGED Viewed

@@ -1,10 +1,18 @@
-v0.6.0 (xxx)
+v0.6.1 (12th March 2008)
+- Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We
+  just replace each character with a little box.
+- Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
+  NoMethodError.
+- Added a method to RegisterReceiver that returns all occurances of a callback
+v0.6.0 (27th February 2008)
 - all text is now transparently converted to UTF-8 before being passed to the callbacks.
   before this version, text was just passed as a byte level copy of what was in the PDF file, which
   was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
 - Fonts that use a difference table are now handled correctly
 - fixed some 1.9 incompatible syntax
 - expanded RegisterReceiver class to record extra info
+- expanded rspec coverage
 - tweaked a README example
 v0.5.1 (1st January 2008)

data/README CHANGED Viewed

@@ -206,6 +206,10 @@ layout of the file, not the order objects are displayed to the user. As a
 consequence of this it is highly unlikely that text will be completely in
 order.
+Occasionally some text cannot be extracted properly due to the way it has been stored, or the use
+of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate
+an unrecognisable character.
 = Resources
 - PDF::Reader Homepage: http://software.pmade.com/pdfreader

data/Rakefile CHANGED Viewed

@@ -6,7 +6,7 @@ require 'rake/testtask'
 require "rake/gempackagetask"
 require 'spec/rake/spectask'
-PKG_VERSION = "0.6"
+PKG_VERSION = "0.6.1"
 PKG_NAME = "pdf-reader"
 PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"

data/TODO CHANGED Viewed

@@ -3,13 +3,17 @@ v0.7
   interested in meta data or bookmarks, there's no point in walking the pages tree.
   - maybe a third option to Reader.parse?
     parse(io, receiver, {:pages => true, :fonts => false, :metadata => true, :bookmarks => false})
+- detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
+- When parsing a CMap into a ruby object, recognise ranged mappings defined by begincodespacerange (see spec, section 5.9.2)
+- Provide a way to get raw access to a particular object. Good for testing purposes
+v0.8
 - Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
   poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
   from the Original encoding to Unicode.
 v0.9
-- Support for CJK text (convert to UTF-8 like all other encodings)
+- Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
 - Add a way to extract raster images
@@ -17,6 +21,9 @@ Sometime
 - Ship some extra receivers in the standard package, particuarly ones that are useful for running
   rspec over generated PDF files
+- When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
+  sensible way to convert them to unicode
 - Improve metadata support
 - Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?

data/lib/pdf/reader/encoding.rb CHANGED Viewed

@@ -28,6 +28,8 @@ require 'enumerator'
 class PDF::Reader
   class Encoding
+    UNKNOWN_CHAR = 0x25AF # ▯
     attr_reader :differences
     # set the differences table for this encoding. should be an array in the following format:
@@ -103,17 +105,26 @@ class PDF::Reader
     class IdentityH < Encoding
       def to_utf8(str, map = nil)
-        raise ArgumentError, "a ToUnicode cmap is required to decode an IdentityH string" if map.nil?
         array_enc = []
         # iterate over string, reading it in 2 byte chunks and interpreting those
         # chunks as ints
         str.unpack("n*").each do |c|
-          # convert the int to a unicode codepoint
-          array_enc << map.decode(c)
+          # convert the int to a unicode codepoint if possible.
+          # without a ToUnicode CMap, it's impossible to reliably convert this text
+          # to unicode, so just replace each character with a little box. Big smacks
+          # the the PDF producing app.
+          if map
+            array_enc << map.decode(c)
+          else
+            array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
+          end
         end
+        # replace charcters that didn't convert to unicode nicely with something valid
+        array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
         # pack all our Unicode codepoints into a UTF-8 string
         ret = array_enc.pack("U*")
@@ -300,6 +311,9 @@ class PDF::Reader
         # convert any glyph names to unicode codepoints
         array_enc = self.process_glyphnames(array_enc)
+        # replace charcters that didn't convert to unicode nicely with something valid
+        array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
         # pack all our Unicode codepoints into a UTF-8 string
         ret = array_enc.pack("U*")
@@ -458,6 +472,9 @@ class PDF::Reader
         # convert any glyph names to unicode codepoints
         array_enc = self.process_glyphnames(array_enc)
+        # replace charcters that didn't convert to unicode nicely with something valid
+        array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
         # pack all our Unicode codepoints into a UTF-8 string
         ret = array_enc.pack("U*")
@@ -532,6 +549,9 @@ class PDF::Reader
         # convert any glyph names to unicode codepoints
         array_enc = self.process_glyphnames(array_enc)
+        # replace charcters that didn't convert to unicode nicely with something valid
+        array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
         # pack all our Unicode codepoints into a UTF-8 string
         ret = array_enc.pack("U*")
@@ -710,6 +730,9 @@ class PDF::Reader
           end
         end
+        # replace charcters that didn't convert to unicode nicely with something valid
+        array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
         # convert any glyph names to unicode codepoints
         array_enc = self.process_glyphnames(array_enc)
@@ -770,6 +793,9 @@ class PDF::Reader
         # convert any glyph names to unicode codepoints
         array_enc = self.process_glyphnames(array_enc)
+        # replace charcters that didn't convert to unicode nicely with something valid
+        array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
         # pack all our Unicode codepoints into a UTF-8 string
         ret = array_enc.pack("U*")
@@ -999,6 +1025,9 @@ class PDF::Reader
         # convert any glyph names to unicode codepoints
         array_enc = self.process_glyphnames(array_enc)
+        # replace charcters that didn't convert to unicode nicely with something valid
+        array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
         # pack all our Unicode codepoints into a UTF-8 string
         ret = array_enc.pack("U*")

data/lib/pdf/reader/register_receiver.rb CHANGED Viewed

@@ -22,6 +22,15 @@ class PDF::Reader
       return counter
     end
+    # return the details for every time the specified callback was fired
+    def all(methodname)
+      ret = []
+      callbacks.each do |cb|
+        ret << cb if cb[:name] == methodname
+      end
+      return ret
+    end
     # return the details for the first time the specified callback was fired
     def first_occurance_of(methodname)
       callbacks.each do |cb|

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pdf-reader
 version: !ruby/object:Gem::Version
-  version: "0.6"
+  version: 0.6.1
 platform: ruby
 authors:
 - Peter Jones
@@ -9,7 +9,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2008-02-26 00:00:00 +11:00
+date: 2008-03-12 00:00:00 +11:00
 default_executable:
 dependencies: []