pdf-reader 0.6 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/CHANGELOG CHANGED
@@ -1,10 +1,18 @@
1
- v0.6.0 (xxx)
1
+ v0.6.1 (12th March 2008)
2
+ - Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We
3
+ just replace each character with a little box.
4
+ - Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
5
+ NoMethodError.
6
+ - Added a method to RegisterReceiver that returns all occurances of a callback
7
+
8
+ v0.6.0 (27th February 2008)
2
9
  - all text is now transparently converted to UTF-8 before being passed to the callbacks.
3
10
  before this version, text was just passed as a byte level copy of what was in the PDF file, which
4
11
  was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
5
12
  - Fonts that use a difference table are now handled correctly
6
13
  - fixed some 1.9 incompatible syntax
7
14
  - expanded RegisterReceiver class to record extra info
15
+ - expanded rspec coverage
8
16
  - tweaked a README example
9
17
 
10
18
  v0.5.1 (1st January 2008)
data/README CHANGED
@@ -206,6 +206,10 @@ layout of the file, not the order objects are displayed to the user. As a
206
206
  consequence of this it is highly unlikely that text will be completely in
207
207
  order.
208
208
 
209
+ Occasionally some text cannot be extracted properly due to the way it has been stored, or the use
210
+ of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate
211
+ an unrecognisable character.
212
+
209
213
  = Resources
210
214
 
211
215
  - PDF::Reader Homepage: http://software.pmade.com/pdfreader
data/Rakefile CHANGED
@@ -6,7 +6,7 @@ require 'rake/testtask'
6
6
  require "rake/gempackagetask"
7
7
  require 'spec/rake/spectask'
8
8
 
9
- PKG_VERSION = "0.6"
9
+ PKG_VERSION = "0.6.1"
10
10
  PKG_NAME = "pdf-reader"
11
11
  PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
12
12
 
data/TODO CHANGED
@@ -3,13 +3,17 @@ v0.7
3
3
  interested in meta data or bookmarks, there's no point in walking the pages tree.
4
4
  - maybe a third option to Reader.parse?
5
5
  parse(io, receiver, {:pages => true, :fonts => false, :metadata => true, :bookmarks => false})
6
+ - detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
7
+ - When parsing a CMap into a ruby object, recognise ranged mappings defined by begincodespacerange (see spec, section 5.9.2)
8
+ - Provide a way to get raw access to a particular object. Good for testing purposes
6
9
 
10
+ v0.8
7
11
  - Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
8
12
  poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
9
13
  from the Original encoding to Unicode.
10
14
 
11
15
  v0.9
12
- - Support for CJK text (convert to UTF-8 like all other encodings)
16
+ - Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
13
17
  - Add a way to extract raster images
14
18
 
15
19
 
@@ -17,6 +21,9 @@ Sometime
17
21
  - Ship some extra receivers in the standard package, particuarly ones that are useful for running
18
22
  rspec over generated PDF files
19
23
 
24
+ - When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
25
+ sensible way to convert them to unicode
26
+
20
27
  - Improve metadata support
21
28
 
22
29
  - Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
@@ -28,6 +28,8 @@ require 'enumerator'
28
28
  class PDF::Reader
29
29
  class Encoding
30
30
 
31
+ UNKNOWN_CHAR = 0x25AF # ▯
32
+
31
33
  attr_reader :differences
32
34
 
33
35
  # set the differences table for this encoding. should be an array in the following format:
@@ -103,17 +105,26 @@ class PDF::Reader
103
105
 
104
106
  class IdentityH < Encoding
105
107
  def to_utf8(str, map = nil)
106
- raise ArgumentError, "a ToUnicode cmap is required to decode an IdentityH string" if map.nil?
107
-
108
+
108
109
  array_enc = []
109
110
 
110
111
  # iterate over string, reading it in 2 byte chunks and interpreting those
111
112
  # chunks as ints
112
113
  str.unpack("n*").each do |c|
113
- # convert the int to a unicode codepoint
114
- array_enc << map.decode(c)
114
+ # convert the int to a unicode codepoint if possible.
115
+ # without a ToUnicode CMap, it's impossible to reliably convert this text
116
+ # to unicode, so just replace each character with a little box. Big smacks
117
+ # the the PDF producing app.
118
+ if map
119
+ array_enc << map.decode(c)
120
+ else
121
+ array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
122
+ end
115
123
  end
116
-
124
+
125
+ # replace charcters that didn't convert to unicode nicely with something valid
126
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
127
+
117
128
  # pack all our Unicode codepoints into a UTF-8 string
118
129
  ret = array_enc.pack("U*")
119
130
 
@@ -300,6 +311,9 @@ class PDF::Reader
300
311
  # convert any glyph names to unicode codepoints
301
312
  array_enc = self.process_glyphnames(array_enc)
302
313
 
314
+ # replace charcters that didn't convert to unicode nicely with something valid
315
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
316
+
303
317
  # pack all our Unicode codepoints into a UTF-8 string
304
318
  ret = array_enc.pack("U*")
305
319
 
@@ -458,6 +472,9 @@ class PDF::Reader
458
472
  # convert any glyph names to unicode codepoints
459
473
  array_enc = self.process_glyphnames(array_enc)
460
474
 
475
+ # replace charcters that didn't convert to unicode nicely with something valid
476
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
477
+
461
478
  # pack all our Unicode codepoints into a UTF-8 string
462
479
  ret = array_enc.pack("U*")
463
480
 
@@ -532,6 +549,9 @@ class PDF::Reader
532
549
 
533
550
  # convert any glyph names to unicode codepoints
534
551
  array_enc = self.process_glyphnames(array_enc)
552
+
553
+ # replace charcters that didn't convert to unicode nicely with something valid
554
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
535
555
 
536
556
  # pack all our Unicode codepoints into a UTF-8 string
537
557
  ret = array_enc.pack("U*")
@@ -710,6 +730,9 @@ class PDF::Reader
710
730
  end
711
731
  end
712
732
 
733
+ # replace charcters that didn't convert to unicode nicely with something valid
734
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
735
+
713
736
  # convert any glyph names to unicode codepoints
714
737
  array_enc = self.process_glyphnames(array_enc)
715
738
 
@@ -770,6 +793,9 @@ class PDF::Reader
770
793
  # convert any glyph names to unicode codepoints
771
794
  array_enc = self.process_glyphnames(array_enc)
772
795
 
796
+ # replace charcters that didn't convert to unicode nicely with something valid
797
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
798
+
773
799
  # pack all our Unicode codepoints into a UTF-8 string
774
800
  ret = array_enc.pack("U*")
775
801
 
@@ -999,6 +1025,9 @@ class PDF::Reader
999
1025
  # convert any glyph names to unicode codepoints
1000
1026
  array_enc = self.process_glyphnames(array_enc)
1001
1027
 
1028
+ # replace charcters that didn't convert to unicode nicely with something valid
1029
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
1030
+
1002
1031
  # pack all our Unicode codepoints into a UTF-8 string
1003
1032
  ret = array_enc.pack("U*")
1004
1033
 
@@ -22,6 +22,15 @@ class PDF::Reader
22
22
  return counter
23
23
  end
24
24
 
25
+ # return the details for every time the specified callback was fired
26
+ def all(methodname)
27
+ ret = []
28
+ callbacks.each do |cb|
29
+ ret << cb if cb[:name] == methodname
30
+ end
31
+ return ret
32
+ end
33
+
25
34
  # return the details for the first time the specified callback was fired
26
35
  def first_occurance_of(methodname)
27
36
  callbacks.each do |cb|
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader
3
3
  version: !ruby/object:Gem::Version
4
- version: "0.6"
4
+ version: 0.6.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Jones
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2008-02-26 00:00:00 +11:00
12
+ date: 2008-03-12 00:00:00 +11:00
13
13
  default_executable:
14
14
  dependencies: []
15
15