pdf-reader 0.6 → 0.6.1

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,10 +1,18 @@
1
- v0.6.0 (xxx)
1
+ v0.6.1 (12th March 2008)
2
+ - Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We
3
+ just replace each character with a little box.
4
+ - Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
5
+ NoMethodError.
6
+ - Added a method to RegisterReceiver that returns all occurances of a callback
7
+
8
+ v0.6.0 (27th February 2008)
2
9
  - all text is now transparently converted to UTF-8 before being passed to the callbacks.
3
10
  before this version, text was just passed as a byte level copy of what was in the PDF file, which
4
11
  was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
5
12
  - Fonts that use a difference table are now handled correctly
6
13
  - fixed some 1.9 incompatible syntax
7
14
  - expanded RegisterReceiver class to record extra info
15
+ - expanded rspec coverage
8
16
  - tweaked a README example
9
17
 
10
18
  v0.5.1 (1st January 2008)
data/README CHANGED
@@ -206,6 +206,10 @@ layout of the file, not the order objects are displayed to the user. As a
206
206
  consequence of this it is highly unlikely that text will be completely in
207
207
  order.
208
208
 
209
+ Occasionally some text cannot be extracted properly due to the way it has been stored, or the use
210
+ of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate
211
+ an unrecognisable character.
212
+
209
213
  = Resources
210
214
 
211
215
  - PDF::Reader Homepage: http://software.pmade.com/pdfreader
data/Rakefile CHANGED
@@ -6,7 +6,7 @@ require 'rake/testtask'
6
6
  require "rake/gempackagetask"
7
7
  require 'spec/rake/spectask'
8
8
 
9
- PKG_VERSION = "0.6"
9
+ PKG_VERSION = "0.6.1"
10
10
  PKG_NAME = "pdf-reader"
11
11
  PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
12
12
 
data/TODO CHANGED
@@ -3,13 +3,17 @@ v0.7
3
3
  interested in meta data or bookmarks, there's no point in walking the pages tree.
4
4
  - maybe a third option to Reader.parse?
5
5
  parse(io, receiver, {:pages => true, :fonts => false, :metadata => true, :bookmarks => false})
6
+ - detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
7
+ - When parsing a CMap into a ruby object, recognise ranged mappings defined by begincodespacerange (see spec, section 5.9.2)
8
+ - Provide a way to get raw access to a particular object. Good for testing purposes
6
9
 
10
+ v0.8
7
11
  - Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
8
12
  poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
9
13
  from the Original encoding to Unicode.
10
14
 
11
15
  v0.9
12
- - Support for CJK text (convert to UTF-8 like all other encodings)
16
+ - Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
13
17
  - Add a way to extract raster images
14
18
 
15
19
 
@@ -17,6 +21,9 @@ Sometime
17
21
  - Ship some extra receivers in the standard package, particuarly ones that are useful for running
18
22
  rspec over generated PDF files
19
23
 
24
+ - When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
25
+ sensible way to convert them to unicode
26
+
20
27
  - Improve metadata support
21
28
 
22
29
  - Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
@@ -28,6 +28,8 @@ require 'enumerator'
28
28
  class PDF::Reader
29
29
  class Encoding
30
30
 
31
+ UNKNOWN_CHAR = 0x25AF # ▯
32
+
31
33
  attr_reader :differences
32
34
 
33
35
  # set the differences table for this encoding. should be an array in the following format:
@@ -103,17 +105,26 @@ class PDF::Reader
103
105
 
104
106
  class IdentityH < Encoding
105
107
  def to_utf8(str, map = nil)
106
- raise ArgumentError, "a ToUnicode cmap is required to decode an IdentityH string" if map.nil?
107
-
108
+
108
109
  array_enc = []
109
110
 
110
111
  # iterate over string, reading it in 2 byte chunks and interpreting those
111
112
  # chunks as ints
112
113
  str.unpack("n*").each do |c|
113
- # convert the int to a unicode codepoint
114
- array_enc << map.decode(c)
114
+ # convert the int to a unicode codepoint if possible.
115
+ # without a ToUnicode CMap, it's impossible to reliably convert this text
116
+ # to unicode, so just replace each character with a little box. Big smacks
117
+ # the the PDF producing app.
118
+ if map
119
+ array_enc << map.decode(c)
120
+ else
121
+ array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
122
+ end
115
123
  end
116
-
124
+
125
+ # replace charcters that didn't convert to unicode nicely with something valid
126
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
127
+
117
128
  # pack all our Unicode codepoints into a UTF-8 string
118
129
  ret = array_enc.pack("U*")
119
130
 
@@ -300,6 +311,9 @@ class PDF::Reader
300
311
  # convert any glyph names to unicode codepoints
301
312
  array_enc = self.process_glyphnames(array_enc)
302
313
 
314
+ # replace charcters that didn't convert to unicode nicely with something valid
315
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
316
+
303
317
  # pack all our Unicode codepoints into a UTF-8 string
304
318
  ret = array_enc.pack("U*")
305
319
 
@@ -458,6 +472,9 @@ class PDF::Reader
458
472
  # convert any glyph names to unicode codepoints
459
473
  array_enc = self.process_glyphnames(array_enc)
460
474
 
475
+ # replace charcters that didn't convert to unicode nicely with something valid
476
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
477
+
461
478
  # pack all our Unicode codepoints into a UTF-8 string
462
479
  ret = array_enc.pack("U*")
463
480
 
@@ -532,6 +549,9 @@ class PDF::Reader
532
549
 
533
550
  # convert any glyph names to unicode codepoints
534
551
  array_enc = self.process_glyphnames(array_enc)
552
+
553
+ # replace charcters that didn't convert to unicode nicely with something valid
554
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
535
555
 
536
556
  # pack all our Unicode codepoints into a UTF-8 string
537
557
  ret = array_enc.pack("U*")
@@ -710,6 +730,9 @@ class PDF::Reader
710
730
  end
711
731
  end
712
732
 
733
+ # replace charcters that didn't convert to unicode nicely with something valid
734
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
735
+
713
736
  # convert any glyph names to unicode codepoints
714
737
  array_enc = self.process_glyphnames(array_enc)
715
738
 
@@ -770,6 +793,9 @@ class PDF::Reader
770
793
  # convert any glyph names to unicode codepoints
771
794
  array_enc = self.process_glyphnames(array_enc)
772
795
 
796
+ # replace charcters that didn't convert to unicode nicely with something valid
797
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
798
+
773
799
  # pack all our Unicode codepoints into a UTF-8 string
774
800
  ret = array_enc.pack("U*")
775
801
 
@@ -999,6 +1025,9 @@ class PDF::Reader
999
1025
  # convert any glyph names to unicode codepoints
1000
1026
  array_enc = self.process_glyphnames(array_enc)
1001
1027
 
1028
+ # replace charcters that didn't convert to unicode nicely with something valid
1029
+ array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
1030
+
1002
1031
  # pack all our Unicode codepoints into a UTF-8 string
1003
1032
  ret = array_enc.pack("U*")
1004
1033
 
@@ -22,6 +22,15 @@ class PDF::Reader
22
22
  return counter
23
23
  end
24
24
 
25
+ # return the details for every time the specified callback was fired
26
+ def all(methodname)
27
+ ret = []
28
+ callbacks.each do |cb|
29
+ ret << cb if cb[:name] == methodname
30
+ end
31
+ return ret
32
+ end
33
+
25
34
  # return the details for the first time the specified callback was fired
26
35
  def first_occurance_of(methodname)
27
36
  callbacks.each do |cb|
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader
3
3
  version: !ruby/object:Gem::Version
4
- version: "0.6"
4
+ version: 0.6.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Jones
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2008-02-26 00:00:00 +11:00
12
+ date: 2008-03-12 00:00:00 +11:00
13
13
  default_executable:
14
14
  dependencies: []
15
15