pdf-reader 0.6 → 0.6.1
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +9 -1
- data/README +4 -0
- data/Rakefile +1 -1
- data/TODO +8 -1
- data/lib/pdf/reader/encoding.rb +34 -5
- data/lib/pdf/reader/register_receiver.rb +9 -0
- metadata +2 -2
data/CHANGELOG
CHANGED
@@ -1,10 +1,18 @@
|
|
1
|
-
v0.6.
|
1
|
+
v0.6.1 (12th March 2008)
|
2
|
+
- Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We
|
3
|
+
just replace each character with a little box.
|
4
|
+
- Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
|
5
|
+
NoMethodError.
|
6
|
+
- Added a method to RegisterReceiver that returns all occurances of a callback
|
7
|
+
|
8
|
+
v0.6.0 (27th February 2008)
|
2
9
|
- all text is now transparently converted to UTF-8 before being passed to the callbacks.
|
3
10
|
before this version, text was just passed as a byte level copy of what was in the PDF file, which
|
4
11
|
was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
|
5
12
|
- Fonts that use a difference table are now handled correctly
|
6
13
|
- fixed some 1.9 incompatible syntax
|
7
14
|
- expanded RegisterReceiver class to record extra info
|
15
|
+
- expanded rspec coverage
|
8
16
|
- tweaked a README example
|
9
17
|
|
10
18
|
v0.5.1 (1st January 2008)
|
data/README
CHANGED
@@ -206,6 +206,10 @@ layout of the file, not the order objects are displayed to the user. As a
|
|
206
206
|
consequence of this it is highly unlikely that text will be completely in
|
207
207
|
order.
|
208
208
|
|
209
|
+
Occasionally some text cannot be extracted properly due to the way it has been stored, or the use
|
210
|
+
of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate
|
211
|
+
an unrecognisable character.
|
212
|
+
|
209
213
|
= Resources
|
210
214
|
|
211
215
|
- PDF::Reader Homepage: http://software.pmade.com/pdfreader
|
data/Rakefile
CHANGED
data/TODO
CHANGED
@@ -3,13 +3,17 @@ v0.7
|
|
3
3
|
interested in meta data or bookmarks, there's no point in walking the pages tree.
|
4
4
|
- maybe a third option to Reader.parse?
|
5
5
|
parse(io, receiver, {:pages => true, :fonts => false, :metadata => true, :bookmarks => false})
|
6
|
+
- detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
|
7
|
+
- When parsing a CMap into a ruby object, recognise ranged mappings defined by begincodespacerange (see spec, section 5.9.2)
|
8
|
+
- Provide a way to get raw access to a particular object. Good for testing purposes
|
6
9
|
|
10
|
+
v0.8
|
7
11
|
- Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
|
8
12
|
poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
|
9
13
|
from the Original encoding to Unicode.
|
10
14
|
|
11
15
|
v0.9
|
12
|
-
- Support for CJK text (convert to UTF-8 like all other encodings)
|
16
|
+
- Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
|
13
17
|
- Add a way to extract raster images
|
14
18
|
|
15
19
|
|
@@ -17,6 +21,9 @@ Sometime
|
|
17
21
|
- Ship some extra receivers in the standard package, particuarly ones that are useful for running
|
18
22
|
rspec over generated PDF files
|
19
23
|
|
24
|
+
- When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
|
25
|
+
sensible way to convert them to unicode
|
26
|
+
|
20
27
|
- Improve metadata support
|
21
28
|
|
22
29
|
- Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
|
data/lib/pdf/reader/encoding.rb
CHANGED
@@ -28,6 +28,8 @@ require 'enumerator'
|
|
28
28
|
class PDF::Reader
|
29
29
|
class Encoding
|
30
30
|
|
31
|
+
UNKNOWN_CHAR = 0x25AF # ▯
|
32
|
+
|
31
33
|
attr_reader :differences
|
32
34
|
|
33
35
|
# set the differences table for this encoding. should be an array in the following format:
|
@@ -103,17 +105,26 @@ class PDF::Reader
|
|
103
105
|
|
104
106
|
class IdentityH < Encoding
|
105
107
|
def to_utf8(str, map = nil)
|
106
|
-
|
107
|
-
|
108
|
+
|
108
109
|
array_enc = []
|
109
110
|
|
110
111
|
# iterate over string, reading it in 2 byte chunks and interpreting those
|
111
112
|
# chunks as ints
|
112
113
|
str.unpack("n*").each do |c|
|
113
|
-
# convert the int to a unicode codepoint
|
114
|
-
|
114
|
+
# convert the int to a unicode codepoint if possible.
|
115
|
+
# without a ToUnicode CMap, it's impossible to reliably convert this text
|
116
|
+
# to unicode, so just replace each character with a little box. Big smacks
|
117
|
+
# the the PDF producing app.
|
118
|
+
if map
|
119
|
+
array_enc << map.decode(c)
|
120
|
+
else
|
121
|
+
array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
|
122
|
+
end
|
115
123
|
end
|
116
|
-
|
124
|
+
|
125
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
126
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
127
|
+
|
117
128
|
# pack all our Unicode codepoints into a UTF-8 string
|
118
129
|
ret = array_enc.pack("U*")
|
119
130
|
|
@@ -300,6 +311,9 @@ class PDF::Reader
|
|
300
311
|
# convert any glyph names to unicode codepoints
|
301
312
|
array_enc = self.process_glyphnames(array_enc)
|
302
313
|
|
314
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
315
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
316
|
+
|
303
317
|
# pack all our Unicode codepoints into a UTF-8 string
|
304
318
|
ret = array_enc.pack("U*")
|
305
319
|
|
@@ -458,6 +472,9 @@ class PDF::Reader
|
|
458
472
|
# convert any glyph names to unicode codepoints
|
459
473
|
array_enc = self.process_glyphnames(array_enc)
|
460
474
|
|
475
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
476
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
477
|
+
|
461
478
|
# pack all our Unicode codepoints into a UTF-8 string
|
462
479
|
ret = array_enc.pack("U*")
|
463
480
|
|
@@ -532,6 +549,9 @@ class PDF::Reader
|
|
532
549
|
|
533
550
|
# convert any glyph names to unicode codepoints
|
534
551
|
array_enc = self.process_glyphnames(array_enc)
|
552
|
+
|
553
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
554
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
535
555
|
|
536
556
|
# pack all our Unicode codepoints into a UTF-8 string
|
537
557
|
ret = array_enc.pack("U*")
|
@@ -710,6 +730,9 @@ class PDF::Reader
|
|
710
730
|
end
|
711
731
|
end
|
712
732
|
|
733
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
734
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
735
|
+
|
713
736
|
# convert any glyph names to unicode codepoints
|
714
737
|
array_enc = self.process_glyphnames(array_enc)
|
715
738
|
|
@@ -770,6 +793,9 @@ class PDF::Reader
|
|
770
793
|
# convert any glyph names to unicode codepoints
|
771
794
|
array_enc = self.process_glyphnames(array_enc)
|
772
795
|
|
796
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
797
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
798
|
+
|
773
799
|
# pack all our Unicode codepoints into a UTF-8 string
|
774
800
|
ret = array_enc.pack("U*")
|
775
801
|
|
@@ -999,6 +1025,9 @@ class PDF::Reader
|
|
999
1025
|
# convert any glyph names to unicode codepoints
|
1000
1026
|
array_enc = self.process_glyphnames(array_enc)
|
1001
1027
|
|
1028
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
1029
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
1030
|
+
|
1002
1031
|
# pack all our Unicode codepoints into a UTF-8 string
|
1003
1032
|
ret = array_enc.pack("U*")
|
1004
1033
|
|
@@ -22,6 +22,15 @@ class PDF::Reader
|
|
22
22
|
return counter
|
23
23
|
end
|
24
24
|
|
25
|
+
# return the details for every time the specified callback was fired
|
26
|
+
def all(methodname)
|
27
|
+
ret = []
|
28
|
+
callbacks.each do |cb|
|
29
|
+
ret << cb if cb[:name] == methodname
|
30
|
+
end
|
31
|
+
return ret
|
32
|
+
end
|
33
|
+
|
25
34
|
# return the details for the first time the specified callback was fired
|
26
35
|
def first_occurance_of(methodname)
|
27
36
|
callbacks.each do |cb|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 0.6.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Peter Jones
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2008-
|
12
|
+
date: 2008-03-12 00:00:00 +11:00
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|