pdf-reader 0.6 → 0.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +9 -1
- data/README +4 -0
- data/Rakefile +1 -1
- data/TODO +8 -1
- data/lib/pdf/reader/encoding.rb +34 -5
- data/lib/pdf/reader/register_receiver.rb +9 -0
- metadata +2 -2
data/CHANGELOG
CHANGED
@@ -1,10 +1,18 @@
|
|
1
|
-
v0.6.
|
1
|
+
v0.6.1 (12th March 2008)
|
2
|
+
- Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We
|
3
|
+
just replace each character with a little box.
|
4
|
+
- Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
|
5
|
+
NoMethodError.
|
6
|
+
- Added a method to RegisterReceiver that returns all occurances of a callback
|
7
|
+
|
8
|
+
v0.6.0 (27th February 2008)
|
2
9
|
- all text is now transparently converted to UTF-8 before being passed to the callbacks.
|
3
10
|
before this version, text was just passed as a byte level copy of what was in the PDF file, which
|
4
11
|
was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
|
5
12
|
- Fonts that use a difference table are now handled correctly
|
6
13
|
- fixed some 1.9 incompatible syntax
|
7
14
|
- expanded RegisterReceiver class to record extra info
|
15
|
+
- expanded rspec coverage
|
8
16
|
- tweaked a README example
|
9
17
|
|
10
18
|
v0.5.1 (1st January 2008)
|
data/README
CHANGED
@@ -206,6 +206,10 @@ layout of the file, not the order objects are displayed to the user. As a
|
|
206
206
|
consequence of this it is highly unlikely that text will be completely in
|
207
207
|
order.
|
208
208
|
|
209
|
+
Occasionally some text cannot be extracted properly due to the way it has been stored, or the use
|
210
|
+
of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate
|
211
|
+
an unrecognisable character.
|
212
|
+
|
209
213
|
= Resources
|
210
214
|
|
211
215
|
- PDF::Reader Homepage: http://software.pmade.com/pdfreader
|
data/Rakefile
CHANGED
data/TODO
CHANGED
@@ -3,13 +3,17 @@ v0.7
|
|
3
3
|
interested in meta data or bookmarks, there's no point in walking the pages tree.
|
4
4
|
- maybe a third option to Reader.parse?
|
5
5
|
parse(io, receiver, {:pages => true, :fonts => false, :metadata => true, :bookmarks => false})
|
6
|
+
- detect when a font's encoding is a CMap (generally used for pre-Unicode, multibyte asian encodings), and display a user friendly error
|
7
|
+
- When parsing a CMap into a ruby object, recognise ranged mappings defined by begincodespacerange (see spec, section 5.9.2)
|
8
|
+
- Provide a way to get raw access to a particular object. Good for testing purposes
|
6
9
|
|
10
|
+
v0.8
|
7
11
|
- Tweak encoding mappings to differentiate between bytes that are invalid for an encoding, and bytes that are unchanged.
|
8
12
|
poppler seems to do this in a quite reasonable way. Original Encoding -> Glyph Names -> Unicode. As of 0.6 we go straight
|
9
13
|
from the Original encoding to Unicode.
|
10
14
|
|
11
15
|
v0.9
|
12
|
-
- Support for CJK text (convert to UTF-8 like all other encodings)
|
16
|
+
- Support for CJK text (convert to UTF-8 like all other encodings. See Section 5.9 of the PDF spec)
|
13
17
|
- Add a way to extract raster images
|
14
18
|
|
15
19
|
|
@@ -17,6 +21,9 @@ Sometime
|
|
17
21
|
- Ship some extra receivers in the standard package, particuarly ones that are useful for running
|
18
22
|
rspec over generated PDF files
|
19
23
|
|
24
|
+
- When we encounter Identity-H encoded text with no ToUnicode CMap, render the glyphs and treat them as images, as there's no
|
25
|
+
sensible way to convert them to unicode
|
26
|
+
|
20
27
|
- Improve metadata support
|
21
28
|
|
22
29
|
- Add support for additional filters: ASCIIHexDecode, ASCII85Decode, LZWDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode, Crypt?
|
data/lib/pdf/reader/encoding.rb
CHANGED
@@ -28,6 +28,8 @@ require 'enumerator'
|
|
28
28
|
class PDF::Reader
|
29
29
|
class Encoding
|
30
30
|
|
31
|
+
UNKNOWN_CHAR = 0x25AF # ▯
|
32
|
+
|
31
33
|
attr_reader :differences
|
32
34
|
|
33
35
|
# set the differences table for this encoding. should be an array in the following format:
|
@@ -103,17 +105,26 @@ class PDF::Reader
|
|
103
105
|
|
104
106
|
class IdentityH < Encoding
|
105
107
|
def to_utf8(str, map = nil)
|
106
|
-
|
107
|
-
|
108
|
+
|
108
109
|
array_enc = []
|
109
110
|
|
110
111
|
# iterate over string, reading it in 2 byte chunks and interpreting those
|
111
112
|
# chunks as ints
|
112
113
|
str.unpack("n*").each do |c|
|
113
|
-
# convert the int to a unicode codepoint
|
114
|
-
|
114
|
+
# convert the int to a unicode codepoint if possible.
|
115
|
+
# without a ToUnicode CMap, it's impossible to reliably convert this text
|
116
|
+
# to unicode, so just replace each character with a little box. Big smacks
|
117
|
+
# the the PDF producing app.
|
118
|
+
if map
|
119
|
+
array_enc << map.decode(c)
|
120
|
+
else
|
121
|
+
array_enc << PDF::Reader::Encoding::UNKNOWN_CHAR
|
122
|
+
end
|
115
123
|
end
|
116
|
-
|
124
|
+
|
125
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
126
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
127
|
+
|
117
128
|
# pack all our Unicode codepoints into a UTF-8 string
|
118
129
|
ret = array_enc.pack("U*")
|
119
130
|
|
@@ -300,6 +311,9 @@ class PDF::Reader
|
|
300
311
|
# convert any glyph names to unicode codepoints
|
301
312
|
array_enc = self.process_glyphnames(array_enc)
|
302
313
|
|
314
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
315
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
316
|
+
|
303
317
|
# pack all our Unicode codepoints into a UTF-8 string
|
304
318
|
ret = array_enc.pack("U*")
|
305
319
|
|
@@ -458,6 +472,9 @@ class PDF::Reader
|
|
458
472
|
# convert any glyph names to unicode codepoints
|
459
473
|
array_enc = self.process_glyphnames(array_enc)
|
460
474
|
|
475
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
476
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
477
|
+
|
461
478
|
# pack all our Unicode codepoints into a UTF-8 string
|
462
479
|
ret = array_enc.pack("U*")
|
463
480
|
|
@@ -532,6 +549,9 @@ class PDF::Reader
|
|
532
549
|
|
533
550
|
# convert any glyph names to unicode codepoints
|
534
551
|
array_enc = self.process_glyphnames(array_enc)
|
552
|
+
|
553
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
554
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
535
555
|
|
536
556
|
# pack all our Unicode codepoints into a UTF-8 string
|
537
557
|
ret = array_enc.pack("U*")
|
@@ -710,6 +730,9 @@ class PDF::Reader
|
|
710
730
|
end
|
711
731
|
end
|
712
732
|
|
733
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
734
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
735
|
+
|
713
736
|
# convert any glyph names to unicode codepoints
|
714
737
|
array_enc = self.process_glyphnames(array_enc)
|
715
738
|
|
@@ -770,6 +793,9 @@ class PDF::Reader
|
|
770
793
|
# convert any glyph names to unicode codepoints
|
771
794
|
array_enc = self.process_glyphnames(array_enc)
|
772
795
|
|
796
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
797
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
798
|
+
|
773
799
|
# pack all our Unicode codepoints into a UTF-8 string
|
774
800
|
ret = array_enc.pack("U*")
|
775
801
|
|
@@ -999,6 +1025,9 @@ class PDF::Reader
|
|
999
1025
|
# convert any glyph names to unicode codepoints
|
1000
1026
|
array_enc = self.process_glyphnames(array_enc)
|
1001
1027
|
|
1028
|
+
# replace charcters that didn't convert to unicode nicely with something valid
|
1029
|
+
array_enc.collect! { |c| c ? c : PDF::Reader::Encoding::UNKNOWN_CHAR }
|
1030
|
+
|
1002
1031
|
# pack all our Unicode codepoints into a UTF-8 string
|
1003
1032
|
ret = array_enc.pack("U*")
|
1004
1033
|
|
@@ -22,6 +22,15 @@ class PDF::Reader
|
|
22
22
|
return counter
|
23
23
|
end
|
24
24
|
|
25
|
+
# return the details for every time the specified callback was fired
|
26
|
+
def all(methodname)
|
27
|
+
ret = []
|
28
|
+
callbacks.each do |cb|
|
29
|
+
ret << cb if cb[:name] == methodname
|
30
|
+
end
|
31
|
+
return ret
|
32
|
+
end
|
33
|
+
|
25
34
|
# return the details for the first time the specified callback was fired
|
26
35
|
def first_occurance_of(methodname)
|
27
36
|
callbacks.each do |cb|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version:
|
4
|
+
version: 0.6.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Peter Jones
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2008-
|
12
|
+
date: 2008-03-12 00:00:00 +11:00
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|