pdf-reader 0.7.3 → 0.7.4
Sign up to get free protection for your applications and to get access to all the features.
- data/CHANGELOG +6 -1
- data/README.rdoc +47 -12
- data/Rakefile +1 -1
- data/TODO +2 -0
- data/lib/pdf/reader/parser.rb +11 -1
- metadata +3 -3
data/CHANGELOG
CHANGED
@@ -1,4 +1,9 @@
|
|
1
|
-
v0.7.
|
1
|
+
v0.7.4 (7th August 2008)
|
2
|
+
- Raise a MalformedPDFError if a content stream contains an unterminated string
|
3
|
+
- Fix an bug that was causing an endless loop on some OSX systems
|
4
|
+
- valid strings were incorrectly thought to be unterminated
|
5
|
+
|
6
|
+
v0.7.3 (11th June 2008)
|
2
7
|
- Add a high level way to get direct access to a PDF object, including a new executable: pdf_object
|
3
8
|
- Fix a hard loop bug caused by a content stream that is missing a final operator
|
4
9
|
- Significantly simplified the internal code for encoding conversions
|
data/README.rdoc
CHANGED
@@ -18,7 +18,7 @@ The recommended installation method is via Rubygems.
|
|
18
18
|
|
19
19
|
PDF::Reader is designed with a callback-style architecture. The basic concept
|
20
20
|
is to build a receiver class and pass that into PDF::Reader along with the PDF
|
21
|
-
to process.
|
21
|
+
to process.
|
22
22
|
|
23
23
|
As PDF::Reader walks the file and encounters various objects (pages, text,
|
24
24
|
images, shapes, etc) it will call methods on the receiver class. What those
|
@@ -37,22 +37,22 @@ text will be converted to UTF-8 before it is passed back from PDF::Reader.
|
|
37
37
|
|
38
38
|
= Exceptions
|
39
39
|
|
40
|
-
There are two key exceptions that you will need to watch out for when processing a
|
40
|
+
There are two key exceptions that you will need to watch out for when processing a
|
41
41
|
PDF file:
|
42
42
|
|
43
|
-
MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
|
44
|
-
file should be valid, or that a corrupt file didn't raise an exception, please
|
43
|
+
MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
|
44
|
+
file should be valid, or that a corrupt file didn't raise an exception, please
|
45
45
|
forward a copy of the file to the maintainers and we can attempt improve the code.
|
46
46
|
|
47
|
-
UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
|
48
|
-
support. Again, we welcome submissions of PDF files that exhibit these features to help
|
47
|
+
UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
|
48
|
+
support. Again, we welcome submissions of PDF files that exhibit these features to help
|
49
49
|
us with future code improvements.
|
50
50
|
|
51
51
|
MalformedPDFError has some subclasses if you want to detect finer grained issues. If you
|
52
52
|
don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
|
53
53
|
|
54
54
|
Any other exceptions should be considered bugs in either PDF::Reader (please
|
55
|
-
report it!) your receiver (please don't report it!).
|
55
|
+
report it!) or your receiver (please don't report it!).
|
56
56
|
|
57
57
|
= Maintainers
|
58
58
|
|
@@ -80,9 +80,9 @@ A simple app to count the number of pages in a PDF File.
|
|
80
80
|
attr_accessor :page_count
|
81
81
|
|
82
82
|
def initialize
|
83
|
-
@page_count = 0
|
83
|
+
@page_count = 0
|
84
84
|
end
|
85
|
-
|
85
|
+
|
86
86
|
# Called when page parsing ends
|
87
87
|
def end_page
|
88
88
|
@page_count += 1
|
@@ -97,7 +97,7 @@ A simple app to count the number of pages in a PDF File.
|
|
97
97
|
|
98
98
|
WARNING: this will generate a *lot* of output, so you probably want to pipe
|
99
99
|
it through less or to a text file.
|
100
|
-
|
100
|
+
|
101
101
|
require 'rubygems'
|
102
102
|
require 'pdf/reader'
|
103
103
|
|
@@ -107,7 +107,42 @@ it through less or to a text file.
|
|
107
107
|
puts cb
|
108
108
|
end
|
109
109
|
|
110
|
-
== Extract
|
110
|
+
== Extract all text from a single PDF
|
111
|
+
|
112
|
+
class PageTextReceiver
|
113
|
+
attr_accessor :content
|
114
|
+
|
115
|
+
def initialize
|
116
|
+
@content = []
|
117
|
+
end
|
118
|
+
|
119
|
+
# Called when page parsing starts
|
120
|
+
def begin_page(arg = nil)
|
121
|
+
@content << ""
|
122
|
+
end
|
123
|
+
|
124
|
+
# record text that is drawn on the page
|
125
|
+
def show_text(string, *params)
|
126
|
+
@content.last << string.strip
|
127
|
+
end
|
128
|
+
|
129
|
+
# there's a few text callbacks, so make sure we process them all
|
130
|
+
alias :super_show_text :show_text
|
131
|
+
alias :move_to_next_line_and_show_text :show_text
|
132
|
+
alias :set_spacing_next_line_show_text :show_text
|
133
|
+
|
134
|
+
# this final text callback takes slightly different arguments
|
135
|
+
def show_text_with_positioning(*params)
|
136
|
+
params = params.first
|
137
|
+
params.each { |str| show_text(str) if str.kind_of?(String)}
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
receiver = PageTextReceiver.new
|
142
|
+
pdf = PDF::Reader.file("somefile.pdf", receiver)
|
143
|
+
puts receiver.content.inspect
|
144
|
+
|
145
|
+
== Extract metadata only
|
111
146
|
|
112
147
|
require 'rubygems'
|
113
148
|
require 'pdf/reader'
|
@@ -150,7 +185,7 @@ A simple app to display the number of pages in a PDF File.
|
|
150
185
|
pdf = PDF::Reader.file("somefile.pdf", receiver, :pages => false)
|
151
186
|
puts "#{receiver.pages} pages"
|
152
187
|
|
153
|
-
== Basic RSpec of a generated PDF
|
188
|
+
== Basic RSpec of a generated PDF
|
154
189
|
|
155
190
|
require 'rubygems'
|
156
191
|
require 'pdf/reader'
|
data/Rakefile
CHANGED
data/TODO
CHANGED
data/lib/pdf/reader/parser.rb
CHANGED
@@ -118,11 +118,21 @@ class PDF::Reader
|
|
118
118
|
|
119
119
|
while count != 0
|
120
120
|
@buffer.ready_token(false, false)
|
121
|
-
|
121
|
+
|
122
|
+
# find the first occurance of ( ) [ \ or ]
|
123
|
+
#
|
124
|
+
# we used to use the following line, but it fails sometimes
|
125
|
+
# under OSX.
|
126
|
+
# i = @buffer.raw.index(/[\\\(\)]/)
|
127
|
+
i = @buffer.raw.unpack("C*").index { |n| [40, 41, 91, 92, 93].include?(n) }
|
122
128
|
|
123
129
|
if i.nil?
|
124
130
|
str << @buffer.raw + "\n"
|
125
131
|
@buffer.raw.replace("")
|
132
|
+
# if a content stream opens a string, but never closes it, we'll
|
133
|
+
# hit the end of the stream and still be appending stuff to the
|
134
|
+
# string. bad! This check prevents a hard loop.
|
135
|
+
raise MalformedPDFError, 'unterminated string in content stream' if @buffer.eof?
|
126
136
|
next
|
127
137
|
end
|
128
138
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.7.
|
4
|
+
version: 0.7.4
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Peter Jones
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2008-
|
12
|
+
date: 2008-08-07 00:00:00 +10:00
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|
@@ -84,7 +84,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
84
84
|
requirements: []
|
85
85
|
|
86
86
|
rubyforge_project: pdf-reader
|
87
|
-
rubygems_version: 1.
|
87
|
+
rubygems_version: 1.2.0
|
88
88
|
signing_key:
|
89
89
|
specification_version: 2
|
90
90
|
summary: A library for accessing the content of PDF files
|