pdf-reader 0.7.3 → 0.7.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/CHANGELOG +6 -1
- data/README.rdoc +47 -12
- data/Rakefile +1 -1
- data/TODO +2 -0
- data/lib/pdf/reader/parser.rb +11 -1
- metadata +3 -3
data/CHANGELOG
CHANGED
@@ -1,4 +1,9 @@
|
|
1
|
-
v0.7.
|
1
|
+
v0.7.4 (7th August 2008)
|
2
|
+
- Raise a MalformedPDFError if a content stream contains an unterminated string
|
3
|
+
- Fix an bug that was causing an endless loop on some OSX systems
|
4
|
+
- valid strings were incorrectly thought to be unterminated
|
5
|
+
|
6
|
+
v0.7.3 (11th June 2008)
|
2
7
|
- Add a high level way to get direct access to a PDF object, including a new executable: pdf_object
|
3
8
|
- Fix a hard loop bug caused by a content stream that is missing a final operator
|
4
9
|
- Significantly simplified the internal code for encoding conversions
|
data/README.rdoc
CHANGED
@@ -18,7 +18,7 @@ The recommended installation method is via Rubygems.
|
|
18
18
|
|
19
19
|
PDF::Reader is designed with a callback-style architecture. The basic concept
|
20
20
|
is to build a receiver class and pass that into PDF::Reader along with the PDF
|
21
|
-
to process.
|
21
|
+
to process.
|
22
22
|
|
23
23
|
As PDF::Reader walks the file and encounters various objects (pages, text,
|
24
24
|
images, shapes, etc) it will call methods on the receiver class. What those
|
@@ -37,22 +37,22 @@ text will be converted to UTF-8 before it is passed back from PDF::Reader.
|
|
37
37
|
|
38
38
|
= Exceptions
|
39
39
|
|
40
|
-
There are two key exceptions that you will need to watch out for when processing a
|
40
|
+
There are two key exceptions that you will need to watch out for when processing a
|
41
41
|
PDF file:
|
42
42
|
|
43
|
-
MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
|
44
|
-
file should be valid, or that a corrupt file didn't raise an exception, please
|
43
|
+
MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
|
44
|
+
file should be valid, or that a corrupt file didn't raise an exception, please
|
45
45
|
forward a copy of the file to the maintainers and we can attempt improve the code.
|
46
46
|
|
47
|
-
UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
|
48
|
-
support. Again, we welcome submissions of PDF files that exhibit these features to help
|
47
|
+
UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
|
48
|
+
support. Again, we welcome submissions of PDF files that exhibit these features to help
|
49
49
|
us with future code improvements.
|
50
50
|
|
51
51
|
MalformedPDFError has some subclasses if you want to detect finer grained issues. If you
|
52
52
|
don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
|
53
53
|
|
54
54
|
Any other exceptions should be considered bugs in either PDF::Reader (please
|
55
|
-
report it!) your receiver (please don't report it!).
|
55
|
+
report it!) or your receiver (please don't report it!).
|
56
56
|
|
57
57
|
= Maintainers
|
58
58
|
|
@@ -80,9 +80,9 @@ A simple app to count the number of pages in a PDF File.
|
|
80
80
|
attr_accessor :page_count
|
81
81
|
|
82
82
|
def initialize
|
83
|
-
@page_count = 0
|
83
|
+
@page_count = 0
|
84
84
|
end
|
85
|
-
|
85
|
+
|
86
86
|
# Called when page parsing ends
|
87
87
|
def end_page
|
88
88
|
@page_count += 1
|
@@ -97,7 +97,7 @@ A simple app to count the number of pages in a PDF File.
|
|
97
97
|
|
98
98
|
WARNING: this will generate a *lot* of output, so you probably want to pipe
|
99
99
|
it through less or to a text file.
|
100
|
-
|
100
|
+
|
101
101
|
require 'rubygems'
|
102
102
|
require 'pdf/reader'
|
103
103
|
|
@@ -107,7 +107,42 @@ it through less or to a text file.
|
|
107
107
|
puts cb
|
108
108
|
end
|
109
109
|
|
110
|
-
== Extract
|
110
|
+
== Extract all text from a single PDF
|
111
|
+
|
112
|
+
class PageTextReceiver
|
113
|
+
attr_accessor :content
|
114
|
+
|
115
|
+
def initialize
|
116
|
+
@content = []
|
117
|
+
end
|
118
|
+
|
119
|
+
# Called when page parsing starts
|
120
|
+
def begin_page(arg = nil)
|
121
|
+
@content << ""
|
122
|
+
end
|
123
|
+
|
124
|
+
# record text that is drawn on the page
|
125
|
+
def show_text(string, *params)
|
126
|
+
@content.last << string.strip
|
127
|
+
end
|
128
|
+
|
129
|
+
# there's a few text callbacks, so make sure we process them all
|
130
|
+
alias :super_show_text :show_text
|
131
|
+
alias :move_to_next_line_and_show_text :show_text
|
132
|
+
alias :set_spacing_next_line_show_text :show_text
|
133
|
+
|
134
|
+
# this final text callback takes slightly different arguments
|
135
|
+
def show_text_with_positioning(*params)
|
136
|
+
params = params.first
|
137
|
+
params.each { |str| show_text(str) if str.kind_of?(String)}
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
receiver = PageTextReceiver.new
|
142
|
+
pdf = PDF::Reader.file("somefile.pdf", receiver)
|
143
|
+
puts receiver.content.inspect
|
144
|
+
|
145
|
+
== Extract metadata only
|
111
146
|
|
112
147
|
require 'rubygems'
|
113
148
|
require 'pdf/reader'
|
@@ -150,7 +185,7 @@ A simple app to display the number of pages in a PDF File.
|
|
150
185
|
pdf = PDF::Reader.file("somefile.pdf", receiver, :pages => false)
|
151
186
|
puts "#{receiver.pages} pages"
|
152
187
|
|
153
|
-
== Basic RSpec of a generated PDF
|
188
|
+
== Basic RSpec of a generated PDF
|
154
189
|
|
155
190
|
require 'rubygems'
|
156
191
|
require 'pdf/reader'
|
data/Rakefile
CHANGED
data/TODO
CHANGED
data/lib/pdf/reader/parser.rb
CHANGED
@@ -118,11 +118,21 @@ class PDF::Reader
|
|
118
118
|
|
119
119
|
while count != 0
|
120
120
|
@buffer.ready_token(false, false)
|
121
|
-
|
121
|
+
|
122
|
+
# find the first occurance of ( ) [ \ or ]
|
123
|
+
#
|
124
|
+
# we used to use the following line, but it fails sometimes
|
125
|
+
# under OSX.
|
126
|
+
# i = @buffer.raw.index(/[\\\(\)]/)
|
127
|
+
i = @buffer.raw.unpack("C*").index { |n| [40, 41, 91, 92, 93].include?(n) }
|
122
128
|
|
123
129
|
if i.nil?
|
124
130
|
str << @buffer.raw + "\n"
|
125
131
|
@buffer.raw.replace("")
|
132
|
+
# if a content stream opens a string, but never closes it, we'll
|
133
|
+
# hit the end of the stream and still be appending stuff to the
|
134
|
+
# string. bad! This check prevents a hard loop.
|
135
|
+
raise MalformedPDFError, 'unterminated string in content stream' if @buffer.eof?
|
126
136
|
next
|
127
137
|
end
|
128
138
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.7.
|
4
|
+
version: 0.7.4
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Peter Jones
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2008-
|
12
|
+
date: 2008-08-07 00:00:00 +10:00
|
13
13
|
default_executable:
|
14
14
|
dependencies: []
|
15
15
|
|
@@ -84,7 +84,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
84
84
|
requirements: []
|
85
85
|
|
86
86
|
rubyforge_project: pdf-reader
|
87
|
-
rubygems_version: 1.
|
87
|
+
rubygems_version: 1.2.0
|
88
88
|
signing_key:
|
89
89
|
specification_version: 2
|
90
90
|
summary: A library for accessing the content of PDF files
|