pdf-reader 0.10.1 → 0.11.0.alpha

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,7 +1,4 @@
1
- v0.10.1 (20th October 2011)
2
- - simple license change to glyph data file, no code changes
3
-
4
- v0.10.0 (6th July 2011)
1
+ v0.9.4 (XXX)
5
2
  - support multiple receivers within a single pass over a source file
6
3
  - massive time saving when dealing with multiple receivers
7
4
 
@@ -6,7 +6,7 @@ degree of flexibility.
6
6
 
7
7
  The PDF 1.7 specification is a weighty document and not all aspects are
8
8
  currently supported. I welcome submission of PDF files that exhibit
9
- unsupported aspects of the spec to assist with improving out support.
9
+ unsupported aspects of the spec to assist with improving our support.
10
10
 
11
11
  = Installation
12
12
 
@@ -16,22 +16,37 @@ The recommended installation method is via Rubygems.
16
16
 
17
17
  = Usage
18
18
 
19
- PDF::Reader is designed with a callback-style architecture. The basic concept
20
- is to build a receiver class and pass that into PDF::Reader along with the PDF
21
- to process.
19
+ Begin by creating a PDF::Reader instance that points to a PDF file. Document
20
+ level information (metadata, page count, bookmarks, etc) is available via
21
+ this object.
22
22
 
23
- As PDF::Reader walks the file and encounters various objects (pages, text,
24
- images, shapes, etc) it will call methods on the receiver class. What those
25
- methods do is entirely up to you - save the text, extract images, count pages,
26
- read metadata, whatever.
23
+ reader = PDF::Reader.new("somefile.pdf")
27
24
 
28
- For a full list of the supported callback methods and a description of when they
29
- will be called, refer to PDF::Reader::PagesStrategy. See the examples directory for a
30
- way to print a list of all the callbacks generated by a file to STDOUT.
25
+ puts reader.pdf_version
26
+ puts reader.info
27
+ puts reader.metadata
28
+ puts reader.page_count
31
29
 
32
- There is also a class called PDF::Reader::ObjectHash. This provides direct
33
- access to the objects in a PDF file using a ruby hash-like API. Checkout the
34
- documentation for the class for further information.
30
+ PDF is a page based file format, so most visible information is available via
31
+ page-based iteration
32
+
33
+ reader = PDF::Reader.new("somefile.pdf")
34
+
35
+ reader.pages.each do |page|
36
+ puts page.fonts
37
+ puts page.images
38
+ puts page.text
39
+ end
40
+
41
+ For low level access to the objects in a PDF file, use the ObjectHash class. You can
42
+ build an ObjectHash instance directly:
43
+
44
+ puts PDF::Reader::ObjectHash.new("somefile.pdf")
45
+
46
+ or via a PDF::Reader instance:
47
+
48
+ reader = PDF::Reader.new("somefile.pdf")
49
+ puts reader.objects
35
50
 
36
51
  = Text Encoding
37
52
 
@@ -60,7 +75,7 @@ MalformedPDFError has some subclasses if you want to detect finer grained issues
60
75
  don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
61
76
 
62
77
  Any other exceptions should be considered bugs in either PDF::Reader (please
63
- report it!) or your receiver (please don't report it!).
78
+ report it!).
64
79
 
65
80
  = Maintainers
66
81
 
@@ -86,11 +101,6 @@ Check out the examples/ directory for a few files.
86
101
 
87
102
  = Known Limitations
88
103
 
89
- The order of the callbacks is unpredictable, and is dependent on the internal
90
- layout of the file, not the order objects are displayed to the user. As a
91
- consequence of this it is highly unlikely that text will be completely in
92
- order.
93
-
94
104
  Occasionally some text cannot be extracted properly due to the way it has been
95
105
  stored, or the use of invalid bytes. In these cases PDF::Reader will output a
96
106
  little UTF-8 friendly box to indicate an unrecognisable character.
@@ -98,6 +108,5 @@ little UTF-8 friendly box to indicate an unrecognisable character.
98
108
  = Resources
99
109
 
100
110
  - PDF::Reader Code Repository: http://github.com/yob/pdf-reader
101
- - PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/
102
111
  - PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
103
112
  - PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html
@@ -5,41 +5,11 @@ $LOAD_PATH.unshift(File.dirname(__FILE__) + "/../lib")
5
5
 
6
6
  require 'pdf/reader'
7
7
 
8
- class PageTextReceiver
9
- attr_accessor :content
10
-
11
- # Called when page parsing starts
12
- def end_page(arg = nil)
13
- if @content
14
- puts @content
15
- @content = nil
16
- puts
17
- end
18
- end
19
-
20
- def show_text(*params)
21
- @content = "" if @content.nil?
22
- params.each do |str|
23
- @content << str.to_s
24
- end
25
- end
26
-
27
- # there's a few text callbacks, so make sure we process them all
28
- alias :super_show_text :show_text
29
- alias :move_to_next_line_and_show_text :show_text
30
- alias :set_spacing_next_line_show_text :show_text
31
-
32
- def show_text_with_positioning(*params)
33
- params = params.first
34
- params ||= []
35
- params.each { |str| show_text(str) if str.kind_of?(String)}
36
- end
37
- end
38
-
39
- receiver = PageTextReceiver.new
40
-
41
8
  if ARGV.empty?
42
- PDF::Reader.new.parse($stdin, receiver)
9
+ browser = PDF::Reader.new($stdin)
43
10
  else
44
- PDF::Reader.file(ARGV[0], receiver)
11
+ browser = PDF::Reader.new(ARGV[0])
12
+ end
13
+ browser.pages.each do |page|
14
+ puts page.text
45
15
  end
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env ruby
2
2
  # coding: utf-8
3
3
 
4
- # List all callbacks generated by a single PDF
4
+ # List all callbacks generated by each page
5
5
  #
6
6
  # WARNING: this will generate a *lot* of output, so you probably want to pipe
7
7
  # it through less or to a text file.
@@ -10,7 +10,12 @@ require 'rubygems'
10
10
  require 'pdf/reader'
11
11
 
12
12
  receiver = PDF::Reader::RegisterReceiver.new
13
- pdf = PDF::Reader.file("somefile.pdf", receiver)
14
- receiver.callbacks.each do |cb|
15
- puts cb
13
+
14
+ PDF::Reader.open("somefile.pdf") do |reader|
15
+ reader.pages.each do |page|
16
+ page.walk(receiver)
17
+ receiver.callbacks.each do |cb|
18
+ puts cb
19
+ end
20
+ end
16
21
  end
@@ -10,26 +10,20 @@
10
10
  # the number is to look for words that match a pattern.
11
11
  #
12
12
  # This example attempts to extract numbers using the Acrobat 9 syntax.
13
- # As a fall back, you can provide a regular expression that will be
14
- # used to look for words that look like the numbers you expect in the
15
- # page content.
13
+ # As a fall back, you can use a regular expression to look for words
14
+ # that match the numbers you expect in the page content.
16
15
 
17
16
  require 'rubygems'
18
17
  require 'pdf/reader'
19
18
 
20
19
  class BatesReceiver
21
20
 
22
- def initialize(regexp = nil)
23
- @numbers = []
24
- @backup = []
25
- @regexp = regexp
26
- end
21
+ attr_reader :numbers
27
22
 
28
- def numbers
29
- @numbers.size > 0 ? @numbers : @backup
23
+ def initialize
24
+ @numbers = []
30
25
  end
31
26
 
32
- # Called when page parsing starts
33
27
  def begin_marked_content(*args)
34
28
  return unless args.size >= 2
35
29
  return unless args.first == :Artifact
@@ -39,25 +33,17 @@ class BatesReceiver
39
33
  end
40
34
  alias :begin_marked_content_with_pl :begin_marked_content
41
35
 
42
- # record text that is drawn on the page
43
- def show_text(string, *params)
44
- return if @regexp.nil?
45
-
46
- string.scan(@regexp).each { |m| @backup << m }
47
- end
36
+ end
48
37
 
49
- # there's a few text callbacks, so make sure we process them all
50
- alias :super_show_text :show_text
51
- alias :move_to_next_line_and_show_text :show_text
52
- alias :set_spacing_next_line_show_text :show_text
53
38
 
54
- # this final text callback takes slightly different arguments
55
- def show_text_with_positioning(*params)
56
- params = params.first
57
- params.each { |str| show_text(str) if str.kind_of?(String)}
39
+ PDF::Reader.open("bates.pdf") do |reader|
40
+ reader.pages.each do |page|
41
+ receiver = BatesReceiver.new
42
+ page.walk(receiver)
43
+ if receiver.numbers.empty?
44
+ puts page.scan(/CC.+/)
45
+ else
46
+ puts receiver.numbers.inspect
47
+ end
58
48
  end
59
49
  end
60
-
61
- receiver = BatesReceiver.new(/CC.+/)
62
- PDF::Reader.file("bates.pdf", receiver)
63
- puts receiver.numbers.inspect
@@ -1,6 +1,7 @@
1
1
  ################################################################################
2
2
  #
3
3
  # Copyright (C) 2006 Peter J Jones (pjones@pmade.com)
4
+ # Copyright (C) 2011 James Healy
4
5
  #
5
6
  # Permission is hereby granted, free of charge, to any person obtaining
6
7
  # a copy of this software and associated documentation files (the
@@ -30,62 +31,102 @@ require 'ascii85'
30
31
 
31
32
  module PDF
32
33
  ################################################################################
33
- # The Reader class serves as an entry point for parsing a PDF file. There are three
34
- # ways to kick off processing - which one you pick will be based on personal preference
35
- # and the situation.
34
+ # The Reader class serves as an entry point for parsing a PDF file.
36
35
  #
37
- # For all examples, assume the receiver variable contains an object that will respond
38
- # to various callbacks. Refer to the README and PDF::Reader::Content for more information
39
- # on receivers.
36
+ # PDF is a page based file format. There is some data associated with the
37
+ # document (metadata, bookmarks, etc) but all visible content is stored
38
+ # under a Page object.
40
39
  #
41
- # = Parsing a file
40
+ # In most use cases for extracting and examining the contents of a PDF it
41
+ # makes sense to traverse the information using page based iteration.
42
42
  #
43
- # PDF::Reader.file("somefile.pdf", receiver)
43
+ # In addition to the documentation here, check out the
44
+ # PDF::Reader::Page class.
44
45
  #
45
- # = Parsing a String
46
+ # == File Metadata
46
47
  #
47
- # This is useful for processing a PDF that is already in memory
48
+ # reader = PDF::Reader.new("somefile.pdf")
48
49
  #
49
- # PDF::Reader.string(pdf_string, receiver)
50
+ # puts reader.pdf_version
51
+ # puts reader.info
52
+ # puts reader.metadata
53
+ # puts reader.page_count
50
54
  #
51
- # = Parsing an IO object
55
+ # == Iterating over page content
52
56
  #
53
- # This can be a useful alternative to the first 2 options in some situations
57
+ # reader = PDF::Reader.new("somefile.pdf")
54
58
  #
55
- # pdf = PDF::Reader.new
56
- # pdf.parse(File.new("somefile.pdf"), receiver)
59
+ # reader.pages.each do |page|
60
+ # puts page.fonts
61
+ # puts page.images
62
+ # puts page.text
63
+ # end
57
64
  #
58
- # = Parsing parts of a file
65
+ # == Extracting all text
59
66
  #
60
- # Both PDF::Reader#file and PDF::Reader#string accept a third argument that
61
- # specifies which parts of the file to process. By default, all options are
62
- # enabled, so this can be useful to cut down processing time if you're only
63
- # interested in say, metadata.
67
+ # reader = PDF::Reader.new("somefile.pdf")
64
68
  #
65
- # As an example, the following call will disable parsing the contents of
66
- # pages in the file, but explicitly enables processing metadata.
69
+ # reader.pages.map(&:text)
67
70
  #
68
- # PDF::Reader.new("somefile.pdf", receiver, {:metadata => true, :pages => false})
71
+ # == Extracting content from a single page
69
72
  #
70
- # Available options are currently:
73
+ # reader = PDF::Reader.new("somefile.pdf")
71
74
  #
72
- # :metadata
73
- # :pages
74
- # :raw_text
75
+ # page = reader.page(1)
76
+ # puts page.fonts
77
+ # puts page.images
78
+ # puts page.text
75
79
  #
76
- # = Processing with multiple receivers
80
+ # == Low level callbacks (ala current version of PDF::Reader)
77
81
  #
78
- # If you wish to parse a PDF file with multiple simultaneous receivers, just
79
- # pass an array of receivers as the second argument:
82
+ # reader = PDF::Reader.new("somefile.pdf")
80
83
  #
81
- # pdf = PDF::Reader.new
82
- # pdf.parse(File.new("somefile.pdf"), [receiver_one, receiever_two])
83
- #
84
- # This saves a significant amount of time by limiting the work to a single pass
85
- # over the source file.
84
+ # page = reader.page(1)
85
+ # page.walk(receiver)
86
86
  #
87
87
  class Reader
88
88
 
89
+ # lowlevel hash-like access to all objects in the underlying PDF
90
+ attr_reader :objects
91
+
92
+ attr_reader :page_count, :pdf_version, :info, :metadata
93
+
94
+ # creates a new document reader for the provided PDF.
95
+ #
96
+ # input can be an IO-ish object (StringIO, File, etc) containing a PDF
97
+ # or a filename
98
+ #
99
+ # reader = PDF::Reader.new("somefile.pdf")
100
+ #
101
+ # File.open("somefile.pdf","rb") do |file|
102
+ # reader = PDF::Reader.new(file)
103
+ # end
104
+ #
105
+ def initialize(input = nil)
106
+ if input # support the deprecated Reader API
107
+ @objects = PDF::Reader::ObjectHash.new(input)
108
+ @page_count = get_page_count
109
+ @pdf_version = @objects.pdf_version
110
+ @info = @objects.deref(@objects.trailer[:Info])
111
+ @metadata = get_metadata
112
+ end
113
+ end
114
+
115
+ # syntactic sugar for opening a PDF file. Accepts the same arguments
116
+ # as new().
117
+ #
118
+ # PDF::Reader.open("somefile.pdf") do |reader|
119
+ # puts reader.pdf_version
120
+ # end
121
+ #
122
+ def self.open(input, &block)
123
+ yield PDF::Reader.new(input)
124
+ end
125
+
126
+ # DEPRECATED: this method was deprecated in version 0.11.0 and will
127
+ # eventually be removed
128
+ #
129
+ #
89
130
  # Parse the file with the given name, sending events to the given receiver.
90
131
  #
91
132
  def self.file(name, receivers, opts = {})
@@ -94,6 +135,9 @@ module PDF
94
135
  end
95
136
  end
96
137
 
138
+ # DEPRECATED: this method was deprecated in version 0.11.0 and will
139
+ # eventually be removed
140
+ #
97
141
  # Parse the given string, sending events to the given receiver.
98
142
  #
99
143
  def self.string(str, receivers, opts = {})
@@ -102,6 +146,9 @@ module PDF
102
146
  end
103
147
  end
104
148
 
149
+ # DEPRECATED: this method was deprecated in version 0.11.0 and will
150
+ # eventually be removed
151
+ #
105
152
  # Parse the file with the given name, returning an unmarshalled ruby version of
106
153
  # represents the requested pdf object
107
154
  #
@@ -111,6 +158,9 @@ module PDF
111
158
  }
112
159
  end
113
160
 
161
+ # DEPRECATED: this method was deprecated in version 0.11.0 and will
162
+ # eventually be removed
163
+ #
114
164
  # Parse the given string, returning an unmarshalled ruby version of represents
115
165
  # the requested pdf object
116
166
  #
@@ -120,6 +170,48 @@ module PDF
120
170
  }
121
171
  end
122
172
 
173
+ # returns an array of PDF::Reader::Page objects, one for each
174
+ # page in the source PDF.
175
+ #
176
+ # reader = PDF::Reader.new("somefile.pdf")
177
+ #
178
+ # reader.pages.each do |page|
179
+ # puts page.fonts
180
+ # puts page.images
181
+ # puts page.text
182
+ # end
183
+ #
184
+ # See the docs for PDF::Reader::Page to read more about the
185
+ # methods available on each page
186
+ #
187
+ def pages
188
+ (1..@page_count).map { |num|
189
+ PDF::Reader::Page.new(@objects, num)
190
+ }
191
+ end
192
+
193
+ # returns a single PDF::Reader::Page for the specified page.
194
+ # Use this instead of pages method when you need to access just a single
195
+ # page
196
+ #
197
+ # reader = PDF::Reader.new("somefile.pdf")
198
+ # page = reader.page(10)
199
+ #
200
+ # puts page.text
201
+ #
202
+ # See the docs for PDF::Reader::Page to read more about the
203
+ # methods available on each page
204
+ #
205
+ def page(num)
206
+ num = num.to_i
207
+ raise ArgumentError, "valid pages are 1 .. #{@page_count}" if num < 1 || num > @page_count
208
+ PDF::Reader::Page.new(@objects, num)
209
+ end
210
+
211
+
212
+ # DEPRECATED: this method was deprecated in version 0.11.0 and will
213
+ # eventually be removed
214
+ #
123
215
  # Given an IO object that contains PDF data, parse it.
124
216
  #
125
217
  def parse(io, receivers, opts = {})
@@ -139,12 +231,15 @@ module PDF
139
231
  self
140
232
  end
141
233
 
234
+ # DEPRECATED: this method was deprecated in version 0.11.0 and will
235
+ # eventually be removed
236
+ #
142
237
  # Given an IO object that contains PDF data, return the contents of a single object
143
238
  #
144
239
  def object (io, id, gen)
145
- @ohash = ObjectHash.new(io)
240
+ @objects = ObjectHash.new(io)
146
241
 
147
- @ohash.object(Reference.new(id, gen))
242
+ @objects.deref(Reference.new(id, gen))
148
243
  end
149
244
 
150
245
  private
@@ -155,6 +250,21 @@ module PDF
155
250
  ::PDF::Reader::PagesStrategy
156
251
  ]
157
252
  end
253
+
254
+ def root
255
+ root ||= @objects.deref(@objects.trailer[:Root])
256
+ end
257
+
258
+ def get_metadata
259
+ stream = @objects.deref(root[:Metadata])
260
+ stream ? stream.unfiltered_data : nil
261
+ end
262
+
263
+ def get_page_count
264
+ pages = @objects.deref(root[:Pages])
265
+ pages[:Count]
266
+ end
267
+
158
268
  end
159
269
  end
160
270
  ################################################################################
@@ -168,6 +278,7 @@ require 'pdf/reader/filter'
168
278
  require 'pdf/reader/font'
169
279
  require 'pdf/reader/lzw'
170
280
  require 'pdf/reader/metadata_strategy'
281
+ require 'pdf/reader/object_cache'
171
282
  require 'pdf/reader/object_hash'
172
283
  require 'pdf/reader/object_stream'
173
284
  require 'pdf/reader/pages_strategy'
@@ -177,6 +288,8 @@ require 'pdf/reader/reference'
177
288
  require 'pdf/reader/register_receiver'
178
289
  require 'pdf/reader/stream'
179
290
  require 'pdf/reader/text_receiver'
291
+ require 'pdf/reader/page_text_receiver'
180
292
  require 'pdf/reader/token'
181
293
  require 'pdf/reader/xref'
294
+ require 'pdf/reader/page'
182
295
  require 'pdf/hash'