pdf-reader 0.7.3 → 0.7.4

Sign up to get free protection for your applications and to get access to all the features.
Files changed (6) hide show
  1. data/CHANGELOG +6 -1
  2. data/README.rdoc +47 -12
  3. data/Rakefile +1 -1
  4. data/TODO +2 -0
  5. data/lib/pdf/reader/parser.rb +11 -1
  6. metadata +3 -3
data/CHANGELOG CHANGED
@@ -1,4 +1,9 @@
1
- v0.7.3 (UNRELESED)
1
+ v0.7.4 (7th August 2008)
2
+ - Raise a MalformedPDFError if a content stream contains an unterminated string
3
+ - Fix an bug that was causing an endless loop on some OSX systems
4
+ - valid strings were incorrectly thought to be unterminated
5
+
6
+ v0.7.3 (11th June 2008)
2
7
  - Add a high level way to get direct access to a PDF object, including a new executable: pdf_object
3
8
  - Fix a hard loop bug caused by a content stream that is missing a final operator
4
9
  - Significantly simplified the internal code for encoding conversions
@@ -18,7 +18,7 @@ The recommended installation method is via Rubygems.
18
18
 
19
19
  PDF::Reader is designed with a callback-style architecture. The basic concept
20
20
  is to build a receiver class and pass that into PDF::Reader along with the PDF
21
- to process.
21
+ to process.
22
22
 
23
23
  As PDF::Reader walks the file and encounters various objects (pages, text,
24
24
  images, shapes, etc) it will call methods on the receiver class. What those
@@ -37,22 +37,22 @@ text will be converted to UTF-8 before it is passed back from PDF::Reader.
37
37
 
38
38
  = Exceptions
39
39
 
40
- There are two key exceptions that you will need to watch out for when processing a
40
+ There are two key exceptions that you will need to watch out for when processing a
41
41
  PDF file:
42
42
 
43
- MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
44
- file should be valid, or that a corrupt file didn't raise an exception, please
43
+ MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the
44
+ file should be valid, or that a corrupt file didn't raise an exception, please
45
45
  forward a copy of the file to the maintainers and we can attempt improve the code.
46
46
 
47
- UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
48
- support. Again, we welcome submissions of PDF files that exhibit these features to help
47
+ UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently
48
+ support. Again, we welcome submissions of PDF files that exhibit these features to help
49
49
  us with future code improvements.
50
50
 
51
51
  MalformedPDFError has some subclasses if you want to detect finer grained issues. If you
52
52
  don't, 'rescue MalformedPDFError' will catch all the subclassed errors as well.
53
53
 
54
54
  Any other exceptions should be considered bugs in either PDF::Reader (please
55
- report it!) your receiver (please don't report it!).
55
+ report it!) or your receiver (please don't report it!).
56
56
 
57
57
  = Maintainers
58
58
 
@@ -80,9 +80,9 @@ A simple app to count the number of pages in a PDF File.
80
80
  attr_accessor :page_count
81
81
 
82
82
  def initialize
83
- @page_count = 0
83
+ @page_count = 0
84
84
  end
85
-
85
+
86
86
  # Called when page parsing ends
87
87
  def end_page
88
88
  @page_count += 1
@@ -97,7 +97,7 @@ A simple app to count the number of pages in a PDF File.
97
97
 
98
98
  WARNING: this will generate a *lot* of output, so you probably want to pipe
99
99
  it through less or to a text file.
100
-
100
+
101
101
  require 'rubygems'
102
102
  require 'pdf/reader'
103
103
 
@@ -107,7 +107,42 @@ it through less or to a text file.
107
107
  puts cb
108
108
  end
109
109
 
110
- == Extract metadata only
110
+ == Extract all text from a single PDF
111
+
112
+ class PageTextReceiver
113
+ attr_accessor :content
114
+
115
+ def initialize
116
+ @content = []
117
+ end
118
+
119
+ # Called when page parsing starts
120
+ def begin_page(arg = nil)
121
+ @content << ""
122
+ end
123
+
124
+ # record text that is drawn on the page
125
+ def show_text(string, *params)
126
+ @content.last << string.strip
127
+ end
128
+
129
+ # there's a few text callbacks, so make sure we process them all
130
+ alias :super_show_text :show_text
131
+ alias :move_to_next_line_and_show_text :show_text
132
+ alias :set_spacing_next_line_show_text :show_text
133
+
134
+ # this final text callback takes slightly different arguments
135
+ def show_text_with_positioning(*params)
136
+ params = params.first
137
+ params.each { |str| show_text(str) if str.kind_of?(String)}
138
+ end
139
+ end
140
+
141
+ receiver = PageTextReceiver.new
142
+ pdf = PDF::Reader.file("somefile.pdf", receiver)
143
+ puts receiver.content.inspect
144
+
145
+ == Extract metadata only
111
146
 
112
147
  require 'rubygems'
113
148
  require 'pdf/reader'
@@ -150,7 +185,7 @@ A simple app to display the number of pages in a PDF File.
150
185
  pdf = PDF::Reader.file("somefile.pdf", receiver, :pages => false)
151
186
  puts "#{receiver.pages} pages"
152
187
 
153
- == Basic RSpec of a generated PDF
188
+ == Basic RSpec of a generated PDF
154
189
 
155
190
  require 'rubygems'
156
191
  require 'pdf/reader'
data/Rakefile CHANGED
@@ -6,7 +6,7 @@ require 'rake/testtask'
6
6
  require "rake/gempackagetask"
7
7
  require 'spec/rake/spectask'
8
8
 
9
- PKG_VERSION = "0.7.3"
9
+ PKG_VERSION = "0.7.4"
10
10
  PKG_NAME = "pdf-reader"
11
11
  PKG_FILE_NAME = "#{PKG_NAME}-#{PKG_VERSION}"
12
12
 
data/TODO CHANGED
@@ -1,4 +1,6 @@
1
1
  v0.8
2
+ - optimise PDF::Reader::Reference#from_buffer
3
+ - ruby-prof shows the match() call in this function is a real killer
2
4
  - add extra callbacks
3
5
  - list implemented features
4
6
  - encrypted? tagged? bookmarks? annotated? optimised?
@@ -118,11 +118,21 @@ class PDF::Reader
118
118
 
119
119
  while count != 0
120
120
  @buffer.ready_token(false, false)
121
- i = @buffer.raw.index(/[\\\(\)]/)
121
+
122
+ # find the first occurance of ( ) [ \ or ]
123
+ #
124
+ # we used to use the following line, but it fails sometimes
125
+ # under OSX.
126
+ # i = @buffer.raw.index(/[\\\(\)]/)
127
+ i = @buffer.raw.unpack("C*").index { |n| [40, 41, 91, 92, 93].include?(n) }
122
128
 
123
129
  if i.nil?
124
130
  str << @buffer.raw + "\n"
125
131
  @buffer.raw.replace("")
132
+ # if a content stream opens a string, but never closes it, we'll
133
+ # hit the end of the stream and still be appending stuff to the
134
+ # string. bad! This check prevents a hard loop.
135
+ raise MalformedPDFError, 'unterminated string in content stream' if @buffer.eof?
126
136
  next
127
137
  end
128
138
 
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.3
4
+ version: 0.7.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Peter Jones
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2008-06-11 00:00:00 +10:00
12
+ date: 2008-08-07 00:00:00 +10:00
13
13
  default_executable:
14
14
  dependencies: []
15
15
 
@@ -84,7 +84,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
84
84
  requirements: []
85
85
 
86
86
  rubyforge_project: pdf-reader
87
- rubygems_version: 1.1.1
87
+ rubygems_version: 1.2.0
88
88
  signing_key:
89
89
  specification_version: 2
90
90
  summary: A library for accessing the content of PDF files