pdf-reader-turtletext 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/.travis.yml +9 -1
- data/CHANGELOG +11 -0
- data/README.rdoc +106 -15
- data/lib/pdf/reader/turtletext.rb +52 -24
- data/lib/pdf/reader/turtletext/textangle.rb +89 -13
- data/lib/pdf/reader/turtletext/version.rb +1 -1
- data/pdf-reader-turtletext.gemspec +5 -2
- data/spec/fixtures/pdf_samples/expectations.yml +95 -0
- data/spec/fixtures/pdf_samples/simple_table_text.pdf +139 -0
- data/spec/integration/pdf_samples_spec.rb +28 -0
- data/spec/support/pdf_samples_helper.rb +23 -0
- data/spec/unit/reader/turtletext/textangle_spec.rb +193 -0
- data/spec/unit/reader/turtletext/turtletext_spec.rb +42 -37
- metadata +21 -18
data/.travis.yml
CHANGED
@@ -1,3 +1,11 @@
|
|
1
1
|
# These are specific configuration settings required for travis-ci
|
2
2
|
# see http://travis-ci.org/tardate/pdf-reader-turtletext
|
3
|
-
|
3
|
+
language: ruby
|
4
|
+
rvm:
|
5
|
+
- 1.8.7
|
6
|
+
- 1.9.2
|
7
|
+
- 1.9.3
|
8
|
+
- rbx-18mode
|
9
|
+
- rbx-19mode
|
10
|
+
- jruby-18mode
|
11
|
+
- jruby-19mode
|
data/CHANGELOG
ADDED
@@ -0,0 +1,11 @@
|
|
1
|
+
Version 0.2.0 Release: n/a
|
2
|
+
==================================================
|
3
|
+
* add bounding_box / textangle semantics
|
4
|
+
* improve documentation
|
5
|
+
* MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
|
6
|
+
|
7
|
+
Version 0.1.0 Release: 22nd July 2012
|
8
|
+
==================================================
|
9
|
+
* Initial packaging and release of core functionality directly extracted
|
10
|
+
from https://github.com/tardate/sps_bill_scanner/
|
11
|
+
* MRI 1.9 only
|
data/README.rdoc
CHANGED
@@ -14,30 +14,121 @@ For an example of how this is works in practice, see the
|
|
14
14
|
|
15
15
|
== Requirements and Known Limitations
|
16
16
|
|
17
|
-
*
|
18
|
-
* fixed dependency on PDF::Reader
|
17
|
+
* Tested with MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
|
18
|
+
* Has a fixed dependency on PDF::Reader v1.1.1
|
19
19
|
|
20
|
-
==
|
20
|
+
== The PDF::Reader::Turtletext Cookbook
|
21
21
|
|
22
|
-
|
22
|
+
=== How do I install it for normal use?
|
23
23
|
|
24
|
-
|
24
|
+
It is distributed as a gem, so all normal gem installation procedures apply. To install the
|
25
|
+
gem directly from the command line:
|
25
26
|
|
26
|
-
|
27
|
+
$ gem install pdf-reader-turtletext
|
27
28
|
|
28
|
-
|
29
|
-
such as <tt>text_position</tt> and <tt>text_in_region</tt>.
|
29
|
+
If you are using bundler or Rails, add to your Gemfile:
|
30
30
|
|
31
|
-
|
31
|
+
gem 'pdf-reader-turtletext'
|
32
32
|
|
33
|
+
Then bundle install:
|
34
|
+
|
35
|
+
$ bundle
|
36
|
+
|
37
|
+
=== How do I install it for gem development?
|
38
|
+
|
39
|
+
If you want to work on enhancements of fix bugs in PDF::Reader::Turtletext, fork and clone the github repository. See the section below on 'Contributing to PDF::Reader::Turtletext'
|
40
|
+
|
41
|
+
=== How to instantiate Turtletext in code
|
42
|
+
|
43
|
+
All interaction is done using an instance of the PDF::Reader::Turtletext class. It is
|
44
|
+
initialised given a filename or IO-like object, and any required options.
|
45
|
+
|
46
|
+
Typical usage:
|
47
|
+
|
48
|
+
pdf_filename = '../some_path/some.pdf'
|
33
49
|
reader = PDF::Reader::Turtletext.new(pdf_filename)
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
50
|
+
options = { :y_precision => 5 }
|
51
|
+
reader_with_options = PDF::Reader::Turtletext.new(pdf_filename,options)
|
52
|
+
|
53
|
+
=== How to extract text within a region described in relation to other text
|
54
|
+
|
55
|
+
Problem: we don't know exactly where the required text will be on the page, and it is not encoded
|
56
|
+
within the PDF as a single object. But we do know that it will be relatively positioned (for example)
|
57
|
+
below a certain bit of text, to the left of another, and above some other text.
|
58
|
+
|
59
|
+
Solution: use the <tt>bounding_box</tt> method to describe the region and extract the matching text.
|
60
|
+
|
61
|
+
textangle = reader.bounding_box do
|
62
|
+
page 1
|
63
|
+
below /electricity/i
|
64
|
+
above 10
|
65
|
+
right_of 240.0
|
66
|
+
left_of "Total ($)"
|
67
|
+
end
|
68
|
+
textangle.text
|
69
|
+
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
|
70
|
+
|
71
|
+
The range of methods that can be used within the <tt>bounding_box</tt> block are all optional, and include:
|
72
|
+
* <tt>page</tt> - specifies the PDF page from which to extract text (default is 1).
|
73
|
+
* <tt>below</tt> - a string, regex or number that describes the upper limit of the text box
|
74
|
+
(default is top border of the page).
|
75
|
+
* <tt>above</tt> - a string, regex or number that describes the lower limit of the text box
|
76
|
+
(default is bottom border of the page).
|
77
|
+
* <tt>left_of</tt> - a string, regex or number that describes the right limit of the text box
|
78
|
+
(default is right border of the page).
|
79
|
+
* <tt>right_of</tt> - a string, regex or number that describes the left limit of the text box
|
80
|
+
(default is left border of the page).
|
81
|
+
|
82
|
+
Note that <tt>left_of</tt> and <tt>right_of</tt> constraints do *not* need to be within the vertical
|
83
|
+
range of the box being described.
|
84
|
+
For example, you could use an element in the page header to describe the <tt>left_of</tt> limit
|
85
|
+
for a table at the bottom of the page, if it has the correct alignment needed to describe your text region.
|
86
|
+
|
87
|
+
Similarly, <tt>above</tt> and <tt>below</tt> constraints do *not* need to be within the horizontal
|
88
|
+
range of the box being described.
|
89
|
+
|
90
|
+
=== Using a block parameter with the <tt>bounding_box</tt> method
|
91
|
+
|
92
|
+
An explicit block parameter may be used with the <tt>bounding_box</tt> method:
|
93
|
+
|
94
|
+
textangle = reader.bounding_box do |r|
|
95
|
+
r.below /electricity/i
|
96
|
+
r.left_of "Total ($)"
|
97
|
+
end
|
98
|
+
textangle.text
|
99
|
+
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
|
100
|
+
|
101
|
+
=== Extract text for a region with known positional co-ordinates
|
102
|
+
|
103
|
+
If you know (or can calculate) the x,y positions of the required text region, you can extract the region's
|
104
|
+
text using the <tt>text_in_region</tt> method.
|
105
|
+
|
106
|
+
text = reader.text_in_region(
|
107
|
+
10, # minimum x (left-most) (inclusive)
|
108
|
+
900, # maximum x (right-most) (inclusive)
|
109
|
+
200, # minimum y (bottom-most) (inclusive)
|
110
|
+
400, # maximum y (top-most) (inclusive)
|
111
|
+
1 # page
|
40
112
|
)
|
113
|
+
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
|
114
|
+
|
115
|
+
Note that the x,y origin is at the bottom-left of the page.
|
116
|
+
|
117
|
+
=== How to find the x,y co-ordinate of a specific text element
|
118
|
+
|
119
|
+
Problem: if you are doing low-level text extraction with <tt>text_in_region</tt> for example,
|
120
|
+
it is usually necessary to locate specific text to provide a positional reference.
|
121
|
+
|
122
|
+
Solution: use the <tt>text_position</tt> method to locate text by exact or partial match.
|
123
|
+
It returns a Hash of x/y co-ordinates that is the bottom-left corner of the text.
|
124
|
+
|
125
|
+
page = 1
|
126
|
+
text_by_exact_match = reader.text_position("Transaction Table", page)
|
127
|
+
=> { :x => 10.0, :y => 600.0 }
|
128
|
+
text_by_regex_match = reader.text_position(/transaction summary/i, page)
|
129
|
+
=> { :x => 10.0, :y => 300.0 }
|
130
|
+
|
131
|
+
Note: in the case of multitple matches, only the first match is returned.
|
41
132
|
|
42
133
|
|
43
134
|
== Contributing to PDF::Reader::Turtletext
|
@@ -16,6 +16,8 @@ class PDF::Reader::Turtletext
|
|
16
16
|
attr_reader :options
|
17
17
|
|
18
18
|
# +source+ is a file name or stream-like object
|
19
|
+
# Supported +options+ include:
|
20
|
+
# * :y_precision
|
19
21
|
def initialize(source, options={})
|
20
22
|
@options = options
|
21
23
|
@reader = PDF::Reader.new(source)
|
@@ -31,7 +33,7 @@ class PDF::Reader::Turtletext
|
|
31
33
|
end
|
32
34
|
|
33
35
|
# Returns positional (with fuzzed y positioning) text content collection as a hash:
|
34
|
-
#
|
36
|
+
# [ fuzzed_y_position, [[x_position,content]] ]
|
35
37
|
def content(page=1)
|
36
38
|
@content ||= []
|
37
39
|
if @content[page]
|
@@ -41,18 +43,24 @@ class PDF::Reader::Turtletext
|
|
41
43
|
end
|
42
44
|
end
|
43
45
|
|
44
|
-
# Returns
|
45
|
-
#
|
46
|
+
# Returns an Array with fuzzed positioning, ordered by decreasing y position. Row content order by x position.
|
47
|
+
# [ fuzzed_y_position, [[x_position,content]] ]
|
46
48
|
# Given +input+ as a hash:
|
47
49
|
# { y_position: { x_position: content}}
|
48
50
|
# Fuzz factors: +y_precision+
|
49
51
|
def fuzzed_y(input)
|
50
|
-
output =
|
51
|
-
input.keys.sort.each do |precise_y|
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
52
|
+
output = []
|
53
|
+
input.keys.sort.reverse.each do |precise_y|
|
54
|
+
matching_y = output.map(&:first).select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
|
55
|
+
y_index = output.index{|y| y.first == matching_y }
|
56
|
+
new_row_content = input[precise_y].to_a
|
57
|
+
if y_index
|
58
|
+
row_content = output[y_index].last
|
59
|
+
row_content += new_row_content
|
60
|
+
output[y_index] = [matching_y,row_content]
|
61
|
+
else
|
62
|
+
output << [matching_y,new_row_content]
|
63
|
+
end
|
56
64
|
end
|
57
65
|
output
|
58
66
|
end
|
@@ -69,21 +77,24 @@ class PDF::Reader::Turtletext
|
|
69
77
|
end
|
70
78
|
|
71
79
|
# Returns an array of text elements found within the x,y limits,
|
80
|
+
# x ranges from +xmin+ (left of page) to +xmax+ (right of page)
|
81
|
+
# y ranges from +ymin+ (bottom of page) to +ymax+ (top of page)
|
72
82
|
# Each line of text found is returned as an array element.
|
73
83
|
# Each line of text is an array of the seperate text elements found on that line.
|
74
84
|
# [["first line first text", "first line last text"],["second line text"]]
|
75
85
|
def text_in_region(xmin,xmax,ymin,ymax,page=1)
|
76
86
|
text_map = content(page)
|
77
87
|
box = []
|
78
|
-
|
88
|
+
|
89
|
+
text_map.each do |y,text_row|
|
79
90
|
if y >= ymin && y<= ymax
|
80
91
|
row = []
|
81
|
-
|
92
|
+
text_row.each do |x,element|
|
82
93
|
if x >= xmin && x<= xmax
|
83
|
-
row <<
|
94
|
+
row << [x,element]
|
84
95
|
end
|
85
96
|
end
|
86
|
-
box << row unless row.empty?
|
97
|
+
box << row.sort{|a,b| a.first <=> b.first }.map(&:last) unless row.empty?
|
87
98
|
end
|
88
99
|
end
|
89
100
|
box
|
@@ -94,7 +105,11 @@ class PDF::Reader::Turtletext
|
|
94
105
|
# +text+ may be a string (exact match required) or a Regexp
|
95
106
|
def text_position(text,page=1)
|
96
107
|
item = if text.class <= Regexp
|
97
|
-
content(page).map
|
108
|
+
content(page).map do |k,v|
|
109
|
+
if x = v.reduce(nil){|memo,vv| memo = (vv[1] =~ text) ? vv[0] : memo }
|
110
|
+
[k,x]
|
111
|
+
end
|
112
|
+
end
|
98
113
|
else
|
99
114
|
content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
|
100
115
|
end
|
@@ -104,17 +119,30 @@ class PDF::Reader::Turtletext
|
|
104
119
|
end
|
105
120
|
end
|
106
121
|
|
107
|
-
#
|
108
|
-
#
|
122
|
+
# Returns a text region definition using a descriptive block.
|
123
|
+
#
|
124
|
+
# Usage:
|
125
|
+
#
|
126
|
+
# textangle = reader.bounding_box do
|
127
|
+
# page 1
|
128
|
+
# below /electricity/i
|
129
|
+
# above 10
|
130
|
+
# right_of 240.0
|
131
|
+
# left_of "Total ($)"
|
132
|
+
# end
|
133
|
+
# textangle.text
|
134
|
+
#
|
135
|
+
# Alternatively, an explicit block parameter may be used:
|
109
136
|
#
|
110
|
-
#
|
111
|
-
#
|
112
|
-
#
|
113
|
-
#
|
114
|
-
#
|
115
|
-
#
|
116
|
-
#
|
117
|
-
#
|
137
|
+
# textangle = reader.bounding_box do |r|
|
138
|
+
# r.page 1
|
139
|
+
# r.below /electricity/i
|
140
|
+
# r.above 10
|
141
|
+
# r.right_of 240.0
|
142
|
+
# r.left_of "Total ($)"
|
143
|
+
# end
|
144
|
+
# textangle.text
|
145
|
+
# => [['string','string'],['string']] # array of rows, each row is an array of column text element
|
118
146
|
#
|
119
147
|
def bounding_box(&block)
|
120
148
|
PDF::Reader::Turtletext::Textangle.new(self,&block)
|
@@ -1,27 +1,103 @@
|
|
1
1
|
# A DSL syntax for text extraction.
|
2
|
-
# WIP - not using this yet
|
3
2
|
#
|
4
|
-
# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do
|
5
|
-
# page 1
|
6
|
-
# below "Electricity Services"
|
7
|
-
# above "Gas Services by City Gas Pte Ltd"
|
8
|
-
# right_of 240.0
|
9
|
-
# left_of "Total ($)"
|
3
|
+
# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do |r|
|
4
|
+
# r.page = 1
|
5
|
+
# r.below = "Electricity Services"
|
6
|
+
# r.above = "Gas Services by City Gas Pte Ltd"
|
7
|
+
# r.right_of = 240.0
|
8
|
+
# r.left_of = "Total ($)"
|
10
9
|
# end
|
11
10
|
# textangle.text
|
12
11
|
#
|
13
12
|
class PDF::Reader::Turtletext::Textangle
|
14
13
|
attr_reader :reader
|
15
|
-
|
14
|
+
attr_accessor :page
|
15
|
+
attr_writer :above,:below,:left_of,:right_of
|
16
16
|
|
17
|
-
# +
|
18
|
-
def initialize(
|
19
|
-
@reader =
|
20
|
-
|
17
|
+
# +turtletext_reader+ is a PDF::Reader::Turtletext
|
18
|
+
def initialize(turtletext_reader,&block)
|
19
|
+
@reader = turtletext_reader
|
20
|
+
@page = 1
|
21
|
+
if block_given?
|
22
|
+
if block.arity == 1
|
23
|
+
yield self
|
24
|
+
else
|
25
|
+
instance_eval &block
|
26
|
+
end
|
27
|
+
end
|
21
28
|
end
|
22
29
|
|
30
|
+
def above(*args)
|
31
|
+
if value = args.first
|
32
|
+
@above = value
|
33
|
+
end
|
34
|
+
@above
|
35
|
+
end
|
36
|
+
|
37
|
+
def below(*args)
|
38
|
+
if value = args.first
|
39
|
+
@below = value
|
40
|
+
end
|
41
|
+
@below
|
42
|
+
end
|
43
|
+
|
44
|
+
def left_of(*args)
|
45
|
+
if value = args.first
|
46
|
+
@left_of = value
|
47
|
+
end
|
48
|
+
@left_of
|
49
|
+
end
|
50
|
+
|
51
|
+
def right_of(*args)
|
52
|
+
if value = args.first
|
53
|
+
@right_of = value
|
54
|
+
end
|
55
|
+
@right_of
|
56
|
+
end
|
57
|
+
|
58
|
+
# Returns the text
|
23
59
|
def text
|
24
|
-
|
60
|
+
return unless reader
|
61
|
+
|
62
|
+
xmin = if right_of
|
63
|
+
if [Fixnum,Float].include?(right_of.class)
|
64
|
+
right_of
|
65
|
+
else
|
66
|
+
reader.text_position(right_of,page)[:x] + 1
|
67
|
+
end
|
68
|
+
else
|
69
|
+
0
|
70
|
+
end
|
71
|
+
xmax = if left_of
|
72
|
+
if [Fixnum,Float].include?(left_of.class)
|
73
|
+
left_of
|
74
|
+
else
|
75
|
+
reader.text_position(left_of,page)[:x] - 1
|
76
|
+
end
|
77
|
+
else
|
78
|
+
99999 # TODO actual limit
|
79
|
+
end
|
80
|
+
|
81
|
+
ymin = if above
|
82
|
+
if [Fixnum,Float].include?(above.class)
|
83
|
+
above
|
84
|
+
else
|
85
|
+
reader.text_position(above,page)[:y] + 1
|
86
|
+
end
|
87
|
+
else
|
88
|
+
0
|
89
|
+
end
|
90
|
+
ymax = if below
|
91
|
+
if [Fixnum,Float].include?(below.class)
|
92
|
+
below
|
93
|
+
else
|
94
|
+
reader.text_position(below,page)[:y] - 1
|
95
|
+
end
|
96
|
+
else
|
97
|
+
99999 # TODO actual limit
|
98
|
+
end
|
99
|
+
|
100
|
+
reader.text_in_region(xmin,xmax,ymin,ymax,page)
|
25
101
|
end
|
26
102
|
|
27
103
|
end
|
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "pdf-reader-turtletext"
|
8
|
-
s.version = "0.
|
8
|
+
s.version = "0.2.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Paul Gallagher"]
|
12
|
-
s.date = "2012-07-
|
12
|
+
s.date = "2012-07-31"
|
13
13
|
s.description = "a library that can read structured and positional text from PDFs. Ideal for asembling structured data from invoices and the like."
|
14
14
|
s.email = "gallagher.paul@gmail.com"
|
15
15
|
s.extra_rdoc_files = [
|
@@ -20,6 +20,7 @@ Gem::Specification.new do |s|
|
|
20
20
|
".rspec",
|
21
21
|
".rvmrc",
|
22
22
|
".travis.yml",
|
23
|
+
"CHANGELOG",
|
23
24
|
"Gemfile",
|
24
25
|
"Gemfile.lock",
|
25
26
|
"Guardfile",
|
@@ -34,8 +35,10 @@ Gem::Specification.new do |s|
|
|
34
35
|
"lib/pdf/reader/turtletext/version.rb",
|
35
36
|
"pdf-reader-turtletext.gemspec",
|
36
37
|
"spec/fixtures/pdf_samples/.gitkeep",
|
38
|
+
"spec/fixtures/pdf_samples/expectations.yml",
|
37
39
|
"spec/fixtures/pdf_samples/hello_world.pdf",
|
38
40
|
"spec/fixtures/pdf_samples/junk_prefix.pdf",
|
41
|
+
"spec/fixtures/pdf_samples/simple_table_text.pdf",
|
39
42
|
"spec/integration/pdf_samples_spec.rb",
|
40
43
|
"spec/spec_helper.rb",
|
41
44
|
"spec/support/pdf_samples_helper.rb",
|
@@ -0,0 +1,95 @@
|
|
1
|
+
# this file defines the test expectations for PDF samples in spec/fixtures/pdf_samples.
|
2
|
+
#
|
3
|
+
# This is a YAML-format file, so beware that indentation is significant
|
4
|
+
---
|
5
|
+
hello_world.pdf:
|
6
|
+
:test_above:
|
7
|
+
:above: 100
|
8
|
+
:expected_text:
|
9
|
+
-
|
10
|
+
- "Hello World"
|
11
|
+
:test_below:
|
12
|
+
:below: 900
|
13
|
+
:expected_text:
|
14
|
+
-
|
15
|
+
- "Hello World"
|
16
|
+
:test_below_na:
|
17
|
+
:below: 10
|
18
|
+
:expected_text: []
|
19
|
+
simple_table_text.pdf:
|
20
|
+
:test_above:
|
21
|
+
:above: Table Header
|
22
|
+
:expected_text:
|
23
|
+
-
|
24
|
+
- "Simple Table Text"
|
25
|
+
:test_below:
|
26
|
+
:below: row 2
|
27
|
+
:expected_text:
|
28
|
+
-
|
29
|
+
- "Table Footer"
|
30
|
+
:test_right_of:
|
31
|
+
:right_of: row 1
|
32
|
+
:expected_text:
|
33
|
+
-
|
34
|
+
- "val 1"
|
35
|
+
- "val 2"
|
36
|
+
- "val 3"
|
37
|
+
-
|
38
|
+
- "val 1"
|
39
|
+
- "val 2"
|
40
|
+
- "val 3"
|
41
|
+
:test_left_of:
|
42
|
+
:left_of: val 1
|
43
|
+
:expected_text:
|
44
|
+
-
|
45
|
+
- "Simple Table Text"
|
46
|
+
-
|
47
|
+
- "Table Header"
|
48
|
+
-
|
49
|
+
- "row 1"
|
50
|
+
-
|
51
|
+
- "row 2"
|
52
|
+
-
|
53
|
+
- "Table Footer"
|
54
|
+
:test_above_and_below:
|
55
|
+
:below: Table Header
|
56
|
+
:above: Table Footer
|
57
|
+
:expected_text:
|
58
|
+
-
|
59
|
+
- "row 1"
|
60
|
+
- "val 1"
|
61
|
+
- "val 2"
|
62
|
+
- "val 3"
|
63
|
+
-
|
64
|
+
- "row 2"
|
65
|
+
- "val 1"
|
66
|
+
- "val 2"
|
67
|
+
- "val 3"
|
68
|
+
:test_above_and_below_and_left_of:
|
69
|
+
:below: Table Header
|
70
|
+
:above: Table Footer
|
71
|
+
:left_of: val 2
|
72
|
+
:expected_text:
|
73
|
+
-
|
74
|
+
- "row 1"
|
75
|
+
- "val 1"
|
76
|
+
-
|
77
|
+
- "row 2"
|
78
|
+
- "val 1"
|
79
|
+
:test_above_and_below_and_left_of_and_right_of:
|
80
|
+
:below: Table Header
|
81
|
+
:above: Table Footer
|
82
|
+
:left_of: val 2
|
83
|
+
:right_of: row 1
|
84
|
+
:expected_text:
|
85
|
+
-
|
86
|
+
- "val 1"
|
87
|
+
-
|
88
|
+
- "val 1"
|
89
|
+
|
90
|
+
|
91
|
+
|
92
|
+
|
93
|
+
|
94
|
+
|
95
|
+
|
@@ -0,0 +1,139 @@
|
|
1
|
+
%PDF-1.3
|
2
|
+
%����
|
3
|
+
1 0 obj
|
4
|
+
<< /Creator <feff0050007200610077006e>
|
5
|
+
/Producer <feff0050007200610077006e>
|
6
|
+
>>
|
7
|
+
endobj
|
8
|
+
2 0 obj
|
9
|
+
<< /Type /Catalog
|
10
|
+
/Pages 3 0 R
|
11
|
+
>>
|
12
|
+
endobj
|
13
|
+
3 0 obj
|
14
|
+
<< /Type /Pages
|
15
|
+
/Count 1
|
16
|
+
/Kids [5 0 R]
|
17
|
+
>>
|
18
|
+
endobj
|
19
|
+
4 0 obj
|
20
|
+
<< /Length 795
|
21
|
+
>>
|
22
|
+
stream
|
23
|
+
q
|
24
|
+
|
25
|
+
BT
|
26
|
+
36 747.384 Td
|
27
|
+
/F1.0 12 Tf
|
28
|
+
[<53696d706c652054> 120 <6162> 20 <6c652054> 120 <65> 30 <7874>] TJ
|
29
|
+
ET
|
30
|
+
|
31
|
+
|
32
|
+
BT
|
33
|
+
46 327.384 Td
|
34
|
+
/F1.0 12 Tf
|
35
|
+
[<54> 120 <6162> 20 <6c6520486561646572>] TJ
|
36
|
+
ET
|
37
|
+
|
38
|
+
|
39
|
+
BT
|
40
|
+
46 277.384 Td
|
41
|
+
/F1.0 12 Tf
|
42
|
+
[<726f> 15 <772031>] TJ
|
43
|
+
ET
|
44
|
+
|
45
|
+
|
46
|
+
BT
|
47
|
+
136 277.384 Td
|
48
|
+
/F1.0 12 Tf
|
49
|
+
[<76> 25 <616c2031>] TJ
|
50
|
+
ET
|
51
|
+
|
52
|
+
|
53
|
+
BT
|
54
|
+
186 277.384 Td
|
55
|
+
/F1.0 12 Tf
|
56
|
+
[<76> 25 <616c2032>] TJ
|
57
|
+
ET
|
58
|
+
|
59
|
+
|
60
|
+
BT
|
61
|
+
236 277.384 Td
|
62
|
+
/F1.0 12 Tf
|
63
|
+
[<76> 25 <616c2033>] TJ
|
64
|
+
ET
|
65
|
+
|
66
|
+
|
67
|
+
BT
|
68
|
+
46 227.38400000000001 Td
|
69
|
+
/F1.0 12 Tf
|
70
|
+
[<726f> 15 <772032>] TJ
|
71
|
+
ET
|
72
|
+
|
73
|
+
|
74
|
+
BT
|
75
|
+
136 227.38400000000001 Td
|
76
|
+
/F1.0 12 Tf
|
77
|
+
[<76> 25 <616c2031>] TJ
|
78
|
+
ET
|
79
|
+
|
80
|
+
|
81
|
+
BT
|
82
|
+
186 227.38400000000001 Td
|
83
|
+
/F1.0 12 Tf
|
84
|
+
[<76> 25 <616c2032>] TJ
|
85
|
+
ET
|
86
|
+
|
87
|
+
|
88
|
+
BT
|
89
|
+
236 227.38400000000001 Td
|
90
|
+
/F1.0 12 Tf
|
91
|
+
[<76> 25 <616c2033>] TJ
|
92
|
+
ET
|
93
|
+
|
94
|
+
|
95
|
+
BT
|
96
|
+
46 177.38400000000001 Td
|
97
|
+
/F1.0 12 Tf
|
98
|
+
[<54> 120 <6162> 20 <6c652046> 30 <6f6f746572>] TJ
|
99
|
+
ET
|
100
|
+
|
101
|
+
Q
|
102
|
+
|
103
|
+
endstream
|
104
|
+
endobj
|
105
|
+
5 0 obj
|
106
|
+
<< /Type /Page
|
107
|
+
/Parent 3 0 R
|
108
|
+
/MediaBox [0 0 612.0 792.0]
|
109
|
+
/Contents 4 0 R
|
110
|
+
/Resources << /ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
|
111
|
+
/Font << /F1.0 6 0 R
|
112
|
+
>>
|
113
|
+
>>
|
114
|
+
>>
|
115
|
+
endobj
|
116
|
+
6 0 obj
|
117
|
+
<< /Type /Font
|
118
|
+
/Subtype /Type1
|
119
|
+
/BaseFont /Helvetica
|
120
|
+
/Encoding /WinAnsiEncoding
|
121
|
+
>>
|
122
|
+
endobj
|
123
|
+
xref
|
124
|
+
0 7
|
125
|
+
0000000000 65535 f
|
126
|
+
0000000015 00000 n
|
127
|
+
0000000109 00000 n
|
128
|
+
0000000158 00000 n
|
129
|
+
0000000215 00000 n
|
130
|
+
0000001061 00000 n
|
131
|
+
0000001239 00000 n
|
132
|
+
trailer
|
133
|
+
<< /Size 7
|
134
|
+
/Root 2 0 R
|
135
|
+
/Info 1 0 R
|
136
|
+
>>
|
137
|
+
startxref
|
138
|
+
1336
|
139
|
+
%%EOF
|
@@ -3,5 +3,33 @@ include PdfSamplesHelper
|
|
3
3
|
|
4
4
|
describe "PDF Samples" do
|
5
5
|
|
6
|
+
# This will scan all *.pdf files in spec/fixtures/personal_pdf_samples
|
7
|
+
# and do basic verification of the file structure without any effort from you.
|
8
|
+
pdf_sample_expectations.each do |sample_name,test_specifications|
|
9
|
+
describe "sample" do
|
10
|
+
let(:options) { test_specifications[:options] || {} }
|
11
|
+
let(:sample_file) { pdf_sample(sample_name) }
|
12
|
+
let(:turtletext_reader) { PDF::Reader::Turtletext.new(sample_file,options) }
|
13
|
+
|
14
|
+
(test_specifications||{}).each do |test_name,expectations|
|
15
|
+
context test_name do
|
16
|
+
let(:bounding_box) {
|
17
|
+
turtletext_reader.bounding_box do
|
18
|
+
above expectations[:above]
|
19
|
+
below expectations[:below]
|
20
|
+
left_of expectations[:left_of]
|
21
|
+
right_of expectations[:right_of]
|
22
|
+
end
|
23
|
+
}
|
24
|
+
# it {
|
25
|
+
# puts "bounding_box"
|
26
|
+
# puts bounding_box.inspect
|
27
|
+
# }
|
28
|
+
subject { bounding_box.text }
|
29
|
+
it { should eql(expectations[:expected_text])}
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
6
34
|
|
7
35
|
end
|
@@ -31,6 +31,7 @@ module PdfSamplesHelper
|
|
31
31
|
require 'prawn'
|
32
32
|
puts "Making PDF samples for tests.."
|
33
33
|
make_sample_hello_world
|
34
|
+
make_sample_simple_table_text
|
34
35
|
end
|
35
36
|
|
36
37
|
def make_sample_hello_world
|
@@ -40,4 +41,26 @@ module PdfSamplesHelper
|
|
40
41
|
end
|
41
42
|
puts "Created: #{filename}"
|
42
43
|
end
|
44
|
+
|
45
|
+
def make_sample_simple_table_text
|
46
|
+
filename = pdf_sample('simple_table_text.pdf')
|
47
|
+
Prawn::Document.generate filename do
|
48
|
+
text "Simple Table Text"
|
49
|
+
text_box "Table Header", :at => [10, 300], :width => 200
|
50
|
+
|
51
|
+
text_box "row 1", :at => [10, 250], :width => 90
|
52
|
+
text_box "val 1", :at => [100, 250], :width => 50
|
53
|
+
text_box "val 2", :at => [150, 250], :width => 50
|
54
|
+
text_box "val 3", :at => [200, 250], :width => 50
|
55
|
+
|
56
|
+
text_box "row 2", :at => [10, 200], :width => 90
|
57
|
+
text_box "val 1", :at => [100, 200], :width => 50
|
58
|
+
text_box "val 2", :at => [150, 200], :width => 50
|
59
|
+
text_box "val 3", :at => [200, 200], :width => 50
|
60
|
+
|
61
|
+
text_box "Table Footer", :at => [10, 150], :width => 200
|
62
|
+
end
|
63
|
+
puts "Created: #{filename}"
|
64
|
+
end
|
65
|
+
|
43
66
|
end
|
@@ -3,4 +3,197 @@ require 'spec_helper'
|
|
3
3
|
describe PDF::Reader::Turtletext::Textangle do
|
4
4
|
let(:resource_class) { PDF::Reader::Turtletext::Textangle }
|
5
5
|
|
6
|
+
let(:source) { nil } # we're just going to mock the PDF source here
|
7
|
+
let(:options) { {} }
|
8
|
+
let(:turtletext_reader) { PDF::Reader::Turtletext.new(source,options) }
|
9
|
+
|
10
|
+
|
11
|
+
describe "#reader" do
|
12
|
+
let(:textangle) { resource_class.new(turtletext_reader) }
|
13
|
+
subject { textangle.reader }
|
14
|
+
it { should be_a(PDF::Reader::Turtletext) }
|
15
|
+
end
|
16
|
+
|
17
|
+
describe "#text" do
|
18
|
+
let(:page) { 1 }
|
19
|
+
before do
|
20
|
+
turtletext_reader.stub(:load_content).and_return(given_page_content)
|
21
|
+
end
|
22
|
+
let(:given_page_content) { {
|
23
|
+
70.0=>{10.0=>"crunchy bacon"},
|
24
|
+
40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
|
25
|
+
30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
|
26
|
+
10.0=>{40.0=>"smoked and streaky for me"}
|
27
|
+
} }
|
28
|
+
|
29
|
+
context "with block param" do
|
30
|
+
[:above,:below,:left_of,:right_of].each do |positional_method|
|
31
|
+
context "with #{positional_method}" do
|
32
|
+
let(:term) { "canary" }
|
33
|
+
|
34
|
+
it "should work with block param" do
|
35
|
+
textangle = resource_class.new(turtletext_reader) do |r|
|
36
|
+
r.send("#{positional_method}=",term)
|
37
|
+
end
|
38
|
+
textangle.send(positional_method).should eql(term)
|
39
|
+
end
|
40
|
+
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
45
|
+
context "without block param" do
|
46
|
+
it "#above should work" do
|
47
|
+
textangle = resource_class.new(turtletext_reader) do
|
48
|
+
above "canary"
|
49
|
+
end
|
50
|
+
textangle.above.should eql("canary")
|
51
|
+
end
|
52
|
+
it "#below should work" do
|
53
|
+
textangle = resource_class.new(turtletext_reader) do
|
54
|
+
below "canary"
|
55
|
+
end
|
56
|
+
textangle.below.should eql("canary")
|
57
|
+
end
|
58
|
+
it "#left_of should work" do
|
59
|
+
textangle = resource_class.new(turtletext_reader) do
|
60
|
+
left_of "canary"
|
61
|
+
end
|
62
|
+
textangle.left_of.should eql("canary")
|
63
|
+
end
|
64
|
+
it "#below should work" do
|
65
|
+
textangle = resource_class.new(turtletext_reader) do
|
66
|
+
right_of "canary"
|
67
|
+
end
|
68
|
+
textangle.right_of.should eql("canary")
|
69
|
+
end
|
70
|
+
end
|
71
|
+
|
72
|
+
context "when only below specified" do
|
73
|
+
context "as a string" do
|
74
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
75
|
+
r.below = "fraud"
|
76
|
+
end }
|
77
|
+
let(:expected) { [["smoked and streaky for me"]]}
|
78
|
+
subject { textangle.text }
|
79
|
+
it { should eql(expected) }
|
80
|
+
end
|
81
|
+
context "as a regex" do
|
82
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
83
|
+
r.below = /Fraud/i
|
84
|
+
end }
|
85
|
+
let(:expected) { [["smoked and streaky for me"]]}
|
86
|
+
subject { textangle.text }
|
87
|
+
it { should eql(expected) }
|
88
|
+
end
|
89
|
+
context "as a number" do
|
90
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
91
|
+
r.below = 20
|
92
|
+
end }
|
93
|
+
let(:expected) { [["smoked and streaky for me"]]}
|
94
|
+
subject { textangle.text }
|
95
|
+
it { should eql(expected) }
|
96
|
+
end
|
97
|
+
end
|
98
|
+
|
99
|
+
context "when only above specified" do
|
100
|
+
context "as a string" do
|
101
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
102
|
+
r.above = "heaven"
|
103
|
+
end }
|
104
|
+
let(:expected) { [["crunchy bacon"]]}
|
105
|
+
subject { textangle.text }
|
106
|
+
it { should eql(expected) }
|
107
|
+
end
|
108
|
+
context "as a regex" do
|
109
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
110
|
+
r.above = /heaVen/i
|
111
|
+
end }
|
112
|
+
let(:expected) { [["crunchy bacon"]]}
|
113
|
+
subject { textangle.text }
|
114
|
+
it { should eql(expected) }
|
115
|
+
end
|
116
|
+
context "as a number" do
|
117
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
118
|
+
r.above = 41
|
119
|
+
end }
|
120
|
+
let(:expected) { [["crunchy bacon"]]}
|
121
|
+
subject { textangle.text }
|
122
|
+
it { should eql(expected) }
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
context "when only left_of specified" do
|
127
|
+
context "as a string" do
|
128
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
129
|
+
r.left_of = "turkey bacon"
|
130
|
+
end }
|
131
|
+
let(:expected) { [
|
132
|
+
["crunchy bacon"],
|
133
|
+
["bacon on kimchi noodles", "heaven"]
|
134
|
+
] }
|
135
|
+
subject { textangle.text }
|
136
|
+
it { should eql(expected) }
|
137
|
+
end
|
138
|
+
context "as a regex" do
|
139
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
140
|
+
r.left_of = /turKey/i
|
141
|
+
end }
|
142
|
+
let(:expected) { [
|
143
|
+
["crunchy bacon"],
|
144
|
+
["bacon on kimchi noodles", "heaven"]
|
145
|
+
] }
|
146
|
+
subject { textangle.text }
|
147
|
+
it { should eql(expected) }
|
148
|
+
end
|
149
|
+
context "as a number" do
|
150
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
151
|
+
r.left_of = 29
|
152
|
+
end }
|
153
|
+
let(:expected) { [
|
154
|
+
["crunchy bacon"],
|
155
|
+
["bacon on kimchi noodles", "heaven"]
|
156
|
+
] }
|
157
|
+
subject { textangle.text }
|
158
|
+
it { should eql(expected) }
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
162
|
+
context "when only right_of specified" do
|
163
|
+
context "as a string" do
|
164
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
165
|
+
r.right_of = "heaven"
|
166
|
+
end }
|
167
|
+
let(:expected) { [
|
168
|
+
["turkey bacon","fraud"],
|
169
|
+
["smoked and streaky for me"]
|
170
|
+
] }
|
171
|
+
subject { textangle.text }
|
172
|
+
it { should eql(expected) }
|
173
|
+
end
|
174
|
+
context "as a regex" do
|
175
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
176
|
+
r.right_of = /Heaven/i
|
177
|
+
end }
|
178
|
+
let(:expected) { [
|
179
|
+
["turkey bacon","fraud"],
|
180
|
+
["smoked and streaky for me"]
|
181
|
+
] }
|
182
|
+
subject { textangle.text }
|
183
|
+
it { should eql(expected) }
|
184
|
+
end
|
185
|
+
context "as a number" do
|
186
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
187
|
+
r.right_of = 26
|
188
|
+
end }
|
189
|
+
let(:expected) { [
|
190
|
+
["turkey bacon","fraud"],
|
191
|
+
["smoked and streaky for me"]
|
192
|
+
] }
|
193
|
+
subject { textangle.text }
|
194
|
+
it { should eql(expected) }
|
195
|
+
end
|
196
|
+
end
|
197
|
+
|
198
|
+
end
|
6
199
|
end
|
@@ -4,16 +4,16 @@ describe PDF::Reader::Turtletext do
|
|
4
4
|
let(:resource_class) { PDF::Reader::Turtletext }
|
5
5
|
|
6
6
|
let(:source) { nil } # we're just going to mock the PDF source here
|
7
|
-
let(:
|
7
|
+
let(:turtletext_reader) { resource_class.new(source,options) }
|
8
8
|
let(:options) { {} }
|
9
9
|
|
10
10
|
describe "#reader" do
|
11
|
-
subject {
|
11
|
+
subject { turtletext_reader.reader}
|
12
12
|
it { should be_a(PDF::Reader) }
|
13
13
|
end
|
14
14
|
|
15
15
|
describe "#y_precision" do
|
16
|
-
subject {
|
16
|
+
subject { turtletext_reader.y_precision}
|
17
17
|
context "default" do
|
18
18
|
it { should eql(3) }
|
19
19
|
end
|
@@ -27,35 +27,40 @@ describe PDF::Reader::Turtletext do
|
|
27
27
|
context "with mocked source content" do
|
28
28
|
let(:page) { 1 }
|
29
29
|
before do
|
30
|
-
|
30
|
+
turtletext_reader.should_receive(:load_content).with(page).and_return(given_page_content)
|
31
31
|
end
|
32
32
|
|
33
33
|
{
|
34
34
|
:with_simple_text => {
|
35
35
|
:source_page_content => {10.0=>{10.0=>"a first bit of text"}},
|
36
36
|
:expected_precise_content => {10.0=>{10.0=>"a first bit of text"}},
|
37
|
-
:expected_fuzzed_content =>
|
37
|
+
:expected_fuzzed_content => [[10.0,[[10.0,"a first bit of text"]]]]
|
38
38
|
},
|
39
39
|
:with_widely_separated_text => {
|
40
|
-
:source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
41
|
-
:expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
42
|
-
:expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}}
|
43
|
-
},
|
44
|
-
:with_unsorted_y_text => {
|
45
40
|
:source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
46
41
|
:expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
47
|
-
:expected_fuzzed_content =>
|
42
|
+
:expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"]]], [10.0, [[20.0, "a second bit of text"]]]]
|
43
|
+
},
|
44
|
+
:with_unsorted_y_text => {
|
45
|
+
:source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
46
|
+
:expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
47
|
+
:expected_fuzzed_content => [[20.0, [[20.0, "a second bit of text"]]], [10.0, [[10.0, "a first bit of text"]]]]
|
48
48
|
},
|
49
49
|
:with_fuzzed_y_text => {
|
50
|
-
:source_page_content => {
|
51
|
-
:expected_precise_content => {
|
52
|
-
:expected_fuzzed_content =>
|
50
|
+
:source_page_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
|
51
|
+
:expected_precise_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
|
52
|
+
:expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [12.0, "a second bit of text"]]]]
|
53
53
|
},
|
54
54
|
:with_widely_separated_fuzzed_y_text => {
|
55
55
|
:y_precision => 25,
|
56
|
-
:source_page_content => {
|
57
|
-
:expected_precise_content => {
|
58
|
-
:expected_fuzzed_content =>
|
56
|
+
:source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
57
|
+
:expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
58
|
+
:expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [20.0, "a second bit of text"]]]]
|
59
|
+
},
|
60
|
+
:with_multiple_row_text => {
|
61
|
+
:source_page_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
|
62
|
+
:expected_precise_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
|
63
|
+
:expected_fuzzed_content => [[10.0, [[10.0, "first"], [20.0, "second"], [30.0, "third"]]]]
|
59
64
|
}
|
60
65
|
}.each do |test_name,test_expectations|
|
61
66
|
context test_name do
|
@@ -69,12 +74,12 @@ describe PDF::Reader::Turtletext do
|
|
69
74
|
}
|
70
75
|
|
71
76
|
describe "#content" do
|
72
|
-
subject {
|
77
|
+
subject { turtletext_reader.content(page) }
|
73
78
|
it { should eql(test_expectations[:expected_fuzzed_content]) }
|
74
79
|
end
|
75
80
|
|
76
81
|
describe "#precise_content" do
|
77
|
-
subject {
|
82
|
+
subject { turtletext_reader.precise_content(page) }
|
78
83
|
it { should eql(test_expectations[:expected_precise_content]) }
|
79
84
|
end
|
80
85
|
|
@@ -90,24 +95,24 @@ describe PDF::Reader::Turtletext do
|
|
90
95
|
},
|
91
96
|
:with_single_line_text => {
|
92
97
|
:source_page_content => {
|
93
|
-
|
98
|
+
70.0=>{10.0=>"first line ignored"},
|
94
99
|
30.0=>{10.0=>"first part found", 20.0=>"last part found"},
|
95
|
-
|
100
|
+
10.0=>{10.0=>"last line ignored"}
|
96
101
|
},
|
97
102
|
:xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
|
98
103
|
:expected_text => [["first part found", "last part found"]]
|
99
104
|
},
|
100
105
|
:with_multi_line_text => {
|
101
106
|
:source_page_content => {
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
107
|
+
70.0=>{10.0=>"first line ignored"},
|
108
|
+
40.0=>{10.0=>"first line first part found", 20.0=>"first line last part found"},
|
109
|
+
30.0=>{10.0=>"last line first part found", 20.0=>"last line last part found"},
|
110
|
+
10.0=>{10.0=>"last line ignored"}
|
106
111
|
},
|
107
112
|
:xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
|
108
113
|
:expected_text => [
|
109
|
-
["
|
110
|
-
["
|
114
|
+
["first line first part found", "first line last part found"],
|
115
|
+
["last line first part found", "last line last part found"]
|
111
116
|
]
|
112
117
|
}
|
113
118
|
}.each do |test_name,test_expectations|
|
@@ -118,7 +123,7 @@ describe PDF::Reader::Turtletext do
|
|
118
123
|
let(:ymin) { test_expectations[:ymin] }
|
119
124
|
let(:ymax) { test_expectations[:ymax] }
|
120
125
|
let(:expected_text) { test_expectations[:expected_text] }
|
121
|
-
subject {
|
126
|
+
subject { turtletext_reader.text_in_region(xmin,xmax,ymin,ymax,page) }
|
122
127
|
it { should eql(expected_text) }
|
123
128
|
end
|
124
129
|
end
|
@@ -126,21 +131,21 @@ describe PDF::Reader::Turtletext do
|
|
126
131
|
|
127
132
|
describe "#text_position" do
|
128
133
|
let(:given_page_content) { {
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
134
|
+
70.0=>{10.0=>"crunchy bacon"},
|
135
|
+
40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
|
136
|
+
30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
|
137
|
+
10.0=>{40.0=>"smoked and streaky da bomb"}
|
133
138
|
} }
|
134
139
|
{
|
135
|
-
:with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>
|
136
|
-
:with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>
|
137
|
-
:with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>
|
138
|
-
:with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>
|
140
|
+
:with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>30.0} },
|
141
|
+
:with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>40.0} },
|
142
|
+
:with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>40.0} },
|
143
|
+
:with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>70.0} }
|
139
144
|
}.each do |test_name,test_expectations|
|
140
145
|
context test_name do
|
141
146
|
let(:match_term) { test_expectations[:match_term] }
|
142
147
|
let(:expected_position) { test_expectations[:expected_position] }
|
143
|
-
subject {
|
148
|
+
subject { turtletext_reader.text_position(match_term,page) }
|
144
149
|
it { should eql(expected_position) }
|
145
150
|
end
|
146
151
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader-turtletext
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-07-
|
12
|
+
date: 2012-07-31 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: pdf-reader
|
16
|
-
requirement: &
|
16
|
+
requirement: &70218189955060 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - =
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.1
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70218189955060
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: bundler
|
27
|
-
requirement: &
|
27
|
+
requirement: &70218189954360 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.1.4
|
33
33
|
type: :development
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *70218189954360
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: jeweler
|
38
|
-
requirement: &
|
38
|
+
requirement: &70218189953580 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 1.6.4
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *70218189953580
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: rake
|
49
|
-
requirement: &
|
49
|
+
requirement: &70218189953020 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 0.9.2.2
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *70218189953020
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &70218189952200 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 2.8.0
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *70218189952200
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rdoc
|
71
|
-
requirement: &
|
71
|
+
requirement: &70218189951400 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: '3.11'
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *70218189951400
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: prawn
|
82
|
-
requirement: &
|
82
|
+
requirement: &70218189950700 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
@@ -87,10 +87,10 @@ dependencies:
|
|
87
87
|
version: 0.12.0
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *70218189950700
|
91
91
|
- !ruby/object:Gem::Dependency
|
92
92
|
name: guard-rspec
|
93
|
-
requirement: &
|
93
|
+
requirement: &70218189950100 !ruby/object:Gem::Requirement
|
94
94
|
none: false
|
95
95
|
requirements:
|
96
96
|
- - ~>
|
@@ -98,7 +98,7 @@ dependencies:
|
|
98
98
|
version: 1.2.0
|
99
99
|
type: :development
|
100
100
|
prerelease: false
|
101
|
-
version_requirements: *
|
101
|
+
version_requirements: *70218189950100
|
102
102
|
description: a library that can read structured and positional text from PDFs. Ideal
|
103
103
|
for asembling structured data from invoices and the like.
|
104
104
|
email: gallagher.paul@gmail.com
|
@@ -111,6 +111,7 @@ files:
|
|
111
111
|
- .rspec
|
112
112
|
- .rvmrc
|
113
113
|
- .travis.yml
|
114
|
+
- CHANGELOG
|
114
115
|
- Gemfile
|
115
116
|
- Gemfile.lock
|
116
117
|
- Guardfile
|
@@ -125,8 +126,10 @@ files:
|
|
125
126
|
- lib/pdf/reader/turtletext/version.rb
|
126
127
|
- pdf-reader-turtletext.gemspec
|
127
128
|
- spec/fixtures/pdf_samples/.gitkeep
|
129
|
+
- spec/fixtures/pdf_samples/expectations.yml
|
128
130
|
- spec/fixtures/pdf_samples/hello_world.pdf
|
129
131
|
- spec/fixtures/pdf_samples/junk_prefix.pdf
|
132
|
+
- spec/fixtures/pdf_samples/simple_table_text.pdf
|
130
133
|
- spec/integration/pdf_samples_spec.rb
|
131
134
|
- spec/spec_helper.rb
|
132
135
|
- spec/support/pdf_samples_helper.rb
|