pdf-reader-turtletext 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.travis.yml +9 -1
- data/CHANGELOG +11 -0
- data/README.rdoc +106 -15
- data/lib/pdf/reader/turtletext.rb +52 -24
- data/lib/pdf/reader/turtletext/textangle.rb +89 -13
- data/lib/pdf/reader/turtletext/version.rb +1 -1
- data/pdf-reader-turtletext.gemspec +5 -2
- data/spec/fixtures/pdf_samples/expectations.yml +95 -0
- data/spec/fixtures/pdf_samples/simple_table_text.pdf +139 -0
- data/spec/integration/pdf_samples_spec.rb +28 -0
- data/spec/support/pdf_samples_helper.rb +23 -0
- data/spec/unit/reader/turtletext/textangle_spec.rb +193 -0
- data/spec/unit/reader/turtletext/turtletext_spec.rb +42 -37
- metadata +21 -18
data/.travis.yml
CHANGED
@@ -1,3 +1,11 @@
|
|
1
1
|
# These are specific configuration settings required for travis-ci
|
2
2
|
# see http://travis-ci.org/tardate/pdf-reader-turtletext
|
3
|
-
|
3
|
+
language: ruby
|
4
|
+
rvm:
|
5
|
+
- 1.8.7
|
6
|
+
- 1.9.2
|
7
|
+
- 1.9.3
|
8
|
+
- rbx-18mode
|
9
|
+
- rbx-19mode
|
10
|
+
- jruby-18mode
|
11
|
+
- jruby-19mode
|
data/CHANGELOG
ADDED
@@ -0,0 +1,11 @@
|
|
1
|
+
Version 0.2.0 Release: n/a
|
2
|
+
==================================================
|
3
|
+
* add bounding_box / textangle semantics
|
4
|
+
* improve documentation
|
5
|
+
* MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
|
6
|
+
|
7
|
+
Version 0.1.0 Release: 22nd July 2012
|
8
|
+
==================================================
|
9
|
+
* Initial packaging and release of core functionality directly extracted
|
10
|
+
from https://github.com/tardate/sps_bill_scanner/
|
11
|
+
* MRI 1.9 only
|
data/README.rdoc
CHANGED
@@ -14,30 +14,121 @@ For an example of how this is works in practice, see the
|
|
14
14
|
|
15
15
|
== Requirements and Known Limitations
|
16
16
|
|
17
|
-
*
|
18
|
-
* fixed dependency on PDF::Reader
|
17
|
+
* Tested with MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
|
18
|
+
* Has a fixed dependency on PDF::Reader v1.1.1
|
19
19
|
|
20
|
-
==
|
20
|
+
== The PDF::Reader::Turtletext Cookbook
|
21
21
|
|
22
|
-
|
22
|
+
=== How do I install it for normal use?
|
23
23
|
|
24
|
-
|
24
|
+
It is distributed as a gem, so all normal gem installation procedures apply. To install the
|
25
|
+
gem directly from the command line:
|
25
26
|
|
26
|
-
|
27
|
+
$ gem install pdf-reader-turtletext
|
27
28
|
|
28
|
-
|
29
|
-
such as <tt>text_position</tt> and <tt>text_in_region</tt>.
|
29
|
+
If you are using bundler or Rails, add to your Gemfile:
|
30
30
|
|
31
|
-
|
31
|
+
gem 'pdf-reader-turtletext'
|
32
32
|
|
33
|
+
Then bundle install:
|
34
|
+
|
35
|
+
$ bundle
|
36
|
+
|
37
|
+
=== How do I install it for gem development?
|
38
|
+
|
39
|
+
If you want to work on enhancements of fix bugs in PDF::Reader::Turtletext, fork and clone the github repository. See the section below on 'Contributing to PDF::Reader::Turtletext'
|
40
|
+
|
41
|
+
=== How to instantiate Turtletext in code
|
42
|
+
|
43
|
+
All interaction is done using an instance of the PDF::Reader::Turtletext class. It is
|
44
|
+
initialised given a filename or IO-like object, and any required options.
|
45
|
+
|
46
|
+
Typical usage:
|
47
|
+
|
48
|
+
pdf_filename = '../some_path/some.pdf'
|
33
49
|
reader = PDF::Reader::Turtletext.new(pdf_filename)
|
34
|
-
|
35
|
-
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
50
|
+
options = { :y_precision => 5 }
|
51
|
+
reader_with_options = PDF::Reader::Turtletext.new(pdf_filename,options)
|
52
|
+
|
53
|
+
=== How to extract text within a region described in relation to other text
|
54
|
+
|
55
|
+
Problem: we don't know exactly where the required text will be on the page, and it is not encoded
|
56
|
+
within the PDF as a single object. But we do know that it will be relatively positioned (for example)
|
57
|
+
below a certain bit of text, to the left of another, and above some other text.
|
58
|
+
|
59
|
+
Solution: use the <tt>bounding_box</tt> method to describe the region and extract the matching text.
|
60
|
+
|
61
|
+
textangle = reader.bounding_box do
|
62
|
+
page 1
|
63
|
+
below /electricity/i
|
64
|
+
above 10
|
65
|
+
right_of 240.0
|
66
|
+
left_of "Total ($)"
|
67
|
+
end
|
68
|
+
textangle.text
|
69
|
+
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
|
70
|
+
|
71
|
+
The range of methods that can be used within the <tt>bounding_box</tt> block are all optional, and include:
|
72
|
+
* <tt>page</tt> - specifies the PDF page from which to extract text (default is 1).
|
73
|
+
* <tt>below</tt> - a string, regex or number that describes the upper limit of the text box
|
74
|
+
(default is top border of the page).
|
75
|
+
* <tt>above</tt> - a string, regex or number that describes the lower limit of the text box
|
76
|
+
(default is bottom border of the page).
|
77
|
+
* <tt>left_of</tt> - a string, regex or number that describes the right limit of the text box
|
78
|
+
(default is right border of the page).
|
79
|
+
* <tt>right_of</tt> - a string, regex or number that describes the left limit of the text box
|
80
|
+
(default is left border of the page).
|
81
|
+
|
82
|
+
Note that <tt>left_of</tt> and <tt>right_of</tt> constraints do *not* need to be within the vertical
|
83
|
+
range of the box being described.
|
84
|
+
For example, you could use an element in the page header to describe the <tt>left_of</tt> limit
|
85
|
+
for a table at the bottom of the page, if it has the correct alignment needed to describe your text region.
|
86
|
+
|
87
|
+
Similarly, <tt>above</tt> and <tt>below</tt> constraints do *not* need to be within the horizontal
|
88
|
+
range of the box being described.
|
89
|
+
|
90
|
+
=== Using a block parameter with the <tt>bounding_box</tt> method
|
91
|
+
|
92
|
+
An explicit block parameter may be used with the <tt>bounding_box</tt> method:
|
93
|
+
|
94
|
+
textangle = reader.bounding_box do |r|
|
95
|
+
r.below /electricity/i
|
96
|
+
r.left_of "Total ($)"
|
97
|
+
end
|
98
|
+
textangle.text
|
99
|
+
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
|
100
|
+
|
101
|
+
=== Extract text for a region with known positional co-ordinates
|
102
|
+
|
103
|
+
If you know (or can calculate) the x,y positions of the required text region, you can extract the region's
|
104
|
+
text using the <tt>text_in_region</tt> method.
|
105
|
+
|
106
|
+
text = reader.text_in_region(
|
107
|
+
10, # minimum x (left-most) (inclusive)
|
108
|
+
900, # maximum x (right-most) (inclusive)
|
109
|
+
200, # minimum y (bottom-most) (inclusive)
|
110
|
+
400, # maximum y (top-most) (inclusive)
|
111
|
+
1 # page
|
40
112
|
)
|
113
|
+
=> [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
|
114
|
+
|
115
|
+
Note that the x,y origin is at the bottom-left of the page.
|
116
|
+
|
117
|
+
=== How to find the x,y co-ordinate of a specific text element
|
118
|
+
|
119
|
+
Problem: if you are doing low-level text extraction with <tt>text_in_region</tt> for example,
|
120
|
+
it is usually necessary to locate specific text to provide a positional reference.
|
121
|
+
|
122
|
+
Solution: use the <tt>text_position</tt> method to locate text by exact or partial match.
|
123
|
+
It returns a Hash of x/y co-ordinates that is the bottom-left corner of the text.
|
124
|
+
|
125
|
+
page = 1
|
126
|
+
text_by_exact_match = reader.text_position("Transaction Table", page)
|
127
|
+
=> { :x => 10.0, :y => 600.0 }
|
128
|
+
text_by_regex_match = reader.text_position(/transaction summary/i, page)
|
129
|
+
=> { :x => 10.0, :y => 300.0 }
|
130
|
+
|
131
|
+
Note: in the case of multitple matches, only the first match is returned.
|
41
132
|
|
42
133
|
|
43
134
|
== Contributing to PDF::Reader::Turtletext
|
@@ -16,6 +16,8 @@ class PDF::Reader::Turtletext
|
|
16
16
|
attr_reader :options
|
17
17
|
|
18
18
|
# +source+ is a file name or stream-like object
|
19
|
+
# Supported +options+ include:
|
20
|
+
# * :y_precision
|
19
21
|
def initialize(source, options={})
|
20
22
|
@options = options
|
21
23
|
@reader = PDF::Reader.new(source)
|
@@ -31,7 +33,7 @@ class PDF::Reader::Turtletext
|
|
31
33
|
end
|
32
34
|
|
33
35
|
# Returns positional (with fuzzed y positioning) text content collection as a hash:
|
34
|
-
#
|
36
|
+
# [ fuzzed_y_position, [[x_position,content]] ]
|
35
37
|
def content(page=1)
|
36
38
|
@content ||= []
|
37
39
|
if @content[page]
|
@@ -41,18 +43,24 @@ class PDF::Reader::Turtletext
|
|
41
43
|
end
|
42
44
|
end
|
43
45
|
|
44
|
-
# Returns
|
45
|
-
#
|
46
|
+
# Returns an Array with fuzzed positioning, ordered by decreasing y position. Row content order by x position.
|
47
|
+
# [ fuzzed_y_position, [[x_position,content]] ]
|
46
48
|
# Given +input+ as a hash:
|
47
49
|
# { y_position: { x_position: content}}
|
48
50
|
# Fuzz factors: +y_precision+
|
49
51
|
def fuzzed_y(input)
|
50
|
-
output =
|
51
|
-
input.keys.sort.each do |precise_y|
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
52
|
+
output = []
|
53
|
+
input.keys.sort.reverse.each do |precise_y|
|
54
|
+
matching_y = output.map(&:first).select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
|
55
|
+
y_index = output.index{|y| y.first == matching_y }
|
56
|
+
new_row_content = input[precise_y].to_a
|
57
|
+
if y_index
|
58
|
+
row_content = output[y_index].last
|
59
|
+
row_content += new_row_content
|
60
|
+
output[y_index] = [matching_y,row_content]
|
61
|
+
else
|
62
|
+
output << [matching_y,new_row_content]
|
63
|
+
end
|
56
64
|
end
|
57
65
|
output
|
58
66
|
end
|
@@ -69,21 +77,24 @@ class PDF::Reader::Turtletext
|
|
69
77
|
end
|
70
78
|
|
71
79
|
# Returns an array of text elements found within the x,y limits,
|
80
|
+
# x ranges from +xmin+ (left of page) to +xmax+ (right of page)
|
81
|
+
# y ranges from +ymin+ (bottom of page) to +ymax+ (top of page)
|
72
82
|
# Each line of text found is returned as an array element.
|
73
83
|
# Each line of text is an array of the seperate text elements found on that line.
|
74
84
|
# [["first line first text", "first line last text"],["second line text"]]
|
75
85
|
def text_in_region(xmin,xmax,ymin,ymax,page=1)
|
76
86
|
text_map = content(page)
|
77
87
|
box = []
|
78
|
-
|
88
|
+
|
89
|
+
text_map.each do |y,text_row|
|
79
90
|
if y >= ymin && y<= ymax
|
80
91
|
row = []
|
81
|
-
|
92
|
+
text_row.each do |x,element|
|
82
93
|
if x >= xmin && x<= xmax
|
83
|
-
row <<
|
94
|
+
row << [x,element]
|
84
95
|
end
|
85
96
|
end
|
86
|
-
box << row unless row.empty?
|
97
|
+
box << row.sort{|a,b| a.first <=> b.first }.map(&:last) unless row.empty?
|
87
98
|
end
|
88
99
|
end
|
89
100
|
box
|
@@ -94,7 +105,11 @@ class PDF::Reader::Turtletext
|
|
94
105
|
# +text+ may be a string (exact match required) or a Regexp
|
95
106
|
def text_position(text,page=1)
|
96
107
|
item = if text.class <= Regexp
|
97
|
-
content(page).map
|
108
|
+
content(page).map do |k,v|
|
109
|
+
if x = v.reduce(nil){|memo,vv| memo = (vv[1] =~ text) ? vv[0] : memo }
|
110
|
+
[k,x]
|
111
|
+
end
|
112
|
+
end
|
98
113
|
else
|
99
114
|
content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
|
100
115
|
end
|
@@ -104,17 +119,30 @@ class PDF::Reader::Turtletext
|
|
104
119
|
end
|
105
120
|
end
|
106
121
|
|
107
|
-
#
|
108
|
-
#
|
122
|
+
# Returns a text region definition using a descriptive block.
|
123
|
+
#
|
124
|
+
# Usage:
|
125
|
+
#
|
126
|
+
# textangle = reader.bounding_box do
|
127
|
+
# page 1
|
128
|
+
# below /electricity/i
|
129
|
+
# above 10
|
130
|
+
# right_of 240.0
|
131
|
+
# left_of "Total ($)"
|
132
|
+
# end
|
133
|
+
# textangle.text
|
134
|
+
#
|
135
|
+
# Alternatively, an explicit block parameter may be used:
|
109
136
|
#
|
110
|
-
#
|
111
|
-
#
|
112
|
-
#
|
113
|
-
#
|
114
|
-
#
|
115
|
-
#
|
116
|
-
#
|
117
|
-
#
|
137
|
+
# textangle = reader.bounding_box do |r|
|
138
|
+
# r.page 1
|
139
|
+
# r.below /electricity/i
|
140
|
+
# r.above 10
|
141
|
+
# r.right_of 240.0
|
142
|
+
# r.left_of "Total ($)"
|
143
|
+
# end
|
144
|
+
# textangle.text
|
145
|
+
# => [['string','string'],['string']] # array of rows, each row is an array of column text element
|
118
146
|
#
|
119
147
|
def bounding_box(&block)
|
120
148
|
PDF::Reader::Turtletext::Textangle.new(self,&block)
|
@@ -1,27 +1,103 @@
|
|
1
1
|
# A DSL syntax for text extraction.
|
2
|
-
# WIP - not using this yet
|
3
2
|
#
|
4
|
-
# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do
|
5
|
-
# page 1
|
6
|
-
# below "Electricity Services"
|
7
|
-
# above "Gas Services by City Gas Pte Ltd"
|
8
|
-
# right_of 240.0
|
9
|
-
# left_of "Total ($)"
|
3
|
+
# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do |r|
|
4
|
+
# r.page = 1
|
5
|
+
# r.below = "Electricity Services"
|
6
|
+
# r.above = "Gas Services by City Gas Pte Ltd"
|
7
|
+
# r.right_of = 240.0
|
8
|
+
# r.left_of = "Total ($)"
|
10
9
|
# end
|
11
10
|
# textangle.text
|
12
11
|
#
|
13
12
|
class PDF::Reader::Turtletext::Textangle
|
14
13
|
attr_reader :reader
|
15
|
-
|
14
|
+
attr_accessor :page
|
15
|
+
attr_writer :above,:below,:left_of,:right_of
|
16
16
|
|
17
|
-
# +
|
18
|
-
def initialize(
|
19
|
-
@reader =
|
20
|
-
|
17
|
+
# +turtletext_reader+ is a PDF::Reader::Turtletext
|
18
|
+
def initialize(turtletext_reader,&block)
|
19
|
+
@reader = turtletext_reader
|
20
|
+
@page = 1
|
21
|
+
if block_given?
|
22
|
+
if block.arity == 1
|
23
|
+
yield self
|
24
|
+
else
|
25
|
+
instance_eval &block
|
26
|
+
end
|
27
|
+
end
|
21
28
|
end
|
22
29
|
|
30
|
+
def above(*args)
|
31
|
+
if value = args.first
|
32
|
+
@above = value
|
33
|
+
end
|
34
|
+
@above
|
35
|
+
end
|
36
|
+
|
37
|
+
def below(*args)
|
38
|
+
if value = args.first
|
39
|
+
@below = value
|
40
|
+
end
|
41
|
+
@below
|
42
|
+
end
|
43
|
+
|
44
|
+
def left_of(*args)
|
45
|
+
if value = args.first
|
46
|
+
@left_of = value
|
47
|
+
end
|
48
|
+
@left_of
|
49
|
+
end
|
50
|
+
|
51
|
+
def right_of(*args)
|
52
|
+
if value = args.first
|
53
|
+
@right_of = value
|
54
|
+
end
|
55
|
+
@right_of
|
56
|
+
end
|
57
|
+
|
58
|
+
# Returns the text
|
23
59
|
def text
|
24
|
-
|
60
|
+
return unless reader
|
61
|
+
|
62
|
+
xmin = if right_of
|
63
|
+
if [Fixnum,Float].include?(right_of.class)
|
64
|
+
right_of
|
65
|
+
else
|
66
|
+
reader.text_position(right_of,page)[:x] + 1
|
67
|
+
end
|
68
|
+
else
|
69
|
+
0
|
70
|
+
end
|
71
|
+
xmax = if left_of
|
72
|
+
if [Fixnum,Float].include?(left_of.class)
|
73
|
+
left_of
|
74
|
+
else
|
75
|
+
reader.text_position(left_of,page)[:x] - 1
|
76
|
+
end
|
77
|
+
else
|
78
|
+
99999 # TODO actual limit
|
79
|
+
end
|
80
|
+
|
81
|
+
ymin = if above
|
82
|
+
if [Fixnum,Float].include?(above.class)
|
83
|
+
above
|
84
|
+
else
|
85
|
+
reader.text_position(above,page)[:y] + 1
|
86
|
+
end
|
87
|
+
else
|
88
|
+
0
|
89
|
+
end
|
90
|
+
ymax = if below
|
91
|
+
if [Fixnum,Float].include?(below.class)
|
92
|
+
below
|
93
|
+
else
|
94
|
+
reader.text_position(below,page)[:y] - 1
|
95
|
+
end
|
96
|
+
else
|
97
|
+
99999 # TODO actual limit
|
98
|
+
end
|
99
|
+
|
100
|
+
reader.text_in_region(xmin,xmax,ymin,ymax,page)
|
25
101
|
end
|
26
102
|
|
27
103
|
end
|
@@ -5,11 +5,11 @@
|
|
5
5
|
|
6
6
|
Gem::Specification.new do |s|
|
7
7
|
s.name = "pdf-reader-turtletext"
|
8
|
-
s.version = "0.
|
8
|
+
s.version = "0.2.0"
|
9
9
|
|
10
10
|
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
|
11
11
|
s.authors = ["Paul Gallagher"]
|
12
|
-
s.date = "2012-07-
|
12
|
+
s.date = "2012-07-31"
|
13
13
|
s.description = "a library that can read structured and positional text from PDFs. Ideal for asembling structured data from invoices and the like."
|
14
14
|
s.email = "gallagher.paul@gmail.com"
|
15
15
|
s.extra_rdoc_files = [
|
@@ -20,6 +20,7 @@ Gem::Specification.new do |s|
|
|
20
20
|
".rspec",
|
21
21
|
".rvmrc",
|
22
22
|
".travis.yml",
|
23
|
+
"CHANGELOG",
|
23
24
|
"Gemfile",
|
24
25
|
"Gemfile.lock",
|
25
26
|
"Guardfile",
|
@@ -34,8 +35,10 @@ Gem::Specification.new do |s|
|
|
34
35
|
"lib/pdf/reader/turtletext/version.rb",
|
35
36
|
"pdf-reader-turtletext.gemspec",
|
36
37
|
"spec/fixtures/pdf_samples/.gitkeep",
|
38
|
+
"spec/fixtures/pdf_samples/expectations.yml",
|
37
39
|
"spec/fixtures/pdf_samples/hello_world.pdf",
|
38
40
|
"spec/fixtures/pdf_samples/junk_prefix.pdf",
|
41
|
+
"spec/fixtures/pdf_samples/simple_table_text.pdf",
|
39
42
|
"spec/integration/pdf_samples_spec.rb",
|
40
43
|
"spec/spec_helper.rb",
|
41
44
|
"spec/support/pdf_samples_helper.rb",
|
@@ -0,0 +1,95 @@
|
|
1
|
+
# this file defines the test expectations for PDF samples in spec/fixtures/pdf_samples.
|
2
|
+
#
|
3
|
+
# This is a YAML-format file, so beware that indentation is significant
|
4
|
+
---
|
5
|
+
hello_world.pdf:
|
6
|
+
:test_above:
|
7
|
+
:above: 100
|
8
|
+
:expected_text:
|
9
|
+
-
|
10
|
+
- "Hello World"
|
11
|
+
:test_below:
|
12
|
+
:below: 900
|
13
|
+
:expected_text:
|
14
|
+
-
|
15
|
+
- "Hello World"
|
16
|
+
:test_below_na:
|
17
|
+
:below: 10
|
18
|
+
:expected_text: []
|
19
|
+
simple_table_text.pdf:
|
20
|
+
:test_above:
|
21
|
+
:above: Table Header
|
22
|
+
:expected_text:
|
23
|
+
-
|
24
|
+
- "Simple Table Text"
|
25
|
+
:test_below:
|
26
|
+
:below: row 2
|
27
|
+
:expected_text:
|
28
|
+
-
|
29
|
+
- "Table Footer"
|
30
|
+
:test_right_of:
|
31
|
+
:right_of: row 1
|
32
|
+
:expected_text:
|
33
|
+
-
|
34
|
+
- "val 1"
|
35
|
+
- "val 2"
|
36
|
+
- "val 3"
|
37
|
+
-
|
38
|
+
- "val 1"
|
39
|
+
- "val 2"
|
40
|
+
- "val 3"
|
41
|
+
:test_left_of:
|
42
|
+
:left_of: val 1
|
43
|
+
:expected_text:
|
44
|
+
-
|
45
|
+
- "Simple Table Text"
|
46
|
+
-
|
47
|
+
- "Table Header"
|
48
|
+
-
|
49
|
+
- "row 1"
|
50
|
+
-
|
51
|
+
- "row 2"
|
52
|
+
-
|
53
|
+
- "Table Footer"
|
54
|
+
:test_above_and_below:
|
55
|
+
:below: Table Header
|
56
|
+
:above: Table Footer
|
57
|
+
:expected_text:
|
58
|
+
-
|
59
|
+
- "row 1"
|
60
|
+
- "val 1"
|
61
|
+
- "val 2"
|
62
|
+
- "val 3"
|
63
|
+
-
|
64
|
+
- "row 2"
|
65
|
+
- "val 1"
|
66
|
+
- "val 2"
|
67
|
+
- "val 3"
|
68
|
+
:test_above_and_below_and_left_of:
|
69
|
+
:below: Table Header
|
70
|
+
:above: Table Footer
|
71
|
+
:left_of: val 2
|
72
|
+
:expected_text:
|
73
|
+
-
|
74
|
+
- "row 1"
|
75
|
+
- "val 1"
|
76
|
+
-
|
77
|
+
- "row 2"
|
78
|
+
- "val 1"
|
79
|
+
:test_above_and_below_and_left_of_and_right_of:
|
80
|
+
:below: Table Header
|
81
|
+
:above: Table Footer
|
82
|
+
:left_of: val 2
|
83
|
+
:right_of: row 1
|
84
|
+
:expected_text:
|
85
|
+
-
|
86
|
+
- "val 1"
|
87
|
+
-
|
88
|
+
- "val 1"
|
89
|
+
|
90
|
+
|
91
|
+
|
92
|
+
|
93
|
+
|
94
|
+
|
95
|
+
|
@@ -0,0 +1,139 @@
|
|
1
|
+
%PDF-1.3
|
2
|
+
%����
|
3
|
+
1 0 obj
|
4
|
+
<< /Creator <feff0050007200610077006e>
|
5
|
+
/Producer <feff0050007200610077006e>
|
6
|
+
>>
|
7
|
+
endobj
|
8
|
+
2 0 obj
|
9
|
+
<< /Type /Catalog
|
10
|
+
/Pages 3 0 R
|
11
|
+
>>
|
12
|
+
endobj
|
13
|
+
3 0 obj
|
14
|
+
<< /Type /Pages
|
15
|
+
/Count 1
|
16
|
+
/Kids [5 0 R]
|
17
|
+
>>
|
18
|
+
endobj
|
19
|
+
4 0 obj
|
20
|
+
<< /Length 795
|
21
|
+
>>
|
22
|
+
stream
|
23
|
+
q
|
24
|
+
|
25
|
+
BT
|
26
|
+
36 747.384 Td
|
27
|
+
/F1.0 12 Tf
|
28
|
+
[<53696d706c652054> 120 <6162> 20 <6c652054> 120 <65> 30 <7874>] TJ
|
29
|
+
ET
|
30
|
+
|
31
|
+
|
32
|
+
BT
|
33
|
+
46 327.384 Td
|
34
|
+
/F1.0 12 Tf
|
35
|
+
[<54> 120 <6162> 20 <6c6520486561646572>] TJ
|
36
|
+
ET
|
37
|
+
|
38
|
+
|
39
|
+
BT
|
40
|
+
46 277.384 Td
|
41
|
+
/F1.0 12 Tf
|
42
|
+
[<726f> 15 <772031>] TJ
|
43
|
+
ET
|
44
|
+
|
45
|
+
|
46
|
+
BT
|
47
|
+
136 277.384 Td
|
48
|
+
/F1.0 12 Tf
|
49
|
+
[<76> 25 <616c2031>] TJ
|
50
|
+
ET
|
51
|
+
|
52
|
+
|
53
|
+
BT
|
54
|
+
186 277.384 Td
|
55
|
+
/F1.0 12 Tf
|
56
|
+
[<76> 25 <616c2032>] TJ
|
57
|
+
ET
|
58
|
+
|
59
|
+
|
60
|
+
BT
|
61
|
+
236 277.384 Td
|
62
|
+
/F1.0 12 Tf
|
63
|
+
[<76> 25 <616c2033>] TJ
|
64
|
+
ET
|
65
|
+
|
66
|
+
|
67
|
+
BT
|
68
|
+
46 227.38400000000001 Td
|
69
|
+
/F1.0 12 Tf
|
70
|
+
[<726f> 15 <772032>] TJ
|
71
|
+
ET
|
72
|
+
|
73
|
+
|
74
|
+
BT
|
75
|
+
136 227.38400000000001 Td
|
76
|
+
/F1.0 12 Tf
|
77
|
+
[<76> 25 <616c2031>] TJ
|
78
|
+
ET
|
79
|
+
|
80
|
+
|
81
|
+
BT
|
82
|
+
186 227.38400000000001 Td
|
83
|
+
/F1.0 12 Tf
|
84
|
+
[<76> 25 <616c2032>] TJ
|
85
|
+
ET
|
86
|
+
|
87
|
+
|
88
|
+
BT
|
89
|
+
236 227.38400000000001 Td
|
90
|
+
/F1.0 12 Tf
|
91
|
+
[<76> 25 <616c2033>] TJ
|
92
|
+
ET
|
93
|
+
|
94
|
+
|
95
|
+
BT
|
96
|
+
46 177.38400000000001 Td
|
97
|
+
/F1.0 12 Tf
|
98
|
+
[<54> 120 <6162> 20 <6c652046> 30 <6f6f746572>] TJ
|
99
|
+
ET
|
100
|
+
|
101
|
+
Q
|
102
|
+
|
103
|
+
endstream
|
104
|
+
endobj
|
105
|
+
5 0 obj
|
106
|
+
<< /Type /Page
|
107
|
+
/Parent 3 0 R
|
108
|
+
/MediaBox [0 0 612.0 792.0]
|
109
|
+
/Contents 4 0 R
|
110
|
+
/Resources << /ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
|
111
|
+
/Font << /F1.0 6 0 R
|
112
|
+
>>
|
113
|
+
>>
|
114
|
+
>>
|
115
|
+
endobj
|
116
|
+
6 0 obj
|
117
|
+
<< /Type /Font
|
118
|
+
/Subtype /Type1
|
119
|
+
/BaseFont /Helvetica
|
120
|
+
/Encoding /WinAnsiEncoding
|
121
|
+
>>
|
122
|
+
endobj
|
123
|
+
xref
|
124
|
+
0 7
|
125
|
+
0000000000 65535 f
|
126
|
+
0000000015 00000 n
|
127
|
+
0000000109 00000 n
|
128
|
+
0000000158 00000 n
|
129
|
+
0000000215 00000 n
|
130
|
+
0000001061 00000 n
|
131
|
+
0000001239 00000 n
|
132
|
+
trailer
|
133
|
+
<< /Size 7
|
134
|
+
/Root 2 0 R
|
135
|
+
/Info 1 0 R
|
136
|
+
>>
|
137
|
+
startxref
|
138
|
+
1336
|
139
|
+
%%EOF
|
@@ -3,5 +3,33 @@ include PdfSamplesHelper
|
|
3
3
|
|
4
4
|
describe "PDF Samples" do
|
5
5
|
|
6
|
+
# This will scan all *.pdf files in spec/fixtures/personal_pdf_samples
|
7
|
+
# and do basic verification of the file structure without any effort from you.
|
8
|
+
pdf_sample_expectations.each do |sample_name,test_specifications|
|
9
|
+
describe "sample" do
|
10
|
+
let(:options) { test_specifications[:options] || {} }
|
11
|
+
let(:sample_file) { pdf_sample(sample_name) }
|
12
|
+
let(:turtletext_reader) { PDF::Reader::Turtletext.new(sample_file,options) }
|
13
|
+
|
14
|
+
(test_specifications||{}).each do |test_name,expectations|
|
15
|
+
context test_name do
|
16
|
+
let(:bounding_box) {
|
17
|
+
turtletext_reader.bounding_box do
|
18
|
+
above expectations[:above]
|
19
|
+
below expectations[:below]
|
20
|
+
left_of expectations[:left_of]
|
21
|
+
right_of expectations[:right_of]
|
22
|
+
end
|
23
|
+
}
|
24
|
+
# it {
|
25
|
+
# puts "bounding_box"
|
26
|
+
# puts bounding_box.inspect
|
27
|
+
# }
|
28
|
+
subject { bounding_box.text }
|
29
|
+
it { should eql(expectations[:expected_text])}
|
30
|
+
end
|
31
|
+
end
|
32
|
+
end
|
33
|
+
end
|
6
34
|
|
7
35
|
end
|
@@ -31,6 +31,7 @@ module PdfSamplesHelper
|
|
31
31
|
require 'prawn'
|
32
32
|
puts "Making PDF samples for tests.."
|
33
33
|
make_sample_hello_world
|
34
|
+
make_sample_simple_table_text
|
34
35
|
end
|
35
36
|
|
36
37
|
def make_sample_hello_world
|
@@ -40,4 +41,26 @@ module PdfSamplesHelper
|
|
40
41
|
end
|
41
42
|
puts "Created: #{filename}"
|
42
43
|
end
|
44
|
+
|
45
|
+
def make_sample_simple_table_text
|
46
|
+
filename = pdf_sample('simple_table_text.pdf')
|
47
|
+
Prawn::Document.generate filename do
|
48
|
+
text "Simple Table Text"
|
49
|
+
text_box "Table Header", :at => [10, 300], :width => 200
|
50
|
+
|
51
|
+
text_box "row 1", :at => [10, 250], :width => 90
|
52
|
+
text_box "val 1", :at => [100, 250], :width => 50
|
53
|
+
text_box "val 2", :at => [150, 250], :width => 50
|
54
|
+
text_box "val 3", :at => [200, 250], :width => 50
|
55
|
+
|
56
|
+
text_box "row 2", :at => [10, 200], :width => 90
|
57
|
+
text_box "val 1", :at => [100, 200], :width => 50
|
58
|
+
text_box "val 2", :at => [150, 200], :width => 50
|
59
|
+
text_box "val 3", :at => [200, 200], :width => 50
|
60
|
+
|
61
|
+
text_box "Table Footer", :at => [10, 150], :width => 200
|
62
|
+
end
|
63
|
+
puts "Created: #{filename}"
|
64
|
+
end
|
65
|
+
|
43
66
|
end
|
@@ -3,4 +3,197 @@ require 'spec_helper'
|
|
3
3
|
describe PDF::Reader::Turtletext::Textangle do
|
4
4
|
let(:resource_class) { PDF::Reader::Turtletext::Textangle }
|
5
5
|
|
6
|
+
let(:source) { nil } # we're just going to mock the PDF source here
|
7
|
+
let(:options) { {} }
|
8
|
+
let(:turtletext_reader) { PDF::Reader::Turtletext.new(source,options) }
|
9
|
+
|
10
|
+
|
11
|
+
describe "#reader" do
|
12
|
+
let(:textangle) { resource_class.new(turtletext_reader) }
|
13
|
+
subject { textangle.reader }
|
14
|
+
it { should be_a(PDF::Reader::Turtletext) }
|
15
|
+
end
|
16
|
+
|
17
|
+
describe "#text" do
|
18
|
+
let(:page) { 1 }
|
19
|
+
before do
|
20
|
+
turtletext_reader.stub(:load_content).and_return(given_page_content)
|
21
|
+
end
|
22
|
+
let(:given_page_content) { {
|
23
|
+
70.0=>{10.0=>"crunchy bacon"},
|
24
|
+
40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
|
25
|
+
30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
|
26
|
+
10.0=>{40.0=>"smoked and streaky for me"}
|
27
|
+
} }
|
28
|
+
|
29
|
+
context "with block param" do
|
30
|
+
[:above,:below,:left_of,:right_of].each do |positional_method|
|
31
|
+
context "with #{positional_method}" do
|
32
|
+
let(:term) { "canary" }
|
33
|
+
|
34
|
+
it "should work with block param" do
|
35
|
+
textangle = resource_class.new(turtletext_reader) do |r|
|
36
|
+
r.send("#{positional_method}=",term)
|
37
|
+
end
|
38
|
+
textangle.send(positional_method).should eql(term)
|
39
|
+
end
|
40
|
+
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
45
|
+
context "without block param" do
|
46
|
+
it "#above should work" do
|
47
|
+
textangle = resource_class.new(turtletext_reader) do
|
48
|
+
above "canary"
|
49
|
+
end
|
50
|
+
textangle.above.should eql("canary")
|
51
|
+
end
|
52
|
+
it "#below should work" do
|
53
|
+
textangle = resource_class.new(turtletext_reader) do
|
54
|
+
below "canary"
|
55
|
+
end
|
56
|
+
textangle.below.should eql("canary")
|
57
|
+
end
|
58
|
+
it "#left_of should work" do
|
59
|
+
textangle = resource_class.new(turtletext_reader) do
|
60
|
+
left_of "canary"
|
61
|
+
end
|
62
|
+
textangle.left_of.should eql("canary")
|
63
|
+
end
|
64
|
+
it "#below should work" do
|
65
|
+
textangle = resource_class.new(turtletext_reader) do
|
66
|
+
right_of "canary"
|
67
|
+
end
|
68
|
+
textangle.right_of.should eql("canary")
|
69
|
+
end
|
70
|
+
end
|
71
|
+
|
72
|
+
context "when only below specified" do
|
73
|
+
context "as a string" do
|
74
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
75
|
+
r.below = "fraud"
|
76
|
+
end }
|
77
|
+
let(:expected) { [["smoked and streaky for me"]]}
|
78
|
+
subject { textangle.text }
|
79
|
+
it { should eql(expected) }
|
80
|
+
end
|
81
|
+
context "as a regex" do
|
82
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
83
|
+
r.below = /Fraud/i
|
84
|
+
end }
|
85
|
+
let(:expected) { [["smoked and streaky for me"]]}
|
86
|
+
subject { textangle.text }
|
87
|
+
it { should eql(expected) }
|
88
|
+
end
|
89
|
+
context "as a number" do
|
90
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
91
|
+
r.below = 20
|
92
|
+
end }
|
93
|
+
let(:expected) { [["smoked and streaky for me"]]}
|
94
|
+
subject { textangle.text }
|
95
|
+
it { should eql(expected) }
|
96
|
+
end
|
97
|
+
end
|
98
|
+
|
99
|
+
context "when only above specified" do
|
100
|
+
context "as a string" do
|
101
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
102
|
+
r.above = "heaven"
|
103
|
+
end }
|
104
|
+
let(:expected) { [["crunchy bacon"]]}
|
105
|
+
subject { textangle.text }
|
106
|
+
it { should eql(expected) }
|
107
|
+
end
|
108
|
+
context "as a regex" do
|
109
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
110
|
+
r.above = /heaVen/i
|
111
|
+
end }
|
112
|
+
let(:expected) { [["crunchy bacon"]]}
|
113
|
+
subject { textangle.text }
|
114
|
+
it { should eql(expected) }
|
115
|
+
end
|
116
|
+
context "as a number" do
|
117
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
118
|
+
r.above = 41
|
119
|
+
end }
|
120
|
+
let(:expected) { [["crunchy bacon"]]}
|
121
|
+
subject { textangle.text }
|
122
|
+
it { should eql(expected) }
|
123
|
+
end
|
124
|
+
end
|
125
|
+
|
126
|
+
context "when only left_of specified" do
|
127
|
+
context "as a string" do
|
128
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
129
|
+
r.left_of = "turkey bacon"
|
130
|
+
end }
|
131
|
+
let(:expected) { [
|
132
|
+
["crunchy bacon"],
|
133
|
+
["bacon on kimchi noodles", "heaven"]
|
134
|
+
] }
|
135
|
+
subject { textangle.text }
|
136
|
+
it { should eql(expected) }
|
137
|
+
end
|
138
|
+
context "as a regex" do
|
139
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
140
|
+
r.left_of = /turKey/i
|
141
|
+
end }
|
142
|
+
let(:expected) { [
|
143
|
+
["crunchy bacon"],
|
144
|
+
["bacon on kimchi noodles", "heaven"]
|
145
|
+
] }
|
146
|
+
subject { textangle.text }
|
147
|
+
it { should eql(expected) }
|
148
|
+
end
|
149
|
+
context "as a number" do
|
150
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
151
|
+
r.left_of = 29
|
152
|
+
end }
|
153
|
+
let(:expected) { [
|
154
|
+
["crunchy bacon"],
|
155
|
+
["bacon on kimchi noodles", "heaven"]
|
156
|
+
] }
|
157
|
+
subject { textangle.text }
|
158
|
+
it { should eql(expected) }
|
159
|
+
end
|
160
|
+
end
|
161
|
+
|
162
|
+
context "when only right_of specified" do
|
163
|
+
context "as a string" do
|
164
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
165
|
+
r.right_of = "heaven"
|
166
|
+
end }
|
167
|
+
let(:expected) { [
|
168
|
+
["turkey bacon","fraud"],
|
169
|
+
["smoked and streaky for me"]
|
170
|
+
] }
|
171
|
+
subject { textangle.text }
|
172
|
+
it { should eql(expected) }
|
173
|
+
end
|
174
|
+
context "as a regex" do
|
175
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
176
|
+
r.right_of = /Heaven/i
|
177
|
+
end }
|
178
|
+
let(:expected) { [
|
179
|
+
["turkey bacon","fraud"],
|
180
|
+
["smoked and streaky for me"]
|
181
|
+
] }
|
182
|
+
subject { textangle.text }
|
183
|
+
it { should eql(expected) }
|
184
|
+
end
|
185
|
+
context "as a number" do
|
186
|
+
let(:textangle) { resource_class.new(turtletext_reader) do |r|
|
187
|
+
r.right_of = 26
|
188
|
+
end }
|
189
|
+
let(:expected) { [
|
190
|
+
["turkey bacon","fraud"],
|
191
|
+
["smoked and streaky for me"]
|
192
|
+
] }
|
193
|
+
subject { textangle.text }
|
194
|
+
it { should eql(expected) }
|
195
|
+
end
|
196
|
+
end
|
197
|
+
|
198
|
+
end
|
6
199
|
end
|
@@ -4,16 +4,16 @@ describe PDF::Reader::Turtletext do
|
|
4
4
|
let(:resource_class) { PDF::Reader::Turtletext }
|
5
5
|
|
6
6
|
let(:source) { nil } # we're just going to mock the PDF source here
|
7
|
-
let(:
|
7
|
+
let(:turtletext_reader) { resource_class.new(source,options) }
|
8
8
|
let(:options) { {} }
|
9
9
|
|
10
10
|
describe "#reader" do
|
11
|
-
subject {
|
11
|
+
subject { turtletext_reader.reader}
|
12
12
|
it { should be_a(PDF::Reader) }
|
13
13
|
end
|
14
14
|
|
15
15
|
describe "#y_precision" do
|
16
|
-
subject {
|
16
|
+
subject { turtletext_reader.y_precision}
|
17
17
|
context "default" do
|
18
18
|
it { should eql(3) }
|
19
19
|
end
|
@@ -27,35 +27,40 @@ describe PDF::Reader::Turtletext do
|
|
27
27
|
context "with mocked source content" do
|
28
28
|
let(:page) { 1 }
|
29
29
|
before do
|
30
|
-
|
30
|
+
turtletext_reader.should_receive(:load_content).with(page).and_return(given_page_content)
|
31
31
|
end
|
32
32
|
|
33
33
|
{
|
34
34
|
:with_simple_text => {
|
35
35
|
:source_page_content => {10.0=>{10.0=>"a first bit of text"}},
|
36
36
|
:expected_precise_content => {10.0=>{10.0=>"a first bit of text"}},
|
37
|
-
:expected_fuzzed_content =>
|
37
|
+
:expected_fuzzed_content => [[10.0,[[10.0,"a first bit of text"]]]]
|
38
38
|
},
|
39
39
|
:with_widely_separated_text => {
|
40
|
-
:source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
41
|
-
:expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
42
|
-
:expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}}
|
43
|
-
},
|
44
|
-
:with_unsorted_y_text => {
|
45
40
|
:source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
46
41
|
:expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
47
|
-
:expected_fuzzed_content =>
|
42
|
+
:expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"]]], [10.0, [[20.0, "a second bit of text"]]]]
|
43
|
+
},
|
44
|
+
:with_unsorted_y_text => {
|
45
|
+
:source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
46
|
+
:expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
|
47
|
+
:expected_fuzzed_content => [[20.0, [[20.0, "a second bit of text"]]], [10.0, [[10.0, "a first bit of text"]]]]
|
48
48
|
},
|
49
49
|
:with_fuzzed_y_text => {
|
50
|
-
:source_page_content => {
|
51
|
-
:expected_precise_content => {
|
52
|
-
:expected_fuzzed_content =>
|
50
|
+
:source_page_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
|
51
|
+
:expected_precise_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
|
52
|
+
:expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [12.0, "a second bit of text"]]]]
|
53
53
|
},
|
54
54
|
:with_widely_separated_fuzzed_y_text => {
|
55
55
|
:y_precision => 25,
|
56
|
-
:source_page_content => {
|
57
|
-
:expected_precise_content => {
|
58
|
-
:expected_fuzzed_content =>
|
56
|
+
:source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
57
|
+
:expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
|
58
|
+
:expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [20.0, "a second bit of text"]]]]
|
59
|
+
},
|
60
|
+
:with_multiple_row_text => {
|
61
|
+
:source_page_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
|
62
|
+
:expected_precise_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
|
63
|
+
:expected_fuzzed_content => [[10.0, [[10.0, "first"], [20.0, "second"], [30.0, "third"]]]]
|
59
64
|
}
|
60
65
|
}.each do |test_name,test_expectations|
|
61
66
|
context test_name do
|
@@ -69,12 +74,12 @@ describe PDF::Reader::Turtletext do
|
|
69
74
|
}
|
70
75
|
|
71
76
|
describe "#content" do
|
72
|
-
subject {
|
77
|
+
subject { turtletext_reader.content(page) }
|
73
78
|
it { should eql(test_expectations[:expected_fuzzed_content]) }
|
74
79
|
end
|
75
80
|
|
76
81
|
describe "#precise_content" do
|
77
|
-
subject {
|
82
|
+
subject { turtletext_reader.precise_content(page) }
|
78
83
|
it { should eql(test_expectations[:expected_precise_content]) }
|
79
84
|
end
|
80
85
|
|
@@ -90,24 +95,24 @@ describe PDF::Reader::Turtletext do
|
|
90
95
|
},
|
91
96
|
:with_single_line_text => {
|
92
97
|
:source_page_content => {
|
93
|
-
|
98
|
+
70.0=>{10.0=>"first line ignored"},
|
94
99
|
30.0=>{10.0=>"first part found", 20.0=>"last part found"},
|
95
|
-
|
100
|
+
10.0=>{10.0=>"last line ignored"}
|
96
101
|
},
|
97
102
|
:xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
|
98
103
|
:expected_text => [["first part found", "last part found"]]
|
99
104
|
},
|
100
105
|
:with_multi_line_text => {
|
101
106
|
:source_page_content => {
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
107
|
+
70.0=>{10.0=>"first line ignored"},
|
108
|
+
40.0=>{10.0=>"first line first part found", 20.0=>"first line last part found"},
|
109
|
+
30.0=>{10.0=>"last line first part found", 20.0=>"last line last part found"},
|
110
|
+
10.0=>{10.0=>"last line ignored"}
|
106
111
|
},
|
107
112
|
:xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
|
108
113
|
:expected_text => [
|
109
|
-
["
|
110
|
-
["
|
114
|
+
["first line first part found", "first line last part found"],
|
115
|
+
["last line first part found", "last line last part found"]
|
111
116
|
]
|
112
117
|
}
|
113
118
|
}.each do |test_name,test_expectations|
|
@@ -118,7 +123,7 @@ describe PDF::Reader::Turtletext do
|
|
118
123
|
let(:ymin) { test_expectations[:ymin] }
|
119
124
|
let(:ymax) { test_expectations[:ymax] }
|
120
125
|
let(:expected_text) { test_expectations[:expected_text] }
|
121
|
-
subject {
|
126
|
+
subject { turtletext_reader.text_in_region(xmin,xmax,ymin,ymax,page) }
|
122
127
|
it { should eql(expected_text) }
|
123
128
|
end
|
124
129
|
end
|
@@ -126,21 +131,21 @@ describe PDF::Reader::Turtletext do
|
|
126
131
|
|
127
132
|
describe "#text_position" do
|
128
133
|
let(:given_page_content) { {
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
134
|
+
70.0=>{10.0=>"crunchy bacon"},
|
135
|
+
40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
|
136
|
+
30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
|
137
|
+
10.0=>{40.0=>"smoked and streaky da bomb"}
|
133
138
|
} }
|
134
139
|
{
|
135
|
-
:with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>
|
136
|
-
:with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>
|
137
|
-
:with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>
|
138
|
-
:with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>
|
140
|
+
:with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>30.0} },
|
141
|
+
:with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>40.0} },
|
142
|
+
:with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>40.0} },
|
143
|
+
:with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>70.0} }
|
139
144
|
}.each do |test_name,test_expectations|
|
140
145
|
context test_name do
|
141
146
|
let(:match_term) { test_expectations[:match_term] }
|
142
147
|
let(:expected_position) { test_expectations[:expected_position] }
|
143
|
-
subject {
|
148
|
+
subject { turtletext_reader.text_position(match_term,page) }
|
144
149
|
it { should eql(expected_position) }
|
145
150
|
end
|
146
151
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pdf-reader-turtletext
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-07-
|
12
|
+
date: 2012-07-31 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: pdf-reader
|
16
|
-
requirement: &
|
16
|
+
requirement: &70218189955060 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - =
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.1.1
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70218189955060
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: bundler
|
27
|
-
requirement: &
|
27
|
+
requirement: &70218189954360 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ~>
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: 1.1.4
|
33
33
|
type: :development
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *70218189954360
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: jeweler
|
38
|
-
requirement: &
|
38
|
+
requirement: &70218189953580 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ~>
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: 1.6.4
|
44
44
|
type: :development
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *70218189953580
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: rake
|
49
|
-
requirement: &
|
49
|
+
requirement: &70218189953020 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ~>
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: 0.9.2.2
|
55
55
|
type: :development
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *70218189953020
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &70218189952200 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ~>
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: 2.8.0
|
66
66
|
type: :development
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *70218189952200
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: rdoc
|
71
|
-
requirement: &
|
71
|
+
requirement: &70218189951400 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ~>
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: '3.11'
|
77
77
|
type: :development
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *70218189951400
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: prawn
|
82
|
-
requirement: &
|
82
|
+
requirement: &70218189950700 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ~>
|
@@ -87,10 +87,10 @@ dependencies:
|
|
87
87
|
version: 0.12.0
|
88
88
|
type: :development
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *70218189950700
|
91
91
|
- !ruby/object:Gem::Dependency
|
92
92
|
name: guard-rspec
|
93
|
-
requirement: &
|
93
|
+
requirement: &70218189950100 !ruby/object:Gem::Requirement
|
94
94
|
none: false
|
95
95
|
requirements:
|
96
96
|
- - ~>
|
@@ -98,7 +98,7 @@ dependencies:
|
|
98
98
|
version: 1.2.0
|
99
99
|
type: :development
|
100
100
|
prerelease: false
|
101
|
-
version_requirements: *
|
101
|
+
version_requirements: *70218189950100
|
102
102
|
description: a library that can read structured and positional text from PDFs. Ideal
|
103
103
|
for asembling structured data from invoices and the like.
|
104
104
|
email: gallagher.paul@gmail.com
|
@@ -111,6 +111,7 @@ files:
|
|
111
111
|
- .rspec
|
112
112
|
- .rvmrc
|
113
113
|
- .travis.yml
|
114
|
+
- CHANGELOG
|
114
115
|
- Gemfile
|
115
116
|
- Gemfile.lock
|
116
117
|
- Guardfile
|
@@ -125,8 +126,10 @@ files:
|
|
125
126
|
- lib/pdf/reader/turtletext/version.rb
|
126
127
|
- pdf-reader-turtletext.gemspec
|
127
128
|
- spec/fixtures/pdf_samples/.gitkeep
|
129
|
+
- spec/fixtures/pdf_samples/expectations.yml
|
128
130
|
- spec/fixtures/pdf_samples/hello_world.pdf
|
129
131
|
- spec/fixtures/pdf_samples/junk_prefix.pdf
|
132
|
+
- spec/fixtures/pdf_samples/simple_table_text.pdf
|
130
133
|
- spec/integration/pdf_samples_spec.rb
|
131
134
|
- spec/spec_helper.rb
|
132
135
|
- spec/support/pdf_samples_helper.rb
|