pdf-reader-turtletext 0.2.1 → 0.2.2

Sign up to get free protection for your applications and to get access to all the features.
data/CHANGELOG CHANGED
@@ -1,4 +1,9 @@
1
- Version 0.2.1 Release: n/a
1
+ Version 0.2.2 Release: 1st Aug 2012
2
+ ==================================================
3
+ * provide better control of inclusive/exclusive behaviour
4
+ for region selection \#4
5
+
6
+ Version 0.2.1 Release: 31st July 2012
2
7
  ==================================================
3
8
  * fix row sorting for Rubinius 1.8 mode
4
9
 
@@ -36,7 +36,9 @@ Then bundle install:
36
36
 
37
37
  === How do I install it for gem development?
38
38
 
39
- If you want to work on enhancements of fix bugs in PDF::Reader::Turtletext, fork and clone the github repository. See the section below on 'Contributing to PDF::Reader::Turtletext'
39
+ If you want to work on enhancements of fix bugs in PDF::Reader::Turtletext, fork and clone the github repository. If you are using bundler (recommended), run <tt>bundle</tt> to install development dependencies.
40
+
41
+ See the section below on 'Contributing to PDF::Reader::Turtletext' for more information.
40
42
 
41
43
  === How to instantiate Turtletext in code
42
44
 
@@ -70,6 +72,8 @@ Solution: use the <tt>bounding_box</tt> method to describe the region and extrac
70
72
 
71
73
  The range of methods that can be used within the <tt>bounding_box</tt> block are all optional, and include:
72
74
  * <tt>page</tt> - specifies the PDF page from which to extract text (default is 1).
75
+ * <tt>inclusive</tt> - whether region selection should be inclusive or exclusive of the specified positions
76
+ (default is false).
73
77
  * <tt>below</tt> - a string, regex or number that describes the upper limit of the text box
74
78
  (default is top border of the page).
75
79
  * <tt>above</tt> - a string, regex or number that describes the lower limit of the text box
@@ -98,17 +102,47 @@ An explicit block parameter may be used with the <tt>bounding_box</tt> method:
98
102
  textangle.text
99
103
  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
100
104
 
105
+ === How to describe an inclusive <tt>bounding_box</tt> region
106
+
107
+ By default, the <tt>bounding_box</tt> method makes exclusive selection (i.e. not including the
108
+ region limits).
109
+
110
+ To specifiy an inclusive region, use the <tt>inclusive!</tt> command:
111
+
112
+ textangle = reader.bounding_box do
113
+ inclusive!
114
+ below /electricity/i
115
+ left_of "Total ($)"
116
+ end
117
+
118
+ Alternatively, set <tt>inclusive</tt> to true:
119
+
120
+ textangle = reader.bounding_box do
121
+ inclusive true
122
+ below /electricity/i
123
+ left_of "Total ($)"
124
+ end
125
+
126
+ Or with a block parameter, you may also assign <tt>inclusive</tt> to true:
127
+
128
+ textangle = reader.bounding_box do |r|
129
+ r.inclusive = true
130
+ r.below /electricity/i
131
+ r.left_of "Total ($)"
132
+ end
133
+
101
134
  === Extract text for a region with known positional co-ordinates
102
135
 
103
136
  If you know (or can calculate) the x,y positions of the required text region, you can extract the region's
104
137
  text using the <tt>text_in_region</tt> method.
105
138
 
106
139
  text = reader.text_in_region(
107
- 10, # minimum x (left-most) (inclusive)
108
- 900, # maximum x (right-most) (inclusive)
109
- 200, # minimum y (bottom-most) (inclusive)
110
- 400, # maximum y (top-most) (inclusive)
111
- 1 # page
140
+ 10, # minimum x (left-most)
141
+ 900, # maximum x (right-most)
142
+ 200, # minimum y (bottom-most)
143
+ 400, # maximum y (top-most)
144
+ 1, # page (default 1)
145
+ false # inclusive of x/y position if true (default false)
112
146
  )
113
147
  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
114
148
 
@@ -76,21 +76,22 @@ class PDF::Reader::Turtletext
76
76
  end
77
77
  end
78
78
 
79
- # Returns an array of text elements found within the x,y limits,
80
- # x ranges from +xmin+ (left of page) to +xmax+ (right of page)
81
- # y ranges from +ymin+ (bottom of page) to +ymax+ (top of page)
82
- # Each line of text found is returned as an array element.
79
+ # Returns an array of text elements found within the x,y limits on +page+:
80
+ # * x ranges from +xmin+ (left of page) to +xmax+ (right of page)
81
+ # * y ranges from +ymin+ (bottom of page) to +ymax+ (top of page)
82
+ # When +inclusive+ is false (default) the x/y limits do not include the actual x/y value.
83
83
  # Each line of text is an array of the seperate text elements found on that line.
84
84
  # [["first line first text", "first line last text"],["second line text"]]
85
- def text_in_region(xmin,xmax,ymin,ymax,page=1)
85
+ def text_in_region(xmin,xmax,ymin,ymax,page=1,inclusive=false)
86
+ return [] unless xmin && xmax && ymin && ymax
86
87
  text_map = content(page)
87
88
  box = []
88
89
 
89
90
  text_map.each do |y,text_row|
90
- if y >= ymin && y<= ymax
91
+ if inclusive ? (y >= ymin && y <= ymax) : (y > ymin && y < ymax)
91
92
  row = []
92
93
  text_row.each do |x,element|
93
- if x >= xmin && x<= xmax
94
+ if inclusive ? (x >= xmin && x <= xmax) : (x > xmin && x < xmax)
94
95
  row << element
95
96
  end
96
97
  end
@@ -102,7 +103,8 @@ class PDF::Reader::Turtletext
102
103
 
103
104
  # Returns the position of +text+ on +page+
104
105
  # {x: val, y: val }
105
- # +text+ may be a string (exact match required) or a Regexp
106
+ # +text+ may be a string (exact match required) or a Regexp.
107
+ # Returns nil if the text cannot be found.
106
108
  def text_position(text,page=1)
107
109
  item = if text.class <= Regexp
108
110
  content(page).map do |k,v|
@@ -10,14 +10,15 @@
10
10
  # textangle.text
11
11
  #
12
12
  class PDF::Reader::Turtletext::Textangle
13
+
14
+ #
13
15
  attr_reader :reader
14
- attr_accessor :page
15
- attr_writer :above,:below,:left_of,:right_of
16
16
 
17
17
  # +turtletext_reader+ is a PDF::Reader::Turtletext
18
18
  def initialize(turtletext_reader,&block)
19
19
  @reader = turtletext_reader
20
20
  @page = 1
21
+ @inclusive = false
21
22
  if block_given?
22
23
  if block.arity == 1
23
24
  yield self
@@ -27,6 +28,34 @@ class PDF::Reader::Turtletext::Textangle
27
28
  end
28
29
  end
29
30
 
31
+ attr_writer :inclusive
32
+
33
+ def inclusive(*args)
34
+ if value = args.first
35
+ @inclusive = value
36
+ end
37
+ @inclusive
38
+ end
39
+
40
+ # Command: sets +inclusive true
41
+ def inclusive!
42
+ @inclusive = true
43
+ end
44
+
45
+ # Command: sets +inclusive false
46
+ def exclusive!
47
+ @inclusive = false
48
+ end
49
+
50
+ attr_writer :page
51
+ def page(*args)
52
+ if value = args.first
53
+ @page = value
54
+ end
55
+ @page
56
+ end
57
+
58
+ attr_writer :above
30
59
  def above(*args)
31
60
  if value = args.first
32
61
  @above = value
@@ -34,6 +63,7 @@ class PDF::Reader::Turtletext::Textangle
34
63
  @above
35
64
  end
36
65
 
66
+ attr_writer :below
37
67
  def below(*args)
38
68
  if value = args.first
39
69
  @below = value
@@ -41,6 +71,7 @@ class PDF::Reader::Turtletext::Textangle
41
71
  @below
42
72
  end
43
73
 
74
+ attr_writer :left_of
44
75
  def left_of(*args)
45
76
  if value = args.first
46
77
  @left_of = value
@@ -48,6 +79,7 @@ class PDF::Reader::Turtletext::Textangle
48
79
  @left_of
49
80
  end
50
81
 
82
+ attr_writer :right_of
51
83
  def right_of(*args)
52
84
  if value = args.first
53
85
  @right_of = value
@@ -55,15 +87,17 @@ class PDF::Reader::Turtletext::Textangle
55
87
  @right_of
56
88
  end
57
89
 
58
- # Returns the text
90
+ # Returns the text array found within the defined region.
91
+ # Each line of text is an array of the seperate text elements found on that line.
92
+ # [["first line first text", "first line last text"],["second line text"]]
59
93
  def text
60
94
  return unless reader
61
95
 
62
96
  xmin = if right_of
63
97
  if [Fixnum,Float].include?(right_of.class)
64
98
  right_of
65
- else
66
- reader.text_position(right_of,page)[:x] + 1
99
+ elsif xy = reader.text_position(right_of,page)
100
+ xy[:x]
67
101
  end
68
102
  else
69
103
  0
@@ -71,18 +105,18 @@ class PDF::Reader::Turtletext::Textangle
71
105
  xmax = if left_of
72
106
  if [Fixnum,Float].include?(left_of.class)
73
107
  left_of
74
- else
75
- reader.text_position(left_of,page)[:x] - 1
108
+ elsif xy = reader.text_position(left_of,page)
109
+ xy[:x]
76
110
  end
77
111
  else
78
- 99999 # TODO actual limit
112
+ 99999 # TODO: figure out the actual limit?
79
113
  end
80
114
 
81
115
  ymin = if above
82
116
  if [Fixnum,Float].include?(above.class)
83
117
  above
84
- else
85
- reader.text_position(above,page)[:y] + 1
118
+ elsif xy = reader.text_position(above,page)
119
+ xy[:y]
86
120
  end
87
121
  else
88
122
  0
@@ -90,14 +124,14 @@ class PDF::Reader::Turtletext::Textangle
90
124
  ymax = if below
91
125
  if [Fixnum,Float].include?(below.class)
92
126
  below
93
- else
94
- reader.text_position(below,page)[:y] - 1
127
+ elsif xy = reader.text_position(below,page)
128
+ xy[:y]
95
129
  end
96
130
  else
97
- 99999 # TODO actual limit
131
+ 99999 # TODO: figure out the actual limit?
98
132
  end
99
133
 
100
- reader.text_in_region(xmin,xmax,ymin,ymax,page)
134
+ reader.text_in_region(xmin,xmax,ymin,ymax,page,inclusive)
101
135
  end
102
136
 
103
137
  end
@@ -4,7 +4,7 @@ module PDF
4
4
  class Version
5
5
  MAJOR = 0
6
6
  MINOR = 2
7
- PATCH = 1
7
+ PATCH = 2
8
8
 
9
9
  STRING = [MAJOR, MINOR, PATCH].compact.join('.')
10
10
  end
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = "pdf-reader-turtletext"
8
- s.version = "0.2.1"
8
+ s.version = "0.2.2"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Paul Gallagher"]
12
- s.date = "2012-07-31"
12
+ s.date = "2012-08-01"
13
13
  s.description = "a library that can read structured and positional text from PDFs. Ideal for asembling structured data from invoices and the like."
14
14
  s.email = "gallagher.paul@gmail.com"
15
15
  s.extra_rdoc_files = [
@@ -3,19 +3,22 @@
3
3
  # This is a YAML-format file, so beware that indentation is significant
4
4
  ---
5
5
  hello_world.pdf:
6
- :test_above:
6
+ :test_numeric_above:
7
7
  :above: 100
8
8
  :expected_text:
9
9
  -
10
10
  - "Hello World"
11
- :test_below:
11
+ :test_numeric_below:
12
12
  :below: 900
13
13
  :expected_text:
14
14
  -
15
15
  - "Hello World"
16
- :test_below_na:
16
+ :test_numeric_below_na:
17
17
  :below: 10
18
18
  :expected_text: []
19
+ :test_below_na:
20
+ :below: "Bertie"
21
+ :expected_text: []
19
22
  simple_table_text.pdf:
20
23
  :test_above:
21
24
  :above: Table Header
@@ -14,7 +14,7 @@ describe PDF::Reader::Turtletext::Textangle do
14
14
  it { should be_a(PDF::Reader::Turtletext) }
15
15
  end
16
16
 
17
- describe "#text" do
17
+ context "with mock content" do
18
18
  let(:page) { 1 }
19
19
  before do
20
20
  turtletext_reader.stub(:load_content).and_return(given_page_content)
@@ -28,10 +28,10 @@ describe PDF::Reader::Turtletext::Textangle do
28
28
 
29
29
  context "with block param" do
30
30
  [:above,:below,:left_of,:right_of].each do |positional_method|
31
- context "with #{positional_method}" do
31
+ describe "##{positional_method}" do
32
32
  let(:term) { "canary" }
33
33
 
34
- it "should work with block param" do
34
+ it "should assign correctly" do
35
35
  textangle = resource_class.new(turtletext_reader) do |r|
36
36
  r.send("#{positional_method}=",term)
37
37
  end
@@ -40,159 +40,430 @@ describe PDF::Reader::Turtletext::Textangle do
40
40
 
41
41
  end
42
42
  end
43
+
44
+ describe "#page" do
45
+ let(:value) { 2 }
46
+ describe "default" do
47
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
48
+ end }
49
+ subject { textangle.page }
50
+ it { should eql(1) }
51
+ end
52
+ it "should assign correctly" do
53
+ textangle = resource_class.new(turtletext_reader) do |r|
54
+ r.page = value
55
+ end
56
+ textangle.page.should eql(value)
57
+ end
58
+ end
59
+
60
+ describe "#inclusive" do
61
+ describe "default" do
62
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
63
+ end }
64
+ subject { textangle.inclusive }
65
+ it { should be_false }
66
+ end
67
+ it "should assign true correctly" do
68
+ textangle = resource_class.new(turtletext_reader) do |r|
69
+ r.inclusive = true
70
+ end
71
+ textangle.inclusive.should be_true
72
+ end
73
+ it "should assign false correctly" do
74
+ textangle = resource_class.new(turtletext_reader) do |r|
75
+ r.inclusive = false
76
+ end
77
+ textangle.inclusive.should be_false
78
+ end
79
+ end
80
+ describe "#inclusive!" do
81
+ it "should assign correctly" do
82
+ textangle = resource_class.new(turtletext_reader) do |r|
83
+ r.inclusive!
84
+ end
85
+ textangle.inclusive.should be_true
86
+ end
87
+ end
88
+ describe "#exclusive!" do
89
+ it "should assign correctly" do
90
+ textangle = resource_class.new(turtletext_reader) do |r|
91
+ r.exclusive!
92
+ end
93
+ textangle.inclusive.should be_false
94
+ end
95
+ end
43
96
  end
44
97
 
45
98
  context "without block param" do
46
- it "#above should work" do
99
+ it "#above should assign correctly" do
47
100
  textangle = resource_class.new(turtletext_reader) do
48
101
  above "canary"
49
102
  end
50
103
  textangle.above.should eql("canary")
51
104
  end
52
- it "#below should work" do
105
+ it "#below should assign correctly" do
53
106
  textangle = resource_class.new(turtletext_reader) do
54
107
  below "canary"
55
108
  end
56
109
  textangle.below.should eql("canary")
57
110
  end
58
- it "#left_of should work" do
111
+ it "#left_of should assign correctly" do
59
112
  textangle = resource_class.new(turtletext_reader) do
60
113
  left_of "canary"
61
114
  end
62
115
  textangle.left_of.should eql("canary")
63
116
  end
64
- it "#below should work" do
117
+ it "#below should assign correctly" do
65
118
  textangle = resource_class.new(turtletext_reader) do
66
119
  right_of "canary"
67
120
  end
68
121
  textangle.right_of.should eql("canary")
69
122
  end
70
- end
71
123
 
72
- context "when only below specified" do
73
- context "as a string" do
74
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
75
- r.below = "fraud"
76
- end }
77
- let(:expected) { [["smoked and streaky for me"]]}
78
- subject { textangle.text }
79
- it { should eql(expected) }
80
- end
81
- context "as a regex" do
82
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
83
- r.below = /Fraud/i
84
- end }
85
- let(:expected) { [["smoked and streaky for me"]]}
86
- subject { textangle.text }
87
- it { should eql(expected) }
88
- end
89
- context "as a number" do
90
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
91
- r.below = 20
92
- end }
93
- let(:expected) { [["smoked and streaky for me"]]}
94
- subject { textangle.text }
95
- it { should eql(expected) }
124
+ describe "#page" do
125
+ it "should assign correctly" do
126
+ textangle = resource_class.new(turtletext_reader) do
127
+ page 2
128
+ end
129
+ textangle.page.should eql(2)
130
+ end
96
131
  end
97
- end
98
132
 
99
- context "when only above specified" do
100
- context "as a string" do
101
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
102
- r.above = "heaven"
103
- end }
104
- let(:expected) { [["crunchy bacon"]]}
105
- subject { textangle.text }
106
- it { should eql(expected) }
107
- end
108
- context "as a regex" do
109
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
110
- r.above = /heaVen/i
111
- end }
112
- let(:expected) { [["crunchy bacon"]]}
113
- subject { textangle.text }
114
- it { should eql(expected) }
115
- end
116
- context "as a number" do
117
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
118
- r.above = 41
119
- end }
120
- let(:expected) { [["crunchy bacon"]]}
121
- subject { textangle.text }
122
- it { should eql(expected) }
133
+ describe "#inclusive" do
134
+ it "should assign true correctly" do
135
+ textangle = resource_class.new(turtletext_reader) do
136
+ inclusive true
137
+ end
138
+ textangle.inclusive.should be_true
139
+ end
140
+ it "should assign false correctly" do
141
+ textangle = resource_class.new(turtletext_reader) do
142
+ inclusive false
143
+ end
144
+ textangle.inclusive.should be_false
145
+ end
146
+ end
147
+ describe "#inclusive!" do
148
+ it "should assign correctly" do
149
+ textangle = resource_class.new(turtletext_reader) do
150
+ inclusive!
151
+ end
152
+ textangle.inclusive.should be_true
153
+ end
154
+ end
155
+ describe "#exclusive!" do
156
+ it "should assign correctly" do
157
+ textangle = resource_class.new(turtletext_reader) do
158
+ exclusive!
159
+ end
160
+ textangle.inclusive.should be_false
161
+ end
123
162
  end
124
163
  end
125
164
 
126
- context "when only left_of specified" do
127
- context "as a string" do
128
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
129
- r.left_of = "turkey bacon"
130
- end }
131
- let(:expected) { [
132
- ["crunchy bacon"],
133
- ["bacon on kimchi noodles", "heaven"]
134
- ] }
135
- subject { textangle.text }
136
- it { should eql(expected) }
137
- end
138
- context "as a regex" do
139
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
140
- r.left_of = /turKey/i
141
- end }
142
- let(:expected) { [
143
- ["crunchy bacon"],
144
- ["bacon on kimchi noodles", "heaven"]
145
- ] }
146
- subject { textangle.text }
147
- it { should eql(expected) }
148
- end
149
- context "as a number" do
150
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
151
- r.left_of = 29
152
- end }
153
- let(:expected) { [
154
- ["crunchy bacon"],
155
- ["bacon on kimchi noodles", "heaven"]
156
- ] }
157
- subject { textangle.text }
158
- it { should eql(expected) }
165
+ describe "#text" do
166
+
167
+ context "when only below specified" do
168
+ context "when exclusive (default)" do
169
+ context "as a string" do
170
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
171
+ r.below = "turkey bacon"
172
+ end }
173
+ let(:expected) { [["smoked and streaky for me"]]}
174
+ subject { textangle.text }
175
+ it { should eql(expected) }
176
+ end
177
+ context "as a regex" do
178
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
179
+ r.below = /turkey/i
180
+ end }
181
+ let(:expected) { [["smoked and streaky for me"]]}
182
+ subject { textangle.text }
183
+ it { should eql(expected) }
184
+ end
185
+ context "as a number" do
186
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
187
+ r.below = 30
188
+ end }
189
+ let(:expected) { [["smoked and streaky for me"]]}
190
+ subject { textangle.text }
191
+ it { should eql(expected) }
192
+ end
193
+ end
194
+ context "when inclusive" do
195
+ context "as a string" do
196
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
197
+ r.inclusive = true
198
+ r.below = "smoked and streaky for me"
199
+ end }
200
+ let(:expected) { [["smoked and streaky for me"]]}
201
+ subject { textangle.text }
202
+ it { should eql(expected) }
203
+ end
204
+ context "as a regex" do
205
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
206
+ r.inclusive = true
207
+ r.below = /Streaky/i
208
+ end }
209
+ let(:expected) { [["smoked and streaky for me"]]}
210
+ subject { textangle.text }
211
+ it { should eql(expected) }
212
+ end
213
+ context "as a number" do
214
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
215
+ r.inclusive = true
216
+ r.below = 10
217
+ end }
218
+ let(:expected) { [["smoked and streaky for me"]]}
219
+ subject { textangle.text }
220
+ it { should eql(expected) }
221
+ end
222
+ end
223
+ context "when no match" do
224
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
225
+ r.below = "fake"
226
+ end }
227
+ let(:expected) { [] }
228
+ subject { textangle.text }
229
+ it { should eql(expected) }
230
+ end
159
231
  end
160
- end
161
232
 
162
- context "when only right_of specified" do
163
- context "as a string" do
164
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
165
- r.right_of = "heaven"
166
- end }
167
- let(:expected) { [
168
- ["turkey bacon","fraud"],
169
- ["smoked and streaky for me"]
170
- ] }
171
- subject { textangle.text }
172
- it { should eql(expected) }
173
- end
174
- context "as a regex" do
175
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
176
- r.right_of = /Heaven/i
177
- end }
178
- let(:expected) { [
179
- ["turkey bacon","fraud"],
180
- ["smoked and streaky for me"]
181
- ] }
182
- subject { textangle.text }
183
- it { should eql(expected) }
184
- end
185
- context "as a number" do
186
- let(:textangle) { resource_class.new(turtletext_reader) do |r|
187
- r.right_of = 26
188
- end }
189
- let(:expected) { [
190
- ["turkey bacon","fraud"],
191
- ["smoked and streaky for me"]
192
- ] }
193
- subject { textangle.text }
194
- it { should eql(expected) }
233
+ context "when only above specified" do
234
+ context "when exclusive (default)" do
235
+ context "as a string" do
236
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
237
+ r.above = "bacon on kimchi noodles"
238
+ end }
239
+ let(:expected) { [["crunchy bacon"]] }
240
+ subject { textangle.text }
241
+ it { should eql(expected) }
242
+ end
243
+ context "as a regex" do
244
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
245
+ r.above = /kimchi/i
246
+ end }
247
+ let(:expected) { [["crunchy bacon"]] }
248
+ subject { textangle.text }
249
+ it { should eql(expected) }
250
+ end
251
+ context "as a number" do
252
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
253
+ r.above = 40
254
+ end }
255
+ let(:expected) { [["crunchy bacon"]] }
256
+ subject { textangle.text }
257
+ it { should eql(expected) }
258
+ end
259
+ end
260
+ context "when inclusive" do
261
+ context "as a string" do
262
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
263
+ r.inclusive = true
264
+ r.above = "crunchy bacon"
265
+ end }
266
+ let(:expected) { [["crunchy bacon"]] }
267
+ subject { textangle.text }
268
+ it { should eql(expected) }
269
+ end
270
+ context "as a regex" do
271
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
272
+ r.inclusive = true
273
+ r.above = /crunChy/i
274
+ end }
275
+ let(:expected) { [["crunchy bacon"]] }
276
+ subject { textangle.text }
277
+ it { should eql(expected) }
278
+ end
279
+ context "as a number" do
280
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
281
+ r.inclusive = true
282
+ r.above = 70
283
+ end }
284
+ let(:expected) { [["crunchy bacon"]] }
285
+ subject { textangle.text }
286
+ it { should eql(expected) }
287
+ end
288
+ end
289
+ context "when no match" do
290
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
291
+ r.above = "fake"
292
+ end }
293
+ let(:expected) { [] }
294
+ subject { textangle.text }
295
+ it { should eql(expected) }
296
+ end
195
297
  end
298
+
299
+ context "when only left_of specified" do
300
+ context "when exclusive (default)" do
301
+ context "as a string" do
302
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
303
+ r.left_of = "turkey bacon"
304
+ end }
305
+ let(:expected) { [
306
+ ["crunchy bacon"],
307
+ ["bacon on kimchi noodles", "heaven"]
308
+ ] }
309
+ subject { textangle.text }
310
+ it { should eql(expected) }
311
+ end
312
+ context "as a regex" do
313
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
314
+ r.left_of = /turKey/i
315
+ end }
316
+ let(:expected) { [
317
+ ["crunchy bacon"],
318
+ ["bacon on kimchi noodles", "heaven"]
319
+ ] }
320
+ subject { textangle.text }
321
+ it { should eql(expected) }
322
+ end
323
+ context "as a number" do
324
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
325
+ r.left_of = 30
326
+ end }
327
+ let(:expected) { [
328
+ ["crunchy bacon"],
329
+ ["bacon on kimchi noodles", "heaven"]
330
+ ] }
331
+ subject { textangle.text }
332
+ it { should eql(expected) }
333
+ end
334
+ end
335
+ context "when inclusive" do
336
+ context "as a string" do
337
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
338
+ r.inclusive = true
339
+ r.left_of = "heaven"
340
+ end }
341
+ let(:expected) { [
342
+ ["crunchy bacon"],
343
+ ["bacon on kimchi noodles", "heaven"]
344
+ ] }
345
+ subject { textangle.text }
346
+ it { should eql(expected) }
347
+ end
348
+ context "as a regex" do
349
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
350
+ r.inclusive = true
351
+ r.left_of = /heaVen/i
352
+ end }
353
+ let(:expected) { [
354
+ ["crunchy bacon"],
355
+ ["bacon on kimchi noodles", "heaven"]
356
+ ] }
357
+ subject { textangle.text }
358
+ it { should eql(expected) }
359
+ end
360
+ context "as a number" do
361
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
362
+ r.inclusive = true
363
+ r.left_of = 25
364
+ end }
365
+ let(:expected) { [
366
+ ["crunchy bacon"],
367
+ ["bacon on kimchi noodles", "heaven"]
368
+ ] }
369
+ subject { textangle.text }
370
+ it { should eql(expected) }
371
+ end
372
+ end
373
+ context "when no match" do
374
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
375
+ r.left_of = "fake"
376
+ end }
377
+ let(:expected) { [] }
378
+ subject { textangle.text }
379
+ it { should eql(expected) }
380
+ end
381
+ end
382
+
383
+ context "when only right_of specified" do
384
+ context "when exclusive (default)" do
385
+ context "as a string" do
386
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
387
+ r.right_of = "heaven"
388
+ end }
389
+ let(:expected) { [
390
+ ["turkey bacon","fraud"],
391
+ ["smoked and streaky for me"]
392
+ ] }
393
+ subject { textangle.text }
394
+ it { should eql(expected) }
395
+ end
396
+ context "as a regex" do
397
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
398
+ r.right_of = /Heaven/i
399
+ end }
400
+ let(:expected) { [
401
+ ["turkey bacon","fraud"],
402
+ ["smoked and streaky for me"]
403
+ ] }
404
+ subject { textangle.text }
405
+ it { should eql(expected) }
406
+ end
407
+ context "as a number" do
408
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
409
+ r.right_of = 25
410
+ end }
411
+ let(:expected) { [
412
+ ["turkey bacon","fraud"],
413
+ ["smoked and streaky for me"]
414
+ ] }
415
+ subject { textangle.text }
416
+ it { should eql(expected) }
417
+ end
418
+ end
419
+ context "when inclusive" do
420
+ context "as a string" do
421
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
422
+ r.inclusive = true
423
+ r.right_of = "turkey bacon"
424
+ end }
425
+ let(:expected) { [
426
+ ["turkey bacon","fraud"],
427
+ ["smoked and streaky for me"]
428
+ ] }
429
+ subject { textangle.text }
430
+ it { should eql(expected) }
431
+ end
432
+ context "as a regex" do
433
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
434
+ r.inclusive = true
435
+ r.right_of = /turkey/i
436
+ end }
437
+ let(:expected) { [
438
+ ["turkey bacon","fraud"],
439
+ ["smoked and streaky for me"]
440
+ ] }
441
+ subject { textangle.text }
442
+ it { should eql(expected) }
443
+ end
444
+ context "as a number" do
445
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
446
+ r.inclusive = true
447
+ r.right_of = 30
448
+ end }
449
+ let(:expected) { [
450
+ ["turkey bacon","fraud"],
451
+ ["smoked and streaky for me"]
452
+ ] }
453
+ subject { textangle.text }
454
+ it { should eql(expected) }
455
+ end
456
+ end
457
+ context "when no match" do
458
+ let(:textangle) { resource_class.new(turtletext_reader) do |r|
459
+ r.right_of = "fake"
460
+ end }
461
+ let(:expected) { [] }
462
+ subject { textangle.text }
463
+ it { should eql(expected) }
464
+ end
465
+ end
466
+
196
467
  end
197
468
 
198
469
  end
@@ -90,7 +90,7 @@ describe PDF::Reader::Turtletext do
90
90
  {
91
91
  :with_single_text => {
92
92
  :source_page_content => {10.0=>{10.0=>"a first bit of text"}},
93
- :xmin => 0, :xmax => 100, :ymin => 0, :ymax => 100,
93
+ :xmin => 0, :xmax => 100, :ymin => 0, :ymax => 100, :inclusive => false,
94
94
  :expected_text => [["a first bit of text"]]
95
95
  },
96
96
  :with_single_line_text => {
@@ -99,7 +99,7 @@ describe PDF::Reader::Turtletext do
99
99
  30.0=>{10.0=>"first part found", 20.0=>"last part found"},
100
100
  10.0=>{10.0=>"last line ignored"}
101
101
  },
102
- :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
102
+ :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50, :inclusive => false,
103
103
  :expected_text => [["first part found", "last part found"]]
104
104
  },
105
105
  :with_multi_line_text => {
@@ -109,11 +109,20 @@ describe PDF::Reader::Turtletext do
109
109
  30.0=>{10.0=>"last line first part found", 20.0=>"last line last part found"},
110
110
  10.0=>{10.0=>"last line ignored"}
111
111
  },
112
- :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
112
+ :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50, :inclusive => false,
113
113
  :expected_text => [
114
114
  ["first line first part found", "first line last part found"],
115
115
  ["last line first part found", "last line last part found"]
116
116
  ]
117
+ },
118
+ :with_inclusive_text => {
119
+ :source_page_content => {
120
+ 70.0=>{10.0=>"first line ignored"},
121
+ 30.0=>{10.0=>"first part found", 20.0=>"last part found"},
122
+ 10.0=>{10.0=>"last line ignored"}
123
+ },
124
+ :xmin => 10, :xmax => 100, :ymin => 30, :ymax => 30, :inclusive => true,
125
+ :expected_text => [["first part found", "last part found"]]
117
126
  }
118
127
  }.each do |test_name,test_expectations|
119
128
  context test_name do
@@ -122,8 +131,9 @@ describe PDF::Reader::Turtletext do
122
131
  let(:xmax) { test_expectations[:xmax] }
123
132
  let(:ymin) { test_expectations[:ymin] }
124
133
  let(:ymax) { test_expectations[:ymax] }
134
+ let(:inclusive) { test_expectations[:inclusive] }
125
135
  let(:expected_text) { test_expectations[:expected_text] }
126
- subject { turtletext_reader.text_in_region(xmin,xmax,ymin,ymax,page) }
136
+ subject { turtletext_reader.text_in_region(xmin,xmax,ymin,ymax,page,inclusive) }
127
137
  it { should eql(expected_text) }
128
138
  end
129
139
  end
@@ -137,6 +147,7 @@ describe PDF::Reader::Turtletext do
137
147
  10.0=>{40.0=>"smoked and streaky da bomb"}
138
148
  } }
139
149
  {
150
+ :with_no_match => { :match_term => 'bertie beetle', :expected_position => nil },
140
151
  :with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>30.0} },
141
152
  :with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>40.0} },
142
153
  :with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>40.0} },
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pdf-reader-turtletext
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.1
4
+ version: 0.2.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-07-31 00:00:00.000000000 Z
12
+ date: 2012-08-01 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: pdf-reader
16
- requirement: &70339095317400 !ruby/object:Gem::Requirement
16
+ requirement: &70159058920880 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - =
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.1.1
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70339095317400
24
+ version_requirements: *70159058920880
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: bundler
27
- requirement: &70339095316760 !ruby/object:Gem::Requirement
27
+ requirement: &70159058920240 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ~>
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: 1.1.4
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *70339095316760
35
+ version_requirements: *70159058920240
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: jeweler
38
- requirement: &70339095315980 !ruby/object:Gem::Requirement
38
+ requirement: &70159058919460 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ~>
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: 1.6.4
44
44
  type: :development
45
45
  prerelease: false
46
- version_requirements: *70339095315980
46
+ version_requirements: *70159058919460
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: rake
49
- requirement: &70339095315380 !ruby/object:Gem::Requirement
49
+ requirement: &70159058918900 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ~>
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: 0.9.2.2
55
55
  type: :development
56
56
  prerelease: false
57
- version_requirements: *70339095315380
57
+ version_requirements: *70159058918900
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &70339095314600 !ruby/object:Gem::Requirement
60
+ requirement: &70159058918080 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ~>
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: 2.8.0
66
66
  type: :development
67
67
  prerelease: false
68
- version_requirements: *70339095314600
68
+ version_requirements: *70159058918080
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: rdoc
71
- requirement: &70339095313760 !ruby/object:Gem::Requirement
71
+ requirement: &70159058917260 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ~>
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: '3.11'
77
77
  type: :development
78
78
  prerelease: false
79
- version_requirements: *70339095313760
79
+ version_requirements: *70159058917260
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: prawn
82
- requirement: &70339095313100 !ruby/object:Gem::Requirement
82
+ requirement: &70159058916580 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ~>
@@ -87,10 +87,10 @@ dependencies:
87
87
  version: 0.12.0
88
88
  type: :development
89
89
  prerelease: false
90
- version_requirements: *70339095313100
90
+ version_requirements: *70159058916580
91
91
  - !ruby/object:Gem::Dependency
92
92
  name: guard-rspec
93
- requirement: &70339095312500 !ruby/object:Gem::Requirement
93
+ requirement: &70159058915980 !ruby/object:Gem::Requirement
94
94
  none: false
95
95
  requirements:
96
96
  - - ~>
@@ -98,7 +98,7 @@ dependencies:
98
98
  version: 1.2.0
99
99
  type: :development
100
100
  prerelease: false
101
- version_requirements: *70339095312500
101
+ version_requirements: *70159058915980
102
102
  description: a library that can read structured and positional text from PDFs. Ideal
103
103
  for asembling structured data from invoices and the like.
104
104
  email: gallagher.paul@gmail.com