RubyGems - pdf-reader-turtletext - Versions diffs - 0.1.0 → 0.2.0 - Mend

pdf-reader-turtletext 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

data/.travis.yml +9 -1
data/CHANGELOG +11 -0
data/README.rdoc +106 -15
data/lib/pdf/reader/turtletext.rb +52 -24
data/lib/pdf/reader/turtletext/textangle.rb +89 -13
data/lib/pdf/reader/turtletext/version.rb +1 -1
data/pdf-reader-turtletext.gemspec +5 -2
data/spec/fixtures/pdf_samples/expectations.yml +95 -0
data/spec/fixtures/pdf_samples/simple_table_text.pdf +139 -0
data/spec/integration/pdf_samples_spec.rb +28 -0
data/spec/support/pdf_samples_helper.rb +23 -0
data/spec/unit/reader/turtletext/textangle_spec.rb +193 -0
data/spec/unit/reader/turtletext/turtletext_spec.rb +42 -37
metadata +21 -18

data/.travis.yml CHANGED

@@ -1,3 +1,11 @@
 # These are specific configuration settings required for travis-ci
 # see http://travis-ci.org/tardate/pdf-reader-turtletext
-rvm: 1.9.3
+language: ruby
+rvm:
+  - 1.8.7
+  - 1.9.2
+  - 1.9.3
+  - rbx-18mode
+  - rbx-19mode
+  - jruby-18mode
+  - jruby-19mode

data/CHANGELOG ADDED

@@ -0,0 +1,11 @@
+Version 0.2.0              Release: n/a
+==================================================
+* add bounding_box / textangle semantics
+* improve documentation
+* MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
+Version 0.1.0              Release: 22nd July 2012
+==================================================
+* Initial packaging and release of core functionality directly extracted
+  from https://github.com/tardate/sps_bill_scanner/
+* MRI 1.9 only

data/README.rdoc CHANGED

@@ -14,30 +14,121 @@ For an example of how this is works in practice, see the
 == Requirements and Known Limitations
-* currently only tested with Ruby 1.9
-* fixed dependency on PDF::Reader v 1.1.1
+* Tested with MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
+* Has a fixed dependency on PDF::Reader v1.1.1
-== Installation
+== The PDF::Reader::Turtletext Cookbook
-  gem install pdf-reader-turtletext
+=== How do I install it for normal use?
-== Usage
+It is distributed as a gem, so all normal gem installation procedures apply. To install the
+gem directly from the command line:
-=== PDF::Reader::Turtletext
+  $ gem install pdf-reader-turtletext
-Provides a range of methods to extract structured text from a PDF file,
-such as <tt>text_position</tt> and <tt>text_in_region</tt>.
+If you are using bundler or Rails, add to your Gemfile:
-A typical usage:
+  gem 'pdf-reader-turtletext'
+Then bundle install:
+  $ bundle
+=== How do I install it for gem development?
+If you want to work on enhancements of fix bugs in PDF::Reader::Turtletext, fork and clone the github repository. See the section below on 'Contributing to PDF::Reader::Turtletext'
+=== How to instantiate Turtletext in code
+All interaction is done using an instance of the PDF::Reader::Turtletext class. It is
+initialised given a filename or IO-like object, and any required options.
+Typical usage:
+  pdf_filename = '../some_path/some.pdf'
   reader = PDF::Reader::Turtletext.new(pdf_filename)
-  page = 1
-  heading_position = reader.text_position(/transaction table/i)
-  next_section = reader.text_position(/transaction summary/i)
-  transaction_rows = reader.text_in_region(
-    heading_position[x], 900,
-    heading_position[y] + 1,next_section[:y] -1
+  options = { :y_precision => 5 }
+  reader_with_options = PDF::Reader::Turtletext.new(pdf_filename,options)
+=== How to extract text within a region described in relation to other text
+Problem: we don't know exactly where the required text will be on the page, and it is not encoded
+within the PDF as a single object. But we do know that it will be relatively positioned (for example)
+below a certain bit of text, to the left of another, and above some other text.
+Solution: use the <tt>bounding_box</tt> method to describe the region and extract the matching text.
+  textangle = reader.bounding_box do
+    page 1
+    below /electricity/i
+    above 10
+    right_of 240.0
+    left_of "Total ($)"
+  end
+  textangle.text
+  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
+The range of methods that can be used within the <tt>bounding_box</tt> block are all optional, and include:
+* <tt>page</tt> - specifies the PDF page from which to extract text (default is 1).
+* <tt>below</tt> - a string, regex or number that describes the upper limit of the text box
+  (default is top border of the page).
+* <tt>above</tt> - a string, regex or number that describes the lower limit of the text box
+  (default is bottom border of the page).
+* <tt>left_of</tt> - a string, regex or number that describes the right limit of the text box
+  (default is right border of the page).
+* <tt>right_of</tt> - a string, regex or number that describes the left limit of the text box
+  (default is left border of the page).
+Note that <tt>left_of</tt> and <tt>right_of</tt> constraints do *not* need to be within the vertical
+range of the box being described.
+For example, you could use an element in the page header to describe the <tt>left_of</tt> limit
+for a table at the bottom of the page, if it has the correct alignment needed to describe your text region.
+Similarly, <tt>above</tt> and <tt>below</tt> constraints do *not* need to be within the horizontal
+range of the box being described.
+=== Using a block parameter with the <tt>bounding_box</tt> method
+An explicit block parameter may be used with the <tt>bounding_box</tt> method:
+  textangle = reader.bounding_box do |r|
+    r.below /electricity/i
+    r.left_of "Total ($)"
+  end
+  textangle.text
+  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
+=== Extract text for a region with known positional co-ordinates
+If you know (or can calculate) the x,y positions of the required text region, you can extract the region's
+text using the <tt>text_in_region</tt> method.
+  text = reader.text_in_region(
+    10,  # minimum x (left-most) (inclusive)
+    900, # maximum x (right-most) (inclusive)
+    200, # minimum y (bottom-most) (inclusive)
+    400, # maximum y (top-most) (inclusive)
+    1    # page
   )
+  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
+Note that the x,y origin is at the bottom-left of the page.
+=== How to find the x,y co-ordinate of a specific text element
+Problem: if you are doing low-level text extraction with <tt>text_in_region</tt> for example,
+it is usually necessary to locate specific text to provide a positional reference.
+Solution: use the <tt>text_position</tt> method to locate text by exact or partial match.
+It returns a Hash of x/y co-ordinates that is the bottom-left corner of the text.
+  page = 1
+  text_by_exact_match = reader.text_position("Transaction Table", page)
+  => { :x => 10.0, :y => 600.0 }
+  text_by_regex_match = reader.text_position(/transaction summary/i, page)
+  => { :x => 10.0, :y => 300.0 }
+Note: in the case of multitple matches, only the first match is returned.
 == Contributing to PDF::Reader::Turtletext

data/lib/pdf/reader/turtletext.rb CHANGED

@@ -16,6 +16,8 @@ class PDF::Reader::Turtletext
   attr_reader :options
   # +source+ is a file name or stream-like object
+  # Supported +options+ include:
+  # * :y_precision
   def initialize(source, options={})
     @options = options
     @reader = PDF::Reader.new(source)
@@ -31,7 +33,7 @@ class PDF::Reader::Turtletext
   end
   # Returns positional (with fuzzed y positioning) text content collection as a hash:
-  #   { y_position: { x_position: content}}
+  #   [ fuzzed_y_position, [[x_position,content]] ]
   def content(page=1)
     @content ||= []
     if @content[page]
@@ -41,18 +43,24 @@ class PDF::Reader::Turtletext
     end
   end
-  # Returns a hash with fuzzed positioning:
-  #   { fuzzed_y_position: { x_position: content}}
+  # Returns an Array with fuzzed positioning, ordered by decreasing y position. Row content order by x position.
+  #   [ fuzzed_y_position, [[x_position,content]] ]
   # Given +input+ as a hash:
   #   { y_position: { x_position: content}}
   # Fuzz factors: +y_precision+
   def fuzzed_y(input)
-    output = {}
-    input.keys.sort.each do |precise_y|
-      # matching_y = (precise_y / 5.0).truncate * 5.0
-      matching_y = output.keys.select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
-      output[matching_y] ||= {}
-      output[matching_y].merge!(input[precise_y])
+    output = []
+    input.keys.sort.reverse.each do |precise_y|
+      matching_y = output.map(&:first).select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
+      y_index = output.index{|y| y.first == matching_y }
+      new_row_content = input[precise_y].to_a
+      if y_index
+        row_content = output[y_index].last
+        row_content += new_row_content
+        output[y_index] = [matching_y,row_content]
+      else
+        output << [matching_y,new_row_content]
+      end
     end
     output
   end
@@ -69,21 +77,24 @@ class PDF::Reader::Turtletext
   end
   # Returns an array of text elements found within the x,y limits,
+  # x ranges from +xmin+ (left of page) to +xmax+ (right of page)
+  # y ranges from +ymin+ (bottom of page) to +ymax+ (top of page)
   # Each line of text found is returned as an array element.
   # Each line of text is an array of the seperate text elements found on that line.
   #   [["first line first text", "first line last text"],["second line text"]]
   def text_in_region(xmin,xmax,ymin,ymax,page=1)
     text_map = content(page)
     box = []
-    text_map.keys.sort.reverse.each do |y|
+    text_map.each do |y,text_row|
       if y >= ymin && y<= ymax
         row = []
-        text_map[y].keys.sort.each do |x|
+        text_row.each do |x,element|
           if x >= xmin && x<= xmax
-            row << text_map[y][x]
+            row << [x,element]
           end
         end
-        box << row unless row.empty?
+        box << row.sort{|a,b| a.first <=> b.first }.map(&:last) unless row.empty?
       end
     end
     box
@@ -94,7 +105,11 @@ class PDF::Reader::Turtletext
   # +text+ may be a string (exact match required) or a Regexp
   def text_position(text,page=1)
     item = if text.class <= Regexp
-      content(page).map {|k,v| if x = v.reduce(nil){|memo,vv|  memo = (vv[1] =~ text) ? vv[0] : memo  } ; [k,x] ; end }
+      content(page).map do |k,v|
+        if x = v.reduce(nil){|memo,vv|  memo = (vv[1] =~ text) ? vv[0] : memo  }
+          [k,x]
+        end
+      end
     else
       content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
     end
@@ -104,17 +119,30 @@ class PDF::Reader::Turtletext
     end
   end
-  # WIP - not using Textangle yet for text extraction.
-  # Ideal usage is something like this:
+  # Returns a text region definition using a descriptive block.
+  #
+  # Usage:
+  #
+  #   textangle = reader.bounding_box do
+  #     page 1
+  #     below /electricity/i
+  #     above 10
+  #     right_of 240.0
+  #     left_of "Total ($)"
+  #   end
+  #   textangle.text
+  #
+  # Alternatively, an explicit block parameter may be used:
   #
-  # textangle = reader.bounding_box do
-  #   page 1
-  #   below "Electricity Services"
-  #   above "Gas Services by City Gas Pte Ltd"
-  #   right_of 240.0
-  #   left_of "Total ($)"
-  # end
-  # textangle.text
+  #   textangle = reader.bounding_box do |r|
+  #     r.page 1
+  #     r.below /electricity/i
+  #     r.above 10
+  #     r.right_of 240.0
+  #     r.left_of "Total ($)"
+  #   end
+  #   textangle.text
+  #   => [['string','string'],['string']] # array of rows, each row is an array of column text element
   #
   def bounding_box(&block)
     PDF::Reader::Turtletext::Textangle.new(self,&block)

data/lib/pdf/reader/turtletext/textangle.rb CHANGED

@@ -1,27 +1,103 @@
 # A DSL syntax for text extraction.
-# WIP - not using this yet
 #
-# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do
-#   page 1
-#   below "Electricity Services"
-#   above "Gas Services by City Gas Pte Ltd"
-#   right_of 240.0
-#   left_of "Total ($)"
+# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do |r|
+#   r.page = 1
+#   r.below = "Electricity Services"
+#   r.above = "Gas Services by City Gas Pte Ltd"
+#   r.right_of = 240.0
+#   r.left_of = "Total ($)"
 # end
 # textangle.text
 #
 class PDF::Reader::Turtletext::Textangle
   attr_reader :reader
-  attr_writer :page,:above,:below,:left_of,:right_of
+  attr_accessor :page
+  attr_writer :above,:below,:left_of,:right_of
-  # +structured_reader+ is a PDF::StructuredReader
-  def initialize(structured_reader,&block)
-    @reader = structured_reader
-    instance_eval( &block ) if block
+  # +turtletext_reader+ is a PDF::Reader::Turtletext
+  def initialize(turtletext_reader,&block)
+    @reader = turtletext_reader
+    @page = 1
+    if block_given?
+      if block.arity == 1
+        yield self
+      else
+        instance_eval &block
+      end
+    end
   end
+  def above(*args)
+    if value = args.first
+      @above = value
+    end
+    @above
+  end
+  def below(*args)
+    if value = args.first
+      @below = value
+    end
+    @below
+  end
+  def left_of(*args)
+    if value = args.first
+      @left_of = value
+    end
+    @left_of
+  end
+  def right_of(*args)
+    if value = args.first
+      @right_of = value
+    end
+    @right_of
+  end
+  # Returns the text
   def text
-    # TODO
+    return unless reader
+    xmin = if right_of
+      if [Fixnum,Float].include?(right_of.class)
+        right_of
+      else
+        reader.text_position(right_of,page)[:x] + 1
+      end
+    else
+      0
+    end
+    xmax = if left_of
+      if [Fixnum,Float].include?(left_of.class)
+        left_of
+      else
+        reader.text_position(left_of,page)[:x] - 1
+      end
+    else
+      99999 # TODO actual limit
+    end
+    ymin = if above
+      if [Fixnum,Float].include?(above.class)
+        above
+      else
+        reader.text_position(above,page)[:y] + 1
+      end
+    else
+      0
+    end
+    ymax = if below
+      if [Fixnum,Float].include?(below.class)
+        below
+      else
+        reader.text_position(below,page)[:y] - 1
+      end
+    else
+      99999 # TODO actual limit
+    end
+    reader.text_in_region(xmin,xmax,ymin,ymax,page)
   end
 end

data/lib/pdf/reader/turtletext/version.rb CHANGED

@@ -3,7 +3,7 @@ module PDF
     class Turtletext
       class Version
         MAJOR = 0
-        MINOR = 1
+        MINOR = 2
         PATCH = 0
         STRING = [MAJOR, MINOR, PATCH].compact.join('.')

data/pdf-reader-turtletext.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = "pdf-reader-turtletext"
-  s.version = "0.1.0"
+  s.version = "0.2.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Paul Gallagher"]
-  s.date = "2012-07-22"
+  s.date = "2012-07-31"
   s.description = "a library that can read structured and positional text from PDFs. Ideal for asembling structured data from invoices and the like."
   s.email = "gallagher.paul@gmail.com"
   s.extra_rdoc_files = [
@@ -20,6 +20,7 @@ Gem::Specification.new do |s|
     ".rspec",
     ".rvmrc",
     ".travis.yml",
+    "CHANGELOG",
     "Gemfile",
     "Gemfile.lock",
     "Guardfile",
@@ -34,8 +35,10 @@ Gem::Specification.new do |s|
     "lib/pdf/reader/turtletext/version.rb",
     "pdf-reader-turtletext.gemspec",
     "spec/fixtures/pdf_samples/.gitkeep",
+    "spec/fixtures/pdf_samples/expectations.yml",
     "spec/fixtures/pdf_samples/hello_world.pdf",
     "spec/fixtures/pdf_samples/junk_prefix.pdf",
+    "spec/fixtures/pdf_samples/simple_table_text.pdf",
     "spec/integration/pdf_samples_spec.rb",
     "spec/spec_helper.rb",
     "spec/support/pdf_samples_helper.rb",

data/spec/fixtures/pdf_samples/expectations.yml ADDED

@@ -0,0 +1,95 @@
+# this file defines the test expectations for PDF samples in spec/fixtures/pdf_samples.
+#
+# This is a YAML-format file, so beware that indentation is significant
+---
+hello_world.pdf:
+  :test_above:
+    :above: 100
+    :expected_text:
+    -
+      - "Hello World"
+  :test_below:
+    :below: 900
+    :expected_text:
+    -
+      - "Hello World"
+  :test_below_na:
+    :below: 10
+    :expected_text: []
+simple_table_text.pdf:
+  :test_above:
+    :above: Table Header
+    :expected_text:
+    -
+      - "Simple Table Text"
+  :test_below:
+    :below: row 2
+    :expected_text:
+    -
+      - "Table Footer"
+  :test_right_of:
+    :right_of: row 1
+    :expected_text:
+    -
+      - "val 1"
+      - "val 2"
+      - "val 3"
+    -
+      - "val 1"
+      - "val 2"
+      - "val 3"
+  :test_left_of:
+    :left_of: val 1
+    :expected_text:
+    -
+      - "Simple Table Text"
+    -
+      - "Table Header"
+    -
+      - "row 1"
+    -
+      - "row 2"
+    -
+      - "Table Footer"
+  :test_above_and_below:
+    :below: Table Header
+    :above: Table Footer
+    :expected_text:
+    -
+      - "row 1"
+      - "val 1"
+      - "val 2"
+      - "val 3"
+    -
+      - "row 2"
+      - "val 1"
+      - "val 2"
+      - "val 3"
+  :test_above_and_below_and_left_of:
+    :below: Table Header
+    :above: Table Footer
+    :left_of: val 2
+    :expected_text:
+    -
+      - "row 1"
+      - "val 1"
+    -
+      - "row 2"
+      - "val 1"
+  :test_above_and_below_and_left_of_and_right_of:
+    :below: Table Header
+    :above: Table Footer
+    :left_of: val 2
+    :right_of: row 1
+    :expected_text:
+    -
+      - "val 1"
+    -
+      - "val 1"

data/spec/fixtures/pdf_samples/simple_table_text.pdf ADDED

@@ -0,0 +1,139 @@
+%PDF-1.3
+%����
+1 0 obj
+<< /Creator <feff0050007200610077006e>
+/Producer <feff0050007200610077006e>
+>>
+endobj
+2 0 obj
+<< /Type /Catalog
+/Pages 3 0 R
+>>
+endobj
+3 0 obj
+<< /Type /Pages
+/Count 1
+/Kids [5 0 R]
+>>
+endobj
+4 0 obj
+<< /Length 795
+>>
+stream
+q
+BT
+36 747.384 Td
+/F1.0 12 Tf
+[<53696d706c652054> 120 <6162> 20 <6c652054> 120 <65> 30 <7874>] TJ
+ET
+BT
+46 327.384 Td
+/F1.0 12 Tf
+[<54> 120 <6162> 20 <6c6520486561646572>] TJ
+ET
+BT
+46 277.384 Td
+/F1.0 12 Tf
+[<726f> 15 <772031>] TJ
+ET
+BT
+136 277.384 Td
+/F1.0 12 Tf
+[<76> 25 <616c2031>] TJ
+ET
+BT
+186 277.384 Td
+/F1.0 12 Tf
+[<76> 25 <616c2032>] TJ
+ET
+BT
+236 277.384 Td
+/F1.0 12 Tf
+[<76> 25 <616c2033>] TJ
+ET
+BT
+46 227.38400000000001 Td
+/F1.0 12 Tf
+[<726f> 15 <772032>] TJ
+ET
+BT
+136 227.38400000000001 Td
+/F1.0 12 Tf
+[<76> 25 <616c2031>] TJ
+ET
+BT
+186 227.38400000000001 Td
+/F1.0 12 Tf
+[<76> 25 <616c2032>] TJ
+ET
+BT
+236 227.38400000000001 Td
+/F1.0 12 Tf
+[<76> 25 <616c2033>] TJ
+ET
+BT
+46 177.38400000000001 Td
+/F1.0 12 Tf
+[<54> 120 <6162> 20 <6c652046> 30 <6f6f746572>] TJ
+ET
+Q
+endstream
+endobj
+5 0 obj
+<< /Type /Page
+/Parent 3 0 R
+/MediaBox [0 0 612.0 792.0]
+/Contents 4 0 R
+/Resources << /ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
+/Font << /F1.0 6 0 R
+>>
+>>
+>>
+endobj
+6 0 obj
+<< /Type /Font
+/Subtype /Type1
+/BaseFont /Helvetica
+/Encoding /WinAnsiEncoding
+>>
+endobj
+xref
+0 7
+0000000000 65535 f
+0000000015 00000 n
+0000000109 00000 n
+0000000158 00000 n
+0000000215 00000 n
+0000001061 00000 n
+0000001239 00000 n
+trailer
+<< /Size 7
+/Root 2 0 R
+/Info 1 0 R
+>>
+startxref
+1336
+%%EOF

data/spec/integration/pdf_samples_spec.rb CHANGED

@@ -3,5 +3,33 @@ include PdfSamplesHelper
 describe "PDF Samples" do
+  # This will scan all *.pdf files in spec/fixtures/personal_pdf_samples
+  # and do basic verification of the file structure without any effort from you.
+  pdf_sample_expectations.each do |sample_name,test_specifications|
+    describe "sample" do
+      let(:options) { test_specifications[:options] || {} }
+      let(:sample_file) { pdf_sample(sample_name) }
+      let(:turtletext_reader) { PDF::Reader::Turtletext.new(sample_file,options) }
+      (test_specifications||{}).each do |test_name,expectations|
+        context test_name do
+          let(:bounding_box) {
+            turtletext_reader.bounding_box do
+              above expectations[:above]
+              below expectations[:below]
+              left_of expectations[:left_of]
+              right_of expectations[:right_of]
+            end
+          }
+          # it {
+          #   puts "bounding_box"
+          #   puts bounding_box.inspect
+          # }
+          subject { bounding_box.text }
+          it { should eql(expectations[:expected_text])}
+        end
+      end
+    end
+  end
 end

data/spec/support/pdf_samples_helper.rb CHANGED

@@ -31,6 +31,7 @@ module PdfSamplesHelper
     require 'prawn'
     puts "Making PDF samples for tests.."
     make_sample_hello_world
+    make_sample_simple_table_text
   end
   def make_sample_hello_world
@@ -40,4 +41,26 @@ module PdfSamplesHelper
     end
     puts "Created: #{filename}"
   end
+  def make_sample_simple_table_text
+    filename = pdf_sample('simple_table_text.pdf')
+    Prawn::Document.generate filename do
+      text "Simple Table Text"
+      text_box "Table Header", :at => [10, 300], :width => 200
+      text_box "row 1", :at => [10, 250], :width => 90
+      text_box "val 1", :at => [100, 250], :width => 50
+      text_box "val 2", :at => [150, 250], :width => 50
+      text_box "val 3", :at => [200, 250], :width => 50
+      text_box "row 2", :at => [10, 200], :width => 90
+      text_box "val 1", :at => [100, 200], :width => 50
+      text_box "val 2", :at => [150, 200], :width => 50
+      text_box "val 3", :at => [200, 200], :width => 50
+      text_box "Table Footer", :at => [10, 150], :width => 200
+    end
+    puts "Created: #{filename}"
+  end
 end

data/spec/unit/reader/turtletext/textangle_spec.rb CHANGED

@@ -3,4 +3,197 @@ require 'spec_helper'
 describe PDF::Reader::Turtletext::Textangle do
   let(:resource_class) { PDF::Reader::Turtletext::Textangle }
+  let(:source) { nil } # we're just going to mock the PDF source here
+  let(:options) { {} }
+  let(:turtletext_reader) { PDF::Reader::Turtletext.new(source,options) }
+  describe "#reader" do
+    let(:textangle) { resource_class.new(turtletext_reader) }
+    subject { textangle.reader }
+    it { should be_a(PDF::Reader::Turtletext) }
+  end
+  describe "#text" do
+    let(:page) { 1 }
+    before do
+      turtletext_reader.stub(:load_content).and_return(given_page_content)
+    end
+    let(:given_page_content) { {
+      70.0=>{10.0=>"crunchy bacon"},
+      40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
+      30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
+      10.0=>{40.0=>"smoked and streaky for me"}
+    } }
+    context "with block param" do
+      [:above,:below,:left_of,:right_of].each do |positional_method|
+        context "with #{positional_method}" do
+          let(:term) { "canary" }
+          it "should work with block param" do
+            textangle = resource_class.new(turtletext_reader) do |r|
+              r.send("#{positional_method}=",term)
+            end
+            textangle.send(positional_method).should eql(term)
+          end
+        end
+      end
+    end
+    context "without block param" do
+      it "#above should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          above "canary"
+        end
+        textangle.above.should eql("canary")
+      end
+      it "#below should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          below "canary"
+        end
+        textangle.below.should eql("canary")
+      end
+      it "#left_of should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          left_of "canary"
+        end
+        textangle.left_of.should eql("canary")
+      end
+      it "#below should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          right_of "canary"
+        end
+        textangle.right_of.should eql("canary")
+      end
+    end
+    context "when only below specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.below = "fraud"
+        end }
+        let(:expected) { [["smoked and streaky for me"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.below = /Fraud/i
+        end }
+        let(:expected) { [["smoked and streaky for me"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.below = 20
+        end }
+        let(:expected) { [["smoked and streaky for me"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+    context "when only above specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.above = "heaven"
+        end }
+        let(:expected) { [["crunchy bacon"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.above = /heaVen/i
+        end }
+        let(:expected) { [["crunchy bacon"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.above = 41
+        end }
+        let(:expected) { [["crunchy bacon"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+    context "when only left_of specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.left_of = "turkey bacon"
+        end }
+        let(:expected) { [
+          ["crunchy bacon"],
+          ["bacon on kimchi noodles", "heaven"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.left_of = /turKey/i
+        end }
+        let(:expected) { [
+          ["crunchy bacon"],
+          ["bacon on kimchi noodles", "heaven"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.left_of = 29
+        end }
+        let(:expected) { [
+          ["crunchy bacon"],
+          ["bacon on kimchi noodles", "heaven"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+    context "when only right_of specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.right_of = "heaven"
+        end }
+        let(:expected) { [
+          ["turkey bacon","fraud"],
+          ["smoked and streaky for me"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.right_of = /Heaven/i
+        end }
+        let(:expected) { [
+          ["turkey bacon","fraud"],
+          ["smoked and streaky for me"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.right_of = 26
+        end }
+        let(:expected) { [
+          ["turkey bacon","fraud"],
+          ["smoked and streaky for me"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+  end
 end

data/spec/unit/reader/turtletext/turtletext_spec.rb CHANGED

@@ -4,16 +4,16 @@ describe PDF::Reader::Turtletext do
   let(:resource_class) { PDF::Reader::Turtletext }
   let(:source) { nil } # we're just going to mock the PDF source here
-  let(:structured_reader) { resource_class.new(source,options) }
+  let(:turtletext_reader) { resource_class.new(source,options) }
   let(:options) { {} }
   describe "#reader" do
-    subject { structured_reader.reader}
+    subject { turtletext_reader.reader}
     it { should be_a(PDF::Reader) }
   end
   describe "#y_precision" do
-    subject { structured_reader.y_precision}
+    subject { turtletext_reader.y_precision}
     context "default" do
       it { should eql(3) }
     end
@@ -27,35 +27,40 @@ describe PDF::Reader::Turtletext do
   context "with mocked source content" do
     let(:page) { 1 }
     before do
-      structured_reader.should_receive(:load_content).with(page).and_return(given_page_content)
+      turtletext_reader.should_receive(:load_content).with(page).and_return(given_page_content)
     end
     {
       :with_simple_text => {
         :source_page_content => {10.0=>{10.0=>"a first bit of text"}},
         :expected_precise_content => {10.0=>{10.0=>"a first bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text"}}
+        :expected_fuzzed_content => [[10.0,[[10.0,"a first bit of text"]]]]
       },
       :with_widely_separated_text => {
-        :source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}}
-      },
-      :with_unsorted_y_text => {
         :source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
         :expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{20.0=>"a second bit of text"},20.0=>{10.0=>"a first bit of text"}}
+        :expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"]]], [10.0, [[20.0, "a second bit of text"]]]]
+      },
+      :with_unsorted_y_text => {
+        :source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
+        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
+        :expected_fuzzed_content => [[20.0, [[20.0, "a second bit of text"]]], [10.0, [[10.0, "a first bit of text"]]]]
       },
       :with_fuzzed_y_text => {
-        :source_page_content => {10.0=>{10.0=>"a first bit of text"},12.0=>{12.0=>"a second bit of text"}},
-        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},12.0=>{12.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text",12.0=>"a second bit of text"}}
+        :source_page_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
+        :expected_precise_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
+        :expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [12.0, "a second bit of text"]]]]
       },
       :with_widely_separated_fuzzed_y_text => {
         :y_precision => 25,
-        :source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text",20.0=>"a second bit of text"}}
+        :source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
+        :expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
+        :expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [20.0, "a second bit of text"]]]]
+      },
+      :with_multiple_row_text => {
+        :source_page_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
+        :expected_precise_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
+        :expected_fuzzed_content => [[10.0, [[10.0, "first"], [20.0, "second"], [30.0, "third"]]]]
       }
     }.each do |test_name,test_expectations|
       context test_name do
@@ -69,12 +74,12 @@ describe PDF::Reader::Turtletext do
         }
         describe "#content" do
-          subject { structured_reader.content(page) }
+          subject { turtletext_reader.content(page) }
           it { should eql(test_expectations[:expected_fuzzed_content]) }
         end
         describe "#precise_content" do
-          subject { structured_reader.precise_content(page) }
+          subject { turtletext_reader.precise_content(page) }
           it { should eql(test_expectations[:expected_precise_content]) }
         end
@@ -90,24 +95,24 @@ describe PDF::Reader::Turtletext do
         },
         :with_single_line_text => {
           :source_page_content => {
-            10.0=>{10.0=>"first line ignored"},
+            70.0=>{10.0=>"first line ignored"},
             30.0=>{10.0=>"first part found", 20.0=>"last part found"},
-            70.0=>{10.0=>"last line ignored"}
+            10.0=>{10.0=>"last line ignored"}
           },
           :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
           :expected_text => [["first part found", "last part found"]]
         },
         :with_multi_line_text => {
           :source_page_content => {
-            10.0=>{10.0=>"first line ignored"},
-            30.0=>{10.0=>"first line first part found", 20.0=>"first line last part found"},
-            40.0=>{10.0=>"last line first part found", 20.0=>"last line last part found"},
-            70.0=>{10.0=>"last line ignored"}
+            70.0=>{10.0=>"first line ignored"},
+            40.0=>{10.0=>"first line first part found", 20.0=>"first line last part found"},
+            30.0=>{10.0=>"last line first part found", 20.0=>"last line last part found"},
+            10.0=>{10.0=>"last line ignored"}
           },
           :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
           :expected_text => [
-            ["last line first part found", "last line last part found"],
-            ["first line first part found", "first line last part found"]
+            ["first line first part found", "first line last part found"],
+            ["last line first part found", "last line last part found"]
           ]
         }
       }.each do |test_name,test_expectations|
@@ -118,7 +123,7 @@ describe PDF::Reader::Turtletext do
           let(:ymin) { test_expectations[:ymin] }
           let(:ymax) { test_expectations[:ymax] }
           let(:expected_text) { test_expectations[:expected_text] }
-          subject { structured_reader.text_in_region(xmin,xmax,ymin,ymax,page) }
+          subject { turtletext_reader.text_in_region(xmin,xmax,ymin,ymax,page) }
           it { should eql(expected_text) }
         end
       end
@@ -126,21 +131,21 @@ describe PDF::Reader::Turtletext do
     describe "#text_position" do
       let(:given_page_content) { {
-        10.0=>{10.0=>"crunchy bacon"},
-        30.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
-        40.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
-        70.0=>{40.0=>"smoked and streaky da bomb"}
+        70.0=>{10.0=>"crunchy bacon"},
+        40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
+        30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
+        10.0=>{40.0=>"smoked and streaky da bomb"}
       } }
       {
-        :with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>40.0} },
-        :with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>30.0} },
-        :with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>30.0} },
-        :with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>10.0} }
+        :with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>30.0} },
+        :with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>40.0} },
+        :with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>40.0} },
+        :with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>70.0} }
       }.each do |test_name,test_expectations|
         context test_name do
           let(:match_term) { test_expectations[:match_term] }
           let(:expected_position) { test_expectations[:expected_position] }
-          subject { structured_reader.text_position(match_term,page) }
+          subject { turtletext_reader.text_position(match_term,page) }
           it { should eql(expected_position) }
         end
       end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pdf-reader-turtletext
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-07-22 00:00:00.000000000 Z
+date: 2012-07-31 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: pdf-reader
-  requirement: &70193556628420 !ruby/object:Gem::Requirement
+  requirement: &70218189955060 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - =
@@ -21,10 +21,10 @@ dependencies:
         version: 1.1.1
   type: :runtime
   prerelease: false
-  version_requirements: *70193556628420
+  version_requirements: *70218189955060
 - !ruby/object:Gem::Dependency
   name: bundler
-  requirement: &70193556627700 !ruby/object:Gem::Requirement
+  requirement: &70218189954360 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -32,10 +32,10 @@ dependencies:
         version: 1.1.4
   type: :development
   prerelease: false
-  version_requirements: *70193556627700
+  version_requirements: *70218189954360
 - !ruby/object:Gem::Dependency
   name: jeweler
-  requirement: &70193556626800 !ruby/object:Gem::Requirement
+  requirement: &70218189953580 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -43,10 +43,10 @@ dependencies:
         version: 1.6.4
   type: :development
   prerelease: false
-  version_requirements: *70193556626800
+  version_requirements: *70218189953580
 - !ruby/object:Gem::Dependency
   name: rake
-  requirement: &70193556626300 !ruby/object:Gem::Requirement
+  requirement: &70218189953020 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -54,10 +54,10 @@ dependencies:
         version: 0.9.2.2
   type: :development
   prerelease: false
-  version_requirements: *70193556626300
+  version_requirements: *70218189953020
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &70193556625680 !ruby/object:Gem::Requirement
+  requirement: &70218189952200 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -65,10 +65,10 @@ dependencies:
         version: 2.8.0
   type: :development
   prerelease: false
-  version_requirements: *70193556625680
+  version_requirements: *70218189952200
 - !ruby/object:Gem::Dependency
   name: rdoc
-  requirement: &70193556624820 !ruby/object:Gem::Requirement
+  requirement: &70218189951400 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -76,10 +76,10 @@ dependencies:
         version: '3.11'
   type: :development
   prerelease: false
-  version_requirements: *70193556624820
+  version_requirements: *70218189951400
 - !ruby/object:Gem::Dependency
   name: prawn
-  requirement: &70193556623960 !ruby/object:Gem::Requirement
+  requirement: &70218189950700 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -87,10 +87,10 @@ dependencies:
         version: 0.12.0
   type: :development
   prerelease: false
-  version_requirements: *70193556623960
+  version_requirements: *70218189950700
 - !ruby/object:Gem::Dependency
   name: guard-rspec
-  requirement: &70193556623440 !ruby/object:Gem::Requirement
+  requirement: &70218189950100 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -98,7 +98,7 @@ dependencies:
         version: 1.2.0
   type: :development
   prerelease: false
-  version_requirements: *70193556623440
+  version_requirements: *70218189950100
 description: a library that can read structured and positional text from PDFs. Ideal
   for asembling structured data from invoices and the like.
 email: gallagher.paul@gmail.com
@@ -111,6 +111,7 @@ files:
 - .rspec
 - .rvmrc
 - .travis.yml
+- CHANGELOG
 - Gemfile
 - Gemfile.lock
 - Guardfile
@@ -125,8 +126,10 @@ files:
 - lib/pdf/reader/turtletext/version.rb
 - pdf-reader-turtletext.gemspec
 - spec/fixtures/pdf_samples/.gitkeep
+- spec/fixtures/pdf_samples/expectations.yml
 - spec/fixtures/pdf_samples/hello_world.pdf
 - spec/fixtures/pdf_samples/junk_prefix.pdf
+- spec/fixtures/pdf_samples/simple_table_text.pdf
 - spec/integration/pdf_samples_spec.rb
 - spec/spec_helper.rb
 - spec/support/pdf_samples_helper.rb