RubyGems - pdf-reader-turtletext - Versions diffs - 0.1.0 → 0.2.0 - Mend

pdf-reader-turtletext 0.1.0 → 0.2.0

Files changed (14) hide show

data/.travis.yml +9 -1
data/CHANGELOG +11 -0
data/README.rdoc +106 -15
data/lib/pdf/reader/turtletext.rb +52 -24
data/lib/pdf/reader/turtletext/textangle.rb +89 -13
data/lib/pdf/reader/turtletext/version.rb +1 -1
data/pdf-reader-turtletext.gemspec +5 -2
data/spec/fixtures/pdf_samples/expectations.yml +95 -0
data/spec/fixtures/pdf_samples/simple_table_text.pdf +139 -0
data/spec/integration/pdf_samples_spec.rb +28 -0
data/spec/support/pdf_samples_helper.rb +23 -0
data/spec/unit/reader/turtletext/textangle_spec.rb +193 -0
data/spec/unit/reader/turtletext/turtletext_spec.rb +42 -37
metadata +21 -18

data/.travis.yml CHANGED

@@ -1,3 +1,11 @@
 # These are specific configuration settings required for travis-ci
 # see http://travis-ci.org/tardate/pdf-reader-turtletext
-rvm: 1.9.3
+language: ruby
+rvm:
+  - 1.8.7
+  - 1.9.2
+  - 1.9.3
+  - rbx-18mode
+  - rbx-19mode
+  - jruby-18mode
+  - jruby-19mode

data/CHANGELOG ADDED

@@ -0,0 +1,11 @@
+Version 0.2.0              Release: n/a
+==================================================
+* add bounding_box / textangle semantics
+* improve documentation
+* MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
+Version 0.1.0              Release: 22nd July 2012
+==================================================
+* Initial packaging and release of core functionality directly extracted
+  from https://github.com/tardate/sps_bill_scanner/
+* MRI 1.9 only

data/README.rdoc CHANGED

@@ -14,30 +14,121 @@ For an example of how this is works in practice, see the
 == Requirements and Known Limitations
-* currently only tested with Ruby 1.9
-* fixed dependency on PDF::Reader v 1.1.1
+* Tested with MRI 1.8.7, 1.9.2, 1.9.3, Rubinius (1.8 and 1.9 mode), JRuby (1.8 and 1.9 mode)
+* Has a fixed dependency on PDF::Reader v1.1.1
-== Installation
+== The PDF::Reader::Turtletext Cookbook
-  gem install pdf-reader-turtletext
+=== How do I install it for normal use?
-== Usage
+It is distributed as a gem, so all normal gem installation procedures apply. To install the
+gem directly from the command line:
-=== PDF::Reader::Turtletext
+  $ gem install pdf-reader-turtletext
-Provides a range of methods to extract structured text from a PDF file,
-such as <tt>text_position</tt> and <tt>text_in_region</tt>.
+If you are using bundler or Rails, add to your Gemfile:
-A typical usage:
+  gem 'pdf-reader-turtletext'
+Then bundle install:
+  $ bundle
+=== How do I install it for gem development?
+If you want to work on enhancements of fix bugs in PDF::Reader::Turtletext, fork and clone the github repository. See the section below on 'Contributing to PDF::Reader::Turtletext'
+=== How to instantiate Turtletext in code
+All interaction is done using an instance of the PDF::Reader::Turtletext class. It is
+initialised given a filename or IO-like object, and any required options.
+Typical usage:
+  pdf_filename = '../some_path/some.pdf'
   reader = PDF::Reader::Turtletext.new(pdf_filename)
-  page = 1
-  heading_position = reader.text_position(/transaction table/i)
-  next_section = reader.text_position(/transaction summary/i)
-  transaction_rows = reader.text_in_region(
-    heading_position[x], 900,
-    heading_position[y] + 1,next_section[:y] -1
+  options = { :y_precision => 5 }
+  reader_with_options = PDF::Reader::Turtletext.new(pdf_filename,options)
+=== How to extract text within a region described in relation to other text
+Problem: we don't know exactly where the required text will be on the page, and it is not encoded
+within the PDF as a single object. But we do know that it will be relatively positioned (for example)
+below a certain bit of text, to the left of another, and above some other text.
+Solution: use the <tt>bounding_box</tt> method to describe the region and extract the matching text.
+  textangle = reader.bounding_box do
+    page 1
+    below /electricity/i
+    above 10
+    right_of 240.0
+    left_of "Total ($)"
+  end
+  textangle.text
+  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
+The range of methods that can be used within the <tt>bounding_box</tt> block are all optional, and include:
+* <tt>page</tt> - specifies the PDF page from which to extract text (default is 1).
+* <tt>below</tt> - a string, regex or number that describes the upper limit of the text box
+  (default is top border of the page).
+* <tt>above</tt> - a string, regex or number that describes the lower limit of the text box
+  (default is bottom border of the page).
+* <tt>left_of</tt> - a string, regex or number that describes the right limit of the text box
+  (default is right border of the page).
+* <tt>right_of</tt> - a string, regex or number that describes the left limit of the text box
+  (default is left border of the page).
+Note that <tt>left_of</tt> and <tt>right_of</tt> constraints do *not* need to be within the vertical
+range of the box being described.
+For example, you could use an element in the page header to describe the <tt>left_of</tt> limit
+for a table at the bottom of the page, if it has the correct alignment needed to describe your text region.
+Similarly, <tt>above</tt> and <tt>below</tt> constraints do *not* need to be within the horizontal
+range of the box being described.
+=== Using a block parameter with the <tt>bounding_box</tt> method
+An explicit block parameter may be used with the <tt>bounding_box</tt> method:
+  textangle = reader.bounding_box do |r|
+    r.below /electricity/i
+    r.left_of "Total ($)"
+  end
+  textangle.text
+  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
+=== Extract text for a region with known positional co-ordinates
+If you know (or can calculate) the x,y positions of the required text region, you can extract the region's
+text using the <tt>text_in_region</tt> method.
+  text = reader.text_in_region(
+    10,  # minimum x (left-most) (inclusive)
+    900, # maximum x (right-most) (inclusive)
+    200, # minimum y (bottom-most) (inclusive)
+    400, # maximum y (top-most) (inclusive)
+    1    # page
   )
+  => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
+Note that the x,y origin is at the bottom-left of the page.
+=== How to find the x,y co-ordinate of a specific text element
+Problem: if you are doing low-level text extraction with <tt>text_in_region</tt> for example,
+it is usually necessary to locate specific text to provide a positional reference.
+Solution: use the <tt>text_position</tt> method to locate text by exact or partial match.
+It returns a Hash of x/y co-ordinates that is the bottom-left corner of the text.
+  page = 1
+  text_by_exact_match = reader.text_position("Transaction Table", page)
+  => { :x => 10.0, :y => 600.0 }
+  text_by_regex_match = reader.text_position(/transaction summary/i, page)
+  => { :x => 10.0, :y => 300.0 }
+Note: in the case of multitple matches, only the first match is returned.
 == Contributing to PDF::Reader::Turtletext

data/lib/pdf/reader/turtletext.rb CHANGED

@@ -16,6 +16,8 @@ class PDF::Reader::Turtletext
   attr_reader :options
   # +source+ is a file name or stream-like object
+  # Supported +options+ include:
+  # * :y_precision
   def initialize(source, options={})
     @options = options
     @reader = PDF::Reader.new(source)
@@ -31,7 +33,7 @@ class PDF::Reader::Turtletext
   end
   # Returns positional (with fuzzed y positioning) text content collection as a hash:
-  #   { y_position: { x_position: content}}
+  #   [ fuzzed_y_position, [[x_position,content]] ]
   def content(page=1)
     @content ||= []
     if @content[page]
@@ -41,18 +43,24 @@ class PDF::Reader::Turtletext
     end
   end
-  # Returns a hash with fuzzed positioning:
-  #   { fuzzed_y_position: { x_position: content}}
+  # Returns an Array with fuzzed positioning, ordered by decreasing y position. Row content order by x position.
+  #   [ fuzzed_y_position, [[x_position,content]] ]
   # Given +input+ as a hash:
   #   { y_position: { x_position: content}}
   # Fuzz factors: +y_precision+
   def fuzzed_y(input)
-    output = {}
-    input.keys.sort.each do |precise_y|
-      # matching_y = (precise_y / 5.0).truncate * 5.0
-      matching_y = output.keys.select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
-      output[matching_y] ||= {}
-      output[matching_y].merge!(input[precise_y])
+    output = []
+    input.keys.sort.reverse.each do |precise_y|
+      matching_y = output.map(&:first).select{|new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
+      y_index = output.index{|y| y.first == matching_y }
+      new_row_content = input[precise_y].to_a
+      if y_index
+        row_content = output[y_index].last
+        row_content += new_row_content
+        output[y_index] = [matching_y,row_content]
+      else
+        output << [matching_y,new_row_content]
+      end
     end
     output
   end
@@ -69,21 +77,24 @@ class PDF::Reader::Turtletext
   end
   # Returns an array of text elements found within the x,y limits,
+  # x ranges from +xmin+ (left of page) to +xmax+ (right of page)
+  # y ranges from +ymin+ (bottom of page) to +ymax+ (top of page)
   # Each line of text found is returned as an array element.
   # Each line of text is an array of the seperate text elements found on that line.
   #   [["first line first text", "first line last text"],["second line text"]]
   def text_in_region(xmin,xmax,ymin,ymax,page=1)
     text_map = content(page)
     box = []
-    text_map.keys.sort.reverse.each do |y|
+    text_map.each do |y,text_row|
       if y >= ymin && y<= ymax
         row = []
-        text_map[y].keys.sort.each do |x|
+        text_row.each do |x,element|
           if x >= xmin && x<= xmax
-            row << text_map[y][x]
+            row << [x,element]
           end
         end
-        box << row unless row.empty?
+        box << row.sort{|a,b| a.first <=> b.first }.map(&:last) unless row.empty?
       end
     end
     box
@@ -94,7 +105,11 @@ class PDF::Reader::Turtletext
   # +text+ may be a string (exact match required) or a Regexp
   def text_position(text,page=1)
     item = if text.class <= Regexp
-      content(page).map {|k,v| if x = v.reduce(nil){|memo,vv|  memo = (vv[1] =~ text) ? vv[0] : memo  } ; [k,x] ; end }
+      content(page).map do |k,v|
+        if x = v.reduce(nil){|memo,vv|  memo = (vv[1] =~ text) ? vv[0] : memo  }
+          [k,x]
+        end
+      end
     else
       content(page).map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
     end
@@ -104,17 +119,30 @@ class PDF::Reader::Turtletext
     end
   end
-  # WIP - not using Textangle yet for text extraction.
-  # Ideal usage is something like this:
+  # Returns a text region definition using a descriptive block.
+  #
+  # Usage:
+  #
+  #   textangle = reader.bounding_box do
+  #     page 1
+  #     below /electricity/i
+  #     above 10
+  #     right_of 240.0
+  #     left_of "Total ($)"
+  #   end
+  #   textangle.text
+  #
+  # Alternatively, an explicit block parameter may be used:
   #
-  # textangle = reader.bounding_box do
-  #   page 1
-  #   below "Electricity Services"
-  #   above "Gas Services by City Gas Pte Ltd"
-  #   right_of 240.0
-  #   left_of "Total ($)"
-  # end
-  # textangle.text
+  #   textangle = reader.bounding_box do |r|
+  #     r.page 1
+  #     r.below /electricity/i
+  #     r.above 10
+  #     r.right_of 240.0
+  #     r.left_of "Total ($)"
+  #   end
+  #   textangle.text
+  #   => [['string','string'],['string']] # array of rows, each row is an array of column text element
   #
   def bounding_box(&block)
     PDF::Reader::Turtletext::Textangle.new(self,&block)

data/lib/pdf/reader/turtletext/textangle.rb CHANGED

@@ -1,27 +1,103 @@
 # A DSL syntax for text extraction.
-# WIP - not using this yet
 #
-# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do
-#   page 1
-#   below "Electricity Services"
-#   above "Gas Services by City Gas Pte Ltd"
-#   right_of 240.0
-#   left_of "Total ($)"
+# textangle = PDF::Reader::Turtletext::Textangle.new(reader) do |r|
+#   r.page = 1
+#   r.below = "Electricity Services"
+#   r.above = "Gas Services by City Gas Pte Ltd"
+#   r.right_of = 240.0
+#   r.left_of = "Total ($)"
 # end
 # textangle.text
 #
 class PDF::Reader::Turtletext::Textangle
   attr_reader :reader
-  attr_writer :page,:above,:below,:left_of,:right_of
+  attr_accessor :page
+  attr_writer :above,:below,:left_of,:right_of
-  # +structured_reader+ is a PDF::StructuredReader
-  def initialize(structured_reader,&block)
-    @reader = structured_reader
-    instance_eval( &block ) if block
+  # +turtletext_reader+ is a PDF::Reader::Turtletext
+  def initialize(turtletext_reader,&block)
+    @reader = turtletext_reader
+    @page = 1
+    if block_given?
+      if block.arity == 1
+        yield self
+      else
+        instance_eval &block
+      end
+    end
   end
+  def above(*args)
+    if value = args.first
+      @above = value
+    end
+    @above
+  end
+  def below(*args)
+    if value = args.first
+      @below = value
+    end
+    @below
+  end
+  def left_of(*args)
+    if value = args.first
+      @left_of = value
+    end
+    @left_of
+  end
+  def right_of(*args)
+    if value = args.first
+      @right_of = value
+    end
+    @right_of
+  end
+  # Returns the text
   def text
-    # TODO
+    return unless reader
+    xmin = if right_of
+      if [Fixnum,Float].include?(right_of.class)
+        right_of
+      else
+        reader.text_position(right_of,page)[:x] + 1
+      end
+    else
+      0
+    end
+    xmax = if left_of
+      if [Fixnum,Float].include?(left_of.class)
+        left_of
+      else
+        reader.text_position(left_of,page)[:x] - 1
+      end
+    else
+      99999 # TODO actual limit
+    end
+    ymin = if above
+      if [Fixnum,Float].include?(above.class)
+        above
+      else
+        reader.text_position(above,page)[:y] + 1
+      end
+    else
+      0
+    end
+    ymax = if below
+      if [Fixnum,Float].include?(below.class)
+        below
+      else
+        reader.text_position(below,page)[:y] - 1
+      end
+    else
+      99999 # TODO actual limit
+    end
+    reader.text_in_region(xmin,xmax,ymin,ymax,page)
   end
 end

data/lib/pdf/reader/turtletext/version.rb CHANGED

@@ -3,7 +3,7 @@ module PDF
     class Turtletext
       class Version
         MAJOR = 0
-        MINOR = 1
+        MINOR = 2
         PATCH = 0
         STRING = [MAJOR, MINOR, PATCH].compact.join('.')

data/pdf-reader-turtletext.gemspec CHANGED

@@ -5,11 +5,11 @@
 Gem::Specification.new do |s|
   s.name = "pdf-reader-turtletext"
-  s.version = "0.1.0"
+  s.version = "0.2.0"
   s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
   s.authors = ["Paul Gallagher"]
-  s.date = "2012-07-22"
+  s.date = "2012-07-31"
   s.description = "a library that can read structured and positional text from PDFs. Ideal for asembling structured data from invoices and the like."
   s.email = "gallagher.paul@gmail.com"
   s.extra_rdoc_files = [
@@ -20,6 +20,7 @@ Gem::Specification.new do |s|
     ".rspec",
     ".rvmrc",
     ".travis.yml",
+    "CHANGELOG",
     "Gemfile",
     "Gemfile.lock",
     "Guardfile",
@@ -34,8 +35,10 @@ Gem::Specification.new do |s|
     "lib/pdf/reader/turtletext/version.rb",
     "pdf-reader-turtletext.gemspec",
     "spec/fixtures/pdf_samples/.gitkeep",
+    "spec/fixtures/pdf_samples/expectations.yml",
     "spec/fixtures/pdf_samples/hello_world.pdf",
     "spec/fixtures/pdf_samples/junk_prefix.pdf",
+    "spec/fixtures/pdf_samples/simple_table_text.pdf",
     "spec/integration/pdf_samples_spec.rb",
     "spec/spec_helper.rb",
     "spec/support/pdf_samples_helper.rb",

data/spec/fixtures/pdf_samples/expectations.yml ADDED

@@ -0,0 +1,95 @@
+# this file defines the test expectations for PDF samples in spec/fixtures/pdf_samples.
+#
+# This is a YAML-format file, so beware that indentation is significant
+---
+hello_world.pdf:
+  :test_above:
+    :above: 100
+    :expected_text:
+    -
+      - "Hello World"
+  :test_below:
+    :below: 900
+    :expected_text:
+    -
+      - "Hello World"
+  :test_below_na:
+    :below: 10
+    :expected_text: []
+simple_table_text.pdf:
+  :test_above:
+    :above: Table Header
+    :expected_text:
+    -
+      - "Simple Table Text"
+  :test_below:
+    :below: row 2
+    :expected_text:
+    -
+      - "Table Footer"
+  :test_right_of:
+    :right_of: row 1
+    :expected_text:
+    -
+      - "val 1"
+      - "val 2"
+      - "val 3"
+    -
+      - "val 1"
+      - "val 2"
+      - "val 3"
+  :test_left_of:
+    :left_of: val 1
+    :expected_text:
+    -
+      - "Simple Table Text"
+    -
+      - "Table Header"
+    -
+      - "row 1"
+    -
+      - "row 2"
+    -
+      - "Table Footer"
+  :test_above_and_below:
+    :below: Table Header
+    :above: Table Footer
+    :expected_text:
+    -
+      - "row 1"
+      - "val 1"
+      - "val 2"
+      - "val 3"
+    -
+      - "row 2"
+      - "val 1"
+      - "val 2"
+      - "val 3"
+  :test_above_and_below_and_left_of:
+    :below: Table Header
+    :above: Table Footer
+    :left_of: val 2
+    :expected_text:
+    -
+      - "row 1"
+      - "val 1"
+    -
+      - "row 2"
+      - "val 1"
+  :test_above_and_below_and_left_of_and_right_of:
+    :below: Table Header
+    :above: Table Footer
+    :left_of: val 2
+    :right_of: row 1
+    :expected_text:
+    -
+      - "val 1"
+    -
+      - "val 1"

data/spec/fixtures/pdf_samples/simple_table_text.pdf ADDED

@@ -0,0 +1,139 @@
+%PDF-1.3
+%����
+1 0 obj
+<< /Creator <feff0050007200610077006e>
+/Producer <feff0050007200610077006e>
+>>
+endobj
+2 0 obj
+<< /Type /Catalog
+/Pages 3 0 R
+>>
+endobj
+3 0 obj
+<< /Type /Pages
+/Count 1
+/Kids [5 0 R]
+>>
+endobj
+4 0 obj
+<< /Length 795
+>>
+stream
+q
+BT
+36 747.384 Td
+/F1.0 12 Tf
+[<53696d706c652054> 120 <6162> 20 <6c652054> 120 <65> 30 <7874>] TJ
+ET
+BT
+46 327.384 Td
+/F1.0 12 Tf
+[<54> 120 <6162> 20 <6c6520486561646572>] TJ
+ET
+BT
+46 277.384 Td
+/F1.0 12 Tf
+[<726f> 15 <772031>] TJ
+ET
+BT
+136 277.384 Td
+/F1.0 12 Tf
+[<76> 25 <616c2031>] TJ
+ET
+BT
+186 277.384 Td
+/F1.0 12 Tf
+[<76> 25 <616c2032>] TJ
+ET
+BT
+236 277.384 Td
+/F1.0 12 Tf
+[<76> 25 <616c2033>] TJ
+ET
+BT
+46 227.38400000000001 Td
+/F1.0 12 Tf
+[<726f> 15 <772032>] TJ
+ET
+BT
+136 227.38400000000001 Td
+/F1.0 12 Tf
+[<76> 25 <616c2031>] TJ
+ET
+BT
+186 227.38400000000001 Td
+/F1.0 12 Tf
+[<76> 25 <616c2032>] TJ
+ET
+BT
+236 227.38400000000001 Td
+/F1.0 12 Tf
+[<76> 25 <616c2033>] TJ
+ET
+BT
+46 177.38400000000001 Td
+/F1.0 12 Tf
+[<54> 120 <6162> 20 <6c652046> 30 <6f6f746572>] TJ
+ET
+Q
+endstream
+endobj
+5 0 obj
+<< /Type /Page
+/Parent 3 0 R
+/MediaBox [0 0 612.0 792.0]
+/Contents 4 0 R
+/Resources << /ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
+/Font << /F1.0 6 0 R
+>>
+>>
+>>
+endobj
+6 0 obj
+<< /Type /Font
+/Subtype /Type1
+/BaseFont /Helvetica
+/Encoding /WinAnsiEncoding
+>>
+endobj
+xref
+0 7
+0000000000 65535 f
+0000000015 00000 n
+0000000109 00000 n
+0000000158 00000 n
+0000000215 00000 n
+0000001061 00000 n
+0000001239 00000 n
+trailer
+<< /Size 7
+/Root 2 0 R
+/Info 1 0 R
+>>
+startxref
+1336
+%%EOF

data/spec/integration/pdf_samples_spec.rb CHANGED

@@ -3,5 +3,33 @@ include PdfSamplesHelper
 describe "PDF Samples" do
+  # This will scan all *.pdf files in spec/fixtures/personal_pdf_samples
+  # and do basic verification of the file structure without any effort from you.
+  pdf_sample_expectations.each do |sample_name,test_specifications|
+    describe "sample" do
+      let(:options) { test_specifications[:options] || {} }
+      let(:sample_file) { pdf_sample(sample_name) }
+      let(:turtletext_reader) { PDF::Reader::Turtletext.new(sample_file,options) }
+      (test_specifications||{}).each do |test_name,expectations|
+        context test_name do
+          let(:bounding_box) {
+            turtletext_reader.bounding_box do
+              above expectations[:above]
+              below expectations[:below]
+              left_of expectations[:left_of]
+              right_of expectations[:right_of]
+            end
+          }
+          # it {
+          #   puts "bounding_box"
+          #   puts bounding_box.inspect
+          # }
+          subject { bounding_box.text }
+          it { should eql(expectations[:expected_text])}
+        end
+      end
+    end
+  end
 end

data/spec/support/pdf_samples_helper.rb CHANGED

@@ -31,6 +31,7 @@ module PdfSamplesHelper
     require 'prawn'
     puts "Making PDF samples for tests.."
     make_sample_hello_world
+    make_sample_simple_table_text
   end
   def make_sample_hello_world
@@ -40,4 +41,26 @@ module PdfSamplesHelper
     end
     puts "Created: #{filename}"
   end
+  def make_sample_simple_table_text
+    filename = pdf_sample('simple_table_text.pdf')
+    Prawn::Document.generate filename do
+      text "Simple Table Text"
+      text_box "Table Header", :at => [10, 300], :width => 200
+      text_box "row 1", :at => [10, 250], :width => 90
+      text_box "val 1", :at => [100, 250], :width => 50
+      text_box "val 2", :at => [150, 250], :width => 50
+      text_box "val 3", :at => [200, 250], :width => 50
+      text_box "row 2", :at => [10, 200], :width => 90
+      text_box "val 1", :at => [100, 200], :width => 50
+      text_box "val 2", :at => [150, 200], :width => 50
+      text_box "val 3", :at => [200, 200], :width => 50
+      text_box "Table Footer", :at => [10, 150], :width => 200
+    end
+    puts "Created: #{filename}"
+  end
 end

data/spec/unit/reader/turtletext/textangle_spec.rb CHANGED

@@ -3,4 +3,197 @@ require 'spec_helper'
 describe PDF::Reader::Turtletext::Textangle do
   let(:resource_class) { PDF::Reader::Turtletext::Textangle }
+  let(:source) { nil } # we're just going to mock the PDF source here
+  let(:options) { {} }
+  let(:turtletext_reader) { PDF::Reader::Turtletext.new(source,options) }
+  describe "#reader" do
+    let(:textangle) { resource_class.new(turtletext_reader) }
+    subject { textangle.reader }
+    it { should be_a(PDF::Reader::Turtletext) }
+  end
+  describe "#text" do
+    let(:page) { 1 }
+    before do
+      turtletext_reader.stub(:load_content).and_return(given_page_content)
+    end
+    let(:given_page_content) { {
+      70.0=>{10.0=>"crunchy bacon"},
+      40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
+      30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
+      10.0=>{40.0=>"smoked and streaky for me"}
+    } }
+    context "with block param" do
+      [:above,:below,:left_of,:right_of].each do |positional_method|
+        context "with #{positional_method}" do
+          let(:term) { "canary" }
+          it "should work with block param" do
+            textangle = resource_class.new(turtletext_reader) do |r|
+              r.send("#{positional_method}=",term)
+            end
+            textangle.send(positional_method).should eql(term)
+          end
+        end
+      end
+    end
+    context "without block param" do
+      it "#above should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          above "canary"
+        end
+        textangle.above.should eql("canary")
+      end
+      it "#below should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          below "canary"
+        end
+        textangle.below.should eql("canary")
+      end
+      it "#left_of should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          left_of "canary"
+        end
+        textangle.left_of.should eql("canary")
+      end
+      it "#below should work" do
+        textangle = resource_class.new(turtletext_reader) do
+          right_of "canary"
+        end
+        textangle.right_of.should eql("canary")
+      end
+    end
+    context "when only below specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.below = "fraud"
+        end }
+        let(:expected) { [["smoked and streaky for me"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.below = /Fraud/i
+        end }
+        let(:expected) { [["smoked and streaky for me"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.below = 20
+        end }
+        let(:expected) { [["smoked and streaky for me"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+    context "when only above specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.above = "heaven"
+        end }
+        let(:expected) { [["crunchy bacon"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.above = /heaVen/i
+        end }
+        let(:expected) { [["crunchy bacon"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.above = 41
+        end }
+        let(:expected) { [["crunchy bacon"]]}
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+    context "when only left_of specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.left_of = "turkey bacon"
+        end }
+        let(:expected) { [
+          ["crunchy bacon"],
+          ["bacon on kimchi noodles", "heaven"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.left_of = /turKey/i
+        end }
+        let(:expected) { [
+          ["crunchy bacon"],
+          ["bacon on kimchi noodles", "heaven"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.left_of = 29
+        end }
+        let(:expected) { [
+          ["crunchy bacon"],
+          ["bacon on kimchi noodles", "heaven"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+    context "when only right_of specified" do
+      context "as a string" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.right_of = "heaven"
+        end }
+        let(:expected) { [
+          ["turkey bacon","fraud"],
+          ["smoked and streaky for me"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a regex" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.right_of = /Heaven/i
+        end }
+        let(:expected) { [
+          ["turkey bacon","fraud"],
+          ["smoked and streaky for me"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+      context "as a number" do
+        let(:textangle) { resource_class.new(turtletext_reader) do |r|
+          r.right_of = 26
+        end }
+        let(:expected) { [
+          ["turkey bacon","fraud"],
+          ["smoked and streaky for me"]
+        ] }
+        subject { textangle.text }
+        it { should eql(expected) }
+      end
+    end
+  end
 end

data/spec/unit/reader/turtletext/turtletext_spec.rb CHANGED

@@ -4,16 +4,16 @@ describe PDF::Reader::Turtletext do
   let(:resource_class) { PDF::Reader::Turtletext }
   let(:source) { nil } # we're just going to mock the PDF source here
-  let(:structured_reader) { resource_class.new(source,options) }
+  let(:turtletext_reader) { resource_class.new(source,options) }
   let(:options) { {} }
   describe "#reader" do
-    subject { structured_reader.reader}
+    subject { turtletext_reader.reader}
     it { should be_a(PDF::Reader) }
   end
   describe "#y_precision" do
-    subject { structured_reader.y_precision}
+    subject { turtletext_reader.y_precision}
     context "default" do
       it { should eql(3) }
     end
@@ -27,35 +27,40 @@ describe PDF::Reader::Turtletext do
   context "with mocked source content" do
     let(:page) { 1 }
     before do
-      structured_reader.should_receive(:load_content).with(page).and_return(given_page_content)
+      turtletext_reader.should_receive(:load_content).with(page).and_return(given_page_content)
     end
     {
       :with_simple_text => {
         :source_page_content => {10.0=>{10.0=>"a first bit of text"}},
         :expected_precise_content => {10.0=>{10.0=>"a first bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text"}}
+        :expected_fuzzed_content => [[10.0,[[10.0,"a first bit of text"]]]]
       },
       :with_widely_separated_text => {
-        :source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}}
-      },
-      :with_unsorted_y_text => {
         :source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
         :expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{20.0=>"a second bit of text"},20.0=>{10.0=>"a first bit of text"}}
+        :expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"]]], [10.0, [[20.0, "a second bit of text"]]]]
+      },
+      :with_unsorted_y_text => {
+        :source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
+        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
+        :expected_fuzzed_content => [[20.0, [[20.0, "a second bit of text"]]], [10.0, [[10.0, "a first bit of text"]]]]
       },
       :with_fuzzed_y_text => {
-        :source_page_content => {10.0=>{10.0=>"a first bit of text"},12.0=>{12.0=>"a second bit of text"}},
-        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},12.0=>{12.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text",12.0=>"a second bit of text"}}
+        :source_page_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
+        :expected_precise_content => {20.0=>{10.0=>"a first bit of text"},18.0=>{12.0=>"a second bit of text"}},
+        :expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [12.0, "a second bit of text"]]]]
       },
       :with_widely_separated_fuzzed_y_text => {
         :y_precision => 25,
-        :source_page_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_precise_content => {10.0=>{10.0=>"a first bit of text"},20.0=>{20.0=>"a second bit of text"}},
-        :expected_fuzzed_content => {10.0=>{10.0=>"a first bit of text",20.0=>"a second bit of text"}}
+        :source_page_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
+        :expected_precise_content => {20.0=>{10.0=>"a first bit of text"},10.0=>{20.0=>"a second bit of text"}},
+        :expected_fuzzed_content => [[20.0, [[10.0, "a first bit of text"], [20.0, "a second bit of text"]]]]
+      },
+      :with_multiple_row_text => {
+        :source_page_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
+        :expected_precise_content => {10.0=>{10.0=>"first"},8.0=>{20.0=>"second",30.0=>"third"}},
+        :expected_fuzzed_content => [[10.0, [[10.0, "first"], [20.0, "second"], [30.0, "third"]]]]
       }
     }.each do |test_name,test_expectations|
       context test_name do
@@ -69,12 +74,12 @@ describe PDF::Reader::Turtletext do
         }
         describe "#content" do
-          subject { structured_reader.content(page) }
+          subject { turtletext_reader.content(page) }
           it { should eql(test_expectations[:expected_fuzzed_content]) }
         end
         describe "#precise_content" do
-          subject { structured_reader.precise_content(page) }
+          subject { turtletext_reader.precise_content(page) }
           it { should eql(test_expectations[:expected_precise_content]) }
         end
@@ -90,24 +95,24 @@ describe PDF::Reader::Turtletext do
         },
         :with_single_line_text => {
           :source_page_content => {
-            10.0=>{10.0=>"first line ignored"},
+            70.0=>{10.0=>"first line ignored"},
             30.0=>{10.0=>"first part found", 20.0=>"last part found"},
-            70.0=>{10.0=>"last line ignored"}
+            10.0=>{10.0=>"last line ignored"}
           },
           :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
           :expected_text => [["first part found", "last part found"]]
         },
         :with_multi_line_text => {
           :source_page_content => {
-            10.0=>{10.0=>"first line ignored"},
-            30.0=>{10.0=>"first line first part found", 20.0=>"first line last part found"},
-            40.0=>{10.0=>"last line first part found", 20.0=>"last line last part found"},
-            70.0=>{10.0=>"last line ignored"}
+            70.0=>{10.0=>"first line ignored"},
+            40.0=>{10.0=>"first line first part found", 20.0=>"first line last part found"},
+            30.0=>{10.0=>"last line first part found", 20.0=>"last line last part found"},
+            10.0=>{10.0=>"last line ignored"}
           },
           :xmin => 0, :xmax => 100, :ymin => 20, :ymax => 50,
           :expected_text => [
-            ["last line first part found", "last line last part found"],
-            ["first line first part found", "first line last part found"]
+            ["first line first part found", "first line last part found"],
+            ["last line first part found", "last line last part found"]
           ]
         }
       }.each do |test_name,test_expectations|
@@ -118,7 +123,7 @@ describe PDF::Reader::Turtletext do
           let(:ymin) { test_expectations[:ymin] }
           let(:ymax) { test_expectations[:ymax] }
           let(:expected_text) { test_expectations[:expected_text] }
-          subject { structured_reader.text_in_region(xmin,xmax,ymin,ymax,page) }
+          subject { turtletext_reader.text_in_region(xmin,xmax,ymin,ymax,page) }
           it { should eql(expected_text) }
         end
       end
@@ -126,21 +131,21 @@ describe PDF::Reader::Turtletext do
     describe "#text_position" do
       let(:given_page_content) { {
-        10.0=>{10.0=>"crunchy bacon"},
-        30.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
-        40.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
-        70.0=>{40.0=>"smoked and streaky da bomb"}
+        70.0=>{10.0=>"crunchy bacon"},
+        40.0=>{15.0=>"bacon on kimchi noodles", 25.0=>"heaven"},
+        30.0=>{30.0=>"turkey bacon", 35.0=>"fraud"},
+        10.0=>{40.0=>"smoked and streaky da bomb"}
       } }
       {
-        :with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>40.0} },
-        :with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>30.0} },
-        :with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>30.0} },
-        :with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>10.0} }
+        :with_simple_match => { :match_term => 'turkey bacon', :expected_position => {:x=>30.0, :y=>30.0} },
+        :with_match_along_line => { :match_term => 'heaven', :expected_position => {:x=>25.0, :y=>40.0} },
+        :with_regex_match => { :match_term => /kimchi/, :expected_position => {:x=>15.0, :y=>40.0} },
+        :with_regex_multi_matches_first => { :match_term => /turkey|crunchy/, :expected_position => {:x=>10.0, :y=>70.0} }
       }.each do |test_name,test_expectations|
         context test_name do
           let(:match_term) { test_expectations[:match_term] }
           let(:expected_position) { test_expectations[:expected_position] }
-          subject { structured_reader.text_position(match_term,page) }
+          subject { turtletext_reader.text_position(match_term,page) }
           it { should eql(expected_position) }
         end
       end

metadata CHANGED

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: pdf-reader-turtletext
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
   prerelease:
 platform: ruby
 authors:
@@ -9,11 +9,11 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2012-07-22 00:00:00.000000000 Z
+date: 2012-07-31 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: pdf-reader
-  requirement: &70193556628420 !ruby/object:Gem::Requirement
+  requirement: &70218189955060 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - =
@@ -21,10 +21,10 @@ dependencies:
         version: 1.1.1
   type: :runtime
   prerelease: false
-  version_requirements: *70193556628420
+  version_requirements: *70218189955060
 - !ruby/object:Gem::Dependency
   name: bundler
-  requirement: &70193556627700 !ruby/object:Gem::Requirement
+  requirement: &70218189954360 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -32,10 +32,10 @@ dependencies:
         version: 1.1.4
   type: :development
   prerelease: false
-  version_requirements: *70193556627700
+  version_requirements: *70218189954360
 - !ruby/object:Gem::Dependency
   name: jeweler
-  requirement: &70193556626800 !ruby/object:Gem::Requirement
+  requirement: &70218189953580 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -43,10 +43,10 @@ dependencies:
         version: 1.6.4
   type: :development
   prerelease: false
-  version_requirements: *70193556626800
+  version_requirements: *70218189953580
 - !ruby/object:Gem::Dependency
   name: rake
-  requirement: &70193556626300 !ruby/object:Gem::Requirement
+  requirement: &70218189953020 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -54,10 +54,10 @@ dependencies:
         version: 0.9.2.2
   type: :development
   prerelease: false
-  version_requirements: *70193556626300
+  version_requirements: *70218189953020
 - !ruby/object:Gem::Dependency
   name: rspec
-  requirement: &70193556625680 !ruby/object:Gem::Requirement
+  requirement: &70218189952200 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -65,10 +65,10 @@ dependencies:
         version: 2.8.0
   type: :development
   prerelease: false
-  version_requirements: *70193556625680
+  version_requirements: *70218189952200
 - !ruby/object:Gem::Dependency
   name: rdoc
-  requirement: &70193556624820 !ruby/object:Gem::Requirement
+  requirement: &70218189951400 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -76,10 +76,10 @@ dependencies:
         version: '3.11'
   type: :development
   prerelease: false
-  version_requirements: *70193556624820
+  version_requirements: *70218189951400
 - !ruby/object:Gem::Dependency
   name: prawn
-  requirement: &70193556623960 !ruby/object:Gem::Requirement
+  requirement: &70218189950700 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -87,10 +87,10 @@ dependencies:
         version: 0.12.0
   type: :development
   prerelease: false
-  version_requirements: *70193556623960
+  version_requirements: *70218189950700
 - !ruby/object:Gem::Dependency
   name: guard-rspec
-  requirement: &70193556623440 !ruby/object:Gem::Requirement
+  requirement: &70218189950100 !ruby/object:Gem::Requirement
     none: false
     requirements:
     - - ~>
@@ -98,7 +98,7 @@ dependencies:
         version: 1.2.0
   type: :development
   prerelease: false
-  version_requirements: *70193556623440
+  version_requirements: *70218189950100
 description: a library that can read structured and positional text from PDFs. Ideal
   for asembling structured data from invoices and the like.
 email: gallagher.paul@gmail.com
@@ -111,6 +111,7 @@ files:
 - .rspec
 - .rvmrc
 - .travis.yml
+- CHANGELOG
 - Gemfile
 - Gemfile.lock
 - Guardfile
@@ -125,8 +126,10 @@ files:
 - lib/pdf/reader/turtletext/version.rb
 - pdf-reader-turtletext.gemspec
 - spec/fixtures/pdf_samples/.gitkeep
+- spec/fixtures/pdf_samples/expectations.yml
 - spec/fixtures/pdf_samples/hello_world.pdf
 - spec/fixtures/pdf_samples/junk_prefix.pdf
+- spec/fixtures/pdf_samples/simple_table_text.pdf
 - spec/integration/pdf_samples_spec.rb
 - spec/spec_helper.rb
 - spec/support/pdf_samples_helper.rb