RubyGems - tabula-extractor - Versions diffs - 0.7.4-java → 0.7.5-java - Mend

tabula-extractor 0.7.4-java → 0.7.5-java

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +4 -4
data/Gemfile.lock +39 -0
data/README.md +111 -3
data/bin/tabula +4 -2
data/lib/tabula.rb +1 -17
data/lib/tabula/core_ext.rb +10 -0
data/lib/tabula/entities/page.rb +51 -12
data/lib/tabula/entities/page_area.rb +1 -0
data/lib/tabula/entities/ruling.rb +6 -1
data/lib/tabula/entities/spreadsheet.rb +7 -8
data/lib/tabula/entities/table.rb +2 -0
data/lib/tabula/version.rb +1 -1
metadata +4 -6
data/lib/tabula/line_segment_detector.rb +0 -125
data/lib/tabula/pdf_line_extractor.rb +0 -319
data/lib/tabula/pdf_render.rb +0 -64

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA1:
-  metadata.gz: 935de0f0dc43fa388a86cc091dc540b74b6ce31f
-  data.tar.gz: 67fa5fda6450c3b1659af3c61c8027843be5c082
+  metadata.gz: 4391bc1af8143d2f60ed2ad11a66cba9f96c955e
+  data.tar.gz: d0d905fbd5b2bae105a11a9fd01470921b7dd4f3
 SHA512:
-  metadata.gz: 191054f79148535bf359c81c72d35b717f71f97ee3c3bedd4c2af66e4332afb98f3071afe4c9ed9e894586e3a20722769742f17fc02b9a5d5d954a4fae50803d
-  data.tar.gz: 711f993194c402d1bca016f0fe13ccaeb8e4eafc6b67c2de0fa8b3cef1e7e3ae5b4cdefc2b251b64467747e7af26f80bb54bf57d4424ea50bb2dd26db7e27570
+  metadata.gz: 4ef1e681e511dc074381696689b8d86915f262d4538cf107be6b8844fcbf7102f20cedcdc8964f326178095e7cd9b47d912386801e57a0e4645ee27faacff4a5
+  data.tar.gz: 336c19a84cd2cf430e24ce0728a5f7753ca78e3b08e2f12251274da577dd2f0cfe2d6a5cf1b9df44dd09e85d9c4e5a80d2672e53632785bc70461f8ca10c3e17

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,39 @@
+PATH
+  remote: .
+  specs:
+    tabula-extractor (0.7.5-java)
+      trollop (~> 2.0)
+GEM
+  remote: http://rubygems.org/
+  specs:
+    coderay (1.1.0)
+    columnize (0.8.9)
+    ffi (1.9.5-java)
+    method_source (0.8.2)
+    minitest (5.4.2)
+    pry (0.10.1-java)
+      coderay (~> 1.1.0)
+      method_source (~> 0.8.1)
+      slop (~> 3.4)
+      spoon (~> 0.0)
+    rake (10.3.2)
+    ruby-debug (0.10.4)
+      columnize (>= 0.1)
+      ruby-debug-base (~> 0.10.4.0)
+    ruby-debug-base (0.10.4-java)
+    slop (3.6.0)
+    spoon (0.0.4)
+      ffi
+    trollop (2.0)
+PLATFORMS
+  java
+DEPENDENCIES
+  bundler (>= 1.3.4)
+  minitest
+  pry
+  rake
+  ruby-debug
+  tabula-extractor!

data/README.md CHANGED Viewed

@@ -3,7 +3,9 @@ tabula-extractor
 [![Build Status](https://travis-ci.org/jazzido/tabula-extractor.png)](https://travis-ci.org/jazzido/tabula-extractor)
-Extract tables from PDF files. `tabula-extractor` is the table extraction engine that powers [Tabula](http://tabula.nerdpower.org), now available as a library and command line program.
+Extract tables from PDF files. `tabula-extractor` is the table extraction engine that powers [Tabula](http://tabula.technology), now available as a library and command line program.
+Versions 0.9.6 and greater of [Tabula](http://tabula.technology) can export shell scripts using `tabula-extractor` for bulk extraction.
 ## Installation
@@ -49,11 +51,45 @@ Tabula helps you extract tables from PDFs
             --help, -h:   Show this message
 ```
+## Command Line Examples
+These examples use documents contained with `tabula-extractor`'s [`test`](https://github.com/tabulapdf/tabula-extractor/tree/master/test) folder. If you want to follow along, download the document and give it a shot. There's more extensive explanation [here](https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool).
+Extract all the tables from a document into a spreadsheet called `output.csv`:
+````bash
+tabula test/heuristic-test-set/spreadsheet/tabla_subsidios.pdf -o output.csv
+````
+Extract only the tables on page 1 into a spreadsheet called `output.csv`:
+````bash
+tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf -o output.csv
+````
+Extract only the tables on page 1 into a CSV spreadsheet onto STDOUT (that is, print it out in your terminal window):
+````bash
+tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf
+````
+Extract the data from the table contained within a certain area on page 1 into a spreadsheet called `output.csv`:
+````bash
+tabula test/data/vertical_rulings_bug.pdf --area 250,0,325,1700  --pages 1 -o output.csv
+````
+Extract all the tables from a document into a tab-separated spreadsheet called `output.tsv`:
+````bash
+tabula test/heuristic-test-set/spreadsheet/strongschools.pdf output.tsv --format TSV #should exclude guff
+````
+Extract the table from page 1, using specified locations for column boundaries, into a spreadsheet called `output.csv`:
+````bash
+tabula test/data/campaign_donors.pdf -o output.csv --columns 47,147,256,310,375,431,504
+````
 ## Scripting examples
-`tabula-extractor` is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.
+`tabula-extractor` is also a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.
-Here's a very basic example:
+Here's a very basic example, using the "spreadsheet" extraction method:
 ````ruby
 require 'tabula'
@@ -73,3 +109,75 @@ end
 out.close
 ````
+Here's another example using the "original" extraction method, which is useful for tables that don't have ruling lines separating the rows and cells. This example extracts data from only pages 1 and 2.
+````ruby
+require 'tabula'
+pdf_file_path = "whatever.pdf"
+outfilename = "whatever.csv"
+out = open(outfilename, 'w')
+extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
+extractor.extract.each_with_index do |pdf_page, page_index|
+  page_areas = [[250, 0, 325, 1700]]
+  page_areas.each do |page_area|
+    out << pdf_page.get_area(page_area).make_table.to_csv
+    out << "\n\n"
+  end
+end
+extractor.close!
+out.close
+````
+This similar example using the "original" extraction method, but specifies the location of columns. This is a useful tactic when crappy PDF creation software let one column's text flow into the next column. Unless you specify column locations manually, Tabula would combine the two columns. You can find the column locations using a screen ruler; I find it works well to measure the width of the entire PDF and scale the locations based on the width of the page as PDFBox renders it, as shown in the example below.
+````ruby
+require 'tabula'
+pdf_file_path = "whatever.pdf"
+outfilename = "whatever.csv"
+out = open(outfilename, 'w')
+extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
+extractor.extract.each_with_index do |pdf_page, page_index|
+  page_areas = [[250, 0, 325, 1700]]
+  scale_factor = pdf_page.width / 1700
+  # where 1700 is the width of the page as you measured it.
+  vertical_ruling_locations = [0, 360, 506, 617, 906, 1034, 1160, 1290, 1418, 1548] #column locations
+  vertical_rulings = vertical_ruling_locations.map{|n| Tabula::Ruling.new(0, n * scale_factor, 0, 1000)}
+  page_areas.each do |page_area|
+    out << pdf_page.get_area(page_area).make_table(:vertical_rulings => vertical_rulings).to_csv
+    out << "\n\n"
+  end
+end
+extractor.close!
+out.close
+````
+## How Does This Work? Like, Theoretically?
+PDFs are a terrible format for transmitting tabular data. Tabula uses two algorithms to try to reconstruct the underlying structure of the data table. This section describes how PDFs represent your data and how Tabula extracts it so you can use `tabula-extractor` productively.
+PDFs were designed to represent a paper document's layout across various computers and on paper, so they focus on precise positioning. They include primitives for text strings, geometric shapes, images and videos (and more), but no data tables. Tabula includes a Java library called PDFBox to access those embedded text strings and geometric shapes and uses them to reconstruct your table.
+<em style="margin-left: 5px;"> Why Can't Tabula Process Scanned Pages? Scanned PDF pages usually contain only one primitive: the image of the scanned page. Since those PDFs don't contain text strings or geometric shapes, Tabula won't be able to reconstruct your data -- unless you run the PDF through an OCR (Optical Character Recognition) program, which re-inserts those text strings into their original position, though the results can be error prone.</em>
+Tabula has two distinct algorithms to use for different kinds of tables. It uses a heuristic to try to guess which algorithm to use for each table, but this heuristic is wrong fairly often, so you may need to specify which algorithm to use, using the Extraction Method selector buttons in the GUI or the `spreadsheet` or `no-spreadsheet` flags on the command line.
+- The `spreadsheet` algorithm uses geometric lines to reconstruct the table structure. After discarding oblique lines, the algorithm finds all of the lines' crossing points. Using those crossing points, it creates a large list of minimal rectangular areas (that is, rectangles that contain no other rectangles) that are spreadsheet cells. The minimum bounding box of groups of adjacent cells is a table (called a Spreadsheet object). After spreadsheet objects are created, empty "placeholder" cells are created when a cell in one row (or, likewise, column) spans over a space in which multiple cells are contained in another row. Once we have the dimensions of all the cells on the page, the PDFBox library can get the text contained within each cell.
+- The `original` or `no-spreadsheet` algorithm uses only the position of text element on the page. (Because OCR software doesn't reconstruct lines, this algorithm is the only algorithm available for OCRed PDFs.) The algorithm collects all the text on the page (or within the area of the page that contains a table, specified with the Tabula GUI or the `--area` flag) and finds "rivers" -- vertical spaces that don't contain any text for the entire height of the table. These are considered column boundaries. (If text from one column flows into another column because the PDF was created with crappy software, you can specify it manually with the `--columns` flag ) Each line of text on the page (by unique y locations) is considered a separate line in the table. (If cells contain multiple rows, you may have to write a script to "roll them up" -- Tabula can't provide this functionality.)
+These two algorithms are inspired by some academic work, including Anssi Nurminen's "[Algorithmic Extraction of Data in Tables in Pdf Documents](http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3)" (2013) for the spreadsheet algorithm.
+## Documentation
+You're welcome to try to integrate the `tabula-extractor` gem into your project. We don't really have documentation yet, though the tests may be a good source. If you're going to, please feel free to drop us a note and we may be able to give you some pointers.

data/bin/tabula CHANGED Viewed

@@ -120,7 +120,6 @@ def main
       end
       tables = pdf_page.spreadsheets(:use_line_returns=> use_line_returns).map(&:rows)
     else
-      STDERR.puts "Page #{pdf_page.number(:one_indexed)}: #{page_area.to_s}" if opts[:debug]
       if opts[:guess]
         page_areas = pdf_page.spreadsheets.map{|rect| pdf_page.get_area(rect.dims(:top, :left, :bottom, :right))}
       elsif area_input
@@ -128,7 +127,10 @@ def main
       else
         page_areas = [pdf_page]
       end
-      tables = page_areas.map{|page_area| page_area.make_table(vertical_rulings.nil? ? {} : { :vertical_rulings => rulings_from_columns(pdf_page, page_area, vertical_rulings) })}
+      tables = page_areas.map { |page_area|
+        STDERR.puts "Page #{pdf_page.number(:one_indexed)}: #{page_area.to_s}" if opts[:debug]
+        page_area.make_table(vertical_rulings.nil? ? {} : { :vertical_rulings => rulings_from_columns(pdf_page, page_area, vertical_rulings) })
+      }
     end
     tables.each do |table|
       Tabula::Writers.send(opts[:format].to_sym,

data/lib/tabula.rb CHANGED Viewed

@@ -9,19 +9,8 @@ require File.join(File.dirname(__FILE__), '../target/', 'slf4j-api-1.6.3.jar')
 require File.join(File.dirname(__FILE__), '../target/', 'trove4j-3.0.3.jar')
 require File.join(File.dirname(__FILE__), '../target/', 'jsi-1.1.0-SNAPSHOT.jar')
-import 'java.util.logging.LogManager'
-import 'java.util.logging.Level'
+java.util.logging.Logger.getLogger('org.apache.pdfbox').setLevel(java.util.logging.Level::OFF)
-lm = LogManager.log_manager
-lm.logger_names.each do |name|
-  if name == "" #rootlogger is apparently the logger PDFBox is talking to.
-    l = lm.get_logger(name)
-    l.level = Level::OFF
-    l.handlers.each do |h|
-      h.level = Level::OFF
-    end
-  end
-end
 require_relative './tabula/version'
 require_relative './tabula/core_ext'
@@ -30,9 +19,4 @@ require_relative './tabula/extraction'
 require_relative './tabula/table_extractor'
 require_relative './tabula/writers'
-module Tabula
-  autoload :LSD               , File.expand_path('tabula/line_segment_detector.rb', File.dirname(__FILE__))
-  autoload :Render            , File.expand_path('tabula/pdf_render.rb', File.dirname(__FILE__))
-end
 require_relative './tabula/table_extractor'

data/lib/tabula/core_ext.rb CHANGED Viewed

@@ -189,6 +189,16 @@ class Rectangle2D
     (other.bottom - self.bottom).abs
   end
+  # decomposes a rectangle into its 4 constitutent lines
+  def to_lines
+    #      top left width height
+    top = Line2D::Float.new self.left, self.top, self.right, self.top
+    bottom = Line2D::Float.new self.left, self.bottom, self.right, self.bottom
+    left = Line2D::Float.new self.left, self.top, self.left, self.bottom
+    right = Line2D::Float.new self.right, self.top, self.right, self.bottom
+    [top, bottom, left, right]
+  end
   # Various ways that rectangles can overlap one another
   #------------------------------

data/lib/tabula/entities/page.rb CHANGED Viewed

@@ -22,6 +22,8 @@ module Tabula
       self.texts = texts
+      @ruling_lines += minimal_bounding_box_of_ruling_lines.to_lines.map{|l| Ruling.new(l.getY1, l.getX1, l.getX2 - l.getX1, l.getY2 - l.getY1)}.select &:finite?
       if spatial_index.nil?
         @spatial_index = TextElementIndex.new
         self.texts.each { |te| @spatial_index << te }
@@ -31,11 +33,44 @@ module Tabula
     end
-    def min_char_width
+    def minimal_bounding_box_of_ruling_lines
+      max_x = 0
+      max_y = 0
+      min_x = ::Float::INFINITY
+      min_y = ::Float::INFINITY
+      horizontal_ruling_lines.each do |t|
+        min_x = t.left if t.left < min_x
+        max_x = t.right if t.right > max_x
+      end
+      vertical_ruling_lines.each do |t|
+        min_y = t.top if t.top < min_y
+        max_y = t.bottom if t.bottom > max_y
+      end
+      java.awt.geom.Rectangle2D::Float.new(min_x, min_y, max_x - min_x, max_y - min_y)
+    end
+    # is there a scenario under which we'd prefer to use this over `minimal_bounding_box_of_ruling_lines`?
+    # if so, what is it? If there are no ruling lines on the page _at all_, then adding this bounding box is
+    # useless.
+    def minimal_bounding_box_of_text_elements
+      max_x = 0
+      max_y = 0
+      min_x = ::Float::INFINITY
+      min_y = ::Float::INFINITY
+      @texts.each do |t|
+        min_x = t.x if t.x < min_x
+        min_y = t.y if t.y < min_y
+        max_x = t.x if t.x > max_x
+        max_y = t.y if t.y > max_y
+      end
+      java.awt.geom.Rectangle2D::Float.new(min_x, min_y, max_x - min_x, max_y - min_y)
+    end
+    def get_min_char_width
       @min_char_width ||= texts.map(&:width).min
     end
-    def min_char_height
+    def get_min_char_height
       @min_char_height ||= texts.map(&:height).min
     end
@@ -107,16 +142,8 @@ module Tabula
       unless @spreadsheets.nil?
         return @spreadsheets
       end
-      get_ruling_lines!(options)
-      self.find_cells!(self.horizontal_ruling_lines, self.vertical_ruling_lines, options)
-      spreadsheet_areas = find_spreadsheets_from_cells #literally, java.awt.geom.Area objects. lol sorry. polygons.
-      #transform each spreadsheet area into a rectangle
-      # and get the cells contained within it.
-      spreadsheet_rectangle_areas = spreadsheet_areas.map{|a| a.getBounds } #getBounds2D is theoretically better, but returns a Rectangle2D.Double, which doesn't have our Ruby sugar on it.
-      @spreadsheets = spreadsheet_rectangle_areas.map do |rect|
+      @spreadsheets = spreadsheet_areas(options).map do |rect|
         spr = Spreadsheet.new(rect.y, rect.x,
                         rect.width, rect.height,
                         self,
@@ -135,6 +162,18 @@ module Tabula
       spreadsheets
     end
+    def spreadsheet_areas (options={})
+      get_ruling_lines!(options)
+      self.find_cells!(self.horizontal_ruling_lines, self.vertical_ruling_lines, options)
+      spreadsheet_java_areas = find_spreadsheets_from_cells #literally, java.awt.geom.Area objects. lol sorry. polygons.
+      #transform each spreadsheet area into a rectangle
+      # and get the cells contained within it.
+      # getBounds2D is theoretically better than getBounds, but it returns a Rectangle2D.Double, which doesn't have our Ruby sugar on it.
+      spreadsheet_java_areas.map{|a| a.getBounds }
+    end
     def fill_in_cells!(options={})
       spreadsheets(options).each do |spreadsheet|
         spreadsheet.cells.each do |cell|
@@ -244,7 +283,7 @@ module Tabula
       # ah, but perhaps I can stick the points in a hash AND in an array
       # and then modify the lines by means of the points in the hash.
-      [[:x, :x=, self.min_char_width], [:y, :y=, self.min_char_height]].each do |getter, setter, cell_size|
+      [[:x, :x=, self.get_min_char_width], [:y, :y=, self.get_min_char_height]].each do |getter, setter, cell_size|
         sorted_points = points.sort_by(&getter)
         first_point = sorted_points.shift
         grouped_points = sorted_points.inject([[first_point]] ) do |memo, next_point|

data/lib/tabula/entities/page_area.rb CHANGED Viewed

@@ -1,6 +1,7 @@
 module Tabula
   class PageArea < Page
   end

data/lib/tabula/entities/ruling.rb CHANGED Viewed

@@ -194,7 +194,8 @@ module Tabula
     # log(n) implementation of find_intersections
     # based on http://people.csail.mit.edu/indyk/6.838-old/handouts/lec2.pdf
     def self.find_intersections(horizontals, verticals)
-      tree = java.util.TreeMap.new(HSegmentComparator.new)
+      construct_treemap_t_comparator = java.util.TreeMap.java_class.constructor(java.util.Comparator)
+      tree = construct_treemap_t_comparator.new_instance(HSegmentComparator.new).to_java
       sort_obj = Struct.new(:type, :pos, :obj)
       (horizontals + verticals)
@@ -237,6 +238,10 @@ module Tabula
         }
     end
+    def finite?
+      top != ::Float::INFINITY && left != ::Float::INFINITY && bottom != ::Float::INFINITY && right != ::Float::INFINITY
+    end
     ##
     # crop an enumerable of +Ruling+ to an +area+
     def self.crop_rulings_to_area(rulings, area)

data/lib/tabula/entities/spreadsheet.rb CHANGED Viewed

@@ -37,10 +37,9 @@ module Tabula
       if evaluate_cells
         fill_in_cells!
       end
-      tops = cells.map(&:top).uniq.sort
-      array_of_rows = tops.map do |top|
-        cells.select{|c| c.top == top }.sort_by(&:left)
-      end
+      array_of_rows = cells.group_by{|cell| cell.top.round(5) }.sort_by(&:first).map{|x| x.last.sort_by(&:left) }
       #here, insert another kind of placeholder for empty corners
       # like in 01001523B_China.pdf
       #TODO: support placeholders for "empty" cells in rows other than row 1, and in #cols
@@ -66,10 +65,8 @@ module Tabula
       if evaluate_cells
         fill_in_cells!
       end
-      lefts = cells.map(&:left).uniq.sort
-      lefts.map do |left|
-        cells.select{|c| c.left == left }.sort_by(&:top)
-      end
+      cells.group_by{|cell| cell.left.round(5) }.sort_by(&:first).map{|x| x.last.sort_by(&:top) }
     end
     #######################################################
@@ -137,12 +134,14 @@ module Tabula
     def to_csv
       out = StringIO.new
+      out.set_encoding("utf-8")
       Tabula::Writers.CSV(rows, out)
       out.string
     end
     def to_tsv
       out = StringIO.new
+      out.set_encoding("utf-8")
       Tabula::Writers.TSV(rows, out)
       out.string
     end

data/lib/tabula/entities/table.rb CHANGED Viewed

@@ -79,12 +79,14 @@ module Tabula
     def to_csv
       out = StringIO.new
+      out.set_encoding("utf-8")
       Tabula::Writers.CSV(rows, out)
       out.string
     end
     def to_tsv
       out = StringIO.new
+      out.set_encoding("utf-8")
       Tabula::Writers.TSV(rows, out)
       out.string
     end

data/lib/tabula/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module Tabula
-  VERSION = '0.7.4'
+  VERSION = '0.7.5'
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: tabula-extractor
 version: !ruby/object:Gem::Version
-  version: 0.7.4
+  version: 0.7.5
 platform: java
 authors:
 - Manuel Aristarán
@@ -10,7 +10,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2014-05-09 00:00:00.000000000 Z
+date: 2014-09-29 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -94,6 +94,7 @@ files:
 - .travis.yml
 - AUTHORS.md
 - Gemfile
+- Gemfile.lock
 - LICENSE.md
 - NOTICE.txt
 - README.md
@@ -116,9 +117,6 @@ files:
 - lib/tabula/entities/text_element_index.rb
 - lib/tabula/entities/zone_entity.rb
 - lib/tabula/extraction.rb
-- lib/tabula/line_segment_detector.rb
-- lib/tabula/pdf_line_extractor.rb
-- lib/tabula/pdf_render.rb
 - lib/tabula/spreadsheet_extractor.rb
 - lib/tabula/table_extractor.rb
 - lib/tabula/table_guesser.rb
@@ -149,7 +147,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 2.2.2
+rubygems_version: 2.4.1
 signing_key:
 specification_version: 4
 summary: extract tables from PDF files

data/lib/tabula/line_segment_detector.rb DELETED Viewed

@@ -1,125 +0,0 @@
-require 'rbconfig'
-require 'ffi'
-java_import javax.imageio.ImageIO
-java_import java.awt.image.BufferedImage
-java_import org.apache.pdfbox.pdmodel.PDDocument
-module Tabula
-  module LSD
-    extend FFI::Library
-    ffi_lib File.expand_path('../../ext/' + case RbConfig::CONFIG['host_os']
-                                            when /mswin|msys|mingw|cygwin|bccwin|wince|emc/
-                                              if RbConfig::CONFIG['host_cpu'] == 'x86_64'
-                                                'liblsd64.dll'
-                                              else
-                                                'liblsd.dll'
-                                              end
-                                            when /darwin|mac os/
-                                              'liblsd.dylib'
-                                            when /linux/
-                                              if RbConfig::CONFIG['target_cpu'] == 'x86_64'
-                                                'liblsd-linux64.so'
-                                              else
-                                                'liblsd-linux32.so'
-                                              end
-                                            else
-                                              raise "unknown os: #{RbConfig::CONFIG['host_os']}"
-                                            end,
-                             File.dirname(__FILE__))
-    attach_function :lsd, [ :pointer, :buffer_in, :int, :int ], :pointer
-    attach_function :free_values, [ :pointer ], :void
-    DETECT_LINES_DEFAULTS = {
-      :scale_factor => nil,
-      :image_size => 2048
-    }
-    def LSD.detect_lines_in_pdf(pdf_path, options={})
-      options = DETECT_LINES_DEFAULTS.merge(options)
-      pdf_file = PDDocument.loadNonSeq(java.io.File.new(pdf_path), nil)
-      lines = pdf_file.getDocumentCatalog.getAllPages.to_a.map do |page|
-        bi = Tabula::Render.pageToBufferedImage(page, options[:image_size])
-        detect_lines(bi, options[:scale_factor] || (page.findCropBox.width / options[:image_size]))
-      end
-      pdf_file.close
-      lines
-    end
-    #zero-indexed page_number
-    def LSD.detect_lines_in_pdf_page(pdf_path, page_number, options={})
-      options = DETECT_LINES_DEFAULTS.merge(options)
-      pdf_file = Extraction.openPDF(pdf_path)
-      page = pdf_file.getDocumentCatalog.getAllPages[page_number]
-      bi = Tabula::Render.pageToBufferedImage(page,
-                                              options[:image_size])
-      pdf_file.close
-      detect_lines(bi,
-                   options[:scale_factor] || (page.findCropBox.width / options[:image_size]))
-    end
-    # image can be either a string (path to image) or a Java::JavaAwtImage::BufferedImage
-    # image to pixels: http://stackoverflow.com/questions/6524196/java-get-pixel-array-from-image
-    def LSD.detect_lines(image, scale_factor=1)
-      bimage = if image.class == Java::JavaAwtImage::BufferedImage
-                 image
-               elsif image.class == String
-                 ImageIO.read(java.io.File.new(image))
-               else
-                 raise ArgumentError, 'image must be a string or a BufferedImage'
-               end
-      image = LSD.image_to_image_float(bimage)
-      lines_found_ptr = FFI::MemoryPointer.new(:int, 1)
-      out = lsd(lines_found_ptr, image, bimage.getWidth, bimage.getHeight)
-      lines_found = lines_found_ptr.get_int
-      rv = []
-      lines_found.times do |i|
-        a = out[7*4*i].read_array_of_type(:float, 7)
-        a_round = a[0..3].map(&:round)
-        p1, p2 = [[a_round[0], a_round[1]], [a_round[2], a_round[3]]]
-        rv << Tabula::Ruling.new(p1[1] * scale_factor,
-                                 p1[0] * scale_factor,
-                                 (p2[0] - p1[0]) * scale_factor,
-                                 (p2[1] - p1[1]) * scale_factor)
-      end
-      free_values(out)
-      bimage.flush
-      bimage.getGraphics.dispose
-      image = nil
-      return rv
-    end
-    private
-    def LSD.image_to_image_float(buffered_image)
-      width = buffered_image.getWidth; height = buffered_image.getHeight
-      raster_size = width * height
-      image_float = FFI::MemoryPointer.new(:float, raster_size)
-      pixels = Java::int[width * height].new
-      buffered_image.getRGB(0, 0, width, height, pixels, 0, width)
-      image_float.put_array_of_float 0, pixels.to_a
-    end
-  end
-end
-if __FILE__ == $0
-  puts Tabula::LSD.detect_lines_in_pdf_page ARGV[0], ARGV[1].to_i
-end

data/lib/tabula/pdf_line_extractor.rb DELETED Viewed

@@ -1,319 +0,0 @@
-java_import org.apache.pdfbox.util.operator.OperatorProcessor
-java_import org.apache.pdfbox.pdfparser.PDFParser
-java_import org.apache.pdfbox.util.PDFStreamEngine
-java_import org.apache.pdfbox.util.ResourceLoader
-java_import java.awt.geom.PathIterator
-java_import java.awt.geom.Point2D
-java_import java.awt.geom.GeneralPath
-java_import java.awt.geom.AffineTransform
-java_import java.awt.Color
-warn 'Tabula::Extraction::LineExtractor is DEPRECATED and will be removed'
-class Tabula::Extraction::LineExtractor < org.apache.pdfbox.util.PDFStreamEngine
-  attr_accessor :currentX, :currentY
-  attr_accessor :currentPath
-  attr_accessor :rulings
-  attr_accessor :options
-  field_accessor :page
-  DETECT_LINES_DEFAULTS = {
-    :snapping_grid_cell_size => 2
-  }
-  def self.collapse_vertical_rulings(lines) #lines should all be of one orientation (i.e. horizontal, vertical)
-    lines.sort!{|a, b| a.left != b.left ? a.left <=> b.left : a.top <=> b.top }
-    lines.inject([]) do |memo, next_line|
-      if memo.last && next_line.left == memo.last.left && memo.last.nearlyIntersects?(next_line)
-        memo.last.top = [next_line.top, memo.last.top].min
-        memo.last.bottom = [next_line.bottom, memo.last.bottom].max
-        memo
-      else
-        memo << next_line
-      end
-    end
-  end
-  def self.collapse_horizontal_rulings(lines) #lines should all be of one orientation (i.e. horizontal, vertical)
-    lines.sort!{|a, b| a.top != b.top ? a.top <=> b.top : a.left <=> b.left }
-    lines.inject([]) do |memo, next_line|
-      if memo.last && next_line.top == memo.last.top && memo.last.nearlyIntersects?(next_line)
-        memo.last.left = [next_line.left, memo.last.left].min
-        memo.last.right = [next_line.right, memo.last.right].max
-        memo
-      else
-        memo << next_line
-      end
-    end
-  end
-  #N.B. for merge `spreadsheets` into `text-extractor-refactor` --
-  # only substantive change here is calling Tabula::Ruling::clean_rulings on LSD output in this method
-  # the rest is readability changes.
-  #page_number here is zero-indexed
-  def self.lines_in_pdf_page(pdf_path, page_number, options={})
-    options = options.merge!(DETECT_LINES_DEFAULTS)
-    if options[:render_pdf]
-      # only LSD rulings need to be "cleaned" with clean_rulings; might as well do this here
-      # since there's no good reason want unclean lines
-      Tabula::Ruling::clean_rulings(Tabula::LSD::detect_lines_in_pdf_page(pdf_path, page_number, options))
-    else
-      pdf_file = ::Tabula::Extraction.openPDF(pdf_path)
-      page = pdf_file.getDocumentCatalog.getAllPages[page_number]
-      le = self.new(options)
-      le.processStream(page, page.findResources, page.getContents.getStream)
-      pdf_file.close
-      rulings = le.rulings.map do |l, color|
-        ::Tabula::Ruling.new(l.getP1.getY,
-                             l.getP1.getX,
-                             l.getP2.getX - l.getP1.getX,
-                             l.getP2.getY - l.getP1.getY,
-                             color)
-      end
-      rulings.reject! { |l| (l.left == l.right && l.top == l.bottom) || [l.top, l.left, l.bottom, l.right].any? { |p| p < 0 } }
-      collapse_vertical_rulings(rulings.select(&:vertical?)) + collapse_horizontal_rulings(rulings.select(&:horizontal?))
-    end
-  end
-  class LineToOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      x, y = arguments[0], arguments[1]
-      ppos = drawer.TransformedPoint(x.floatValue, y.floatValue)
-      l = java.awt.geom.Line2D::Float.new(drawer.currentX, drawer.currentY, ppos.getX, ppos.getY)
-      drawer.currentPath << l if l.horizontal? or l.vertical?
-      drawer.currentX, drawer.currentY = ppos.getX, ppos.getY
-    end
-  end
-  class MoveToOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      x, y = arguments[0], arguments[1]
-      ppos = drawer.TransformedPoint(x.floatValue, y.floatValue)
-      drawer.currentX, drawer.currentY = ppos.getX, ppos.getY
-    end
-  end
-  class AppendRectangleToPathOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      finalX, finalY, finalW, finalH = arguments.to_array.map(&:floatValue)
-      ppos = drawer.TransformedPoint(finalX, finalY)
-      psize = drawer.ScaledPoint(finalW, finalH)
-      finalY = ppos.getY - psize.getY
-      if finalY < 0
-        finalY = 0
-      end
-      width = psize.getX.abs
-      height = psize.getY.abs
-      lines = if width > height && height < 2 # horizontal line, "thin" rectangle.
-                [java.awt.geom.Line2D::Float.new(ppos.getX, finalY + psize.getY/2, ppos.getX + psize.getX, finalY + psize.getY/2)]
-              elsif width < height && width < 2 # vertical line, "thin" rectangle
-                [java.awt.geom.Line2D::Float.new(ppos.getX + psize.getX/2, finalY, ppos.getX + psize.getX/2, finalY + psize.getY)]
-              else
-                # add every edge of the rectangle to drawer.rulings
-                [java.awt.geom.Line2D::Float.new(ppos.getX, finalY, ppos.getX + psize.getX, finalY),
-                 java.awt.geom.Line2D::Float.new(ppos.getX, finalY, ppos.getX, finalY + psize.getY),
-                 java.awt.geom.Line2D::Float.new(ppos.getX+psize.getX, finalY, ppos.getX + psize.getX, finalY + psize.getY),
-                 java.awt.geom.Line2D::Float.new(ppos.getX, finalY+psize.getY, ppos.getX + psize.getX, finalY + psize.getY)]
-              end
-      drawer.currentPath += lines.select { |l| l.horizontal? or l.vertical? }
-    end
-  end
-  class StrokePathOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      strokeColorComps = drawer.getGraphicsState.getStrokingColor.getJavaColor.getRGBColorComponents(nil)
-      color_filter = drawer.options[:line_color_filter] || lambda{|c| true } #by default, use all lines, regardless of color
-      if color_filter.call(strokeColorComps)
-        drawer.currentPath.each { |segment| drawer.addRuling(segment, strokeColorComps.to_a) }
-      end
-      drawer.currentPath = []
-    end
-  end
-  class CloseFillNonZeroAndStrokePathOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      fillColorComps = drawer.getGraphicsState.getNonStrokingColor.getJavaColor.getRGBColorComponents(nil)
-      color_filter = drawer.options[:line_color_filter] || lambda{|c| true } #by default, use all lines, regardless of color
-      if color_filter.call(fillColorComps)
-        drawer.currentPath.each { |segment| drawer.addRuling(segment, fillColorComps.to_a) }
-      end
-      drawer.currentPath = []
-    end
-  end
-  class CloseAndStrokePathOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      drawer.currentPath.each { |segment| drawer.addRuling(segment) }
-      drawer.currentPath = []
-    end
-  end
-  class EndPathOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      # end without stroke, we don't care about it. discard it
-      drawer.currentPath = []
-    end
-  end
-  class FillNonZeroRuleOperator < OperatorProcessor
-    def process(operator, arguments)
-      drawer = self.context
-      # end without stroke, we don't care about it. discard it
-      drawer.currentPath = []
-    end
-  end
-  OPERATOR_PROCESSORS = {
-    'm' => MoveToOperator.new,
-    're' => AppendRectangleToPathOperator.new,
-    'l' => LineToOperator.new,
-    'S' => StrokePathOperator.new,
-    's' => StrokePathOperator.new,
-    'n' => EndPathOperator.new,
-    'b' => CloseFillNonZeroAndStrokePathOperator.new,
-    'b*' => CloseFillNonZeroAndStrokePathOperator.new,
-    'f' => CloseFillNonZeroAndStrokePathOperator.new,
-    'f*' => CloseFillNonZeroAndStrokePathOperator.new,
-    'BT' => org.apache.pdfbox.util.operator.BeginText.new,
-    'cm' => org.apache.pdfbox.util.operator.Concatenate.new,
-    'CS' => org.apache.pdfbox.util.operator.SetStrokingColorSpace.new,
-    'cs' => org.apache.pdfbox.util.operator.SetNonStrokingColorSpace.new,
-    'ET' => org.apache.pdfbox.util.operator.EndText.new,
-    'G' => org.apache.pdfbox.util.operator.SetStrokingGrayColor.new,
-    'g' => org.apache.pdfbox.util.operator.SetNonStrokingGrayColor.new,
-    'gs' => org.apache.pdfbox.util.operator.SetGraphicsStateParameters.new,
-    'K' => org.apache.pdfbox.util.operator.SetStrokingCMYKColor.new,
-    'k' => org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor.new,
-    'q' => org.apache.pdfbox.util.operator.GSave.new,
-    'Q' => org.apache.pdfbox.util.operator.GRestore.new,
-    'RG' => org.apache.pdfbox.util.operator.SetStrokingRGBColor.new,
-    'rg' => org.apache.pdfbox.util.operator.SetNonStrokingRGBColor.new,
-    'SC' => org.apache.pdfbox.util.operator.SetStrokingColor.new,
-    'sc' => org.apache.pdfbox.util.operator.SetNonStrokingColor.new,
-    'SCN' => org.apache.pdfbox.util.operator.SetStrokingColor.new,
-    'scn' => org.apache.pdfbox.util.operator.SetNonStrokingColor.new,
-    'T*' => org.apache.pdfbox.util.operator.NextLine.new,
-    'Tc' => org.apache.pdfbox.util.operator.SetCharSpacing.new,
-    'Td' => org.apache.pdfbox.util.operator.MoveText.new,
-    'TD' => org.apache.pdfbox.util.operator.MoveTextSetLeading.new,
-    'Tf' => org.apache.pdfbox.util.operator.SetTextFont.new,
-    'Tj' => org.apache.pdfbox.util.operator.ShowText.new,
-    'TJ' => org.apache.pdfbox.util.operator.ShowTextGlyph.new,
-    'TL' => org.apache.pdfbox.util.operator.SetTextLeading.new,
-    'Tm' => org.apache.pdfbox.util.operator.SetMatrix.new,
-    'Tr' => org.apache.pdfbox.util.operator.SetTextRenderingMode.new,
-    'Ts' => org.apache.pdfbox.util.operator.SetTextRise.new,
-    'Tw' => org.apache.pdfbox.util.operator.SetWordSpacing.new,
-    'Tz' => org.apache.pdfbox.util.operator.SetHorizontalTextScaling.new,
-    "\'" => org.apache.pdfbox.util.operator.MoveAndShow.new,
-    '\"' => org.apache.pdfbox.util.operator.SetMoveAndShow.new,
-  }
-  def initialize(options={})
-    super()
-    @options = options.merge!(DETECT_LINES_DEFAULTS)
-    self.clear!
-    OPERATOR_PROCESSORS.each { |k,v| registerOperatorProcessor(k, v) }
-  end
-  def clear!
-    self.rulings = []
-    self.currentX = -1
-    self.currentY = -1
-    self.currentPath = []
-    @pageSize = nil
-  end
-  def addRuling(ruling, color=nil)
-    color = color.nil? ? [0,0,0] : color
-    if !page.getRotation.nil? && [90, -270, -90, 270].include?(page.getRotation)
-      mb = page.findMediaBox
-      ruling.rotate!(mb.getLowerLeftX, mb.getLowerLeftY, page.getRotation)
-      trans = if page.getRotation == 90 || page.getRotation == -270
-                AffineTransform.getTranslateInstance(mb.getHeight, 0)
-              else
-                AffineTransform.getTranslateInstance(0, mb.getWidth)
-              end
-      ruling.transform!(trans)
-    end
-    # snapping to grid and joining lines that are close together
-    ruling.snap!(options[:snapping_grid_cell_size])
-    self.rulings << [ruling, color]
-  end
-  ##
-  # get current page size
-  def pageSize
-    @pageSize ||= self.page.findMediaBox.createDimension
-  end
-  ##
-  # fix the Y coordinate based on page rotation
-  def fixY(y)
-    pageSize.getHeight - y
-  end
-  def ScaledPoint(*args)
-    x, y = args[0], args[1]
-    # if scale factor not provided, get it from current transformation matrix
-    if args.size == 2
-      ctm = getGraphicsState.getCurrentTransformationMatrix
-      at = ctm.createAffineTransform
-      scaleX = at.getScaleX; scaleY = at.getScaleY
-    else
-      scaleX = args[2]; scaleY = args[3]
-    end
-    finalX = 0.0;
-    finalY = 0.0;
-    if scaleX > 0
-      finalX = x * scaleX;
-    end
-    if scaleY > 0
-      finalY = y * scaleY;
-    end
-    return java.awt.geom.Point2D::Float.new(finalX, finalY);
-  end
-  def TransformedPoint(x, y)
-    position = [x,y].to_java(:float)
-    at = self.getGraphicsState.getCurrentTransformationMatrix.createAffineTransform
-    at.transform(position, 0, position, 0, 1)
-    position[1] = fixY(position[1])
-    java.awt.geom.Point2D::Float.new(position[0], position[1])
-  end
-end

data/lib/tabula/pdf_render.rb DELETED Viewed

@@ -1,64 +0,0 @@
-require 'java'
-java_import org.apache.pdfbox.pdmodel.PDDocument
-java_import org.apache.pdfbox.pdfviewer.PageDrawer
-java_import java.awt.image.BufferedImage
-java_import javax.imageio.ImageIO
-java_import java.awt.Dimension
-java_import java.awt.Color
-module Tabula
-  module Render
-    # render a PDF page to a graphics context, but skip rendering the text
-    # This is done to reduce 'noise' introduced by the text, we only
-    # care about lines.
-    class PageDrawerNoText < PageDrawer
-      def processTextPosition(text)
-      end
-    end
-    #ugh jruby; suppresses "ambiguous method" warning that arises due to Java's overloaded constructor.
-    TRANSPARENT_WHITE =  java.awt.Color.java_class.constructor(Java::int, Java::int, Java::int, Java::int).new_instance(255, 255, 255, 0)
-    # 2048 width is important, if this is too small, thin lines won't be drawn.
-    def self.pageToBufferedImage(page, width=2048, pageDrawerClass=PageDrawerNoText)
-      cropbox = page.findCropBox
-      widthPt, heightPt = cropbox.getWidth, cropbox.getHeight
-      pageDimension = Dimension.new(widthPt, heightPt)
-      rotation = java.lang.Math.toRadians(page.findRotation)
-      scaling = width / (rotation == 0 ? widthPt : heightPt)
-      widthPx, heightPx = (java.lang.Math.java_send :round, [Java::float], widthPt * scaling ), (java.lang.Math.java_send :round, [Java::float], heightPt * scaling)
-      retval = if rotation != 0
-                 BufferedImage.new(heightPx, widthPx, BufferedImage::TYPE_BYTE_GRAY)
-               else
-                 BufferedImage.new(widthPx, heightPx, BufferedImage::TYPE_BYTE_GRAY)
-               end
-      graphics = retval.getGraphics()
-      graphics.setBackground(TRANSPARENT_WHITE)
-      graphics.clearRect(0, 0, retval.getWidth, retval.getHeight)
-      if rotation != 0
-        graphics.java_send :translate, [Java::int, Java::int], retval.getWidth, 0.0
-        graphics.rotate(rotation)
-      end
-      graphics.scale(scaling, scaling)
-      drawer = pageDrawerClass.new()
-      drawer.drawPage(graphics,  page, pageDimension)
-      graphics.dispose
-      return retval
-    end
-  end
-end
-# testing
-if __FILE__ == $0
-  pdf_file = PDDocument.loadNonSeq(java.io.File.new(ARGV[0]), nil)
-  bi = Tabula::Render.pageToBufferedImage(pdf_file.getDocumentCatalog.getAllPages[ARGV[1].to_i - 1])
-  puts bi.class
-  ImageIO.write(bi, 'png',
-                java.io.File.new('notext.png'))
-end