tabula-extractor 0.7.4-java → 0.7.5-java

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 935de0f0dc43fa388a86cc091dc540b74b6ce31f
4
- data.tar.gz: 67fa5fda6450c3b1659af3c61c8027843be5c082
3
+ metadata.gz: 4391bc1af8143d2f60ed2ad11a66cba9f96c955e
4
+ data.tar.gz: d0d905fbd5b2bae105a11a9fd01470921b7dd4f3
5
5
  SHA512:
6
- metadata.gz: 191054f79148535bf359c81c72d35b717f71f97ee3c3bedd4c2af66e4332afb98f3071afe4c9ed9e894586e3a20722769742f17fc02b9a5d5d954a4fae50803d
7
- data.tar.gz: 711f993194c402d1bca016f0fe13ccaeb8e4eafc6b67c2de0fa8b3cef1e7e3ae5b4cdefc2b251b64467747e7af26f80bb54bf57d4424ea50bb2dd26db7e27570
6
+ metadata.gz: 4ef1e681e511dc074381696689b8d86915f262d4538cf107be6b8844fcbf7102f20cedcdc8964f326178095e7cd9b47d912386801e57a0e4645ee27faacff4a5
7
+ data.tar.gz: 336c19a84cd2cf430e24ce0728a5f7753ca78e3b08e2f12251274da577dd2f0cfe2d6a5cf1b9df44dd09e85d9c4e5a80d2672e53632785bc70461f8ca10c3e17
data/Gemfile.lock ADDED
@@ -0,0 +1,39 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ tabula-extractor (0.7.5-java)
5
+ trollop (~> 2.0)
6
+
7
+ GEM
8
+ remote: http://rubygems.org/
9
+ specs:
10
+ coderay (1.1.0)
11
+ columnize (0.8.9)
12
+ ffi (1.9.5-java)
13
+ method_source (0.8.2)
14
+ minitest (5.4.2)
15
+ pry (0.10.1-java)
16
+ coderay (~> 1.1.0)
17
+ method_source (~> 0.8.1)
18
+ slop (~> 3.4)
19
+ spoon (~> 0.0)
20
+ rake (10.3.2)
21
+ ruby-debug (0.10.4)
22
+ columnize (>= 0.1)
23
+ ruby-debug-base (~> 0.10.4.0)
24
+ ruby-debug-base (0.10.4-java)
25
+ slop (3.6.0)
26
+ spoon (0.0.4)
27
+ ffi
28
+ trollop (2.0)
29
+
30
+ PLATFORMS
31
+ java
32
+
33
+ DEPENDENCIES
34
+ bundler (>= 1.3.4)
35
+ minitest
36
+ pry
37
+ rake
38
+ ruby-debug
39
+ tabula-extractor!
data/README.md CHANGED
@@ -3,7 +3,9 @@ tabula-extractor
3
3
 
4
4
  [![Build Status](https://travis-ci.org/jazzido/tabula-extractor.png)](https://travis-ci.org/jazzido/tabula-extractor)
5
5
 
6
- Extract tables from PDF files. `tabula-extractor` is the table extraction engine that powers [Tabula](http://tabula.nerdpower.org), now available as a library and command line program.
6
+ Extract tables from PDF files. `tabula-extractor` is the table extraction engine that powers [Tabula](http://tabula.technology), now available as a library and command line program.
7
+
8
+ Versions 0.9.6 and greater of [Tabula](http://tabula.technology) can export shell scripts using `tabula-extractor` for bulk extraction.
7
9
 
8
10
  ## Installation
9
11
 
@@ -49,11 +51,45 @@ Tabula helps you extract tables from PDFs
49
51
  --help, -h: Show this message
50
52
  ```
51
53
 
54
+ ## Command Line Examples
55
+
56
+ These examples use documents contained with `tabula-extractor`'s [`test`](https://github.com/tabulapdf/tabula-extractor/tree/master/test) folder. If you want to follow along, download the document and give it a shot. There's more extensive explanation [here](https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool).
57
+
58
+ Extract all the tables from a document into a spreadsheet called `output.csv`:
59
+ ````bash
60
+ tabula test/heuristic-test-set/spreadsheet/tabla_subsidios.pdf -o output.csv
61
+ ````
62
+
63
+ Extract only the tables on page 1 into a spreadsheet called `output.csv`:
64
+ ````bash
65
+ tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf -o output.csv
66
+ ````
67
+
68
+ Extract only the tables on page 1 into a CSV spreadsheet onto STDOUT (that is, print it out in your terminal window):
69
+ ````bash
70
+ tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf
71
+ ````
72
+
73
+ Extract the data from the table contained within a certain area on page 1 into a spreadsheet called `output.csv`:
74
+ ````bash
75
+ tabula test/data/vertical_rulings_bug.pdf --area 250,0,325,1700 --pages 1 -o output.csv
76
+ ````
77
+
78
+ Extract all the tables from a document into a tab-separated spreadsheet called `output.tsv`:
79
+ ````bash
80
+ tabula test/heuristic-test-set/spreadsheet/strongschools.pdf output.tsv --format TSV #should exclude guff
81
+ ````
82
+
83
+ Extract the table from page 1, using specified locations for column boundaries, into a spreadsheet called `output.csv`:
84
+ ````bash
85
+ tabula test/data/campaign_donors.pdf -o output.csv --columns 47,147,256,310,375,431,504
86
+ ````
87
+
52
88
  ## Scripting examples
53
89
 
54
- `tabula-extractor` is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.
90
+ `tabula-extractor` is also a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.
55
91
 
56
- Here's a very basic example:
92
+ Here's a very basic example, using the "spreadsheet" extraction method:
57
93
 
58
94
  ````ruby
59
95
  require 'tabula'
@@ -73,3 +109,75 @@ end
73
109
  out.close
74
110
 
75
111
  ````
112
+
113
+ Here's another example using the "original" extraction method, which is useful for tables that don't have ruling lines separating the rows and cells. This example extracts data from only pages 1 and 2.
114
+
115
+ ````ruby
116
+ require 'tabula'
117
+
118
+ pdf_file_path = "whatever.pdf"
119
+ outfilename = "whatever.csv"
120
+
121
+ out = open(outfilename, 'w')
122
+
123
+ extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
124
+ extractor.extract.each_with_index do |pdf_page, page_index|
125
+ page_areas = [[250, 0, 325, 1700]]
126
+
127
+ page_areas.each do |page_area|
128
+ out << pdf_page.get_area(page_area).make_table.to_csv
129
+ out << "\n\n"
130
+ end
131
+
132
+ end
133
+ extractor.close!
134
+ out.close
135
+ ````
136
+
137
+ This similar example using the "original" extraction method, but specifies the location of columns. This is a useful tactic when crappy PDF creation software let one column's text flow into the next column. Unless you specify column locations manually, Tabula would combine the two columns. You can find the column locations using a screen ruler; I find it works well to measure the width of the entire PDF and scale the locations based on the width of the page as PDFBox renders it, as shown in the example below.
138
+
139
+ ````ruby
140
+ require 'tabula'
141
+
142
+ pdf_file_path = "whatever.pdf"
143
+ outfilename = "whatever.csv"
144
+
145
+ out = open(outfilename, 'w')
146
+
147
+ extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
148
+ extractor.extract.each_with_index do |pdf_page, page_index|
149
+ page_areas = [[250, 0, 325, 1700]]
150
+
151
+ scale_factor = pdf_page.width / 1700
152
+ # where 1700 is the width of the page as you measured it.
153
+
154
+ vertical_ruling_locations = [0, 360, 506, 617, 906, 1034, 1160, 1290, 1418, 1548] #column locations
155
+ vertical_rulings = vertical_ruling_locations.map{|n| Tabula::Ruling.new(0, n * scale_factor, 0, 1000)}
156
+
157
+ page_areas.each do |page_area|
158
+ out << pdf_page.get_area(page_area).make_table(:vertical_rulings => vertical_rulings).to_csv
159
+ out << "\n\n"
160
+ end
161
+ end
162
+ extractor.close!
163
+ out.close
164
+ ````
165
+
166
+ ## How Does This Work? Like, Theoretically?
167
+
168
+ PDFs are a terrible format for transmitting tabular data. Tabula uses two algorithms to try to reconstruct the underlying structure of the data table. This section describes how PDFs represent your data and how Tabula extracts it so you can use `tabula-extractor` productively.
169
+
170
+ PDFs were designed to represent a paper document's layout across various computers and on paper, so they focus on precise positioning. They include primitives for text strings, geometric shapes, images and videos (and more), but no data tables. Tabula includes a Java library called PDFBox to access those embedded text strings and geometric shapes and uses them to reconstruct your table.
171
+
172
+ <em style="margin-left: 5px;"> Why Can't Tabula Process Scanned Pages? Scanned PDF pages usually contain only one primitive: the image of the scanned page. Since those PDFs don't contain text strings or geometric shapes, Tabula won't be able to reconstruct your data -- unless you run the PDF through an OCR (Optical Character Recognition) program, which re-inserts those text strings into their original position, though the results can be error prone.</em>
173
+
174
+ Tabula has two distinct algorithms to use for different kinds of tables. It uses a heuristic to try to guess which algorithm to use for each table, but this heuristic is wrong fairly often, so you may need to specify which algorithm to use, using the Extraction Method selector buttons in the GUI or the `spreadsheet` or `no-spreadsheet` flags on the command line.
175
+
176
+ - The `spreadsheet` algorithm uses geometric lines to reconstruct the table structure. After discarding oblique lines, the algorithm finds all of the lines' crossing points. Using those crossing points, it creates a large list of minimal rectangular areas (that is, rectangles that contain no other rectangles) that are spreadsheet cells. The minimum bounding box of groups of adjacent cells is a table (called a Spreadsheet object). After spreadsheet objects are created, empty "placeholder" cells are created when a cell in one row (or, likewise, column) spans over a space in which multiple cells are contained in another row. Once we have the dimensions of all the cells on the page, the PDFBox library can get the text contained within each cell.
177
+ - The `original` or `no-spreadsheet` algorithm uses only the position of text element on the page. (Because OCR software doesn't reconstruct lines, this algorithm is the only algorithm available for OCRed PDFs.) The algorithm collects all the text on the page (or within the area of the page that contains a table, specified with the Tabula GUI or the `--area` flag) and finds "rivers" -- vertical spaces that don't contain any text for the entire height of the table. These are considered column boundaries. (If text from one column flows into another column because the PDF was created with crappy software, you can specify it manually with the `--columns` flag ) Each line of text on the page (by unique y locations) is considered a separate line in the table. (If cells contain multiple rows, you may have to write a script to "roll them up" -- Tabula can't provide this functionality.)
178
+
179
+ These two algorithms are inspired by some academic work, including Anssi Nurminen's "[Algorithmic Extraction of Data in Tables in Pdf Documents](http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3)" (2013) for the spreadsheet algorithm.
180
+
181
+ ## Documentation
182
+
183
+ You're welcome to try to integrate the `tabula-extractor` gem into your project. We don't really have documentation yet, though the tests may be a good source. If you're going to, please feel free to drop us a note and we may be able to give you some pointers.
data/bin/tabula CHANGED
@@ -120,7 +120,6 @@ def main
120
120
  end
121
121
  tables = pdf_page.spreadsheets(:use_line_returns=> use_line_returns).map(&:rows)
122
122
  else
123
- STDERR.puts "Page #{pdf_page.number(:one_indexed)}: #{page_area.to_s}" if opts[:debug]
124
123
  if opts[:guess]
125
124
  page_areas = pdf_page.spreadsheets.map{|rect| pdf_page.get_area(rect.dims(:top, :left, :bottom, :right))}
126
125
  elsif area_input
@@ -128,7 +127,10 @@ def main
128
127
  else
129
128
  page_areas = [pdf_page]
130
129
  end
131
- tables = page_areas.map{|page_area| page_area.make_table(vertical_rulings.nil? ? {} : { :vertical_rulings => rulings_from_columns(pdf_page, page_area, vertical_rulings) })}
130
+ tables = page_areas.map { |page_area|
131
+ STDERR.puts "Page #{pdf_page.number(:one_indexed)}: #{page_area.to_s}" if opts[:debug]
132
+ page_area.make_table(vertical_rulings.nil? ? {} : { :vertical_rulings => rulings_from_columns(pdf_page, page_area, vertical_rulings) })
133
+ }
132
134
  end
133
135
  tables.each do |table|
134
136
  Tabula::Writers.send(opts[:format].to_sym,
data/lib/tabula.rb CHANGED
@@ -9,19 +9,8 @@ require File.join(File.dirname(__FILE__), '../target/', 'slf4j-api-1.6.3.jar')
9
9
  require File.join(File.dirname(__FILE__), '../target/', 'trove4j-3.0.3.jar')
10
10
  require File.join(File.dirname(__FILE__), '../target/', 'jsi-1.1.0-SNAPSHOT.jar')
11
11
 
12
- import 'java.util.logging.LogManager'
13
- import 'java.util.logging.Level'
12
+ java.util.logging.Logger.getLogger('org.apache.pdfbox').setLevel(java.util.logging.Level::OFF)
14
13
 
15
- lm = LogManager.log_manager
16
- lm.logger_names.each do |name|
17
- if name == "" #rootlogger is apparently the logger PDFBox is talking to.
18
- l = lm.get_logger(name)
19
- l.level = Level::OFF
20
- l.handlers.each do |h|
21
- h.level = Level::OFF
22
- end
23
- end
24
- end
25
14
  require_relative './tabula/version'
26
15
  require_relative './tabula/core_ext'
27
16
 
@@ -30,9 +19,4 @@ require_relative './tabula/extraction'
30
19
  require_relative './tabula/table_extractor'
31
20
  require_relative './tabula/writers'
32
21
 
33
- module Tabula
34
- autoload :LSD , File.expand_path('tabula/line_segment_detector.rb', File.dirname(__FILE__))
35
- autoload :Render , File.expand_path('tabula/pdf_render.rb', File.dirname(__FILE__))
36
- end
37
-
38
22
  require_relative './tabula/table_extractor'
@@ -189,6 +189,16 @@ class Rectangle2D
189
189
  (other.bottom - self.bottom).abs
190
190
  end
191
191
 
192
+ # decomposes a rectangle into its 4 constitutent lines
193
+ def to_lines
194
+ # top left width height
195
+ top = Line2D::Float.new self.left, self.top, self.right, self.top
196
+ bottom = Line2D::Float.new self.left, self.bottom, self.right, self.bottom
197
+ left = Line2D::Float.new self.left, self.top, self.left, self.bottom
198
+ right = Line2D::Float.new self.right, self.top, self.right, self.bottom
199
+ [top, bottom, left, right]
200
+ end
201
+
192
202
 
193
203
  # Various ways that rectangles can overlap one another
194
204
  #------------------------------
@@ -22,6 +22,8 @@ module Tabula
22
22
 
23
23
  self.texts = texts
24
24
 
25
+ @ruling_lines += minimal_bounding_box_of_ruling_lines.to_lines.map{|l| Ruling.new(l.getY1, l.getX1, l.getX2 - l.getX1, l.getY2 - l.getY1)}.select &:finite?
26
+
25
27
  if spatial_index.nil?
26
28
  @spatial_index = TextElementIndex.new
27
29
  self.texts.each { |te| @spatial_index << te }
@@ -31,11 +33,44 @@ module Tabula
31
33
 
32
34
  end
33
35
 
34
- def min_char_width
36
+ def minimal_bounding_box_of_ruling_lines
37
+ max_x = 0
38
+ max_y = 0
39
+ min_x = ::Float::INFINITY
40
+ min_y = ::Float::INFINITY
41
+ horizontal_ruling_lines.each do |t|
42
+ min_x = t.left if t.left < min_x
43
+ max_x = t.right if t.right > max_x
44
+ end
45
+ vertical_ruling_lines.each do |t|
46
+ min_y = t.top if t.top < min_y
47
+ max_y = t.bottom if t.bottom > max_y
48
+ end
49
+ java.awt.geom.Rectangle2D::Float.new(min_x, min_y, max_x - min_x, max_y - min_y)
50
+ end
51
+
52
+ # is there a scenario under which we'd prefer to use this over `minimal_bounding_box_of_ruling_lines`?
53
+ # if so, what is it? If there are no ruling lines on the page _at all_, then adding this bounding box is
54
+ # useless.
55
+ def minimal_bounding_box_of_text_elements
56
+ max_x = 0
57
+ max_y = 0
58
+ min_x = ::Float::INFINITY
59
+ min_y = ::Float::INFINITY
60
+ @texts.each do |t|
61
+ min_x = t.x if t.x < min_x
62
+ min_y = t.y if t.y < min_y
63
+ max_x = t.x if t.x > max_x
64
+ max_y = t.y if t.y > max_y
65
+ end
66
+ java.awt.geom.Rectangle2D::Float.new(min_x, min_y, max_x - min_x, max_y - min_y)
67
+ end
68
+
69
+ def get_min_char_width
35
70
  @min_char_width ||= texts.map(&:width).min
36
71
  end
37
72
 
38
- def min_char_height
73
+ def get_min_char_height
39
74
  @min_char_height ||= texts.map(&:height).min
40
75
  end
41
76
 
@@ -107,16 +142,8 @@ module Tabula
107
142
  unless @spreadsheets.nil?
108
143
  return @spreadsheets
109
144
  end
110
- get_ruling_lines!(options)
111
- self.find_cells!(self.horizontal_ruling_lines, self.vertical_ruling_lines, options)
112
-
113
- spreadsheet_areas = find_spreadsheets_from_cells #literally, java.awt.geom.Area objects. lol sorry. polygons.
114
145
 
115
- #transform each spreadsheet area into a rectangle
116
- # and get the cells contained within it.
117
- spreadsheet_rectangle_areas = spreadsheet_areas.map{|a| a.getBounds } #getBounds2D is theoretically better, but returns a Rectangle2D.Double, which doesn't have our Ruby sugar on it.
118
-
119
- @spreadsheets = spreadsheet_rectangle_areas.map do |rect|
146
+ @spreadsheets = spreadsheet_areas(options).map do |rect|
120
147
  spr = Spreadsheet.new(rect.y, rect.x,
121
148
  rect.width, rect.height,
122
149
  self,
@@ -135,6 +162,18 @@ module Tabula
135
162
  spreadsheets
136
163
  end
137
164
 
165
+ def spreadsheet_areas (options={})
166
+ get_ruling_lines!(options)
167
+ self.find_cells!(self.horizontal_ruling_lines, self.vertical_ruling_lines, options)
168
+
169
+ spreadsheet_java_areas = find_spreadsheets_from_cells #literally, java.awt.geom.Area objects. lol sorry. polygons.
170
+
171
+ #transform each spreadsheet area into a rectangle
172
+ # and get the cells contained within it.
173
+ # getBounds2D is theoretically better than getBounds, but it returns a Rectangle2D.Double, which doesn't have our Ruby sugar on it.
174
+ spreadsheet_java_areas.map{|a| a.getBounds }
175
+ end
176
+
138
177
  def fill_in_cells!(options={})
139
178
  spreadsheets(options).each do |spreadsheet|
140
179
  spreadsheet.cells.each do |cell|
@@ -244,7 +283,7 @@ module Tabula
244
283
  # ah, but perhaps I can stick the points in a hash AND in an array
245
284
  # and then modify the lines by means of the points in the hash.
246
285
 
247
- [[:x, :x=, self.min_char_width], [:y, :y=, self.min_char_height]].each do |getter, setter, cell_size|
286
+ [[:x, :x=, self.get_min_char_width], [:y, :y=, self.get_min_char_height]].each do |getter, setter, cell_size|
248
287
  sorted_points = points.sort_by(&getter)
249
288
  first_point = sorted_points.shift
250
289
  grouped_points = sorted_points.inject([[first_point]] ) do |memo, next_point|
@@ -1,6 +1,7 @@
1
1
  module Tabula
2
2
  class PageArea < Page
3
3
 
4
+
4
5
 
5
6
  end
6
7
 
@@ -194,7 +194,8 @@ module Tabula
194
194
  # log(n) implementation of find_intersections
195
195
  # based on http://people.csail.mit.edu/indyk/6.838-old/handouts/lec2.pdf
196
196
  def self.find_intersections(horizontals, verticals)
197
- tree = java.util.TreeMap.new(HSegmentComparator.new)
197
+ construct_treemap_t_comparator = java.util.TreeMap.java_class.constructor(java.util.Comparator)
198
+ tree = construct_treemap_t_comparator.new_instance(HSegmentComparator.new).to_java
198
199
  sort_obj = Struct.new(:type, :pos, :obj)
199
200
 
200
201
  (horizontals + verticals)
@@ -237,6 +238,10 @@ module Tabula
237
238
  }
238
239
  end
239
240
 
241
+ def finite?
242
+ top != ::Float::INFINITY && left != ::Float::INFINITY && bottom != ::Float::INFINITY && right != ::Float::INFINITY
243
+ end
244
+
240
245
  ##
241
246
  # crop an enumerable of +Ruling+ to an +area+
242
247
  def self.crop_rulings_to_area(rulings, area)
@@ -37,10 +37,9 @@ module Tabula
37
37
  if evaluate_cells
38
38
  fill_in_cells!
39
39
  end
40
- tops = cells.map(&:top).uniq.sort
41
- array_of_rows = tops.map do |top|
42
- cells.select{|c| c.top == top }.sort_by(&:left)
43
- end
40
+
41
+ array_of_rows = cells.group_by{|cell| cell.top.round(5) }.sort_by(&:first).map{|x| x.last.sort_by(&:left) }
42
+
44
43
  #here, insert another kind of placeholder for empty corners
45
44
  # like in 01001523B_China.pdf
46
45
  #TODO: support placeholders for "empty" cells in rows other than row 1, and in #cols
@@ -66,10 +65,8 @@ module Tabula
66
65
  if evaluate_cells
67
66
  fill_in_cells!
68
67
  end
69
- lefts = cells.map(&:left).uniq.sort
70
- lefts.map do |left|
71
- cells.select{|c| c.left == left }.sort_by(&:top)
72
- end
68
+
69
+ cells.group_by{|cell| cell.left.round(5) }.sort_by(&:first).map{|x| x.last.sort_by(&:top) }
73
70
  end
74
71
 
75
72
  #######################################################
@@ -137,12 +134,14 @@ module Tabula
137
134
 
138
135
  def to_csv
139
136
  out = StringIO.new
137
+ out.set_encoding("utf-8")
140
138
  Tabula::Writers.CSV(rows, out)
141
139
  out.string
142
140
  end
143
141
 
144
142
  def to_tsv
145
143
  out = StringIO.new
144
+ out.set_encoding("utf-8")
146
145
  Tabula::Writers.TSV(rows, out)
147
146
  out.string
148
147
  end
@@ -79,12 +79,14 @@ module Tabula
79
79
 
80
80
  def to_csv
81
81
  out = StringIO.new
82
+ out.set_encoding("utf-8")
82
83
  Tabula::Writers.CSV(rows, out)
83
84
  out.string
84
85
  end
85
86
 
86
87
  def to_tsv
87
88
  out = StringIO.new
89
+ out.set_encoding("utf-8")
88
90
  Tabula::Writers.TSV(rows, out)
89
91
  out.string
90
92
  end
@@ -1,3 +1,3 @@
1
1
  module Tabula
2
- VERSION = '0.7.4'
2
+ VERSION = '0.7.5'
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tabula-extractor
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.7.4
4
+ version: 0.7.5
5
5
  platform: java
6
6
  authors:
7
7
  - Manuel Aristarán
@@ -10,7 +10,7 @@ authors:
10
10
  autorequire:
11
11
  bindir: bin
12
12
  cert_chain: []
13
- date: 2014-05-09 00:00:00.000000000 Z
13
+ date: 2014-09-29 00:00:00.000000000 Z
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
16
  name: bundler
@@ -94,6 +94,7 @@ files:
94
94
  - .travis.yml
95
95
  - AUTHORS.md
96
96
  - Gemfile
97
+ - Gemfile.lock
97
98
  - LICENSE.md
98
99
  - NOTICE.txt
99
100
  - README.md
@@ -116,9 +117,6 @@ files:
116
117
  - lib/tabula/entities/text_element_index.rb
117
118
  - lib/tabula/entities/zone_entity.rb
118
119
  - lib/tabula/extraction.rb
119
- - lib/tabula/line_segment_detector.rb
120
- - lib/tabula/pdf_line_extractor.rb
121
- - lib/tabula/pdf_render.rb
122
120
  - lib/tabula/spreadsheet_extractor.rb
123
121
  - lib/tabula/table_extractor.rb
124
122
  - lib/tabula/table_guesser.rb
@@ -149,7 +147,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
149
147
  version: '0'
150
148
  requirements: []
151
149
  rubyforge_project:
152
- rubygems_version: 2.2.2
150
+ rubygems_version: 2.4.1
153
151
  signing_key:
154
152
  specification_version: 4
155
153
  summary: extract tables from PDF files
@@ -1,125 +0,0 @@
1
- require 'rbconfig'
2
- require 'ffi'
3
-
4
-
5
- java_import javax.imageio.ImageIO
6
- java_import java.awt.image.BufferedImage
7
- java_import org.apache.pdfbox.pdmodel.PDDocument
8
-
9
- module Tabula
10
- module LSD
11
- extend FFI::Library
12
- ffi_lib File.expand_path('../../ext/' + case RbConfig::CONFIG['host_os']
13
- when /mswin|msys|mingw|cygwin|bccwin|wince|emc/
14
- if RbConfig::CONFIG['host_cpu'] == 'x86_64'
15
- 'liblsd64.dll'
16
- else
17
- 'liblsd.dll'
18
- end
19
- when /darwin|mac os/
20
- 'liblsd.dylib'
21
- when /linux/
22
- if RbConfig::CONFIG['target_cpu'] == 'x86_64'
23
- 'liblsd-linux64.so'
24
- else
25
- 'liblsd-linux32.so'
26
- end
27
- else
28
- raise "unknown os: #{RbConfig::CONFIG['host_os']}"
29
- end,
30
- File.dirname(__FILE__))
31
-
32
- attach_function :lsd, [ :pointer, :buffer_in, :int, :int ], :pointer
33
- attach_function :free_values, [ :pointer ], :void
34
-
35
- DETECT_LINES_DEFAULTS = {
36
- :scale_factor => nil,
37
- :image_size => 2048
38
- }
39
-
40
- def LSD.detect_lines_in_pdf(pdf_path, options={})
41
- options = DETECT_LINES_DEFAULTS.merge(options)
42
-
43
- pdf_file = PDDocument.loadNonSeq(java.io.File.new(pdf_path), nil)
44
- lines = pdf_file.getDocumentCatalog.getAllPages.to_a.map do |page|
45
- bi = Tabula::Render.pageToBufferedImage(page, options[:image_size])
46
- detect_lines(bi, options[:scale_factor] || (page.findCropBox.width / options[:image_size]))
47
- end
48
- pdf_file.close
49
- lines
50
- end
51
-
52
- #zero-indexed page_number
53
- def LSD.detect_lines_in_pdf_page(pdf_path, page_number, options={})
54
- options = DETECT_LINES_DEFAULTS.merge(options)
55
-
56
- pdf_file = Extraction.openPDF(pdf_path)
57
- page = pdf_file.getDocumentCatalog.getAllPages[page_number]
58
- bi = Tabula::Render.pageToBufferedImage(page,
59
- options[:image_size])
60
- pdf_file.close
61
- detect_lines(bi,
62
- options[:scale_factor] || (page.findCropBox.width / options[:image_size]))
63
- end
64
-
65
- # image can be either a string (path to image) or a Java::JavaAwtImage::BufferedImage
66
- # image to pixels: http://stackoverflow.com/questions/6524196/java-get-pixel-array-from-image
67
- def LSD.detect_lines(image, scale_factor=1)
68
-
69
- bimage = if image.class == Java::JavaAwtImage::BufferedImage
70
- image
71
- elsif image.class == String
72
- ImageIO.read(java.io.File.new(image))
73
- else
74
- raise ArgumentError, 'image must be a string or a BufferedImage'
75
- end
76
-
77
- image = LSD.image_to_image_float(bimage)
78
-
79
- lines_found_ptr = FFI::MemoryPointer.new(:int, 1)
80
-
81
- out = lsd(lines_found_ptr, image, bimage.getWidth, bimage.getHeight)
82
-
83
- lines_found = lines_found_ptr.get_int
84
-
85
- rv = []
86
- lines_found.times do |i|
87
- a = out[7*4*i].read_array_of_type(:float, 7)
88
-
89
- a_round = a[0..3].map(&:round)
90
- p1, p2 = [[a_round[0], a_round[1]], [a_round[2], a_round[3]]]
91
-
92
- rv << Tabula::Ruling.new(p1[1] * scale_factor,
93
- p1[0] * scale_factor,
94
- (p2[0] - p1[0]) * scale_factor,
95
- (p2[1] - p1[1]) * scale_factor)
96
- end
97
-
98
- free_values(out)
99
- bimage.flush
100
- bimage.getGraphics.dispose
101
- image = nil
102
-
103
- return rv
104
- end
105
-
106
- private
107
-
108
- def LSD.image_to_image_float(buffered_image)
109
- width = buffered_image.getWidth; height = buffered_image.getHeight
110
- raster_size = width * height
111
-
112
- image_float = FFI::MemoryPointer.new(:float, raster_size)
113
- pixels = Java::int[width * height].new
114
- buffered_image.getRGB(0, 0, width, height, pixels, 0, width)
115
-
116
- image_float.put_array_of_float 0, pixels.to_a
117
- end
118
-
119
-
120
- end
121
- end
122
-
123
- if __FILE__ == $0
124
- puts Tabula::LSD.detect_lines_in_pdf_page ARGV[0], ARGV[1].to_i
125
- end
@@ -1,319 +0,0 @@
1
- java_import org.apache.pdfbox.util.operator.OperatorProcessor
2
- java_import org.apache.pdfbox.pdfparser.PDFParser
3
- java_import org.apache.pdfbox.util.PDFStreamEngine
4
- java_import org.apache.pdfbox.util.ResourceLoader
5
-
6
- java_import java.awt.geom.PathIterator
7
- java_import java.awt.geom.Point2D
8
- java_import java.awt.geom.GeneralPath
9
- java_import java.awt.geom.AffineTransform
10
- java_import java.awt.Color
11
-
12
- warn 'Tabula::Extraction::LineExtractor is DEPRECATED and will be removed'
13
-
14
- class Tabula::Extraction::LineExtractor < org.apache.pdfbox.util.PDFStreamEngine
15
-
16
- attr_accessor :currentX, :currentY
17
- attr_accessor :currentPath
18
- attr_accessor :rulings
19
- attr_accessor :options
20
- field_accessor :page
21
-
22
- DETECT_LINES_DEFAULTS = {
23
- :snapping_grid_cell_size => 2
24
- }
25
-
26
- def self.collapse_vertical_rulings(lines) #lines should all be of one orientation (i.e. horizontal, vertical)
27
- lines.sort!{|a, b| a.left != b.left ? a.left <=> b.left : a.top <=> b.top }
28
- lines.inject([]) do |memo, next_line|
29
- if memo.last && next_line.left == memo.last.left && memo.last.nearlyIntersects?(next_line)
30
- memo.last.top = [next_line.top, memo.last.top].min
31
- memo.last.bottom = [next_line.bottom, memo.last.bottom].max
32
- memo
33
- else
34
- memo << next_line
35
- end
36
- end
37
- end
38
-
39
- def self.collapse_horizontal_rulings(lines) #lines should all be of one orientation (i.e. horizontal, vertical)
40
- lines.sort!{|a, b| a.top != b.top ? a.top <=> b.top : a.left <=> b.left }
41
- lines.inject([]) do |memo, next_line|
42
- if memo.last && next_line.top == memo.last.top && memo.last.nearlyIntersects?(next_line)
43
- memo.last.left = [next_line.left, memo.last.left].min
44
- memo.last.right = [next_line.right, memo.last.right].max
45
- memo
46
- else
47
- memo << next_line
48
- end
49
- end
50
- end
51
-
52
- #N.B. for merge `spreadsheets` into `text-extractor-refactor` --
53
- # only substantive change here is calling Tabula::Ruling::clean_rulings on LSD output in this method
54
- # the rest is readability changes.
55
- #page_number here is zero-indexed
56
- def self.lines_in_pdf_page(pdf_path, page_number, options={})
57
- options = options.merge!(DETECT_LINES_DEFAULTS)
58
- if options[:render_pdf]
59
- # only LSD rulings need to be "cleaned" with clean_rulings; might as well do this here
60
- # since there's no good reason want unclean lines
61
- Tabula::Ruling::clean_rulings(Tabula::LSD::detect_lines_in_pdf_page(pdf_path, page_number, options))
62
- else
63
- pdf_file = ::Tabula::Extraction.openPDF(pdf_path)
64
- page = pdf_file.getDocumentCatalog.getAllPages[page_number]
65
- le = self.new(options)
66
- le.processStream(page, page.findResources, page.getContents.getStream)
67
- pdf_file.close
68
- rulings = le.rulings.map do |l, color|
69
- ::Tabula::Ruling.new(l.getP1.getY,
70
- l.getP1.getX,
71
- l.getP2.getX - l.getP1.getX,
72
- l.getP2.getY - l.getP1.getY,
73
- color)
74
- end
75
- rulings.reject! { |l| (l.left == l.right && l.top == l.bottom) || [l.top, l.left, l.bottom, l.right].any? { |p| p < 0 } }
76
- collapse_vertical_rulings(rulings.select(&:vertical?)) + collapse_horizontal_rulings(rulings.select(&:horizontal?))
77
- end
78
- end
79
-
80
- class LineToOperator < OperatorProcessor
81
- def process(operator, arguments)
82
- drawer = self.context
83
- x, y = arguments[0], arguments[1]
84
- ppos = drawer.TransformedPoint(x.floatValue, y.floatValue)
85
-
86
- l = java.awt.geom.Line2D::Float.new(drawer.currentX, drawer.currentY, ppos.getX, ppos.getY)
87
-
88
- drawer.currentPath << l if l.horizontal? or l.vertical?
89
-
90
- drawer.currentX, drawer.currentY = ppos.getX, ppos.getY
91
- end
92
- end
93
-
94
- class MoveToOperator < OperatorProcessor
95
- def process(operator, arguments)
96
- drawer = self.context
97
- x, y = arguments[0], arguments[1]
98
-
99
- ppos = drawer.TransformedPoint(x.floatValue, y.floatValue)
100
-
101
- drawer.currentX, drawer.currentY = ppos.getX, ppos.getY
102
- end
103
- end
104
-
105
- class AppendRectangleToPathOperator < OperatorProcessor
106
- def process(operator, arguments)
107
-
108
- drawer = self.context
109
- finalX, finalY, finalW, finalH = arguments.to_array.map(&:floatValue)
110
-
111
- ppos = drawer.TransformedPoint(finalX, finalY)
112
- psize = drawer.ScaledPoint(finalW, finalH)
113
-
114
- finalY = ppos.getY - psize.getY
115
- if finalY < 0
116
- finalY = 0
117
- end
118
-
119
- width = psize.getX.abs
120
- height = psize.getY.abs
121
-
122
- lines = if width > height && height < 2 # horizontal line, "thin" rectangle.
123
- [java.awt.geom.Line2D::Float.new(ppos.getX, finalY + psize.getY/2, ppos.getX + psize.getX, finalY + psize.getY/2)]
124
- elsif width < height && width < 2 # vertical line, "thin" rectangle
125
- [java.awt.geom.Line2D::Float.new(ppos.getX + psize.getX/2, finalY, ppos.getX + psize.getX/2, finalY + psize.getY)]
126
- else
127
- # add every edge of the rectangle to drawer.rulings
128
- [java.awt.geom.Line2D::Float.new(ppos.getX, finalY, ppos.getX + psize.getX, finalY),
129
- java.awt.geom.Line2D::Float.new(ppos.getX, finalY, ppos.getX, finalY + psize.getY),
130
- java.awt.geom.Line2D::Float.new(ppos.getX+psize.getX, finalY, ppos.getX + psize.getX, finalY + psize.getY),
131
- java.awt.geom.Line2D::Float.new(ppos.getX, finalY+psize.getY, ppos.getX + psize.getX, finalY + psize.getY)]
132
- end
133
-
134
- drawer.currentPath += lines.select { |l| l.horizontal? or l.vertical? }
135
-
136
- end
137
- end
138
-
139
- class StrokePathOperator < OperatorProcessor
140
- def process(operator, arguments)
141
- drawer = self.context
142
- strokeColorComps = drawer.getGraphicsState.getStrokingColor.getJavaColor.getRGBColorComponents(nil)
143
- color_filter = drawer.options[:line_color_filter] || lambda{|c| true } #by default, use all lines, regardless of color
144
- if color_filter.call(strokeColorComps)
145
- drawer.currentPath.each { |segment| drawer.addRuling(segment, strokeColorComps.to_a) }
146
- end
147
-
148
- drawer.currentPath = []
149
- end
150
- end
151
-
152
- class CloseFillNonZeroAndStrokePathOperator < OperatorProcessor
153
- def process(operator, arguments)
154
- drawer = self.context
155
-
156
- fillColorComps = drawer.getGraphicsState.getNonStrokingColor.getJavaColor.getRGBColorComponents(nil)
157
- color_filter = drawer.options[:line_color_filter] || lambda{|c| true } #by default, use all lines, regardless of color
158
- if color_filter.call(fillColorComps)
159
- drawer.currentPath.each { |segment| drawer.addRuling(segment, fillColorComps.to_a) }
160
- end
161
-
162
- drawer.currentPath = []
163
- end
164
- end
165
-
166
- class CloseAndStrokePathOperator < OperatorProcessor
167
- def process(operator, arguments)
168
- drawer = self.context
169
- drawer.currentPath.each { |segment| drawer.addRuling(segment) }
170
- drawer.currentPath = []
171
- end
172
- end
173
-
174
- class EndPathOperator < OperatorProcessor
175
- def process(operator, arguments)
176
- drawer = self.context
177
- # end without stroke, we don't care about it. discard it
178
- drawer.currentPath = []
179
- end
180
- end
181
-
182
- class FillNonZeroRuleOperator < OperatorProcessor
183
- def process(operator, arguments)
184
- drawer = self.context
185
- # end without stroke, we don't care about it. discard it
186
- drawer.currentPath = []
187
- end
188
- end
189
-
190
- OPERATOR_PROCESSORS = {
191
- 'm' => MoveToOperator.new,
192
- 're' => AppendRectangleToPathOperator.new,
193
- 'l' => LineToOperator.new,
194
- 'S' => StrokePathOperator.new,
195
- 's' => StrokePathOperator.new,
196
- 'n' => EndPathOperator.new,
197
- 'b' => CloseFillNonZeroAndStrokePathOperator.new,
198
- 'b*' => CloseFillNonZeroAndStrokePathOperator.new,
199
- 'f' => CloseFillNonZeroAndStrokePathOperator.new,
200
- 'f*' => CloseFillNonZeroAndStrokePathOperator.new,
201
- 'BT' => org.apache.pdfbox.util.operator.BeginText.new,
202
- 'cm' => org.apache.pdfbox.util.operator.Concatenate.new,
203
- 'CS' => org.apache.pdfbox.util.operator.SetStrokingColorSpace.new,
204
- 'cs' => org.apache.pdfbox.util.operator.SetNonStrokingColorSpace.new,
205
- 'ET' => org.apache.pdfbox.util.operator.EndText.new,
206
- 'G' => org.apache.pdfbox.util.operator.SetStrokingGrayColor.new,
207
- 'g' => org.apache.pdfbox.util.operator.SetNonStrokingGrayColor.new,
208
- 'gs' => org.apache.pdfbox.util.operator.SetGraphicsStateParameters.new,
209
- 'K' => org.apache.pdfbox.util.operator.SetStrokingCMYKColor.new,
210
- 'k' => org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor.new,
211
- 'q' => org.apache.pdfbox.util.operator.GSave.new,
212
- 'Q' => org.apache.pdfbox.util.operator.GRestore.new,
213
- 'RG' => org.apache.pdfbox.util.operator.SetStrokingRGBColor.new,
214
- 'rg' => org.apache.pdfbox.util.operator.SetNonStrokingRGBColor.new,
215
- 'SC' => org.apache.pdfbox.util.operator.SetStrokingColor.new,
216
- 'sc' => org.apache.pdfbox.util.operator.SetNonStrokingColor.new,
217
- 'SCN' => org.apache.pdfbox.util.operator.SetStrokingColor.new,
218
- 'scn' => org.apache.pdfbox.util.operator.SetNonStrokingColor.new,
219
- 'T*' => org.apache.pdfbox.util.operator.NextLine.new,
220
- 'Tc' => org.apache.pdfbox.util.operator.SetCharSpacing.new,
221
- 'Td' => org.apache.pdfbox.util.operator.MoveText.new,
222
- 'TD' => org.apache.pdfbox.util.operator.MoveTextSetLeading.new,
223
- 'Tf' => org.apache.pdfbox.util.operator.SetTextFont.new,
224
- 'Tj' => org.apache.pdfbox.util.operator.ShowText.new,
225
- 'TJ' => org.apache.pdfbox.util.operator.ShowTextGlyph.new,
226
- 'TL' => org.apache.pdfbox.util.operator.SetTextLeading.new,
227
- 'Tm' => org.apache.pdfbox.util.operator.SetMatrix.new,
228
- 'Tr' => org.apache.pdfbox.util.operator.SetTextRenderingMode.new,
229
- 'Ts' => org.apache.pdfbox.util.operator.SetTextRise.new,
230
- 'Tw' => org.apache.pdfbox.util.operator.SetWordSpacing.new,
231
- 'Tz' => org.apache.pdfbox.util.operator.SetHorizontalTextScaling.new,
232
- "\'" => org.apache.pdfbox.util.operator.MoveAndShow.new,
233
- '\"' => org.apache.pdfbox.util.operator.SetMoveAndShow.new,
234
- }
235
-
236
- def initialize(options={})
237
- super()
238
- @options = options.merge!(DETECT_LINES_DEFAULTS)
239
- self.clear!
240
- OPERATOR_PROCESSORS.each { |k,v| registerOperatorProcessor(k, v) }
241
- end
242
-
243
- def clear!
244
- self.rulings = []
245
- self.currentX = -1
246
- self.currentY = -1
247
- self.currentPath = []
248
- @pageSize = nil
249
- end
250
-
251
- def addRuling(ruling, color=nil)
252
- color = color.nil? ? [0,0,0] : color
253
- if !page.getRotation.nil? && [90, -270, -90, 270].include?(page.getRotation)
254
-
255
- mb = page.findMediaBox
256
-
257
- ruling.rotate!(mb.getLowerLeftX, mb.getLowerLeftY, page.getRotation)
258
-
259
- trans = if page.getRotation == 90 || page.getRotation == -270
260
- AffineTransform.getTranslateInstance(mb.getHeight, 0)
261
- else
262
- AffineTransform.getTranslateInstance(0, mb.getWidth)
263
- end
264
- ruling.transform!(trans)
265
- end
266
-
267
- # snapping to grid and joining lines that are close together
268
- ruling.snap!(options[:snapping_grid_cell_size])
269
-
270
- self.rulings << [ruling, color]
271
- end
272
-
273
- ##
274
- # get current page size
275
- def pageSize
276
- @pageSize ||= self.page.findMediaBox.createDimension
277
- end
278
-
279
- ##
280
- # fix the Y coordinate based on page rotation
281
- def fixY(y)
282
- pageSize.getHeight - y
283
- end
284
-
285
- def ScaledPoint(*args)
286
- x, y = args[0], args[1]
287
-
288
- # if scale factor not provided, get it from current transformation matrix
289
- if args.size == 2
290
- ctm = getGraphicsState.getCurrentTransformationMatrix
291
- at = ctm.createAffineTransform
292
- scaleX = at.getScaleX; scaleY = at.getScaleY
293
- else
294
- scaleX = args[2]; scaleY = args[3]
295
- end
296
-
297
- finalX = 0.0;
298
- finalY = 0.0;
299
-
300
- if scaleX > 0
301
- finalX = x * scaleX;
302
- end
303
- if scaleY > 0
304
- finalY = y * scaleY;
305
- end
306
-
307
- return java.awt.geom.Point2D::Float.new(finalX, finalY);
308
-
309
- end
310
-
311
- def TransformedPoint(x, y)
312
- position = [x,y].to_java(:float)
313
- at = self.getGraphicsState.getCurrentTransformationMatrix.createAffineTransform
314
- at.transform(position, 0, position, 0, 1)
315
- position[1] = fixY(position[1])
316
- java.awt.geom.Point2D::Float.new(position[0], position[1])
317
- end
318
-
319
- end
@@ -1,64 +0,0 @@
1
- require 'java'
2
-
3
- java_import org.apache.pdfbox.pdmodel.PDDocument
4
- java_import org.apache.pdfbox.pdfviewer.PageDrawer
5
- java_import java.awt.image.BufferedImage
6
- java_import javax.imageio.ImageIO
7
- java_import java.awt.Dimension
8
- java_import java.awt.Color
9
-
10
- module Tabula
11
- module Render
12
-
13
- # render a PDF page to a graphics context, but skip rendering the text
14
- # This is done to reduce 'noise' introduced by the text, we only
15
- # care about lines.
16
- class PageDrawerNoText < PageDrawer
17
- def processTextPosition(text)
18
- end
19
- end
20
-
21
- #ugh jruby; suppresses "ambiguous method" warning that arises due to Java's overloaded constructor.
22
- TRANSPARENT_WHITE = java.awt.Color.java_class.constructor(Java::int, Java::int, Java::int, Java::int).new_instance(255, 255, 255, 0)
23
-
24
- # 2048 width is important, if this is too small, thin lines won't be drawn.
25
- def self.pageToBufferedImage(page, width=2048, pageDrawerClass=PageDrawerNoText)
26
- cropbox = page.findCropBox
27
- widthPt, heightPt = cropbox.getWidth, cropbox.getHeight
28
- pageDimension = Dimension.new(widthPt, heightPt)
29
- rotation = java.lang.Math.toRadians(page.findRotation)
30
-
31
- scaling = width / (rotation == 0 ? widthPt : heightPt)
32
- widthPx, heightPx = (java.lang.Math.java_send :round, [Java::float], widthPt * scaling ), (java.lang.Math.java_send :round, [Java::float], heightPt * scaling)
33
-
34
-
35
- retval = if rotation != 0
36
- BufferedImage.new(heightPx, widthPx, BufferedImage::TYPE_BYTE_GRAY)
37
- else
38
- BufferedImage.new(widthPx, heightPx, BufferedImage::TYPE_BYTE_GRAY)
39
- end
40
- graphics = retval.getGraphics()
41
- graphics.setBackground(TRANSPARENT_WHITE)
42
- graphics.clearRect(0, 0, retval.getWidth, retval.getHeight)
43
- if rotation != 0
44
- graphics.java_send :translate, [Java::int, Java::int], retval.getWidth, 0.0
45
- graphics.rotate(rotation)
46
- end
47
- graphics.scale(scaling, scaling)
48
- drawer = pageDrawerClass.new()
49
- drawer.drawPage(graphics, page, pageDimension)
50
- graphics.dispose
51
-
52
- return retval
53
- end
54
- end
55
- end
56
-
57
- # testing
58
- if __FILE__ == $0
59
- pdf_file = PDDocument.loadNonSeq(java.io.File.new(ARGV[0]), nil)
60
- bi = Tabula::Render.pageToBufferedImage(pdf_file.getDocumentCatalog.getAllPages[ARGV[1].to_i - 1])
61
- puts bi.class
62
- ImageIO.write(bi, 'png',
63
- java.io.File.new('notext.png'))
64
- end