tabula-extractor 0.7.4-java → 0.7.5-java
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile.lock +39 -0
- data/README.md +111 -3
- data/bin/tabula +4 -2
- data/lib/tabula.rb +1 -17
- data/lib/tabula/core_ext.rb +10 -0
- data/lib/tabula/entities/page.rb +51 -12
- data/lib/tabula/entities/page_area.rb +1 -0
- data/lib/tabula/entities/ruling.rb +6 -1
- data/lib/tabula/entities/spreadsheet.rb +7 -8
- data/lib/tabula/entities/table.rb +2 -0
- data/lib/tabula/version.rb +1 -1
- metadata +4 -6
- data/lib/tabula/line_segment_detector.rb +0 -125
- data/lib/tabula/pdf_line_extractor.rb +0 -319
- data/lib/tabula/pdf_render.rb +0 -64
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 4391bc1af8143d2f60ed2ad11a66cba9f96c955e
|
4
|
+
data.tar.gz: d0d905fbd5b2bae105a11a9fd01470921b7dd4f3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4ef1e681e511dc074381696689b8d86915f262d4538cf107be6b8844fcbf7102f20cedcdc8964f326178095e7cd9b47d912386801e57a0e4645ee27faacff4a5
|
7
|
+
data.tar.gz: 336c19a84cd2cf430e24ce0728a5f7753ca78e3b08e2f12251274da577dd2f0cfe2d6a5cf1b9df44dd09e85d9c4e5a80d2672e53632785bc70461f8ca10c3e17
|
data/Gemfile.lock
ADDED
@@ -0,0 +1,39 @@
|
|
1
|
+
PATH
|
2
|
+
remote: .
|
3
|
+
specs:
|
4
|
+
tabula-extractor (0.7.5-java)
|
5
|
+
trollop (~> 2.0)
|
6
|
+
|
7
|
+
GEM
|
8
|
+
remote: http://rubygems.org/
|
9
|
+
specs:
|
10
|
+
coderay (1.1.0)
|
11
|
+
columnize (0.8.9)
|
12
|
+
ffi (1.9.5-java)
|
13
|
+
method_source (0.8.2)
|
14
|
+
minitest (5.4.2)
|
15
|
+
pry (0.10.1-java)
|
16
|
+
coderay (~> 1.1.0)
|
17
|
+
method_source (~> 0.8.1)
|
18
|
+
slop (~> 3.4)
|
19
|
+
spoon (~> 0.0)
|
20
|
+
rake (10.3.2)
|
21
|
+
ruby-debug (0.10.4)
|
22
|
+
columnize (>= 0.1)
|
23
|
+
ruby-debug-base (~> 0.10.4.0)
|
24
|
+
ruby-debug-base (0.10.4-java)
|
25
|
+
slop (3.6.0)
|
26
|
+
spoon (0.0.4)
|
27
|
+
ffi
|
28
|
+
trollop (2.0)
|
29
|
+
|
30
|
+
PLATFORMS
|
31
|
+
java
|
32
|
+
|
33
|
+
DEPENDENCIES
|
34
|
+
bundler (>= 1.3.4)
|
35
|
+
minitest
|
36
|
+
pry
|
37
|
+
rake
|
38
|
+
ruby-debug
|
39
|
+
tabula-extractor!
|
data/README.md
CHANGED
@@ -3,7 +3,9 @@ tabula-extractor
|
|
3
3
|
|
4
4
|
[![Build Status](https://travis-ci.org/jazzido/tabula-extractor.png)](https://travis-ci.org/jazzido/tabula-extractor)
|
5
5
|
|
6
|
-
Extract tables from PDF files. `tabula-extractor` is the table extraction engine that powers [Tabula](http://tabula.
|
6
|
+
Extract tables from PDF files. `tabula-extractor` is the table extraction engine that powers [Tabula](http://tabula.technology), now available as a library and command line program.
|
7
|
+
|
8
|
+
Versions 0.9.6 and greater of [Tabula](http://tabula.technology) can export shell scripts using `tabula-extractor` for bulk extraction.
|
7
9
|
|
8
10
|
## Installation
|
9
11
|
|
@@ -49,11 +51,45 @@ Tabula helps you extract tables from PDFs
|
|
49
51
|
--help, -h: Show this message
|
50
52
|
```
|
51
53
|
|
54
|
+
## Command Line Examples
|
55
|
+
|
56
|
+
These examples use documents contained with `tabula-extractor`'s [`test`](https://github.com/tabulapdf/tabula-extractor/tree/master/test) folder. If you want to follow along, download the document and give it a shot. There's more extensive explanation [here](https://github.com/tabulapdf/tabula-extractor/wiki/Using-the-command-line-tabula-extractor-tool).
|
57
|
+
|
58
|
+
Extract all the tables from a document into a spreadsheet called `output.csv`:
|
59
|
+
````bash
|
60
|
+
tabula test/heuristic-test-set/spreadsheet/tabla_subsidios.pdf -o output.csv
|
61
|
+
````
|
62
|
+
|
63
|
+
Extract only the tables on page 1 into a spreadsheet called `output.csv`:
|
64
|
+
````bash
|
65
|
+
tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf -o output.csv
|
66
|
+
````
|
67
|
+
|
68
|
+
Extract only the tables on page 1 into a CSV spreadsheet onto STDOUT (that is, print it out in your terminal window):
|
69
|
+
````bash
|
70
|
+
tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf
|
71
|
+
````
|
72
|
+
|
73
|
+
Extract the data from the table contained within a certain area on page 1 into a spreadsheet called `output.csv`:
|
74
|
+
````bash
|
75
|
+
tabula test/data/vertical_rulings_bug.pdf --area 250,0,325,1700 --pages 1 -o output.csv
|
76
|
+
````
|
77
|
+
|
78
|
+
Extract all the tables from a document into a tab-separated spreadsheet called `output.tsv`:
|
79
|
+
````bash
|
80
|
+
tabula test/heuristic-test-set/spreadsheet/strongschools.pdf output.tsv --format TSV #should exclude guff
|
81
|
+
````
|
82
|
+
|
83
|
+
Extract the table from page 1, using specified locations for column boundaries, into a spreadsheet called `output.csv`:
|
84
|
+
````bash
|
85
|
+
tabula test/data/campaign_donors.pdf -o output.csv --columns 47,147,256,310,375,431,504
|
86
|
+
````
|
87
|
+
|
52
88
|
## Scripting examples
|
53
89
|
|
54
|
-
`tabula-extractor` is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.
|
90
|
+
`tabula-extractor` is also a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.
|
55
91
|
|
56
|
-
Here's a very basic example:
|
92
|
+
Here's a very basic example, using the "spreadsheet" extraction method:
|
57
93
|
|
58
94
|
````ruby
|
59
95
|
require 'tabula'
|
@@ -73,3 +109,75 @@ end
|
|
73
109
|
out.close
|
74
110
|
|
75
111
|
````
|
112
|
+
|
113
|
+
Here's another example using the "original" extraction method, which is useful for tables that don't have ruling lines separating the rows and cells. This example extracts data from only pages 1 and 2.
|
114
|
+
|
115
|
+
````ruby
|
116
|
+
require 'tabula'
|
117
|
+
|
118
|
+
pdf_file_path = "whatever.pdf"
|
119
|
+
outfilename = "whatever.csv"
|
120
|
+
|
121
|
+
out = open(outfilename, 'w')
|
122
|
+
|
123
|
+
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
|
124
|
+
extractor.extract.each_with_index do |pdf_page, page_index|
|
125
|
+
page_areas = [[250, 0, 325, 1700]]
|
126
|
+
|
127
|
+
page_areas.each do |page_area|
|
128
|
+
out << pdf_page.get_area(page_area).make_table.to_csv
|
129
|
+
out << "\n\n"
|
130
|
+
end
|
131
|
+
|
132
|
+
end
|
133
|
+
extractor.close!
|
134
|
+
out.close
|
135
|
+
````
|
136
|
+
|
137
|
+
This similar example using the "original" extraction method, but specifies the location of columns. This is a useful tactic when crappy PDF creation software let one column's text flow into the next column. Unless you specify column locations manually, Tabula would combine the two columns. You can find the column locations using a screen ruler; I find it works well to measure the width of the entire PDF and scale the locations based on the width of the page as PDFBox renders it, as shown in the example below.
|
138
|
+
|
139
|
+
````ruby
|
140
|
+
require 'tabula'
|
141
|
+
|
142
|
+
pdf_file_path = "whatever.pdf"
|
143
|
+
outfilename = "whatever.csv"
|
144
|
+
|
145
|
+
out = open(outfilename, 'w')
|
146
|
+
|
147
|
+
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
|
148
|
+
extractor.extract.each_with_index do |pdf_page, page_index|
|
149
|
+
page_areas = [[250, 0, 325, 1700]]
|
150
|
+
|
151
|
+
scale_factor = pdf_page.width / 1700
|
152
|
+
# where 1700 is the width of the page as you measured it.
|
153
|
+
|
154
|
+
vertical_ruling_locations = [0, 360, 506, 617, 906, 1034, 1160, 1290, 1418, 1548] #column locations
|
155
|
+
vertical_rulings = vertical_ruling_locations.map{|n| Tabula::Ruling.new(0, n * scale_factor, 0, 1000)}
|
156
|
+
|
157
|
+
page_areas.each do |page_area|
|
158
|
+
out << pdf_page.get_area(page_area).make_table(:vertical_rulings => vertical_rulings).to_csv
|
159
|
+
out << "\n\n"
|
160
|
+
end
|
161
|
+
end
|
162
|
+
extractor.close!
|
163
|
+
out.close
|
164
|
+
````
|
165
|
+
|
166
|
+
## How Does This Work? Like, Theoretically?
|
167
|
+
|
168
|
+
PDFs are a terrible format for transmitting tabular data. Tabula uses two algorithms to try to reconstruct the underlying structure of the data table. This section describes how PDFs represent your data and how Tabula extracts it so you can use `tabula-extractor` productively.
|
169
|
+
|
170
|
+
PDFs were designed to represent a paper document's layout across various computers and on paper, so they focus on precise positioning. They include primitives for text strings, geometric shapes, images and videos (and more), but no data tables. Tabula includes a Java library called PDFBox to access those embedded text strings and geometric shapes and uses them to reconstruct your table.
|
171
|
+
|
172
|
+
<em style="margin-left: 5px;"> Why Can't Tabula Process Scanned Pages? Scanned PDF pages usually contain only one primitive: the image of the scanned page. Since those PDFs don't contain text strings or geometric shapes, Tabula won't be able to reconstruct your data -- unless you run the PDF through an OCR (Optical Character Recognition) program, which re-inserts those text strings into their original position, though the results can be error prone.</em>
|
173
|
+
|
174
|
+
Tabula has two distinct algorithms to use for different kinds of tables. It uses a heuristic to try to guess which algorithm to use for each table, but this heuristic is wrong fairly often, so you may need to specify which algorithm to use, using the Extraction Method selector buttons in the GUI or the `spreadsheet` or `no-spreadsheet` flags on the command line.
|
175
|
+
|
176
|
+
- The `spreadsheet` algorithm uses geometric lines to reconstruct the table structure. After discarding oblique lines, the algorithm finds all of the lines' crossing points. Using those crossing points, it creates a large list of minimal rectangular areas (that is, rectangles that contain no other rectangles) that are spreadsheet cells. The minimum bounding box of groups of adjacent cells is a table (called a Spreadsheet object). After spreadsheet objects are created, empty "placeholder" cells are created when a cell in one row (or, likewise, column) spans over a space in which multiple cells are contained in another row. Once we have the dimensions of all the cells on the page, the PDFBox library can get the text contained within each cell.
|
177
|
+
- The `original` or `no-spreadsheet` algorithm uses only the position of text element on the page. (Because OCR software doesn't reconstruct lines, this algorithm is the only algorithm available for OCRed PDFs.) The algorithm collects all the text on the page (or within the area of the page that contains a table, specified with the Tabula GUI or the `--area` flag) and finds "rivers" -- vertical spaces that don't contain any text for the entire height of the table. These are considered column boundaries. (If text from one column flows into another column because the PDF was created with crappy software, you can specify it manually with the `--columns` flag ) Each line of text on the page (by unique y locations) is considered a separate line in the table. (If cells contain multiple rows, you may have to write a script to "roll them up" -- Tabula can't provide this functionality.)
|
178
|
+
|
179
|
+
These two algorithms are inspired by some academic work, including Anssi Nurminen's "[Algorithmic Extraction of Data in Tables in Pdf Documents](http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3)" (2013) for the spreadsheet algorithm.
|
180
|
+
|
181
|
+
## Documentation
|
182
|
+
|
183
|
+
You're welcome to try to integrate the `tabula-extractor` gem into your project. We don't really have documentation yet, though the tests may be a good source. If you're going to, please feel free to drop us a note and we may be able to give you some pointers.
|
data/bin/tabula
CHANGED
@@ -120,7 +120,6 @@ def main
|
|
120
120
|
end
|
121
121
|
tables = pdf_page.spreadsheets(:use_line_returns=> use_line_returns).map(&:rows)
|
122
122
|
else
|
123
|
-
STDERR.puts "Page #{pdf_page.number(:one_indexed)}: #{page_area.to_s}" if opts[:debug]
|
124
123
|
if opts[:guess]
|
125
124
|
page_areas = pdf_page.spreadsheets.map{|rect| pdf_page.get_area(rect.dims(:top, :left, :bottom, :right))}
|
126
125
|
elsif area_input
|
@@ -128,7 +127,10 @@ def main
|
|
128
127
|
else
|
129
128
|
page_areas = [pdf_page]
|
130
129
|
end
|
131
|
-
tables = page_areas.map{|page_area|
|
130
|
+
tables = page_areas.map { |page_area|
|
131
|
+
STDERR.puts "Page #{pdf_page.number(:one_indexed)}: #{page_area.to_s}" if opts[:debug]
|
132
|
+
page_area.make_table(vertical_rulings.nil? ? {} : { :vertical_rulings => rulings_from_columns(pdf_page, page_area, vertical_rulings) })
|
133
|
+
}
|
132
134
|
end
|
133
135
|
tables.each do |table|
|
134
136
|
Tabula::Writers.send(opts[:format].to_sym,
|
data/lib/tabula.rb
CHANGED
@@ -9,19 +9,8 @@ require File.join(File.dirname(__FILE__), '../target/', 'slf4j-api-1.6.3.jar')
|
|
9
9
|
require File.join(File.dirname(__FILE__), '../target/', 'trove4j-3.0.3.jar')
|
10
10
|
require File.join(File.dirname(__FILE__), '../target/', 'jsi-1.1.0-SNAPSHOT.jar')
|
11
11
|
|
12
|
-
|
13
|
-
import 'java.util.logging.Level'
|
12
|
+
java.util.logging.Logger.getLogger('org.apache.pdfbox').setLevel(java.util.logging.Level::OFF)
|
14
13
|
|
15
|
-
lm = LogManager.log_manager
|
16
|
-
lm.logger_names.each do |name|
|
17
|
-
if name == "" #rootlogger is apparently the logger PDFBox is talking to.
|
18
|
-
l = lm.get_logger(name)
|
19
|
-
l.level = Level::OFF
|
20
|
-
l.handlers.each do |h|
|
21
|
-
h.level = Level::OFF
|
22
|
-
end
|
23
|
-
end
|
24
|
-
end
|
25
14
|
require_relative './tabula/version'
|
26
15
|
require_relative './tabula/core_ext'
|
27
16
|
|
@@ -30,9 +19,4 @@ require_relative './tabula/extraction'
|
|
30
19
|
require_relative './tabula/table_extractor'
|
31
20
|
require_relative './tabula/writers'
|
32
21
|
|
33
|
-
module Tabula
|
34
|
-
autoload :LSD , File.expand_path('tabula/line_segment_detector.rb', File.dirname(__FILE__))
|
35
|
-
autoload :Render , File.expand_path('tabula/pdf_render.rb', File.dirname(__FILE__))
|
36
|
-
end
|
37
|
-
|
38
22
|
require_relative './tabula/table_extractor'
|
data/lib/tabula/core_ext.rb
CHANGED
@@ -189,6 +189,16 @@ class Rectangle2D
|
|
189
189
|
(other.bottom - self.bottom).abs
|
190
190
|
end
|
191
191
|
|
192
|
+
# decomposes a rectangle into its 4 constitutent lines
|
193
|
+
def to_lines
|
194
|
+
# top left width height
|
195
|
+
top = Line2D::Float.new self.left, self.top, self.right, self.top
|
196
|
+
bottom = Line2D::Float.new self.left, self.bottom, self.right, self.bottom
|
197
|
+
left = Line2D::Float.new self.left, self.top, self.left, self.bottom
|
198
|
+
right = Line2D::Float.new self.right, self.top, self.right, self.bottom
|
199
|
+
[top, bottom, left, right]
|
200
|
+
end
|
201
|
+
|
192
202
|
|
193
203
|
# Various ways that rectangles can overlap one another
|
194
204
|
#------------------------------
|
data/lib/tabula/entities/page.rb
CHANGED
@@ -22,6 +22,8 @@ module Tabula
|
|
22
22
|
|
23
23
|
self.texts = texts
|
24
24
|
|
25
|
+
@ruling_lines += minimal_bounding_box_of_ruling_lines.to_lines.map{|l| Ruling.new(l.getY1, l.getX1, l.getX2 - l.getX1, l.getY2 - l.getY1)}.select &:finite?
|
26
|
+
|
25
27
|
if spatial_index.nil?
|
26
28
|
@spatial_index = TextElementIndex.new
|
27
29
|
self.texts.each { |te| @spatial_index << te }
|
@@ -31,11 +33,44 @@ module Tabula
|
|
31
33
|
|
32
34
|
end
|
33
35
|
|
34
|
-
def
|
36
|
+
def minimal_bounding_box_of_ruling_lines
|
37
|
+
max_x = 0
|
38
|
+
max_y = 0
|
39
|
+
min_x = ::Float::INFINITY
|
40
|
+
min_y = ::Float::INFINITY
|
41
|
+
horizontal_ruling_lines.each do |t|
|
42
|
+
min_x = t.left if t.left < min_x
|
43
|
+
max_x = t.right if t.right > max_x
|
44
|
+
end
|
45
|
+
vertical_ruling_lines.each do |t|
|
46
|
+
min_y = t.top if t.top < min_y
|
47
|
+
max_y = t.bottom if t.bottom > max_y
|
48
|
+
end
|
49
|
+
java.awt.geom.Rectangle2D::Float.new(min_x, min_y, max_x - min_x, max_y - min_y)
|
50
|
+
end
|
51
|
+
|
52
|
+
# is there a scenario under which we'd prefer to use this over `minimal_bounding_box_of_ruling_lines`?
|
53
|
+
# if so, what is it? If there are no ruling lines on the page _at all_, then adding this bounding box is
|
54
|
+
# useless.
|
55
|
+
def minimal_bounding_box_of_text_elements
|
56
|
+
max_x = 0
|
57
|
+
max_y = 0
|
58
|
+
min_x = ::Float::INFINITY
|
59
|
+
min_y = ::Float::INFINITY
|
60
|
+
@texts.each do |t|
|
61
|
+
min_x = t.x if t.x < min_x
|
62
|
+
min_y = t.y if t.y < min_y
|
63
|
+
max_x = t.x if t.x > max_x
|
64
|
+
max_y = t.y if t.y > max_y
|
65
|
+
end
|
66
|
+
java.awt.geom.Rectangle2D::Float.new(min_x, min_y, max_x - min_x, max_y - min_y)
|
67
|
+
end
|
68
|
+
|
69
|
+
def get_min_char_width
|
35
70
|
@min_char_width ||= texts.map(&:width).min
|
36
71
|
end
|
37
72
|
|
38
|
-
def
|
73
|
+
def get_min_char_height
|
39
74
|
@min_char_height ||= texts.map(&:height).min
|
40
75
|
end
|
41
76
|
|
@@ -107,16 +142,8 @@ module Tabula
|
|
107
142
|
unless @spreadsheets.nil?
|
108
143
|
return @spreadsheets
|
109
144
|
end
|
110
|
-
get_ruling_lines!(options)
|
111
|
-
self.find_cells!(self.horizontal_ruling_lines, self.vertical_ruling_lines, options)
|
112
|
-
|
113
|
-
spreadsheet_areas = find_spreadsheets_from_cells #literally, java.awt.geom.Area objects. lol sorry. polygons.
|
114
145
|
|
115
|
-
|
116
|
-
# and get the cells contained within it.
|
117
|
-
spreadsheet_rectangle_areas = spreadsheet_areas.map{|a| a.getBounds } #getBounds2D is theoretically better, but returns a Rectangle2D.Double, which doesn't have our Ruby sugar on it.
|
118
|
-
|
119
|
-
@spreadsheets = spreadsheet_rectangle_areas.map do |rect|
|
146
|
+
@spreadsheets = spreadsheet_areas(options).map do |rect|
|
120
147
|
spr = Spreadsheet.new(rect.y, rect.x,
|
121
148
|
rect.width, rect.height,
|
122
149
|
self,
|
@@ -135,6 +162,18 @@ module Tabula
|
|
135
162
|
spreadsheets
|
136
163
|
end
|
137
164
|
|
165
|
+
def spreadsheet_areas (options={})
|
166
|
+
get_ruling_lines!(options)
|
167
|
+
self.find_cells!(self.horizontal_ruling_lines, self.vertical_ruling_lines, options)
|
168
|
+
|
169
|
+
spreadsheet_java_areas = find_spreadsheets_from_cells #literally, java.awt.geom.Area objects. lol sorry. polygons.
|
170
|
+
|
171
|
+
#transform each spreadsheet area into a rectangle
|
172
|
+
# and get the cells contained within it.
|
173
|
+
# getBounds2D is theoretically better than getBounds, but it returns a Rectangle2D.Double, which doesn't have our Ruby sugar on it.
|
174
|
+
spreadsheet_java_areas.map{|a| a.getBounds }
|
175
|
+
end
|
176
|
+
|
138
177
|
def fill_in_cells!(options={})
|
139
178
|
spreadsheets(options).each do |spreadsheet|
|
140
179
|
spreadsheet.cells.each do |cell|
|
@@ -244,7 +283,7 @@ module Tabula
|
|
244
283
|
# ah, but perhaps I can stick the points in a hash AND in an array
|
245
284
|
# and then modify the lines by means of the points in the hash.
|
246
285
|
|
247
|
-
[[:x, :x=, self.
|
286
|
+
[[:x, :x=, self.get_min_char_width], [:y, :y=, self.get_min_char_height]].each do |getter, setter, cell_size|
|
248
287
|
sorted_points = points.sort_by(&getter)
|
249
288
|
first_point = sorted_points.shift
|
250
289
|
grouped_points = sorted_points.inject([[first_point]] ) do |memo, next_point|
|
@@ -194,7 +194,8 @@ module Tabula
|
|
194
194
|
# log(n) implementation of find_intersections
|
195
195
|
# based on http://people.csail.mit.edu/indyk/6.838-old/handouts/lec2.pdf
|
196
196
|
def self.find_intersections(horizontals, verticals)
|
197
|
-
|
197
|
+
construct_treemap_t_comparator = java.util.TreeMap.java_class.constructor(java.util.Comparator)
|
198
|
+
tree = construct_treemap_t_comparator.new_instance(HSegmentComparator.new).to_java
|
198
199
|
sort_obj = Struct.new(:type, :pos, :obj)
|
199
200
|
|
200
201
|
(horizontals + verticals)
|
@@ -237,6 +238,10 @@ module Tabula
|
|
237
238
|
}
|
238
239
|
end
|
239
240
|
|
241
|
+
def finite?
|
242
|
+
top != ::Float::INFINITY && left != ::Float::INFINITY && bottom != ::Float::INFINITY && right != ::Float::INFINITY
|
243
|
+
end
|
244
|
+
|
240
245
|
##
|
241
246
|
# crop an enumerable of +Ruling+ to an +area+
|
242
247
|
def self.crop_rulings_to_area(rulings, area)
|
@@ -37,10 +37,9 @@ module Tabula
|
|
37
37
|
if evaluate_cells
|
38
38
|
fill_in_cells!
|
39
39
|
end
|
40
|
-
|
41
|
-
array_of_rows =
|
42
|
-
|
43
|
-
end
|
40
|
+
|
41
|
+
array_of_rows = cells.group_by{|cell| cell.top.round(5) }.sort_by(&:first).map{|x| x.last.sort_by(&:left) }
|
42
|
+
|
44
43
|
#here, insert another kind of placeholder for empty corners
|
45
44
|
# like in 01001523B_China.pdf
|
46
45
|
#TODO: support placeholders for "empty" cells in rows other than row 1, and in #cols
|
@@ -66,10 +65,8 @@ module Tabula
|
|
66
65
|
if evaluate_cells
|
67
66
|
fill_in_cells!
|
68
67
|
end
|
69
|
-
|
70
|
-
|
71
|
-
cells.select{|c| c.left == left }.sort_by(&:top)
|
72
|
-
end
|
68
|
+
|
69
|
+
cells.group_by{|cell| cell.left.round(5) }.sort_by(&:first).map{|x| x.last.sort_by(&:top) }
|
73
70
|
end
|
74
71
|
|
75
72
|
#######################################################
|
@@ -137,12 +134,14 @@ module Tabula
|
|
137
134
|
|
138
135
|
def to_csv
|
139
136
|
out = StringIO.new
|
137
|
+
out.set_encoding("utf-8")
|
140
138
|
Tabula::Writers.CSV(rows, out)
|
141
139
|
out.string
|
142
140
|
end
|
143
141
|
|
144
142
|
def to_tsv
|
145
143
|
out = StringIO.new
|
144
|
+
out.set_encoding("utf-8")
|
146
145
|
Tabula::Writers.TSV(rows, out)
|
147
146
|
out.string
|
148
147
|
end
|
@@ -79,12 +79,14 @@ module Tabula
|
|
79
79
|
|
80
80
|
def to_csv
|
81
81
|
out = StringIO.new
|
82
|
+
out.set_encoding("utf-8")
|
82
83
|
Tabula::Writers.CSV(rows, out)
|
83
84
|
out.string
|
84
85
|
end
|
85
86
|
|
86
87
|
def to_tsv
|
87
88
|
out = StringIO.new
|
89
|
+
out.set_encoding("utf-8")
|
88
90
|
Tabula::Writers.TSV(rows, out)
|
89
91
|
out.string
|
90
92
|
end
|
data/lib/tabula/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tabula-extractor
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.7.
|
4
|
+
version: 0.7.5
|
5
5
|
platform: java
|
6
6
|
authors:
|
7
7
|
- Manuel Aristarán
|
@@ -10,7 +10,7 @@ authors:
|
|
10
10
|
autorequire:
|
11
11
|
bindir: bin
|
12
12
|
cert_chain: []
|
13
|
-
date: 2014-
|
13
|
+
date: 2014-09-29 00:00:00.000000000 Z
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
16
|
name: bundler
|
@@ -94,6 +94,7 @@ files:
|
|
94
94
|
- .travis.yml
|
95
95
|
- AUTHORS.md
|
96
96
|
- Gemfile
|
97
|
+
- Gemfile.lock
|
97
98
|
- LICENSE.md
|
98
99
|
- NOTICE.txt
|
99
100
|
- README.md
|
@@ -116,9 +117,6 @@ files:
|
|
116
117
|
- lib/tabula/entities/text_element_index.rb
|
117
118
|
- lib/tabula/entities/zone_entity.rb
|
118
119
|
- lib/tabula/extraction.rb
|
119
|
-
- lib/tabula/line_segment_detector.rb
|
120
|
-
- lib/tabula/pdf_line_extractor.rb
|
121
|
-
- lib/tabula/pdf_render.rb
|
122
120
|
- lib/tabula/spreadsheet_extractor.rb
|
123
121
|
- lib/tabula/table_extractor.rb
|
124
122
|
- lib/tabula/table_guesser.rb
|
@@ -149,7 +147,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
149
147
|
version: '0'
|
150
148
|
requirements: []
|
151
149
|
rubyforge_project:
|
152
|
-
rubygems_version: 2.
|
150
|
+
rubygems_version: 2.4.1
|
153
151
|
signing_key:
|
154
152
|
specification_version: 4
|
155
153
|
summary: extract tables from PDF files
|
@@ -1,125 +0,0 @@
|
|
1
|
-
require 'rbconfig'
|
2
|
-
require 'ffi'
|
3
|
-
|
4
|
-
|
5
|
-
java_import javax.imageio.ImageIO
|
6
|
-
java_import java.awt.image.BufferedImage
|
7
|
-
java_import org.apache.pdfbox.pdmodel.PDDocument
|
8
|
-
|
9
|
-
module Tabula
|
10
|
-
module LSD
|
11
|
-
extend FFI::Library
|
12
|
-
ffi_lib File.expand_path('../../ext/' + case RbConfig::CONFIG['host_os']
|
13
|
-
when /mswin|msys|mingw|cygwin|bccwin|wince|emc/
|
14
|
-
if RbConfig::CONFIG['host_cpu'] == 'x86_64'
|
15
|
-
'liblsd64.dll'
|
16
|
-
else
|
17
|
-
'liblsd.dll'
|
18
|
-
end
|
19
|
-
when /darwin|mac os/
|
20
|
-
'liblsd.dylib'
|
21
|
-
when /linux/
|
22
|
-
if RbConfig::CONFIG['target_cpu'] == 'x86_64'
|
23
|
-
'liblsd-linux64.so'
|
24
|
-
else
|
25
|
-
'liblsd-linux32.so'
|
26
|
-
end
|
27
|
-
else
|
28
|
-
raise "unknown os: #{RbConfig::CONFIG['host_os']}"
|
29
|
-
end,
|
30
|
-
File.dirname(__FILE__))
|
31
|
-
|
32
|
-
attach_function :lsd, [ :pointer, :buffer_in, :int, :int ], :pointer
|
33
|
-
attach_function :free_values, [ :pointer ], :void
|
34
|
-
|
35
|
-
DETECT_LINES_DEFAULTS = {
|
36
|
-
:scale_factor => nil,
|
37
|
-
:image_size => 2048
|
38
|
-
}
|
39
|
-
|
40
|
-
def LSD.detect_lines_in_pdf(pdf_path, options={})
|
41
|
-
options = DETECT_LINES_DEFAULTS.merge(options)
|
42
|
-
|
43
|
-
pdf_file = PDDocument.loadNonSeq(java.io.File.new(pdf_path), nil)
|
44
|
-
lines = pdf_file.getDocumentCatalog.getAllPages.to_a.map do |page|
|
45
|
-
bi = Tabula::Render.pageToBufferedImage(page, options[:image_size])
|
46
|
-
detect_lines(bi, options[:scale_factor] || (page.findCropBox.width / options[:image_size]))
|
47
|
-
end
|
48
|
-
pdf_file.close
|
49
|
-
lines
|
50
|
-
end
|
51
|
-
|
52
|
-
#zero-indexed page_number
|
53
|
-
def LSD.detect_lines_in_pdf_page(pdf_path, page_number, options={})
|
54
|
-
options = DETECT_LINES_DEFAULTS.merge(options)
|
55
|
-
|
56
|
-
pdf_file = Extraction.openPDF(pdf_path)
|
57
|
-
page = pdf_file.getDocumentCatalog.getAllPages[page_number]
|
58
|
-
bi = Tabula::Render.pageToBufferedImage(page,
|
59
|
-
options[:image_size])
|
60
|
-
pdf_file.close
|
61
|
-
detect_lines(bi,
|
62
|
-
options[:scale_factor] || (page.findCropBox.width / options[:image_size]))
|
63
|
-
end
|
64
|
-
|
65
|
-
# image can be either a string (path to image) or a Java::JavaAwtImage::BufferedImage
|
66
|
-
# image to pixels: http://stackoverflow.com/questions/6524196/java-get-pixel-array-from-image
|
67
|
-
def LSD.detect_lines(image, scale_factor=1)
|
68
|
-
|
69
|
-
bimage = if image.class == Java::JavaAwtImage::BufferedImage
|
70
|
-
image
|
71
|
-
elsif image.class == String
|
72
|
-
ImageIO.read(java.io.File.new(image))
|
73
|
-
else
|
74
|
-
raise ArgumentError, 'image must be a string or a BufferedImage'
|
75
|
-
end
|
76
|
-
|
77
|
-
image = LSD.image_to_image_float(bimage)
|
78
|
-
|
79
|
-
lines_found_ptr = FFI::MemoryPointer.new(:int, 1)
|
80
|
-
|
81
|
-
out = lsd(lines_found_ptr, image, bimage.getWidth, bimage.getHeight)
|
82
|
-
|
83
|
-
lines_found = lines_found_ptr.get_int
|
84
|
-
|
85
|
-
rv = []
|
86
|
-
lines_found.times do |i|
|
87
|
-
a = out[7*4*i].read_array_of_type(:float, 7)
|
88
|
-
|
89
|
-
a_round = a[0..3].map(&:round)
|
90
|
-
p1, p2 = [[a_round[0], a_round[1]], [a_round[2], a_round[3]]]
|
91
|
-
|
92
|
-
rv << Tabula::Ruling.new(p1[1] * scale_factor,
|
93
|
-
p1[0] * scale_factor,
|
94
|
-
(p2[0] - p1[0]) * scale_factor,
|
95
|
-
(p2[1] - p1[1]) * scale_factor)
|
96
|
-
end
|
97
|
-
|
98
|
-
free_values(out)
|
99
|
-
bimage.flush
|
100
|
-
bimage.getGraphics.dispose
|
101
|
-
image = nil
|
102
|
-
|
103
|
-
return rv
|
104
|
-
end
|
105
|
-
|
106
|
-
private
|
107
|
-
|
108
|
-
def LSD.image_to_image_float(buffered_image)
|
109
|
-
width = buffered_image.getWidth; height = buffered_image.getHeight
|
110
|
-
raster_size = width * height
|
111
|
-
|
112
|
-
image_float = FFI::MemoryPointer.new(:float, raster_size)
|
113
|
-
pixels = Java::int[width * height].new
|
114
|
-
buffered_image.getRGB(0, 0, width, height, pixels, 0, width)
|
115
|
-
|
116
|
-
image_float.put_array_of_float 0, pixels.to_a
|
117
|
-
end
|
118
|
-
|
119
|
-
|
120
|
-
end
|
121
|
-
end
|
122
|
-
|
123
|
-
if __FILE__ == $0
|
124
|
-
puts Tabula::LSD.detect_lines_in_pdf_page ARGV[0], ARGV[1].to_i
|
125
|
-
end
|
@@ -1,319 +0,0 @@
|
|
1
|
-
java_import org.apache.pdfbox.util.operator.OperatorProcessor
|
2
|
-
java_import org.apache.pdfbox.pdfparser.PDFParser
|
3
|
-
java_import org.apache.pdfbox.util.PDFStreamEngine
|
4
|
-
java_import org.apache.pdfbox.util.ResourceLoader
|
5
|
-
|
6
|
-
java_import java.awt.geom.PathIterator
|
7
|
-
java_import java.awt.geom.Point2D
|
8
|
-
java_import java.awt.geom.GeneralPath
|
9
|
-
java_import java.awt.geom.AffineTransform
|
10
|
-
java_import java.awt.Color
|
11
|
-
|
12
|
-
warn 'Tabula::Extraction::LineExtractor is DEPRECATED and will be removed'
|
13
|
-
|
14
|
-
class Tabula::Extraction::LineExtractor < org.apache.pdfbox.util.PDFStreamEngine
|
15
|
-
|
16
|
-
attr_accessor :currentX, :currentY
|
17
|
-
attr_accessor :currentPath
|
18
|
-
attr_accessor :rulings
|
19
|
-
attr_accessor :options
|
20
|
-
field_accessor :page
|
21
|
-
|
22
|
-
DETECT_LINES_DEFAULTS = {
|
23
|
-
:snapping_grid_cell_size => 2
|
24
|
-
}
|
25
|
-
|
26
|
-
def self.collapse_vertical_rulings(lines) #lines should all be of one orientation (i.e. horizontal, vertical)
|
27
|
-
lines.sort!{|a, b| a.left != b.left ? a.left <=> b.left : a.top <=> b.top }
|
28
|
-
lines.inject([]) do |memo, next_line|
|
29
|
-
if memo.last && next_line.left == memo.last.left && memo.last.nearlyIntersects?(next_line)
|
30
|
-
memo.last.top = [next_line.top, memo.last.top].min
|
31
|
-
memo.last.bottom = [next_line.bottom, memo.last.bottom].max
|
32
|
-
memo
|
33
|
-
else
|
34
|
-
memo << next_line
|
35
|
-
end
|
36
|
-
end
|
37
|
-
end
|
38
|
-
|
39
|
-
def self.collapse_horizontal_rulings(lines) #lines should all be of one orientation (i.e. horizontal, vertical)
|
40
|
-
lines.sort!{|a, b| a.top != b.top ? a.top <=> b.top : a.left <=> b.left }
|
41
|
-
lines.inject([]) do |memo, next_line|
|
42
|
-
if memo.last && next_line.top == memo.last.top && memo.last.nearlyIntersects?(next_line)
|
43
|
-
memo.last.left = [next_line.left, memo.last.left].min
|
44
|
-
memo.last.right = [next_line.right, memo.last.right].max
|
45
|
-
memo
|
46
|
-
else
|
47
|
-
memo << next_line
|
48
|
-
end
|
49
|
-
end
|
50
|
-
end
|
51
|
-
|
52
|
-
#N.B. for merge `spreadsheets` into `text-extractor-refactor` --
|
53
|
-
# only substantive change here is calling Tabula::Ruling::clean_rulings on LSD output in this method
|
54
|
-
# the rest is readability changes.
|
55
|
-
#page_number here is zero-indexed
|
56
|
-
def self.lines_in_pdf_page(pdf_path, page_number, options={})
|
57
|
-
options = options.merge!(DETECT_LINES_DEFAULTS)
|
58
|
-
if options[:render_pdf]
|
59
|
-
# only LSD rulings need to be "cleaned" with clean_rulings; might as well do this here
|
60
|
-
# since there's no good reason want unclean lines
|
61
|
-
Tabula::Ruling::clean_rulings(Tabula::LSD::detect_lines_in_pdf_page(pdf_path, page_number, options))
|
62
|
-
else
|
63
|
-
pdf_file = ::Tabula::Extraction.openPDF(pdf_path)
|
64
|
-
page = pdf_file.getDocumentCatalog.getAllPages[page_number]
|
65
|
-
le = self.new(options)
|
66
|
-
le.processStream(page, page.findResources, page.getContents.getStream)
|
67
|
-
pdf_file.close
|
68
|
-
rulings = le.rulings.map do |l, color|
|
69
|
-
::Tabula::Ruling.new(l.getP1.getY,
|
70
|
-
l.getP1.getX,
|
71
|
-
l.getP2.getX - l.getP1.getX,
|
72
|
-
l.getP2.getY - l.getP1.getY,
|
73
|
-
color)
|
74
|
-
end
|
75
|
-
rulings.reject! { |l| (l.left == l.right && l.top == l.bottom) || [l.top, l.left, l.bottom, l.right].any? { |p| p < 0 } }
|
76
|
-
collapse_vertical_rulings(rulings.select(&:vertical?)) + collapse_horizontal_rulings(rulings.select(&:horizontal?))
|
77
|
-
end
|
78
|
-
end
|
79
|
-
|
80
|
-
class LineToOperator < OperatorProcessor
|
81
|
-
def process(operator, arguments)
|
82
|
-
drawer = self.context
|
83
|
-
x, y = arguments[0], arguments[1]
|
84
|
-
ppos = drawer.TransformedPoint(x.floatValue, y.floatValue)
|
85
|
-
|
86
|
-
l = java.awt.geom.Line2D::Float.new(drawer.currentX, drawer.currentY, ppos.getX, ppos.getY)
|
87
|
-
|
88
|
-
drawer.currentPath << l if l.horizontal? or l.vertical?
|
89
|
-
|
90
|
-
drawer.currentX, drawer.currentY = ppos.getX, ppos.getY
|
91
|
-
end
|
92
|
-
end
|
93
|
-
|
94
|
-
class MoveToOperator < OperatorProcessor
|
95
|
-
def process(operator, arguments)
|
96
|
-
drawer = self.context
|
97
|
-
x, y = arguments[0], arguments[1]
|
98
|
-
|
99
|
-
ppos = drawer.TransformedPoint(x.floatValue, y.floatValue)
|
100
|
-
|
101
|
-
drawer.currentX, drawer.currentY = ppos.getX, ppos.getY
|
102
|
-
end
|
103
|
-
end
|
104
|
-
|
105
|
-
class AppendRectangleToPathOperator < OperatorProcessor
|
106
|
-
def process(operator, arguments)
|
107
|
-
|
108
|
-
drawer = self.context
|
109
|
-
finalX, finalY, finalW, finalH = arguments.to_array.map(&:floatValue)
|
110
|
-
|
111
|
-
ppos = drawer.TransformedPoint(finalX, finalY)
|
112
|
-
psize = drawer.ScaledPoint(finalW, finalH)
|
113
|
-
|
114
|
-
finalY = ppos.getY - psize.getY
|
115
|
-
if finalY < 0
|
116
|
-
finalY = 0
|
117
|
-
end
|
118
|
-
|
119
|
-
width = psize.getX.abs
|
120
|
-
height = psize.getY.abs
|
121
|
-
|
122
|
-
lines = if width > height && height < 2 # horizontal line, "thin" rectangle.
|
123
|
-
[java.awt.geom.Line2D::Float.new(ppos.getX, finalY + psize.getY/2, ppos.getX + psize.getX, finalY + psize.getY/2)]
|
124
|
-
elsif width < height && width < 2 # vertical line, "thin" rectangle
|
125
|
-
[java.awt.geom.Line2D::Float.new(ppos.getX + psize.getX/2, finalY, ppos.getX + psize.getX/2, finalY + psize.getY)]
|
126
|
-
else
|
127
|
-
# add every edge of the rectangle to drawer.rulings
|
128
|
-
[java.awt.geom.Line2D::Float.new(ppos.getX, finalY, ppos.getX + psize.getX, finalY),
|
129
|
-
java.awt.geom.Line2D::Float.new(ppos.getX, finalY, ppos.getX, finalY + psize.getY),
|
130
|
-
java.awt.geom.Line2D::Float.new(ppos.getX+psize.getX, finalY, ppos.getX + psize.getX, finalY + psize.getY),
|
131
|
-
java.awt.geom.Line2D::Float.new(ppos.getX, finalY+psize.getY, ppos.getX + psize.getX, finalY + psize.getY)]
|
132
|
-
end
|
133
|
-
|
134
|
-
drawer.currentPath += lines.select { |l| l.horizontal? or l.vertical? }
|
135
|
-
|
136
|
-
end
|
137
|
-
end
|
138
|
-
|
139
|
-
class StrokePathOperator < OperatorProcessor
|
140
|
-
def process(operator, arguments)
|
141
|
-
drawer = self.context
|
142
|
-
strokeColorComps = drawer.getGraphicsState.getStrokingColor.getJavaColor.getRGBColorComponents(nil)
|
143
|
-
color_filter = drawer.options[:line_color_filter] || lambda{|c| true } #by default, use all lines, regardless of color
|
144
|
-
if color_filter.call(strokeColorComps)
|
145
|
-
drawer.currentPath.each { |segment| drawer.addRuling(segment, strokeColorComps.to_a) }
|
146
|
-
end
|
147
|
-
|
148
|
-
drawer.currentPath = []
|
149
|
-
end
|
150
|
-
end
|
151
|
-
|
152
|
-
class CloseFillNonZeroAndStrokePathOperator < OperatorProcessor
|
153
|
-
def process(operator, arguments)
|
154
|
-
drawer = self.context
|
155
|
-
|
156
|
-
fillColorComps = drawer.getGraphicsState.getNonStrokingColor.getJavaColor.getRGBColorComponents(nil)
|
157
|
-
color_filter = drawer.options[:line_color_filter] || lambda{|c| true } #by default, use all lines, regardless of color
|
158
|
-
if color_filter.call(fillColorComps)
|
159
|
-
drawer.currentPath.each { |segment| drawer.addRuling(segment, fillColorComps.to_a) }
|
160
|
-
end
|
161
|
-
|
162
|
-
drawer.currentPath = []
|
163
|
-
end
|
164
|
-
end
|
165
|
-
|
166
|
-
class CloseAndStrokePathOperator < OperatorProcessor
|
167
|
-
def process(operator, arguments)
|
168
|
-
drawer = self.context
|
169
|
-
drawer.currentPath.each { |segment| drawer.addRuling(segment) }
|
170
|
-
drawer.currentPath = []
|
171
|
-
end
|
172
|
-
end
|
173
|
-
|
174
|
-
class EndPathOperator < OperatorProcessor
|
175
|
-
def process(operator, arguments)
|
176
|
-
drawer = self.context
|
177
|
-
# end without stroke, we don't care about it. discard it
|
178
|
-
drawer.currentPath = []
|
179
|
-
end
|
180
|
-
end
|
181
|
-
|
182
|
-
class FillNonZeroRuleOperator < OperatorProcessor
|
183
|
-
def process(operator, arguments)
|
184
|
-
drawer = self.context
|
185
|
-
# end without stroke, we don't care about it. discard it
|
186
|
-
drawer.currentPath = []
|
187
|
-
end
|
188
|
-
end
|
189
|
-
|
190
|
-
OPERATOR_PROCESSORS = {
|
191
|
-
'm' => MoveToOperator.new,
|
192
|
-
're' => AppendRectangleToPathOperator.new,
|
193
|
-
'l' => LineToOperator.new,
|
194
|
-
'S' => StrokePathOperator.new,
|
195
|
-
's' => StrokePathOperator.new,
|
196
|
-
'n' => EndPathOperator.new,
|
197
|
-
'b' => CloseFillNonZeroAndStrokePathOperator.new,
|
198
|
-
'b*' => CloseFillNonZeroAndStrokePathOperator.new,
|
199
|
-
'f' => CloseFillNonZeroAndStrokePathOperator.new,
|
200
|
-
'f*' => CloseFillNonZeroAndStrokePathOperator.new,
|
201
|
-
'BT' => org.apache.pdfbox.util.operator.BeginText.new,
|
202
|
-
'cm' => org.apache.pdfbox.util.operator.Concatenate.new,
|
203
|
-
'CS' => org.apache.pdfbox.util.operator.SetStrokingColorSpace.new,
|
204
|
-
'cs' => org.apache.pdfbox.util.operator.SetNonStrokingColorSpace.new,
|
205
|
-
'ET' => org.apache.pdfbox.util.operator.EndText.new,
|
206
|
-
'G' => org.apache.pdfbox.util.operator.SetStrokingGrayColor.new,
|
207
|
-
'g' => org.apache.pdfbox.util.operator.SetNonStrokingGrayColor.new,
|
208
|
-
'gs' => org.apache.pdfbox.util.operator.SetGraphicsStateParameters.new,
|
209
|
-
'K' => org.apache.pdfbox.util.operator.SetStrokingCMYKColor.new,
|
210
|
-
'k' => org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor.new,
|
211
|
-
'q' => org.apache.pdfbox.util.operator.GSave.new,
|
212
|
-
'Q' => org.apache.pdfbox.util.operator.GRestore.new,
|
213
|
-
'RG' => org.apache.pdfbox.util.operator.SetStrokingRGBColor.new,
|
214
|
-
'rg' => org.apache.pdfbox.util.operator.SetNonStrokingRGBColor.new,
|
215
|
-
'SC' => org.apache.pdfbox.util.operator.SetStrokingColor.new,
|
216
|
-
'sc' => org.apache.pdfbox.util.operator.SetNonStrokingColor.new,
|
217
|
-
'SCN' => org.apache.pdfbox.util.operator.SetStrokingColor.new,
|
218
|
-
'scn' => org.apache.pdfbox.util.operator.SetNonStrokingColor.new,
|
219
|
-
'T*' => org.apache.pdfbox.util.operator.NextLine.new,
|
220
|
-
'Tc' => org.apache.pdfbox.util.operator.SetCharSpacing.new,
|
221
|
-
'Td' => org.apache.pdfbox.util.operator.MoveText.new,
|
222
|
-
'TD' => org.apache.pdfbox.util.operator.MoveTextSetLeading.new,
|
223
|
-
'Tf' => org.apache.pdfbox.util.operator.SetTextFont.new,
|
224
|
-
'Tj' => org.apache.pdfbox.util.operator.ShowText.new,
|
225
|
-
'TJ' => org.apache.pdfbox.util.operator.ShowTextGlyph.new,
|
226
|
-
'TL' => org.apache.pdfbox.util.operator.SetTextLeading.new,
|
227
|
-
'Tm' => org.apache.pdfbox.util.operator.SetMatrix.new,
|
228
|
-
'Tr' => org.apache.pdfbox.util.operator.SetTextRenderingMode.new,
|
229
|
-
'Ts' => org.apache.pdfbox.util.operator.SetTextRise.new,
|
230
|
-
'Tw' => org.apache.pdfbox.util.operator.SetWordSpacing.new,
|
231
|
-
'Tz' => org.apache.pdfbox.util.operator.SetHorizontalTextScaling.new,
|
232
|
-
"\'" => org.apache.pdfbox.util.operator.MoveAndShow.new,
|
233
|
-
'\"' => org.apache.pdfbox.util.operator.SetMoveAndShow.new,
|
234
|
-
}
|
235
|
-
|
236
|
-
def initialize(options={})
|
237
|
-
super()
|
238
|
-
@options = options.merge!(DETECT_LINES_DEFAULTS)
|
239
|
-
self.clear!
|
240
|
-
OPERATOR_PROCESSORS.each { |k,v| registerOperatorProcessor(k, v) }
|
241
|
-
end
|
242
|
-
|
243
|
-
def clear!
|
244
|
-
self.rulings = []
|
245
|
-
self.currentX = -1
|
246
|
-
self.currentY = -1
|
247
|
-
self.currentPath = []
|
248
|
-
@pageSize = nil
|
249
|
-
end
|
250
|
-
|
251
|
-
def addRuling(ruling, color=nil)
|
252
|
-
color = color.nil? ? [0,0,0] : color
|
253
|
-
if !page.getRotation.nil? && [90, -270, -90, 270].include?(page.getRotation)
|
254
|
-
|
255
|
-
mb = page.findMediaBox
|
256
|
-
|
257
|
-
ruling.rotate!(mb.getLowerLeftX, mb.getLowerLeftY, page.getRotation)
|
258
|
-
|
259
|
-
trans = if page.getRotation == 90 || page.getRotation == -270
|
260
|
-
AffineTransform.getTranslateInstance(mb.getHeight, 0)
|
261
|
-
else
|
262
|
-
AffineTransform.getTranslateInstance(0, mb.getWidth)
|
263
|
-
end
|
264
|
-
ruling.transform!(trans)
|
265
|
-
end
|
266
|
-
|
267
|
-
# snapping to grid and joining lines that are close together
|
268
|
-
ruling.snap!(options[:snapping_grid_cell_size])
|
269
|
-
|
270
|
-
self.rulings << [ruling, color]
|
271
|
-
end
|
272
|
-
|
273
|
-
##
|
274
|
-
# get current page size
|
275
|
-
def pageSize
|
276
|
-
@pageSize ||= self.page.findMediaBox.createDimension
|
277
|
-
end
|
278
|
-
|
279
|
-
##
|
280
|
-
# fix the Y coordinate based on page rotation
|
281
|
-
def fixY(y)
|
282
|
-
pageSize.getHeight - y
|
283
|
-
end
|
284
|
-
|
285
|
-
def ScaledPoint(*args)
|
286
|
-
x, y = args[0], args[1]
|
287
|
-
|
288
|
-
# if scale factor not provided, get it from current transformation matrix
|
289
|
-
if args.size == 2
|
290
|
-
ctm = getGraphicsState.getCurrentTransformationMatrix
|
291
|
-
at = ctm.createAffineTransform
|
292
|
-
scaleX = at.getScaleX; scaleY = at.getScaleY
|
293
|
-
else
|
294
|
-
scaleX = args[2]; scaleY = args[3]
|
295
|
-
end
|
296
|
-
|
297
|
-
finalX = 0.0;
|
298
|
-
finalY = 0.0;
|
299
|
-
|
300
|
-
if scaleX > 0
|
301
|
-
finalX = x * scaleX;
|
302
|
-
end
|
303
|
-
if scaleY > 0
|
304
|
-
finalY = y * scaleY;
|
305
|
-
end
|
306
|
-
|
307
|
-
return java.awt.geom.Point2D::Float.new(finalX, finalY);
|
308
|
-
|
309
|
-
end
|
310
|
-
|
311
|
-
def TransformedPoint(x, y)
|
312
|
-
position = [x,y].to_java(:float)
|
313
|
-
at = self.getGraphicsState.getCurrentTransformationMatrix.createAffineTransform
|
314
|
-
at.transform(position, 0, position, 0, 1)
|
315
|
-
position[1] = fixY(position[1])
|
316
|
-
java.awt.geom.Point2D::Float.new(position[0], position[1])
|
317
|
-
end
|
318
|
-
|
319
|
-
end
|
data/lib/tabula/pdf_render.rb
DELETED
@@ -1,64 +0,0 @@
|
|
1
|
-
require 'java'
|
2
|
-
|
3
|
-
java_import org.apache.pdfbox.pdmodel.PDDocument
|
4
|
-
java_import org.apache.pdfbox.pdfviewer.PageDrawer
|
5
|
-
java_import java.awt.image.BufferedImage
|
6
|
-
java_import javax.imageio.ImageIO
|
7
|
-
java_import java.awt.Dimension
|
8
|
-
java_import java.awt.Color
|
9
|
-
|
10
|
-
module Tabula
|
11
|
-
module Render
|
12
|
-
|
13
|
-
# render a PDF page to a graphics context, but skip rendering the text
|
14
|
-
# This is done to reduce 'noise' introduced by the text, we only
|
15
|
-
# care about lines.
|
16
|
-
class PageDrawerNoText < PageDrawer
|
17
|
-
def processTextPosition(text)
|
18
|
-
end
|
19
|
-
end
|
20
|
-
|
21
|
-
#ugh jruby; suppresses "ambiguous method" warning that arises due to Java's overloaded constructor.
|
22
|
-
TRANSPARENT_WHITE = java.awt.Color.java_class.constructor(Java::int, Java::int, Java::int, Java::int).new_instance(255, 255, 255, 0)
|
23
|
-
|
24
|
-
# 2048 width is important, if this is too small, thin lines won't be drawn.
|
25
|
-
def self.pageToBufferedImage(page, width=2048, pageDrawerClass=PageDrawerNoText)
|
26
|
-
cropbox = page.findCropBox
|
27
|
-
widthPt, heightPt = cropbox.getWidth, cropbox.getHeight
|
28
|
-
pageDimension = Dimension.new(widthPt, heightPt)
|
29
|
-
rotation = java.lang.Math.toRadians(page.findRotation)
|
30
|
-
|
31
|
-
scaling = width / (rotation == 0 ? widthPt : heightPt)
|
32
|
-
widthPx, heightPx = (java.lang.Math.java_send :round, [Java::float], widthPt * scaling ), (java.lang.Math.java_send :round, [Java::float], heightPt * scaling)
|
33
|
-
|
34
|
-
|
35
|
-
retval = if rotation != 0
|
36
|
-
BufferedImage.new(heightPx, widthPx, BufferedImage::TYPE_BYTE_GRAY)
|
37
|
-
else
|
38
|
-
BufferedImage.new(widthPx, heightPx, BufferedImage::TYPE_BYTE_GRAY)
|
39
|
-
end
|
40
|
-
graphics = retval.getGraphics()
|
41
|
-
graphics.setBackground(TRANSPARENT_WHITE)
|
42
|
-
graphics.clearRect(0, 0, retval.getWidth, retval.getHeight)
|
43
|
-
if rotation != 0
|
44
|
-
graphics.java_send :translate, [Java::int, Java::int], retval.getWidth, 0.0
|
45
|
-
graphics.rotate(rotation)
|
46
|
-
end
|
47
|
-
graphics.scale(scaling, scaling)
|
48
|
-
drawer = pageDrawerClass.new()
|
49
|
-
drawer.drawPage(graphics, page, pageDimension)
|
50
|
-
graphics.dispose
|
51
|
-
|
52
|
-
return retval
|
53
|
-
end
|
54
|
-
end
|
55
|
-
end
|
56
|
-
|
57
|
-
# testing
|
58
|
-
if __FILE__ == $0
|
59
|
-
pdf_file = PDDocument.loadNonSeq(java.io.File.new(ARGV[0]), nil)
|
60
|
-
bi = Tabula::Render.pageToBufferedImage(pdf_file.getDocumentCatalog.getAllPages[ARGV[1].to_i - 1])
|
61
|
-
puts bi.class
|
62
|
-
ImageIO.write(bi, 'png',
|
63
|
-
java.io.File.new('notext.png'))
|
64
|
-
end
|