pdf-extract 0.0.10 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -22,8 +22,7 @@ resolvers = {
22
22
 
23
23
  outputs = {
24
24
  :xml => proc { :stdout },
25
- :pdf => proc { |f| File::basename(f.sub /\.[a-zA-Z0-9]+\Z/, "") + ".mask.pdf" },
26
- :png => proc { |f| File::basename(f.sub /\.[a-zA-Z0-9]+\Z/, "") + ".mask.png" }
25
+ :pdf => proc { |f| File::basename(f.sub /\.[a-zA-Z0-9]+\Z/, "") + ".mask.pdf" }
27
26
  }
28
27
 
29
28
  commands = [
Binary file
Binary file
Binary file
Binary file
Binary file
@@ -0,0 +1,368 @@
1
+ <?xml version="1.0"?>
2
+ <pdf>
3
+ <section line_height="7.96" font="TRUPSF+CMR9" letter_ratio="0.06" year_ratio="0.0"
4
+ cap_ratio="0.2" name_ratio="0.172" word_count="250" lateness="0.125"
5
+ reference_score="3.94">ABSTRACT Detecting tables in document images is important since not only
6
+ do tables contain important information, but also most of the layout analysis methods fail in
7
+ the presence of tables in the document image. Existing approaches for table detection mainly
8
+ focus on detecting tables in single columns of text and do not work reliably on documents with
9
+ varying layouts. This paper presents a practical algorithm for table detection that works with a
10
+ high accuracy on documents with varying layouts (company reports, newspaper articles, magazine
11
+ pages, . . . ). An open source implementation of the algorithm is provided as part of the
12
+ Tesseract OCR engine. Evaluation of the algorithm on document images from publicly available
13
+ UNLV dataset shows competitive performance in comparison to the table detection module of a
14
+ commercial OCR system. Categories and Subject Descriptors I.7.5 [Document and Text Processing]:
15
+ Document Capture|Document Analysis Keywords page segmentation, table detection, document
16
+ analysis 1. INTRODUCTION Automatic conversion of paper documents into an editable electronic
17
+ representation relies on optical character recognition (OCR) technology. A typical OCR system
18
+ consists of three major steps. First, layout analysis is performed to locate text-lines in the
19
+ document image and to identify their reading order. Then, a character recognition engine
20
+ processes the text-line images and generates a text string by recognizing individual characters
21
+ in the text-line image. Finally, a language modeling module makes corrections in the text string
22
+ using a dictionary or a language model. ( The author gratefully acknowledges funding from Google
23
+ Inc. for supporting this work<component x="53.8" y="466.55" width="239.12" height="167.4"
24
+ page="1" page_width="595.28" page_height="841.89"></component><component x="53.8" y="419.88"
25
+ width="239.11" height="31.41" page="1" page_width="595.28"
26
+ page_height="841.89"></component><component x="53.8" y="383.67" width="219.51" height="20.95"
27
+ page="1" page_width="595.28" page_height="841.89"></component><component x="53.8" y="225.03"
28
+ width="239.12" height="143.37" page="1" page_width="595.28"
29
+ page_height="841.89"></component></section>
30
+ <section line_height="7.13" font="OFVLTP+NimbusRomNo9L-Regu" letter_ratio="0.14"
31
+ year_ratio="0.0" cap_ratio="0.15" name_ratio="0.2261904761904762" word_count="84"
32
+ lateness="0.125" reference_score="4.77">Permission to make digital or hard copies of all or part
33
+ of this work for personal or classroom use is granted without fee provided that copies are not
34
+ made or distributed for profit or commercial advantage and that copies bear this notice and the
35
+ full citation on the first page. To copy otherwise, to republish, to post on servers or to
36
+ redistribute to lists, requires prior specific permission and/or a fee. DAS '10, June 9-11,
37
+ 2010, Boston, MA, USA Copyright 2010 ACM 978-1-60558-773-8/10/06 ...$10.00<component x="53.8"
38
+ y="120.67" width="239.1" height="69.89" page="1" page_width="595.28"
39
+ page_height="841.89"></component></section>
40
+ <section line_height="7.96" font="TRUPSF+CMR9" letter_ratio="0.05" year_ratio="0.0"
41
+ cap_ratio="0.08" name_ratio="0.20614035087719298" word_count="1140" lateness="0.375"
42
+ reference_score="2.03">Since layout analysis is the (rst step in such a process, all subsequent
43
+ stages rely on layout analysis to work correctly. One of the major challenges faced by layout
44
+ analysis is detecting table regions. Table detection is a hard problem since tables have a large
45
+ variation in their layouts. Existing open-source OCR systems lack the capability of table
46
+ detection and their layout analysis modules break down in the presence of table regions. A
47
+ distinction should be made at this stage between table detection and table recognition [8].
48
+ Table detection deals with the problem of (nding boundaries of tables in a page image. Table
49
+ recognition, on the other hand, focuses on analyzing a detected table by (nding its rows and
50
+ columns and tries to extract the structure of the table. Our focus in this paper is on the table
51
+ detection problem. Wang et al. [20] take a statistical learning approach for the table detection
52
+ problem. Given a set of candidate text-lines, candidate table lines are identi(ed based on gaps
53
+ between consecutive words. Then, vertically adjacent lines with large gaps and horizontally
54
+ adjacent words are grouped together to make table entity candidates. Finally, a statistical
55
+ based learning algorithm is used to re(ne the table candidates and reduce false alarms. They
56
+ make the assumption that the maximum number of columns is two and design three templates of page
57
+ layout (single column, double column, mixed column). They apply a column style classi(cation
58
+ algorithm to (nd out the column layout of the page and use this information as a priori
59
+ knowledge for spotting table regions. This approach can handle only those layouts on which it
60
+ has been trained. Besides, training the algorithm requires a large amount of labeled data. Hu et
61
+ al. [6] presented a system for table detection from scanned page images or from plain text
62
+ documents. Their system assumes a single-column input page that can be easily segmented into
63
+ individual text-lines (for instance by horizontal projection). The table detection problem is
64
+ then posed as an optimization problem where start and end textlines belonging to a table are
65
+ identi(ed by optimizing some quality function. Like previous approaches, this technique can not
66
+ be applied to multi-column documents. Cesarini et al. [2] present a system for locating table
67
+ regions by detecting parallel lines. The table hypothesis formed in this way are then veri(ed by
68
+ locating perpendicular lines or white spaces in the region included between the parallel lines.
69
+ However, relying only on horizontal or vertical lines for table detection limits the scope of
70
+ the system since not all tables have such lines. More recent work in table detection is reported
71
+ by Gatos et al. [4] and Costa e Silva [3]. Gatos et al. [4] focus on locating tables that have
72
+ both horizontal and vertical rulings and (nd their intersection points. Then, table
73
+ reconstruction is achieved by drawing the corresponding horizontal and vertical lines that
74
+ connect all line intersection pairs. The system works pretty well for their target documents but
75
+ can not be used when the tables rows/columns are not separated by ruling lines. The work of
76
+ Costa e Silva [3] focuses on extracting table regions from PDF documents using Hidden Markov
77
+ Models (HMMs). They extract text from the PDF using pdftotext Linux utility. The spaces in the
78
+ extracted text are used for computing the feature vector. Clearly, this approach would not work
79
+ for document images. Summarizing the state of the art in table detection, we can see a clear
80
+ limitation of existing methods. The methods do not work well on multi-column document images.
81
+ This is probably due to the fact that most of the existing approaches focus on table recognition
82
+ to extract the structure (rows, columns, cells) of the tables and hence make some simplifying
83
+ assumptions on the table detection part. This approach works well when one has to deal with some
84
+ speci(c classes of document images having simple layouts. However, more robust table detection
85
+ algorithms are needed when dealing with a heterogeneous collection of documents. In this paper,
86
+ we try to bridge this gap. Our goal is to accurately spot table regions in complex heterogeneous
87
+ documents (company reports, journal articles, newspapers, magazines, . . . ). Once table regions
88
+ are spotted, one of the existing table recognition techniques (e.g. [10]) could be used to
89
+ extract the structure of the tables. The rest of this paper is organized as follows. First, we
90
+ describe in Section 2 the layout analysis module of Tesseract [18, 19] that would be used as a
91
+ basis of our table detection algorithm. Then, our table detection algorithm is illustrated in
92
+ Section 3. Di(erent performance measures used to evaluate our system are presented in Section 4.
93
+ Experimental results and discussion is given in Section 5 followed by a conclusion in Section 6.
94
+ 2. LAYOUT ANALYSIS VIA TAB-STOP DETECTION The layout analysis of Tesseract is a recent addition
95
+ to the open source OCR system [19]. It is based on the idea of detecting tab-stops in a document
96
+ image. When type-setting a document, tab-stops are the locations where text aligns (left, right,
97
+ center, decimal, . . . ). Therefore, tab-stops can be used as a reliable indication of where a
98
+ text block starts or ends. Finding the layout of the page via tab-stop detection proceeds as
99
+ follows (see Figure 1 for illustration): ( First, a document image pre-processing step is
100
+ performed to identify horizontal and vertical ruling lines or separators and to locate half-tone
101
+ or image regions in the document. Then, a connected component analysis is performed to identify
102
+ candidate text components based on their size and stroke width. ( The (ltered text components
103
+ are evaluated as candidates for lying on a tab-stop position. These candidates are grouped into
104
+ vertical lines to (nd tab-stop positions that are vertically aligned. As a (nal step, pairs of
105
+ connected tab lines are adjusted such that they end at the same y-coordinate (see Figure 1(a)).
106
+ At this stage, vertical tab lines marks the start and end of text regions. ( Based on the
107
+ tab-lines, the column layout of the page is inferred and connected components are grouped into
108
+ Column Partitions. A column partition is a sequence of connected components that do not cross
109
+ any tab line and are of the same type (text, image, . . . ). Text column partitions can be
110
+ regarded as initial candidates for text-lines(see Figure 1(b)). ( The last step creates ows of
111
+ column partitions such that neighboring column partitions of the same type are grouped into the
112
+ same block (Figure 1(c)). Text column partitions having di(erent font size and line spacing are
113
+ grouped into di(erent blocks. Then, the reading order of these blocks is identi(ed. The boundary
114
+ of the blocks is represented as an isothetic polygon (a polygon that has all edges parallel to
115
+ the axes). 3. TABLE SPOTTING Our table detection algorithm is built upon two components of the
116
+ layout analysis module:<component x="316.81" y="477.51" width="239.11" height="154.41" page="1"
117
+ page_width="595.28" page_height="841.89"></component><component x="316.81" y="174.15"
118
+ width="239.12" height="164.87" page="1" page_width="595.28"
119
+ page_height="841.89"></component><component x="316.81" y="132.3" width="239.11" height="28.88"
120
+ page="1" page_width="595.28" page_height="841.89"></component><component x="53.8" y="480.09"
121
+ width="239.11" height="60.27" page="2" page_width="595.28"
122
+ page_height="841.89"></component><component x="53.8" y="155.81" width="239.12" height="206.72"
123
+ page="2" page_width="595.28" page_height="841.89"></component><component x="53.8" y="124.43"
124
+ width="239.11" height="18.42" page="2" page_width="595.28"
125
+ page_height="841.89"></component><component x="316.81" y="385.95" width="239.12" height="154.41"
126
+ page="2" page_width="595.28" page_height="841.89"></component><component x="316.81" y="291.8"
127
+ width="239.12" height="81.19" page="2" page_width="595.28"
128
+ page_height="841.89"></component><component x="316.81" y="170.93" width="239.64" height="108.12"
129
+ page="2" page_width="595.28" page_height="841.89"></component><component x="330.14" y="122.84"
130
+ width="225.79" height="29.39" page="2" page_width="595.28"
131
+ page_height="841.89"></component><component x="76.21" y="476.58" width="216.69" height="28.88"
132
+ page="3" page_width="595.28" page_height="841.89"></component><component x="67.12" y="385.62"
133
+ width="225.79" height="81.69" page="3" page_width="595.28"
134
+ page_height="841.89"></component><component x="67.12" y="305.12" width="225.79" height="71.23"
135
+ page="3" page_width="595.28" page_height="841.89"></component><component x="67.12" y="214.16"
136
+ width="225.79" height="81.69" page="3" page_width="595.28"
137
+ page_height="841.89"></component><component x="53.8" y="168.72" width="239.11" height="31.41"
138
+ page="3" page_width="595.28" page_height="841.89"></component></section>
139
+ <section line_height="7.96" font="TRUPSF+CMR9" letter_ratio="0.1" year_ratio="0.0"
140
+ cap_ratio="0.16" name_ratio="0.20500894454382826" word_count="2795" lateness="1.0"
141
+ reference_score="9.47">3.1 Identifying Table Partitions The (rst step in our algorithm identi(es
142
+ text column partitions that could belong to a table region, referred to as table partitions.
143
+ Based on the observations mentioned in the previous paragraph, three types of partitions are
144
+ marked as table partitions: (1) partitions that have at lease one large gap between their
145
+ connected components, (2) partitions that consist of only one word (no signi(cant gap between
146
+ components), (3) partitions that overlap along the y-axis with other partitions within the same
147
+ column. The (rst case identi(es table partitions that result from merging cells from di(erent
148
+ columns of a table into one partition. The second case detects table partitions that consists of
149
+ a single data cell. The third case identi(es table partitions that lie in one column but were
150
+ not joined together due to the presence of a strong tab-line. This stage tries to (nd table
151
+ partition candidates quite aggressively. This has the advantage that even small evidence of the
152
+ presence of a table is not missed, since any tables that are missed at this stage will not be
153
+ recoverable at later stages. The disadvantage of the aggressive approach is that several false
154
+ alarms may originate, for instance from single word section headings, page headers and footers,
155
+ numbered equations, small parts of text words in the marginal noise, and line drawing regions. A
156
+ smoothing (lter is applied that detects isolated table partitions that have no other table
157
+ partition neighbor above or below them. These partitions are removed from the candidate table
158
+ partition list. The candidate table partitions for our example image are shown in Figure 3(a).
159
+ 3.2 Detecting Page Column Split The next step is to detect split in the column layout of the
160
+ page due to the presence of a table. Such a split occurs when the cells of the table are very
161
+ well aligned. To detect this case, we divide the page into columns and (nd the ratio of table
162
+ partitions in each column. Table columns that were erroneously reported as page columns are
163
+ easily detected since they have a high ratio of table partition as compared to normal text
164
+ partitions. However, extra care needs to be taken at this stage to undo a column split (i.e. to
165
+ merge two columns) since a wrong decision would result in merging two text columns leading to a
166
+ large numbers of errors in page layout analysis itself. Therefore, we undo a page column split
167
+ only if su(cient number of text partitions spanning the two columns are present and the split in
168
+ the columns starts with table partitions. This extra care prevents merging table columns in
169
+ full-page tables when there is no owing text in the page. Since the cost of a wrong decision
170
+ here is very high in terms of layout analysis errors we chose to perform this step defensively.
171
+ 3.3 Locating Table Columns The goal of this step is to group table partitions into table
172
+ columns. For this purpose, runs of vertically neighboring table partitions are assigned to a
173
+ single table column. If a column partition of type \horizontal ruling" is encountered, the run
174
+ continues. When a partition of any other type is found, the table column obtained so far is
175
+ (nalized. If a table column consists of only one table partition, it is removed as a false
176
+ alarm. The identi(ed table columns for the example image are shown in Figure 3(b). 3.4 Marking
177
+ Table Regions Table columns obtained in the previous steps give a strong hint about the presence
178
+ of a table in that region. We make a simple assumption here: within a single page column, owing
179
+ text does not share space with a table along the y-axis. This assumption holds true for most of
180
+ the layouts that we encounter in practice since if a table shares space vertically with owing
181
+ text, it is hard to see whether the text belongs to the table or not. Based on this assumption,
182
+ we horizontally expand the boundaries of table columns to the page columns that contain them.
183
+ Hence we obtain with-in column table regions for each page column. At this stage, tables that
184
+ are laid out within one column are correctly identi(ed. However, tables spanning multiple page
185
+ columns are over-segmented. Although two table regions in neighboring page columns could be
186
+ merged if their start and end positions align, this might wrongly merge di(erent tables in the
187
+ two columns. Therefore a merge is carried out only if at least one column partition of any type
188
+ (text, table, horizontal ruling) is found that overlaps with both tables. Table partitions and
189
+ horizontal ruling partitions that are not included in any table and are directly above or below
190
+ a table region with a large overlap along the x-axis are also included in the neighboring table.
191
+ The table regions thus obtained for the example image are shown in Figure 3(c). 3.5 Removing
192
+ False Alarms Although most of the false alarms originating from normal text regions are removed
193
+ in previous stages, other sources of false alarms like marginal noise [17] and (gures still
194
+ remain. Therefore the identi(ed table regions are passed through a simple validity test: a valid
195
+ table should have at least two columns. False alarms consisting of a single column are removed
196
+ by analyzing their projection on the x-axis. Projection of a valid table on the x-axis should
197
+ have at least one zero-valley larger than the global median x-height of the page. Therefore,
198
+ table candidates that do not have a zero-valley in their vertical projection are removed. 4.
199
+ PERFORMANCE MEASURES Di(erent performance measures have been reported in the literature for
200
+ evaluating table detection algorithms. These range from simple precision and recall based
201
+ measures [6, 13] to more sophisticated measures for benchmarking complete table structure
202
+ extraction algorithms [8]. In this paper, since we are only focusing on table spotting, we use
203
+ standard measures for document image segmentation focusing on the table regions. Hence in
204
+ accordance with [13, 14, 16, 20] we use several measures for quantitatively evaluating di(erent
205
+ aspects of our table spotting algorithm. Both ground-truth tables and tables detected by our
206
+ algorithm are represented by their bounding boxes. Let G repi resent the bounding box of ith
207
+ ground-truth table and D j represent the bounding box of the jth detected table in a document
208
+ image. The amount of overlap between the two is de(ned as: 2jG \ D j ij (1) A(G ; D ) =ij i j jG
209
+ j + jD j where jG \ D j represents the area of intersection of the ij two zones, and jG j; jD j
210
+ represent the individual areas of ij the ground-truth and the detected tables. The amount of
211
+ area overlap A will vary between zero and one depending on the overlap between ground-truth
212
+ table G and detected i table D . If the two tables do not overlap at all A = 0, and j ij ij if
213
+ the two tables match perfectly i.e. jG \D j = jG j = jD j, then A = 1. ( Partial Detections:
214
+ These are the number of groundtruth tables that have a one-to-one correspondence with a detected
215
+ table, however the amount of overlap is not large enough (0:1 &lt; A &lt; 0:9) to be classi(ed
216
+ as a correct detection (see Figure 4(a)). ( Over-Segmented Tables: These are the number of
217
+ ground-truth tables that have a major overlap (0:1 &lt; A &lt; 0:9) with more than one detected
218
+ tables. This indicates that di(erent parts of the ground-truth table were detected as separate
219
+ tables (see Figure 4(b)). ( Missed Tables: These are the number of groundtruth tables that do
220
+ not have a major overlap with any of the detected tables (A ( 0:1). These tables are regarded as
221
+ missed by the detection algorithm. ( False Positive Detections: These are the number of detected
222
+ tables that do not have a major overlap with any of the ground-truth tables (A ( 0:1). These
223
+ tables are regarded as false positive detections since the system mistook some non-table region
224
+ as a table (see Figure 4(d)). ( Area Precision: While the measures de(ned above help in
225
+ understanding which types of errors were made by the table detection algorithm, the goal of this
226
+ measure is to summarize the performance of the algorithm by measuring what percentage of the
227
+ detected table regions actually belong to a table region in the groundtruth image. A high
228
+ precision is achieved when the decision about the presence of a table region is made very
229
+ conservatively. ( Area Recall: This measure evaluates the percentage of the ground-truth table
230
+ regions that was marked as belonging to a table by the algorithm. The concept of precision and
231
+ recall measures are similar to their use in the information retrieval community [13]. 5.
232
+ EXPERIMENTS AND RESULTS To evaluate the performance of our table detection algorithm, we chose
233
+ the UNLV dataset [1]. The UNLV dataset contains a large variety of documents ranging from
234
+ technical reports and business letters to newspapers and magazines. The dataset was speci(cally
235
+ created to analyze the performance of leading commercial OCR systems in the UNLV annual tests of
236
+ OCR accuracy [15]. It contains more than 10,000 scanned pages at di(erent resolutions and 1000
237
+ fax documents. The scanned pages are categorized into bi-tonal and greyscale documents. The
238
+ bi-tonal documents are again grouped into di(erent scan resolutions (200, 300, and 400 dpi). For
239
+ each page, manually-keyed ground-truth text is provided, along with manually-determined zone
240
+ information. The zones are further labeled according to their contents (text, table, half-tone,
241
+ . . . ). We picked bi-tonal documents in the 300 dpi class for our experiments since this
242
+ represents the most common settings for scanning documents. Among these images, 427 pages
243
+ containing table zones were selected. These page images were further split into a training set
244
+ of 213 images and a test set of 214 images. The training images were used in the development of
245
+ the algorithm and di(erent steps of the algorithm were extensively evaluated on these images.
246
+ The test images were used in the end to evaluate the complete system. Results of our table
247
+ detection algorithm on some sample images from the UNLV dataset are shown in Figure 5. Detailed
248
+ evaluation of the algorithm and its comparison with a state-of-the-art commercial OCR system is
249
+ given in Table 1 and Figure 6.It should be noted that the ground-truth table zones provided with
250
+ the UNLV dataset also include the table caption inside the zone. Since table caption is not a
251
+ tabular structure, it is left out of the table by all OCR systems. Therefore, we edited the
252
+ ground-truth information by manually marking the table caption regions in all documents. Then
253
+ this region was excluded from the ground-truth table zones provided with the dataset. This was
254
+ achieved by shrinking the ground-truth table zones to tightly enclose all foreground pixels that
255
+ were not part of the table caption. The experimental results show that our system was able to
256
+ spot table regions with a precision of 86% on the test data. The recall was also quite high
257
+ (79%) showing a good compromise between precision and recall. The commercial OCR system, on the
258
+ other hand, had a lower recall (37%) but higher precision (96%). Figure 6: A bar chart of the
259
+ accuracy of the proposed table detection system with that of a commercial OCR on UNLV test set
260
+ (214 page containing 268 tables). Some of the errors made by our algorithm are shown in Figure
261
+ 4. An analysis of the results shows that the major source of errors are full-page tables. In
262
+ these cases, the column (nding algorithm reports several columns of text. Since newspapers also
263
+ have several text columns, without using a priori knowledge about the type of documents (report,
264
+ newspaper, . . . ) it is hard to detect that the large number of columns are due to a full-page
265
+ table. One typical example is a page containing \table of contents". Such pages are marked as
266
+ table regions in the ground-truth information provided with the UNLV dataset. However, our
267
+ algorithm regards them as regular text pages hence either missing these \tables" completely or
268
+ partially detecting them. The false positive detection made by our algorithm were also analyzed.
269
+ We noticed an interesting side-e(ect of our algorithm. Since many graphics regions have text
270
+ inside them that is spaced apart, such regions were also spotted as tables. Although such cases
271
+ were reported as false alarms, in some cases it might be bene(cial to additionally spot graphics
272
+ regions as well. Other cases of false alarms originated from tabulated equations. False alarms
273
+ in pure text regions were quite rare. 6. CONCLUSION This paper presented a table detection
274
+ algorithm as part of the Tesseract open source OCR system. The presented algorithm uses
275
+ components of the layout analysis module of Tesseract to locate tables in documents having a
276
+ large variety of layouts. Experimental results on di(erent classes of documents (company
277
+ reports, journal articles, newspaper articles, magazine pages) from the UNLV dataset showed that
278
+ our table detection algorithm competes well with that of a commercial OCR system with a much
279
+ higher recall and slightly lower precision. We plan to extend this work in the direction of
280
+ table structure extraction in future. Figure 5: Some sample images from the UNLV dataset showing
281
+ the table spotting results of our algorithm. 7. REFERENCES [1]
282
+ http://www.isri.unlv.edu/ISRI/OCRtk. [2] F. Cesarini, S. Marinai, L. Sarti, and G. Soda.
283
+ Trainable table location in document images. In Proc. Int. Conf. on Pattern Recognition, pages
284
+ 236{240, Quebec, Canada, Aug. 2002. [3] A. C. e Silva. Learning rich hidden markov models in
285
+ document analysis: Table location. In Proc. Int. Conf. on Document Analysis and Recognition,
286
+ pages 843{847, Barcelona, Spain, July 2009. [4] B. Gatos, D. Danatsas, I. Pratikakis, and S. J.
287
+ Perantonis. Automatic table detection in document images. In Proc. Int. Conf. on Advances in
288
+ Pattern Recognition, pages 612{621, Path, UK, Aug. 2005. [5] I. Guyon, R. M. Haralick, J. J.
289
+ Hull, and I. T. Phillips. Data sets for OCR and document image understanding research. In H.
290
+ Bunke and P. Wang, editors, Handbook of character recognition and document image analysis, pages
291
+ 779{799. World Scienti(c, Singapore, 1997. [6] J. Hu, R. Kashi, D. Lopresti, and G. Wilfong.
292
+ Medium-independent table detection. In Proc. SPIE Document Recognition and Retrieval VII, pages
293
+ 291{302, San Jose, CA, USA, Jan. 2000. [7] J. Hu, R. S. Kashi, D. Lopresti, and G. Wilfong.
294
+ Experiments in table recognition. In Proc. Int. Workshop on Document Layout Interpretation and
295
+ Applications, Seattle, WA, USA, Sep. 2001. [8] J. Hu, R. S. Kashi, D. Lopresti, and G. Wilfong.
296
+ Evaluating the performance of table processing algorithms. Int. Jour. on Document Analysis and
297
+ Recognition, 4(3):140{153, 2002. [9] D. Keysers, F. Shafait, and T. M. Breuel. Document image
298
+ zone classi(cation - a simple high-performance approach. In 2nd Int. Conf. on Computer Vision
299
+ Theory and Applications, pages 44{51, Barcelona, Spain, Mar. 2007. [10] T. Kieninger and A.
300
+ Dengel. A paper-to-HTML table converting system. In Proc. Document Analysis Systems, pages
301
+ 356{365, Nagano, Japan, Nov. 1998. [11] T. Kieninger and A. Dengel. Table recognition and
302
+ labeling using intrinsic layout features. In Proc. Int. Conf. on Advances in Pattern
303
+ Recognition, Plymouth, UK, Nov. 1998. [12] T. Kieninger and A. Dengel. Applying the T-RECS table
304
+ recognition system to the business letter domain. In Proc. Int. Conf. on Document Analysis and
305
+ Recognition, pages 518{522, Seattle, WA, USA, Sep. 2001. [13] T. Kieninger and A. Dengel. An
306
+ approach towards benchmarking of table structure recognition results. In Proc. 8th Int. Conf. on
307
+ Document Analysis and Recognition, pages 1232{1236, Seoul, Korea, Aug. 2005. [14] S. Mandal, S.
308
+ Chowdhury, A. Das, and B. Chanda. A simple and e(ective table detection system from document
309
+ images. Int. Jour. on Document Analysis and Recognition, 8(2-3):172{182, 2006. [15] S. V. Rice,
310
+ F. R. Jenkins, and T. A. Nartker. The fourth annual test of OCR accuracy. Technical report,
311
+ Information Science Research Institute, University of Nevada, Las Vegas, 1995. [16] F. Shafait,
312
+ D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six page segmentation
313
+ algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(6):941{954, 2008. [17]
314
+ F. Shafait, J. van Beusekom, D. Keysers, and T. M. Breuel. Document cleanup using page frame
315
+ detection. Int. Jour. on Document Analysis and Recognition, 11(2):81{96, 2008. [18] R. Smith. An
316
+ overview of the Tesseract OCR engine. In Proc. 9th Int. Conf. on Document Analysis and
317
+ Recognition, pages 629{633, Curitiba, Brazil, Sep. 2007. [19] R. Smith. Hybrid page layout
318
+ analysis via tab-stop detection. In Proc. Int. Conf. on Document Analysis and Recognition, pages
319
+ 241{245, Barcelona, Spain, July 2009. [20] Y. Wang, R. Haralick, and I. T. Phillips. Automatic
320
+ table ground truth generation and a background-analysis-based table structure extraction method.
321
+ In Proc. Int. Conf. on Document Analysis and Recognition, pages 528{532, Seattle, WA, USA, Sep.
322
+ 2001. [21] Y. Wang, I. Phillips, and R. Haralick. Document zone content classi(cation and its
323
+ performance evaluation. Pattern Recognition, 39(1):57{73, 2006.<component x="316.81" y="122.84"
324
+ width="239.11" height="83.71" page="3" page_width="595.28"
325
+ page_height="841.89"></component><component x="53.8" y="455.66" width="239.12" height="81.19"
326
+ page="4" page_width="595.28" page_height="841.89"></component><component x="53.8" y="298.75"
327
+ width="239.12" height="143.95" page="4" page_width="595.28"
328
+ page_height="841.89"></component><component x="53.8" y="143.76" width="239.12" height="136.02"
329
+ page="4" page_width="595.28" page_height="841.89"></component><component x="53.8" y="122.84"
330
+ width="239.11" height="7.96" page="4" page_width="595.28"
331
+ page_height="841.89"></component><component x="316.81" y="466.12" width="239.11" height="70.73"
332
+ page="4" page_width="595.28" page_height="841.89"></component><component x="316.81" y="346.79"
333
+ width="239.12" height="104.63" page="4" page_width="595.28"
334
+ page_height="841.89"></component><component x="316.81" y="206.53" width="239.11" height="125.55"
335
+ page="4" page_width="595.28" page_height="841.89"></component><component x="316.81" y="122.84"
336
+ width="239.11" height="70.73" page="4" page_width="595.28"
337
+ page_height="841.89"></component><component x="53.8" y="506.1" width="239.12" height="60.27"
338
+ page="5" page_width="595.28" page_height="841.89"></component><component x="53.8" y="366.62"
339
+ width="239.11" height="125.55" page="5" page_width="595.28"
340
+ page_height="841.89"></component><component x="53.8" y="237.59" width="239.11" height="115.09"
341
+ page="5" page_width="595.28" page_height="841.89"></component><component x="53.8" y="122.42"
342
+ width="239.11" height="102.21" page="5" page_width="595.28"
343
+ page_height="841.89"></component><component x="316.81" y="495.64" width="239.11" height="71.23"
344
+ page="5" page_width="595.28" page_height="841.89"></component><component x="330.14" y="388.41"
345
+ width="226.7" height="50.31" page="5" page_width="595.28"
346
+ page_height="841.89"></component><component x="330.14" y="329.86" width="225.78" height="50.31"
347
+ page="5" page_width="595.28" page_height="841.89"></component><component x="330.14" y="191.85"
348
+ width="225.79" height="39.85" page="5" page_width="595.28"
349
+ page_height="841.89"></component><component x="330.14" y="122.84" width="225.79" height="60.77"
350
+ page="5" page_width="595.28" page_height="841.89"></component><component x="67.12" y="692.7"
351
+ width="225.78" height="92.15" page="6" page_width="595.28"
352
+ page_height="841.89"></component><component x="67.12" y="625.6" width="225.78" height="50.31"
353
+ page="6" page_width="595.28" page_height="841.89"></component><component x="53.8" y="342.52"
354
+ width="239.32" height="261.54" page="6" page_width="595.28"
355
+ page_height="841.89"></component><component x="53.8" y="122.84" width="239.11" height="206.72"
356
+ page="6" page_width="595.28" page_height="841.89"></component><component x="316.81" y="536.37"
357
+ width="239.11" height="39.34" page="6" page_width="595.28"
358
+ page_height="841.89"></component><component x="316.81" y="376.11" width="239.11" height="133.49"
359
+ page="6" page_width="595.28" page_height="841.89"></component><component x="316.81" y="271.5"
360
+ width="239.11" height="91.65" page="6" page_width="595.28"
361
+ page_height="841.89"></component><component x="316.81" y="122.84" width="239.11" height="125.56"
362
+ page="6" page_width="595.28" page_height="841.89"></component><component x="55.39" y="134.11"
363
+ width="498.95" height="7.96" page="7" page_width="595.28"
364
+ page_height="841.89"></component><component x="53.8" y="124.03" width="237.98" height="481.22"
365
+ page="8" page_width="595.28" page_height="841.89"></component><component x="316.81" y="157.4"
366
+ width="239.11" height="445.82" page="8" page_width="595.28"
367
+ page_height="841.89"></component></section>
368
+ </pdf>
@@ -28,14 +28,14 @@ module PdfExtract
28
28
  def self.include_in pdf
29
29
  deps = [:regions, :bodies]
30
30
  pdf.spatials :columns, :paged => true, :depends_on => deps do |parser|
31
-
31
+
32
32
  body = nil
33
33
  body_regions = []
34
34
 
35
35
  parser.before do
36
36
  body_regions = []
37
37
  end
38
-
38
+
39
39
  parser.objects :bodies do |b|
40
40
  body = b
41
41
  end
@@ -48,7 +48,7 @@ module PdfExtract
48
48
 
49
49
  parser.after do
50
50
  column_sample_count = pdf.settings[:column_sample_count]
51
-
51
+
52
52
  step = 1.0 / (column_sample_count + 1)
53
53
  column_ranges = []
54
54
 
@@ -59,10 +59,14 @@ module PdfExtract
59
59
 
60
60
  # Discard those with a coverage of 0.
61
61
  column_ranges.reject! { |r| r.covered.zero? }
62
-
62
+
63
63
  # Discard those with more than x columns. They've probably hit a table.
64
64
  column_ranges.reject! { |r| r.count > pdf.settings[:max_column_count] }
65
65
 
66
+ # Discard ranges that comprise only of very narrow columns.
67
+ # Likely tables or columns picking up on false tab stops.
68
+ column_ranges.reject! { |r| r.widest < (0.25 * body[:width]) }
69
+
66
70
  if column_ranges.count.zero?
67
71
  []
68
72
  else
@@ -79,7 +83,7 @@ module PdfExtract
79
83
  end
80
84
  end
81
85
  end
82
-
86
+
83
87
  end
84
88
  end
85
89
 
@@ -1,3 +1,4 @@
1
+ # -*- coding: utf-8 -*-
1
2
  require_relative '../language'
2
3
  require_relative '../spatial'
3
4
  require_relative '../kmeans'
@@ -10,16 +11,19 @@ module PdfExtract
10
11
  :module => self.name,
11
12
  :description => "Minimum ratio of text region width to containing column width for a text region to be considered as part of an article section."
12
13
  }
13
-
14
+
14
15
  def self.match? a, b
15
- lh = a[:line_height].round(2) == b[:line_height].round(2)
16
- f = a[:font] == b[:font]
17
- lh && f
16
+ # A must have a width around the width of B and have the same
17
+ font size.
18
+ avg_width = (a[:width] + b[:width]) / 2.0
19
+ matched_width = (a[:width] - b[:width]).abs <= avg_width * 0.1
20
+ matched_font_size = a[:line_height].round(2) == b[:line_height].round(2)
21
+ matched_width && matched_font_size
18
22
  end
19
23
 
20
24
  def self.candidate? pdf, region, column
21
25
  # Regions that make up sections or headers must be
22
- # both less width than their column width and,
26
+ # both less wide than their column width and,
23
27
  # unless they are a single line, must be within the
24
28
  # width_ratio.
25
29
  width_ratio = pdf.settings[:width_ratio]
@@ -27,13 +31,23 @@ module PdfExtract
27
31
  within_column && (region[:width].to_f / column[:width]) >= width_ratio
28
32
  end
29
33
 
34
+ def self.possible_header? pdf, region, column
35
+ # Possible headers are narrower than the column width_ratio
36
+ # but still within the column bounds. They must also be at least
37
+ # as wide as they are tall (otherwise we may have a table
38
+ # column, which should be ignored for purposes of determing
39
+ # page flow).
40
+ within_column = region[:width] <= column[:width]
41
+ within_column && (region[:width] >= region[:height])
42
+ end
43
+
30
44
  def self.reference_cluster clusters
31
45
  # Find the cluster with name_ratio closest to 0.1
32
46
  # Those are our reference sections.
33
47
  ideal = 0.1
34
48
  ref_cluster = nil
35
49
  smallest_diff = 1
36
-
50
+
37
51
  clusters.each do |cluster|
38
52
  diff = (cluster[:centre][:name_ratio] - ideal).abs
39
53
  if diff < smallest_diff
@@ -63,29 +77,29 @@ module PdfExtract
63
77
  :letter_ratio => Language.letter_ratio(content),
64
78
  :year_ratio => Language.year_ratio(content),
65
79
  :cap_ratio => Language.cap_ratio(content),
66
- :name_ratio => Language.name_ratio(content),
80
+ :name_ratio => Language.name_ratio(content),
67
81
  :word_count => Language.word_count(content),
68
- :lateness => (last_page / page_count.to_f)
82
+ :lateness => (last_page / page_count.to_f)
69
83
  })
70
84
  end
71
85
  end
72
-
86
+
73
87
  def self.include_in pdf
74
88
  pdf.spatials :sections, :depends_on => [:regions, :columns] do |parser|
75
89
 
76
90
  columns = []
77
-
91
+
78
92
  parser.objects :columns do |column|
79
- columns << {:column => column, :regions => []}
93
+ columns << {:column => column, :regions => []}
80
94
  end
81
95
 
82
96
  parser.objects :regions do |region|
83
97
  containers = columns.reject do |c|
84
98
  column = c[:column]
85
- not (column[:page] == region[:page] && Spatial.contains?(column, region))
99
+ not (column[:page] == region[:page] && Spatial.contains?(column, region, 1))
86
100
  end
87
101
 
88
- containers.first[:regions] << region unless containers.count.zero?
102
+ containers.first[:regions] << region unless containers.empty?
89
103
  end
90
104
 
91
105
  parser.after do
@@ -107,36 +121,40 @@ module PdfExtract
107
121
  end
108
122
 
109
123
  sections = []
110
- found = []
111
-
124
+ merging_region = nil
125
+
112
126
  pages.each_pair do |page, columns|
113
- columns.each do |c|
114
- column = c[:column]
115
-
116
- c[:regions].each do |region|
117
-
127
+ columns.each do |container|
128
+ column = container[:column]
129
+
130
+ container[:regions].each do |region|
118
131
  if candidate? pdf, region, column
119
- if !found.last.nil? && match?(found.last, region)
120
- content = Spatial.merge_lines(found.last, region, {})
121
- found.last.merge!(content)
132
+ if !merging_region.nil? && match?(merging_region, region)
133
+ content = Spatial.merge_lines(merging_region, region, {})
134
+
135
+ merging_region.merge!(content)
122
136
 
123
- found.last[:components] << Spatial.get_dimensions(region)
124
-
137
+ merging_region[:components] << Spatial.get_dimensions(region)
138
+ elsif !merging_region.nil?
139
+ sections << merging_region
140
+ merging_region = region.merge({
141
+ :components => [Spatial.get_dimensions(region)]
142
+ })
125
143
  else
126
- found << region.merge({
144
+ merging_region = region.merge({
127
145
  :components => [Spatial.get_dimensions(region)]
128
146
  })
129
147
  end
130
- else
131
- sections = sections + found
132
- found = []
148
+ elsif possible_header? pdf, region, column
149
+ # Split sections, ignore the header
150
+ sections << merging_region if !merging_region.nil?
151
+ merging_region = nil
133
152
  end
134
-
135
153
  end
136
154
  end
137
155
  end
138
156
 
139
- sections = sections + found
157
+ sections << merging_region if not merging_region.nil?
140
158
 
141
159
  # We now have sections. Add information to them.
142
160
  # add_content_types sections
@@ -155,7 +173,7 @@ module PdfExtract
155
173
 
156
174
  sections
157
175
  end
158
-
176
+
159
177
  end
160
178
  end
161
179