RubyGems - pdf-extract - Versions diffs - 0.0.10 → 0.1.0 - Mend

pdf-extract 0.0.10 → 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

data/bin/8630-31489-1-PB.mask.pdf +0 -0
data/bin/pdf-extract +1 -2
data/bin/test2.mask.pdf +0 -0
data/bin/test3.mask.pdf +0 -0
data/bin/test4.mask.pdf +0 -0
data/bin/test5.mask.pdf +0 -0
data/bin/test6.mask.pdf +0 -0
data/bin/tmp.txt +368 -0
data/lib/analysis/columns.rb +9 -5
data/lib/analysis/sections.rb +50 -32
data/lib/font_metrics.rb +11 -3
data/lib/language.rb +9 -9
data/lib/model/chunks.rb +8 -4
data/lib/model/regions.rb +7 -7
data/lib/multi_range.rb +13 -3
data/lib/pdf-extract.rb +0 -2
data/lib/references/references.rb +16 -15
data/lib/references/resolve.rb +15 -15
data/lib/references/score.rb +1 -1
data/lib/spatial.rb +13 -13
metadata +77 -134
data/lib/view/png_view.rb +0 -30

data/bin/8630-31489-1-PB.mask.pdf ADDED

Binary file

data/bin/pdf-extract CHANGED

@@ -22,8 +22,7 @@ resolvers = {
 outputs = {
   :xml => proc { :stdout },
-  :pdf => proc { |f| File::basename(f.sub /\.[a-zA-Z0-9]+\Z/, "") + ".mask.pdf" },
-  :png => proc { |f| File::basename(f.sub /\.[a-zA-Z0-9]+\Z/, "") + ".mask.png" }
+  :pdf => proc { |f| File::basename(f.sub /\.[a-zA-Z0-9]+\Z/, "") + ".mask.pdf" }
 }
 commands = [

data/bin/test2.mask.pdf ADDED

Binary file

data/bin/test3.mask.pdf ADDED

Binary file

data/bin/test4.mask.pdf ADDED

Binary file

data/bin/test5.mask.pdf ADDED

Binary file

data/bin/test6.mask.pdf ADDED

Binary file

data/bin/tmp.txt ADDED

@@ -0,0 +1,368 @@
+<?xml version="1.0"?>
+<pdf>
+  <section line_height="7.96" font="TRUPSF+CMR9" letter_ratio="0.06" year_ratio="0.0"
+cap_ratio="0.2" name_ratio="0.172" word_count="250" lateness="0.125"
+reference_score="3.94">ABSTRACT Detecting tables in document images is important since not only
+do tables contain important information, but also most of the layout analysis methods fail in
+the presence of tables in the document image. Existing approaches for table detection mainly
+focus on detecting tables in single columns of text and do not work reliably on documents with
+varying layouts. This paper presents a practical algorithm for table detection that works with a
+high accuracy on documents with varying layouts (company reports, newspaper articles, magazine
+pages, . . . ). An open source implementation of the algorithm is provided as part of the
+Tesseract OCR engine. Evaluation of the algorithm on document images from publicly available
+UNLV dataset shows competitive performance in comparison to the table detection module of a
+commercial OCR system. Categories and Subject Descriptors I.7.5 [Document and Text Processing]:
+Document Capture|Document Analysis Keywords page segmentation, table detection, document
+analysis 1. INTRODUCTION Automatic conversion of paper documents into an editable electronic
+representation relies on optical character recognition (OCR) technology. A typical OCR system
+consists of three major steps. First, layout analysis is performed to locate text-lines in the
+document image and to identify their reading order. Then, a character recognition engine
+processes the text-line images and generates a text string by recognizing individual characters
+in the text-line image. Finally, a language modeling module makes corrections in the text string
+using a dictionary or a language model. ( The author gratefully acknowledges funding from Google
+Inc. for supporting this work<component x="53.8" y="466.55" width="239.12" height="167.4"
+page="1" page_width="595.28" page_height="841.89"></component><component x="53.8" y="419.88"
+width="239.11" height="31.41" page="1" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="383.67" width="219.51" height="20.95"
+page="1" page_width="595.28" page_height="841.89"></component><component x="53.8" y="225.03"
+width="239.12" height="143.37" page="1" page_width="595.28"
+page_height="841.89"></component></section>
+  <section line_height="7.13" font="OFVLTP+NimbusRomNo9L-Regu" letter_ratio="0.14"
+year_ratio="0.0" cap_ratio="0.15" name_ratio="0.2261904761904762" word_count="84"
+lateness="0.125" reference_score="4.77">Permission to make digital or hard copies of all or part
+of this work for personal or classroom use is granted without fee provided that copies are not
+made or distributed for profit or commercial advantage and that copies bear this notice and the
+full citation on the first page. To copy otherwise, to republish, to post on servers or to
+redistribute to lists, requires prior specific permission and/or a fee. DAS '10, June 9-11,
+2010, Boston, MA, USA Copyright 2010 ACM 978-1-60558-773-8/10/06 ...$10.00<component x="53.8"
+y="120.67" width="239.1" height="69.89" page="1" page_width="595.28"
+page_height="841.89"></component></section>
+  <section line_height="7.96" font="TRUPSF+CMR9" letter_ratio="0.05" year_ratio="0.0"
+cap_ratio="0.08" name_ratio="0.20614035087719298" word_count="1140" lateness="0.375"
+reference_score="2.03">Since layout analysis is the (rst step in such a process, all subsequent
+stages rely on layout analysis to work correctly. One of the major challenges faced by layout
+analysis is detecting table regions. Table detection is a hard problem since tables have a large
+variation in their layouts. Existing open-source OCR systems lack the capability of table
+detection and their layout analysis modules break down in the presence of table regions. A
+distinction should be made at this stage between table detection and table recognition [8].
+Table detection deals with the problem of (nding boundaries of tables in a page image. Table
+recognition, on the other hand, focuses on analyzing a detected table by (nding its rows and
+columns and tries to extract the structure of the table. Our focus in this paper is on the table
+detection problem. Wang et al. [20] take a statistical learning approach for the table detection
+problem. Given a set of candidate text-lines, candidate table lines are identi(ed based on gaps
+between consecutive words. Then, vertically adjacent lines with large gaps and horizontally
+adjacent words are grouped together to make table entity candidates. Finally, a statistical
+based learning algorithm is used to re(ne the table candidates and reduce false alarms. They
+make the assumption that the maximum number of columns is two and design three templates of page
+layout (single column, double column, mixed column). They apply a column style classi(cation
+algorithm to (nd out the column layout of the page and use this information as a priori
+knowledge for spotting table regions. This approach can handle only those layouts on which it
+has been trained. Besides, training the algorithm requires a large amount of labeled data. Hu et
+al. [6] presented a system for table detection from scanned page images or from plain text
+documents. Their system assumes a single-column input page that can be easily segmented into
+individual text-lines (for instance by horizontal projection). The table detection problem is
+then posed as an optimization problem where start and end textlines belonging to a table are
+identi(ed by optimizing some quality function. Like previous approaches, this technique can not
+be applied to multi-column documents. Cesarini et al. [2] present a system for locating table
+regions by detecting parallel lines. The table hypothesis formed in this way are then veri(ed by
+locating perpendicular lines or white spaces in the region included between the parallel lines.
+However, relying only on horizontal or vertical lines for table detection limits the scope of
+the system since not all tables have such lines. More recent work in table detection is reported
+by Gatos et al. [4] and Costa e Silva [3]. Gatos et al. [4] focus on locating tables that have
+both horizontal and vertical rulings and (nd their intersection points. Then, table
+reconstruction is achieved by drawing the corresponding horizontal and vertical lines that
+connect all line intersection pairs. The system works pretty well for their target documents but
+can not be used when the tables rows/columns are not separated by ruling lines. The work of
+Costa e Silva [3] focuses on extracting table regions from PDF documents using Hidden Markov
+Models (HMMs). They extract text from the PDF using pdftotext Linux utility. The spaces in the
+extracted text are used for computing the feature vector. Clearly, this approach would not work
+for document images. Summarizing the state of the art in table detection, we can see a clear
+limitation of existing methods. The methods do not work well on multi-column document images.
+This is probably due to the fact that most of the existing approaches focus on table recognition
+to extract the structure (rows, columns, cells) of the tables and hence make some simplifying
+assumptions on the table detection part. This approach works well when one has to deal with some
+speci(c classes of document images having simple layouts. However, more robust table detection
+algorithms are needed when dealing with a heterogeneous collection of documents. In this paper,
+we try to bridge this gap. Our goal is to accurately spot table regions in complex heterogeneous
+documents (company reports, journal articles, newspapers, magazines, . . . ). Once table regions
+are spotted, one of the existing table recognition techniques (e.g. [10]) could be used to
+extract the structure of the tables. The rest of this paper is organized as follows. First, we
+describe in Section 2 the layout analysis module of Tesseract [18, 19] that would be used as a
+basis of our table detection algorithm. Then, our table detection algorithm is illustrated in
+Section 3. Di(erent performance measures used to evaluate our system are presented in Section 4.
+Experimental results and discussion is given in Section 5 followed by a conclusion in Section 6.
+2. LAYOUT ANALYSIS VIA TAB-STOP DETECTION The layout analysis of Tesseract is a recent addition
+to the open source OCR system [19]. It is based on the idea of detecting tab-stops in a document
+image. When type-setting a document, tab-stops are the locations where text aligns (left, right,
+center, decimal, . . . ). Therefore, tab-stops can be used as a reliable indication of where a
+text block starts or ends. Finding the layout of the page via tab-stop detection proceeds as
+follows (see Figure 1 for illustration): ( First, a document image pre-processing step is
+performed to identify horizontal and vertical ruling lines or separators and to locate half-tone
+or image regions in the document. Then, a connected component analysis is performed to identify
+candidate text components based on their size and stroke width. ( The (ltered text components
+are evaluated as candidates for lying on a tab-stop position. These candidates are grouped into
+vertical lines to (nd tab-stop positions that are vertically aligned. As a (nal step, pairs of
+connected tab lines are adjusted such that they end at the same y-coordinate (see Figure 1(a)).
+At this stage, vertical tab lines marks the start and end of text regions. ( Based on the
+tab-lines, the column layout of the page is inferred and connected components are grouped into
+Column Partitions. A column partition is a sequence of connected components that do not cross
+any tab line and are of the same type (text, image, . . . ). Text column partitions can be
+regarded as initial candidates for text-lines(see Figure 1(b)). ( The last step creates ows of
+column partitions such that neighboring column partitions of the same type are grouped into the
+same block (Figure 1(c)). Text column partitions having di(erent font size and line spacing are
+grouped into di(erent blocks. Then, the reading order of these blocks is identi(ed. The boundary
+of the blocks is represented as an isothetic polygon (a polygon that has all edges parallel to
+the axes). 3. TABLE SPOTTING Our table detection algorithm is built upon two components of the
+layout analysis module:<component x="316.81" y="477.51" width="239.11" height="154.41" page="1"
+page_width="595.28" page_height="841.89"></component><component x="316.81" y="174.15"
+width="239.12" height="164.87" page="1" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="132.3" width="239.11" height="28.88"
+page="1" page_width="595.28" page_height="841.89"></component><component x="53.8" y="480.09"
+width="239.11" height="60.27" page="2" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="155.81" width="239.12" height="206.72"
+page="2" page_width="595.28" page_height="841.89"></component><component x="53.8" y="124.43"
+width="239.11" height="18.42" page="2" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="385.95" width="239.12" height="154.41"
+page="2" page_width="595.28" page_height="841.89"></component><component x="316.81" y="291.8"
+width="239.12" height="81.19" page="2" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="170.93" width="239.64" height="108.12"
+page="2" page_width="595.28" page_height="841.89"></component><component x="330.14" y="122.84"
+width="225.79" height="29.39" page="2" page_width="595.28"
+page_height="841.89"></component><component x="76.21" y="476.58" width="216.69" height="28.88"
+page="3" page_width="595.28" page_height="841.89"></component><component x="67.12" y="385.62"
+width="225.79" height="81.69" page="3" page_width="595.28"
+page_height="841.89"></component><component x="67.12" y="305.12" width="225.79" height="71.23"
+page="3" page_width="595.28" page_height="841.89"></component><component x="67.12" y="214.16"
+width="225.79" height="81.69" page="3" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="168.72" width="239.11" height="31.41"
+page="3" page_width="595.28" page_height="841.89"></component></section>
+  <section line_height="7.96" font="TRUPSF+CMR9" letter_ratio="0.1" year_ratio="0.0"
+cap_ratio="0.16" name_ratio="0.20500894454382826" word_count="2795" lateness="1.0"
+reference_score="9.47">3.1 Identifying Table Partitions The (rst step in our algorithm identi(es
+text column partitions that could belong to a table region, referred to as table partitions.
+Based on the observations mentioned in the previous paragraph, three types of partitions are
+marked as table partitions: (1) partitions that have at lease one large gap between their
+connected components, (2) partitions that consist of only one word (no signi(cant gap between
+components), (3) partitions that overlap along the y-axis with other partitions within the same
+column. The (rst case identi(es table partitions that result from merging cells from di(erent
+columns of a table into one partition. The second case detects table partitions that consists of
+a single data cell. The third case identi(es table partitions that lie in one column but were
+not joined together due to the presence of a strong tab-line. This stage tries to (nd table
+partition candidates quite aggressively. This has the advantage that even small evidence of the
+presence of a table is not missed, since any tables that are missed at this stage will not be
+recoverable at later stages. The disadvantage of the aggressive approach is that several false
+alarms may originate, for instance from single word section headings, page headers and footers,
+numbered equations, small parts of text words in the marginal noise, and line drawing regions. A
+smoothing (lter is applied that detects isolated table partitions that have no other table
+partition neighbor above or below them. These partitions are removed from the candidate table
+partition list. The candidate table partitions for our example image are shown in Figure 3(a).
+3.2 Detecting Page Column Split The next step is to detect split in the column layout of the
+page due to the presence of a table. Such a split occurs when the cells of the table are very
+well aligned. To detect this case, we divide the page into columns and (nd the ratio of table
+partitions in each column. Table columns that were erroneously reported as page columns are
+easily detected since they have a high ratio of table partition as compared to normal text
+partitions. However, extra care needs to be taken at this stage to undo a column split (i.e. to
+merge two columns) since a wrong decision would result in merging two text columns leading to a
+large numbers of errors in page layout analysis itself. Therefore, we undo a page column split
+only if su(cient number of text partitions spanning the two columns are present and the split in
+the columns starts with table partitions. This extra care prevents merging table columns in
+full-page tables when there is no owing text in the page. Since the cost of a wrong decision
+here is very high in terms of layout analysis errors we chose to perform this step defensively.
+3.3 Locating Table Columns The goal of this step is to group table partitions into table
+columns. For this purpose, runs of vertically neighboring table partitions are assigned to a
+single table column. If a column partition of type \horizontal ruling" is encountered, the run
+continues. When a partition of any other type is found, the table column obtained so far is
+(nalized. If a table column consists of only one table partition, it is removed as a false
+alarm. The identi(ed table columns for the example image are shown in Figure 3(b). 3.4 Marking
+Table Regions Table columns obtained in the previous steps give a strong hint about the presence
+of a table in that region. We make a simple assumption here: within a single page column, owing
+text does not share space with a table along the y-axis. This assumption holds true for most of
+the layouts that we encounter in practice since if a table shares space vertically with owing
+text, it is hard to see whether the text belongs to the table or not. Based on this assumption,
+we horizontally expand the boundaries of table columns to the page columns that contain them.
+Hence we obtain with-in column table regions for each page column. At this stage, tables that
+are laid out within one column are correctly identi(ed. However, tables spanning multiple page
+columns are over-segmented. Although two table regions in neighboring page columns could be
+merged if their start and end positions align, this might wrongly merge di(erent tables in the
+two columns. Therefore a merge is carried out only if at least one column partition of any type
+(text, table, horizontal ruling) is found that overlaps with both tables. Table partitions and
+horizontal ruling partitions that are not included in any table and are directly above or below
+a table region with a large overlap along the x-axis are also included in the neighboring table.
+The table regions thus obtained for the example image are shown in Figure 3(c). 3.5 Removing
+False Alarms Although most of the false alarms originating from normal text regions are removed
+in previous stages, other sources of false alarms like marginal noise [17] and (gures still
+remain. Therefore the identi(ed table regions are passed through a simple validity test: a valid
+table should have at least two columns. False alarms consisting of a single column are removed
+by analyzing their projection on the x-axis. Projection of a valid table on the x-axis should
+have at least one zero-valley larger than the global median x-height of the page. Therefore,
+table candidates that do not have a zero-valley in their vertical projection are removed. 4.
+PERFORMANCE MEASURES Di(erent performance measures have been reported in the literature for
+evaluating table detection algorithms. These range from simple precision and recall based
+measures [6, 13] to more sophisticated measures for benchmarking complete table structure
+extraction algorithms [8]. In this paper, since we are only focusing on table spotting, we use
+standard measures for document image segmentation focusing on the table regions. Hence in
+accordance with [13, 14, 16, 20] we use several measures for quantitatively evaluating di(erent
+aspects of our table spotting algorithm. Both ground-truth tables and tables detected by our
+algorithm are represented by their bounding boxes. Let G repi resent the bounding box of ith
+ground-truth table and D j represent the bounding box of the jth detected table in a document
+image. The amount of overlap between the two is de(ned as: 2jG \ D j ij (1) A(G ; D ) =ij i j jG
+j + jD j where jG \ D j represents the area of intersection of the ij two zones, and jG j; jD j
+represent the individual areas of ij the ground-truth and the detected tables. The amount of
+area overlap A will vary between zero and one depending on the overlap between ground-truth
+table G and detected i table D . If the two tables do not overlap at all A = 0, and j ij ij if
+the two tables match perfectly i.e. jG \D j = jG j = jD j, then A = 1. ( Partial Detections:
+These are the number of groundtruth tables that have a one-to-one correspondence with a detected
+table, however the amount of overlap is not large enough (0:1 &lt; A &lt; 0:9) to be classi(ed
+as a correct detection (see Figure 4(a)). ( Over-Segmented Tables: These are the number of
+ground-truth tables that have a major overlap (0:1 &lt; A &lt; 0:9) with more than one detected
+tables. This indicates that di(erent parts of the ground-truth table were detected as separate
+tables (see Figure 4(b)). ( Missed Tables: These are the number of groundtruth tables that do
+not have a major overlap with any of the detected tables (A ( 0:1). These tables are regarded as
+missed by the detection algorithm. ( False Positive Detections: These are the number of detected
+tables that do not have a major overlap with any of the ground-truth tables (A ( 0:1). These
+tables are regarded as false positive detections since the system mistook some non-table region
+as a table (see Figure 4(d)). ( Area Precision: While the measures de(ned above help in
+understanding which types of errors were made by the table detection algorithm, the goal of this
+measure is to summarize the performance of the algorithm by measuring what percentage of the
+detected table regions actually belong to a table region in the groundtruth image. A high
+precision is achieved when the decision about the presence of a table region is made very
+conservatively. ( Area Recall: This measure evaluates the percentage of the ground-truth table
+regions that was marked as belonging to a table by the algorithm. The concept of precision and
+recall measures are similar to their use in the information retrieval community [13]. 5.
+EXPERIMENTS AND RESULTS To evaluate the performance of our table detection algorithm, we chose
+the UNLV dataset [1]. The UNLV dataset contains a large variety of documents ranging from
+technical reports and business letters to newspapers and magazines. The dataset was speci(cally
+created to analyze the performance of leading commercial OCR systems in the UNLV annual tests of
+OCR accuracy [15]. It contains more than 10,000 scanned pages at di(erent resolutions and 1000
+fax documents. The scanned pages are categorized into bi-tonal and greyscale documents. The
+bi-tonal documents are again grouped into di(erent scan resolutions (200, 300, and 400 dpi). For
+each page, manually-keyed ground-truth text is provided, along with manually-determined zone
+information. The zones are further labeled according to their contents (text, table, half-tone,
+. . . ). We picked bi-tonal documents in the 300 dpi class for our experiments since this
+represents the most common settings for scanning documents. Among these images, 427 pages
+containing table zones were selected. These page images were further split into a training set
+of 213 images and a test set of 214 images. The training images were used in the development of
+the algorithm and di(erent steps of the algorithm were extensively evaluated on these images.
+The test images were used in the end to evaluate the complete system. Results of our table
+detection algorithm on some sample images from the UNLV dataset are shown in Figure 5. Detailed
+evaluation of the algorithm and its comparison with a state-of-the-art commercial OCR system is
+given in Table 1 and Figure 6.It should be noted that the ground-truth table zones provided with
+the UNLV dataset also include the table caption inside the zone. Since table caption is not a
+tabular structure, it is left out of the table by all OCR systems. Therefore, we edited the
+ground-truth information by manually marking the table caption regions in all documents. Then
+this region was excluded from the ground-truth table zones provided with the dataset. This was
+achieved by shrinking the ground-truth table zones to tightly enclose all foreground pixels that
+were not part of the table caption. The experimental results show that our system was able to
+spot table regions with a precision of 86% on the test data. The recall was also quite high
+(79%) showing a good compromise between precision and recall. The commercial OCR system, on the
+other hand, had a lower recall (37%) but higher precision (96%). Figure 6: A bar chart of the
+accuracy of the proposed table detection system with that of a commercial OCR on UNLV test set
+(214 page containing 268 tables). Some of the errors made by our algorithm are shown in Figure
+4. An analysis of the results shows that the major source of errors are full-page tables. In
+these cases, the column (nding algorithm reports several columns of text. Since newspapers also
+have several text columns, without using a priori knowledge about the type of documents (report,
+newspaper, . . . ) it is hard to detect that the large number of columns are due to a full-page
+table. One typical example is a page containing \table of contents". Such pages are marked as
+table regions in the ground-truth information provided with the UNLV dataset. However, our
+algorithm regards them as regular text pages hence either missing these \tables" completely or
+partially detecting them. The false positive detection made by our algorithm were also analyzed.
+We noticed an interesting side-e(ect of our algorithm. Since many graphics regions have text
+inside them that is spaced apart, such regions were also spotted as tables. Although such cases
+were reported as false alarms, in some cases it might be bene(cial to additionally spot graphics
+regions as well. Other cases of false alarms originated from tabulated equations. False alarms
+in pure text regions were quite rare. 6. CONCLUSION This paper presented a table detection
+algorithm as part of the Tesseract open source OCR system. The presented algorithm uses
+components of the layout analysis module of Tesseract to locate tables in documents having a
+large variety of layouts. Experimental results on di(erent classes of documents (company
+reports, journal articles, newspaper articles, magazine pages) from the UNLV dataset showed that
+our table detection algorithm competes well with that of a commercial OCR system with a much
+higher recall and slightly lower precision. We plan to extend this work in the direction of
+table structure extraction in future. Figure 5: Some sample images from the UNLV dataset showing
+the table spotting results of our algorithm. 7. REFERENCES [1]
+http://www.isri.unlv.edu/ISRI/OCRtk. [2] F. Cesarini, S. Marinai, L. Sarti, and G. Soda.
+Trainable table location in document images. In Proc. Int. Conf. on Pattern Recognition, pages
+236{240, Quebec, Canada, Aug. 2002. [3] A. C. e Silva. Learning rich hidden markov models in
+document analysis: Table location. In Proc. Int. Conf. on Document Analysis and Recognition,
+pages 843{847, Barcelona, Spain, July 2009. [4] B. Gatos, D. Danatsas, I. Pratikakis, and S. J.
+Perantonis. Automatic table detection in document images. In Proc. Int. Conf. on Advances in
+Pattern Recognition, pages 612{621, Path, UK, Aug. 2005. [5] I. Guyon, R. M. Haralick, J. J.
+Hull, and I. T. Phillips. Data sets for OCR and document image understanding research. In H.
+Bunke and P. Wang, editors, Handbook of character recognition and document image analysis, pages
+779{799. World Scienti(c, Singapore, 1997. [6] J. Hu, R. Kashi, D. Lopresti, and G. Wilfong.
+Medium-independent table detection. In Proc. SPIE Document Recognition and Retrieval VII, pages
+291{302, San Jose, CA, USA, Jan. 2000. [7] J. Hu, R. S. Kashi, D. Lopresti, and G. Wilfong.
+Experiments in table recognition. In Proc. Int. Workshop on Document Layout Interpretation and
+Applications, Seattle, WA, USA, Sep. 2001. [8] J. Hu, R. S. Kashi, D. Lopresti, and G. Wilfong.
+Evaluating the performance of table processing algorithms. Int. Jour. on Document Analysis and
+Recognition, 4(3):140{153, 2002. [9] D. Keysers, F. Shafait, and T. M. Breuel. Document image
+zone classi(cation - a simple high-performance approach. In 2nd Int. Conf. on Computer Vision
+Theory and Applications, pages 44{51, Barcelona, Spain, Mar. 2007. [10] T. Kieninger and A.
+Dengel. A paper-to-HTML table converting system. In Proc. Document Analysis Systems, pages
+356{365, Nagano, Japan, Nov. 1998. [11] T. Kieninger and A. Dengel. Table recognition and
+labeling using intrinsic layout features. In Proc. Int. Conf. on Advances in Pattern
+Recognition, Plymouth, UK, Nov. 1998. [12] T. Kieninger and A. Dengel. Applying the T-RECS table
+recognition system to the business letter domain. In Proc. Int. Conf. on Document Analysis and
+Recognition, pages 518{522, Seattle, WA, USA, Sep. 2001. [13] T. Kieninger and A. Dengel. An
+approach towards benchmarking of table structure recognition results. In Proc. 8th Int. Conf. on
+Document Analysis and Recognition, pages 1232{1236, Seoul, Korea, Aug. 2005. [14] S. Mandal, S.
+Chowdhury, A. Das, and B. Chanda. A simple and e(ective table detection system from document
+images. Int. Jour. on Document Analysis and Recognition, 8(2-3):172{182, 2006. [15] S. V. Rice,
+F. R. Jenkins, and T. A. Nartker. The fourth annual test of OCR accuracy. Technical report,
+Information Science Research Institute, University of Nevada, Las Vegas, 1995. [16] F. Shafait,
+D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six page segmentation
+algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 30(6):941{954, 2008. [17]
+F. Shafait, J. van Beusekom, D. Keysers, and T. M. Breuel. Document cleanup using page frame
+detection. Int. Jour. on Document Analysis and Recognition, 11(2):81{96, 2008. [18] R. Smith. An
+overview of the Tesseract OCR engine. In Proc. 9th Int. Conf. on Document Analysis and
+Recognition, pages 629{633, Curitiba, Brazil, Sep. 2007. [19] R. Smith. Hybrid page layout
+analysis via tab-stop detection. In Proc. Int. Conf. on Document Analysis and Recognition, pages
+241{245, Barcelona, Spain, July 2009. [20] Y. Wang, R. Haralick, and I. T. Phillips. Automatic
+table ground truth generation and a background-analysis-based table structure extraction method.
+In Proc. Int. Conf. on Document Analysis and Recognition, pages 528{532, Seattle, WA, USA, Sep.
+2001. [21] Y. Wang, I. Phillips, and R. Haralick. Document zone content classi(cation and its
+performance evaluation. Pattern Recognition, 39(1):57{73, 2006.<component x="316.81" y="122.84"
+width="239.11" height="83.71" page="3" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="455.66" width="239.12" height="81.19"
+page="4" page_width="595.28" page_height="841.89"></component><component x="53.8" y="298.75"
+width="239.12" height="143.95" page="4" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="143.76" width="239.12" height="136.02"
+page="4" page_width="595.28" page_height="841.89"></component><component x="53.8" y="122.84"
+width="239.11" height="7.96" page="4" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="466.12" width="239.11" height="70.73"
+page="4" page_width="595.28" page_height="841.89"></component><component x="316.81" y="346.79"
+width="239.12" height="104.63" page="4" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="206.53" width="239.11" height="125.55"
+page="4" page_width="595.28" page_height="841.89"></component><component x="316.81" y="122.84"
+width="239.11" height="70.73" page="4" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="506.1" width="239.12" height="60.27"
+page="5" page_width="595.28" page_height="841.89"></component><component x="53.8" y="366.62"
+width="239.11" height="125.55" page="5" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="237.59" width="239.11" height="115.09"
+page="5" page_width="595.28" page_height="841.89"></component><component x="53.8" y="122.42"
+width="239.11" height="102.21" page="5" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="495.64" width="239.11" height="71.23"
+page="5" page_width="595.28" page_height="841.89"></component><component x="330.14" y="388.41"
+width="226.7" height="50.31" page="5" page_width="595.28"
+page_height="841.89"></component><component x="330.14" y="329.86" width="225.78" height="50.31"
+page="5" page_width="595.28" page_height="841.89"></component><component x="330.14" y="191.85"
+width="225.79" height="39.85" page="5" page_width="595.28"
+page_height="841.89"></component><component x="330.14" y="122.84" width="225.79" height="60.77"
+page="5" page_width="595.28" page_height="841.89"></component><component x="67.12" y="692.7"
+width="225.78" height="92.15" page="6" page_width="595.28"
+page_height="841.89"></component><component x="67.12" y="625.6" width="225.78" height="50.31"
+page="6" page_width="595.28" page_height="841.89"></component><component x="53.8" y="342.52"
+width="239.32" height="261.54" page="6" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="122.84" width="239.11" height="206.72"
+page="6" page_width="595.28" page_height="841.89"></component><component x="316.81" y="536.37"
+width="239.11" height="39.34" page="6" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="376.11" width="239.11" height="133.49"
+page="6" page_width="595.28" page_height="841.89"></component><component x="316.81" y="271.5"
+width="239.11" height="91.65" page="6" page_width="595.28"
+page_height="841.89"></component><component x="316.81" y="122.84" width="239.11" height="125.56"
+page="6" page_width="595.28" page_height="841.89"></component><component x="55.39" y="134.11"
+width="498.95" height="7.96" page="7" page_width="595.28"
+page_height="841.89"></component><component x="53.8" y="124.03" width="237.98" height="481.22"
+page="8" page_width="595.28" page_height="841.89"></component><component x="316.81" y="157.4"
+width="239.11" height="445.82" page="8" page_width="595.28"
+page_height="841.89"></component></section>
+</pdf>

data/lib/analysis/columns.rb CHANGED

@@ -28,14 +28,14 @@ module PdfExtract
     def self.include_in pdf
       deps = [:regions, :bodies]
       pdf.spatials :columns, :paged => true, :depends_on => deps do |parser|
         body = nil
         body_regions = []
         parser.before do
           body_regions = []
         end
         parser.objects :bodies do |b|
           body = b
         end
@@ -48,7 +48,7 @@ module PdfExtract
         parser.after do
           column_sample_count = pdf.settings[:column_sample_count]
           step = 1.0 / (column_sample_count + 1)
           column_ranges = []
@@ -59,10 +59,14 @@ module PdfExtract
           # Discard those with a coverage of 0.
           column_ranges.reject! { |r| r.covered.zero? }
           # Discard those with more than x columns. They've probably hit a table.
           column_ranges.reject! { |r| r.count > pdf.settings[:max_column_count] }
+          # Discard ranges that comprise only of very narrow columns.
+          # Likely tables or columns picking up on false tab stops.
+          column_ranges.reject! { |r| r.widest < (0.25 * body[:width]) }
           if column_ranges.count.zero?
             []
           else
@@ -79,7 +83,7 @@ module PdfExtract
             end
           end
         end
       end
     end

data/lib/analysis/sections.rb CHANGED

@@ -1,3 +1,4 @@
+# -*- coding: utf-8 -*-
 require_relative '../language'
 require_relative '../spatial'
 require_relative '../kmeans'
@@ -10,16 +11,19 @@ module PdfExtract
       :module => self.name,
       :description => "Minimum ratio of text region width to containing column width for a text region to be considered as part of an article section."
     }
     def self.match? a, b
-      lh = a[:line_height].round(2) == b[:line_height].round(2)
-      f = a[:font] == b[:font]
-      lh && f
+      # A must have a width around the width of B and have the same
+      # font size.
+      avg_width = (a[:width] + b[:width]) / 2.0
+      matched_width = (a[:width] - b[:width]).abs <= avg_width * 0.1
+      matched_font_size = a[:line_height].round(2) == b[:line_height].round(2)
+      matched_width && matched_font_size
     end
     def self.candidate? pdf, region, column
       # Regions that make up sections or headers must be
-      # both less width than their column width and,
+      # both less wide than their column width and,
       # unless they are a single line, must be within the
       # width_ratio.
       width_ratio = pdf.settings[:width_ratio]
@@ -27,13 +31,23 @@ module PdfExtract
       within_column && (region[:width].to_f / column[:width]) >= width_ratio
     end
+    def self.possible_header? pdf, region, column
+      # Possible headers are narrower than the column width_ratio
+      # but still within the column bounds. They must also be at least
+      # as wide as they are tall (otherwise we may have a table
+      # column, which should be ignored for purposes of determing
+      # page flow).
+      within_column = region[:width] <= column[:width]
+      within_column && (region[:width] >= region[:height])
+    end
     def self.reference_cluster clusters
       # Find the cluster with name_ratio closest to 0.1
       # Those are our reference sections.
       ideal = 0.1
       ref_cluster = nil
       smallest_diff = 1
       clusters.each do |cluster|
         diff = (cluster[:centre][:name_ratio] - ideal).abs
         if diff < smallest_diff
@@ -63,29 +77,29 @@ module PdfExtract
           :letter_ratio => Language.letter_ratio(content),
           :year_ratio => Language.year_ratio(content),
           :cap_ratio => Language.cap_ratio(content),
-          :name_ratio => Language.name_ratio(content),
+          :name_ratio => Language.name_ratio(content),
           :word_count => Language.word_count(content),
-          :lateness => (last_page / page_count.to_f)
+          :lateness => (last_page / page_count.to_f)
         })
       end
     end
     def self.include_in pdf
       pdf.spatials :sections, :depends_on => [:regions, :columns] do |parser|
         columns = []
         parser.objects :columns do |column|
-          columns << {:column => column, :regions => []}
+           columns << {:column => column, :regions => []}
         end
         parser.objects :regions do |region|
           containers = columns.reject do |c|
             column = c[:column]
-            not (column[:page] == region[:page] && Spatial.contains?(column, region))
+            not (column[:page] == region[:page] && Spatial.contains?(column, region, 1))
           end
-          containers.first[:regions] << region unless containers.count.zero?
+          containers.first[:regions] << region unless containers.empty?
         end
         parser.after do
@@ -107,36 +121,40 @@ module PdfExtract
           end
           sections = []
-          found = []
+          merging_region = nil
           pages.each_pair do |page, columns|
-            columns.each do |c|
-              column = c[:column]
-              c[:regions].each do |region|
+            columns.each do |container|
+              column = container[:column]
+              container[:regions].each do |region|
                 if candidate? pdf, region, column
-                  if !found.last.nil? && match?(found.last, region)
-                    content = Spatial.merge_lines(found.last, region, {})
-                    found.last.merge!(content)
+                  if !merging_region.nil? && match?(merging_region, region)
+                    content = Spatial.merge_lines(merging_region, region, {})
+                    merging_region.merge!(content)
-                    found.last[:components] << Spatial.get_dimensions(region)
+                    merging_region[:components] << Spatial.get_dimensions(region)
+                  elsif !merging_region.nil?
+                    sections << merging_region
+                    merging_region = region.merge({
+                      :components => [Spatial.get_dimensions(region)]
+                    })
                   else
-                    found << region.merge({
+                    merging_region = region.merge({
                       :components => [Spatial.get_dimensions(region)]
                     })
                   end
-                else
-                  sections = sections + found
-                  found = []
+                elsif possible_header? pdf, region, column
+                  # Split sections, ignore the header
+                  sections << merging_region if !merging_region.nil?
+                  merging_region = nil
                 end
               end
             end
           end
-          sections = sections + found
+          sections <<  merging_region if not merging_region.nil?
           # We now have sections. Add information to them.
           # add_content_types sections
@@ -155,7 +173,7 @@ module PdfExtract
           sections
         end
       end
     end