RubyGems - pdf_table_extractor - Versions diffs - 0.1.0 - Mend

pdf_table_extractor 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +7 -0
data/LICENSE +22 -0
data/README.md +137 -0
data/lib/pdf_table_extractor/pdf_table_extractor.rb +201 -0
data/lib/pdf_table_extractor/pdf_table_extractor_row.rb +106 -0
data/lib/pdf_table_extractor/version.rb +5 -0
data/lib/pdf_table_extractor.rb +6 -0
metadata +133 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: b35bb6b824eb69e6824b289b55794f355c5ae2e2ce342e8bb0130bfd269c1280
+  data.tar.gz: a42dba7aecb0f034a20738865c3ccf669bcb7c71c58ff787de543bf7bef7650e
+SHA512:
+  metadata.gz: 15c8eb4460907aab704bb81cfcc869f4847be223f6709c0203f575a51b2dad4a22a74266eccc7b00d7f29747a88a3405eab13bf275f2313d62c1b71faeddea30
+  data.tar.gz: 06611570f328ff938b0183f23bc93546294421a2a7f85f83e3878b21aa1412fbbb931ffe4183570f4c9ffd6a3bafce14f711ec2ce6ed52a8868c55619a5aa141

data/LICENSE ADDED Viewed

@@ -0,0 +1,22 @@
+MIT License
+Copyright (c) 2025 Marko Boskovic
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,137 @@
+# PDF Table Extractor
+A Ruby gem for extracting tables from PDF files by analyzing text spacing and positions. It parses PDF pages, removes headers/footers and pagination if configured, splits lines into cells based on multiple-space runs, and merges rows into table-like structures.
+- Source: https://github.com/jomb-ch/pdf_table_extractor
+- License: MIT
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'pdf_table_extractor'
+```
+Then install:
+```bash
+bundle install
+```
+Or install it yourself:
+```bash
+gem install pdf_table_extractor
+```
+## Usage
+### Basic usage
+```ruby
+require 'pdf_table_extractor'
+# Initialize with a PDF file path
+extractor = PdfTableExtractor.new('path/to/file.pdf')
+# Extract tables
+extractor.extract_tables
+# Get results: array of rows, each row is an array of cells
+rows = extractor.result
+# rows => [[{ text: 'Cell 1', position: 0 }, { text: 'Cell 2', position: 20 }], ...]
+```
+### Advanced usage with options
+```ruby
+extractor = PdfTableExtractor.new('path/to/file.pdf', options: {
+  remove_page_headers: true,          # Remove common leading lines across pages (default: true)
+  remove_page_footers: true,          # Remove common trailing lines across pages (default: true)
+  remove_pagination_from_header: false, # true or Integer (line number from top) to remove pagination; if true, tests first 5 lines (default: false)
+  remove_pagination_from_footer: false, # true or Integer (line number from bottom) to remove pagination; if true, tests last 5 lines (default: false)
+  remove_empty_lines: true,           # Filter out empty lines (default: true)
+  position_tolerance: 2               # Tolerance for matching column positions, allowing indentation within columns (default: 2)
+})
+extractor.extract_tables
+rows = extractor.result
+```
+### Using with PDF::Reader
+```ruby
+require 'pdf-reader'
+require 'pdf_table_extractor'
+reader = PDF::Reader.new('path/to/file.pdf')
+extractor = PdfTableExtractor.new(reader: reader)
+extractor.extract_tables
+rows = extractor.result
+```
+## How it works
+The gem uses several heuristics to identify table-like structures:
+- Text Positioning: splits lines into cells using multiple spaces as separators and tracks each cell's starting position
+- Row Congruence: considers rows to belong to the same table if their cell positions match (or are a subset) within a given tolerance
+- Header/Footer Removal: optionally removes common leading/trailing lines across pages
+- Pagination Handling: optionally removes page numbers from headers/footers
+### Constraints
+- Single-row tables are joined into a single cell before further processing
+- The first row of a table cannot have empty cells
+- Multi-cell rows followed by rows with fewer cells (but matching positions) are considered part of the same table
+- Trailing rows of multi-cell tables with content only in the first cell are treated as new single-cell tables if the content length is larger than the position of the second column minus 2
+## Development
+Set up the project:
+```bash
+bundle install
+```
+Run tests:
+```bash
+bundle exec rspec
+```
+Run linters:
+```bash
+bundle exec standardrb
+bundle exec rubocop --parallel
+```
+Generate API docs (YARD):
+```bash
+bundle exec yard doc
+open doc/index.html
+```
+Release a new version:
+1. Update the version number in `lib/pdf_table_extractor/version.rb`
+2. Build and release:
+```bash
+bundle exec rake release
+```
+## CI
+GitHub Actions workflow runs StandardRB, RuboCop, RSpec, and generates YARD docs on pushes and pull requests. Docs are uploaded as an artifact.
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/jomb-ch/pdf_table_extractor
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/lib/pdf_table_extractor/pdf_table_extractor.rb ADDED Viewed

@@ -0,0 +1,201 @@
+# frozen_string_literal: true
+require 'pdf-reader'
+# PdfTableExtractor extracts tables from PDF text using spacing and position heuristics.
+#
+# @!attribute [r] rows
+#   @return [Array<PdfTableExtractorRow>] raw rows parsed from text
+# @!attribute [r] merged_rows
+#   @return [Array<PdfTableExtractorRow>] rows merged into tables
+# @!attribute [r] options
+#   @return [Hash] configuration options
+class PdfTableExtractor
+  attr_reader :rows, :merged_rows, :options
+  # @param pdf_path [String, nil] Path to the PDF file (optional when reader is provided)
+  # @param reader [PDF::Reader, nil] Pre-initialized PDF::Reader instance
+  # @param options [Hash] Configuration options
+  # @option options [Boolean] :remove_page_headers (true) Remove common leading lines across pages
+  # @option options [Boolean] :remove_page_footers (true) Remove common trailing lines across pages
+  # @option options [Boolean, Integer] :remove_pagination_from_header (false) true or line number from top
+  # @option options [Boolean, Integer] :remove_pagination_from_footer (false) true or line number from bottom
+  # @option options [Boolean] :remove_empty_lines (true) Remove empty lines from the extracted text
+  # @option options [Integer] :position_tolerance (2) Tolerance for matching column positions
+  def initialize(pdf_path = nil, reader: nil, options: {})
+    @reader = reader || PDF::Reader.new(pdf_path)
+    @options = options
+    @options[:remove_page_headers] = true unless @options.key?(:remove_page_headers)
+    @options[:remove_page_footers] = true unless @options.key?(:remove_page_footers)
+    @options[:remove_pagination_from_header] = false unless @options.key?(:remove_pagination_from_header)
+    @options[:remove_pagination_from_footer] = false unless @options.key?(:remove_pagination_from_footer)
+    @options[:remove_empty_lines] = true unless @options.key?(:remove_empty_lines)
+    @options[:position_tolerance] = 2 unless @options.key?(:position_tolerance)
+    @merged_rows = []
+  end
+  # Extracts tables from the PDF and stores them in @merged_rows
+  # @return [void]
+  def extract_tables
+    pages = all_pages
+    pages = remove_pagination(pages) if @options[:remove_pagination_from_header] || @options[:remove_pagination_from_footer]
+    pages = remove_common_leading_lines(pages) if @options[:remove_page_headers]
+    pages = remove_common_trailing_lines(pages) if @options[:remove_page_footers]
+    lines = pages&.flatten
+    lines = lines&.reject { |l| l.strip.empty? } if @options[:remove_empty_lines]
+    @rows = lines&.map&.with_index do |line, index|
+      cells, positions = parse_line_to_cells(line)
+      PdfTableExtractorRow.new(self, cells, positions, index)
+    end
+    process_rows
+  end
+  # Returns the extracted cells as arrays per merged row.
+  # @return [Array<Array<Hash>>] Array of rows, each row is an array of cells with :text and :position
+  def result
+    @merged_rows.map(&:cells)
+  end
+  private
+  # @return [Array<Array<String>>] Array of pages, each page is an array of lines
+  def all_pages
+    all_pages_texts.map { |page_text| page_text.lines.map(&:chomp) }
+  end
+  # @return [Array<String>] Array of page texts
+  def all_pages_texts
+    @reader.pages.map { |page| page.text }
+  end
+  # Remove pagination lines from headers and footers.
+  # @param pages [Array<Array<String>>]
+  # @return [Array<Array<String>>]
+  def remove_pagination(pages)
+    return pages if pages.empty? || pages.length == 1
+    if @options[:remove_pagination_from_header]
+      if @options[:remove_pagination_from_header].is_a?(Integer)
+        index = @options[:remove_pagination_from_header] - 1
+        pages.each do |lines|
+          if lines.length > index && is_pagination?(lines[index])
+            lines.delete_at(index)
+          end
+        end
+      else
+        [1..5].each do |index|
+          if pages.all? { |lines| lines.length > index && is_pagination?(lines[index - 1]) }
+            pages.each { |lines| lines.delete_at(index - 1) }
+            break
+          end
+        end
+      end
+    end
+    if @options[:remove_pagination_from_footer]
+      if @options[:remove_pagination_from_footer].is_a?(Integer)
+        index = @options[:remove_pagination_from_footer]
+        pages.each do |lines|
+          if lines.length > index && is_pagination?(lines[-index])
+            lines.delete_at(-index)
+          end
+        end
+      else
+        [1..5].each do |index|
+          if pages.all? { |lines| lines.length > index && is_pagination?(lines[-index]) }
+            pages.each { |lines| lines.delete_at(-index) }
+            break
+          end
+        end
+      end
+    end
+    pages
+  end
+  # Remove common leading lines across pages.
+  # @param pages [Array<Array<String>>]
+  # @return [Array<Array<String>>]
+  def remove_common_leading_lines(pages)
+    return pages if pages.empty? || pages.length == 1
+    pages.each(&:shift) while same_leading_line?(pages)
+    pages
+  end
+  # Remove common trailing lines across pages.
+  # @param pages [Array<Array<String>>]
+  # @return [Array<Array<String>>]
+  def remove_common_trailing_lines(pages)
+    return pages if pages.empty? || pages.length == 1
+    pages.each(&:pop) while same_trailing_line?(pages)
+    pages
+  end
+  # Parse a line into cells using runs of multiple spaces as separators.
+  # @param line [String]
+  # @return [Array<(Array<Hash>, Array<Integer>)>] cells and positions
+  def parse_line_to_cells(line)
+    if has_consecutive_spaces?(line)
+      cells = []
+      position = 0
+      positions = []
+      line.split(/(\s{2,})/).each do |text|
+        if has_consecutive_spaces?(text)
+          position += text.length
+        elsif !text.empty?
+          cells << {text:, position:}
+          positions << position
+          position += text.length
+        end
+      end
+      [cells, positions]
+    else
+      [[{text: line.strip, position: 0}], [0]]
+    end
+  end
+  # Process parsed rows into merged table rows.
+  # @return [void]
+  def process_rows
+    @merged_rows = []
+    @rows.each do |row|
+      row.transform_to_single_cell! if row.incongruent_with_neighbours?
+      if @merged_rows.empty? || !row.congruent_with_last_merged?
+        @merged_rows << PdfTableExtractorRow.new(self, row.cells, row.positions, nil, true)
+      else
+        @merged_rows.last.merge!(row)
+      end
+    end
+  end
+  # @param pages [Array<Array<String>>]
+  # @return [Boolean]
+  def same_leading_line?(pages)
+    pages.all? { |lines| lines.any? } && pages.all? { |lines| lines[0] == pages[0][0] }
+  end
+  # @param pages [Array<Array<String>>]
+  # @return [Boolean]
+  def same_trailing_line?(pages)
+    pages.all? { |lines| lines.any? } && pages.all? { |lines| lines[-1] == pages[0][-1] }
+  end
+  # @param text [String]
+  # @return [Boolean]
+  def has_consecutive_spaces?(text)
+    text.match?(/\s{2,}/)
+  end
+  # @param line [String, nil]
+  # @return [Boolean]
+  def is_pagination?(line)
+    line&.strip&.match?(/^.*\d+$/)
+  end
+end

data/lib/pdf_table_extractor/pdf_table_extractor_row.rb ADDED Viewed

@@ -0,0 +1,106 @@
+# frozen_string_literal: true
+# Represents a parsed row of cells with positions and supports merging/grouping.
+#
+# @!attribute [r] positions
+#   @return [Array<Integer>] positions of the cells in this row
+# @!attribute [r] index
+#   @return [Integer, nil] original index when parsed (nil for merged rows)
+# @!attribute [r] cells
+#   @return [Array<Hash>] cells with :text and :position
+# @!attribute [r] merged
+#   @return [Boolean] whether this row is a merged row (not original)
+class PdfTableExtractorRow
+  attr_reader :positions, :index, :cells, :merged
+  def initialize(extractor, cells, positions, index, merged = false)
+    @extractor = extractor
+    @cells = cells
+    @index = index
+    @merged = merged
+    @positions = positions
+  end
+  # Whether this row is a single-cell row at position 0.
+  # @param against [Symbol] :previous or :last_merged
+  # @return [Boolean]
+  def single_cell?(against = :previous)
+    other_row = (against == :previous) ? prev : last_merged unless @merged
+    second_pos = other_row&.positions&.[](1).to_i
+    @positions == [0] && (
+      @merged || @index == 0 || other_row.single_cell? || @cells.first[:text].length > second_pos - 2
+    )
+  end
+  # Previous row in extractor.
+  # @return [PdfTableExtractorRow, nil]
+  def prev
+    return nil if @index.zero?
+    @extractor.rows[@index - 1]
+  end
+  # Next row in extractor.
+  # @return [PdfTableExtractorRow, nil]
+  def nxt
+    return nil if @index == @extractor.rows.length - 1
+    @extractor.rows[@index + 1]
+  end
+  # @return [Boolean] whether positions are congruent with previous row
+  def congruent_with_previous?
+    single_cell? == prev&.single_cell? && positions_match_with?(prev)
+  end
+  # @return [Boolean] whether positions are congruent with last merged row
+  def congruent_with_last_merged?
+    single_cell?(:last_merged) == last_merged&.single_cell? && positions_match_with?(last_merged)
+  end
+  # @return [Boolean] whether this row is incongruent relative to neighbours
+  def incongruent_with_neighbours?
+    !single_cell? && !(prev && congruent_with_previous?) && !nxt&.congruent_with_previous?
+  end
+  # Check if positions match with another row (within tolerance).
+  # @param other_row [PdfTableExtractorRow, nil]
+  # @return [Boolean]
+  def positions_match_with?(other_row)
+    @positions.each do |pos|
+      if @extractor.options[:position_tolerance].zero?
+        return false unless other_row&.positions.to_a.include?(pos)
+      elsif ([pos..pos + @extractor.options[:position_tolerance]] - other_row&.positions.to_a).length == @extractor.options[:position_tolerance] + 1
+        return false
+      end
+    end
+    true
+  end
+  # Transform this row into a single-cell row by merging text.
+  # @return [void]
+  def transform_to_single_cell!
+    @cells = [{
+      text: @cells.sort_by { |c| c[:position] }.map { |c| c[:text] }.join(" ").gsub(/\s+/, " ").strip,
+      position: 0
+    }]
+    @positions = [0]
+  end
+  # Merge text into matching cell positions from another row.
+  # @param row [PdfTableExtractorRow]
+  # @return [void]
+  def merge!(row)
+    @cells.each do |cell|
+      r_cell = row.cells.find { |c| c[:position] == cell[:position] }
+      cell[:text] += "\s#{r_cell[:text]}" if r_cell
+    end
+  end
+  private
+  # @return [PdfTableExtractorRow, nil]
+  def last_merged
+    @extractor.merged_rows.last
+  end
+end

data/lib/pdf_table_extractor/version.rb ADDED Viewed

@@ -0,0 +1,5 @@
+# frozen_string_literal: true
+class PdfTableExtractor
+  VERSION = '0.1.0'.freeze
+end

data/lib/pdf_table_extractor.rb ADDED Viewed

@@ -0,0 +1,6 @@
+# frozen_string_literal: true
+require_relative 'pdf_table_extractor/version'
+require_relative 'pdf_table_extractor/pdf_table_extractor'
+require_relative 'pdf_table_extractor/pdf_table_extractor_row'

metadata ADDED Viewed

@@ -0,0 +1,133 @@
+--- !ruby/object:Gem::Specification
+name: pdf_table_extractor
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Marko Boskovic
+bindir: bin
+cert_chain: []
+date: 1980-01-02 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: pdf-reader
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.8'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '2.8'
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '13.0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.0'
+- !ruby/object:Gem::Dependency
+  name: rubocop
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.64'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.64'
+- !ruby/object:Gem::Dependency
+  name: standard
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.36'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.36'
+- !ruby/object:Gem::Dependency
+  name: yard
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+description: Extracts tables from PDF text using spacing and position heuristics.
+email:
+- marko@jomb.ch
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- LICENSE
+- README.md
+- lib/pdf_table_extractor.rb
+- lib/pdf_table_extractor/pdf_table_extractor.rb
+- lib/pdf_table_extractor/pdf_table_extractor_row.rb
+- lib/pdf_table_extractor/version.rb
+homepage: https://github.com/jomb-ch/pdf_table_extractor
+licenses:
+- MIT
+metadata:
+  rubygems_mfa_required: 'true'
+  source_code_uri: https://github.com/jomb-ch/pdf_table_extractor
+  changelog_uri: https://github.com/jomb-ch/pdf_table_extractor/releases
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '3.1'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.7.1
+specification_version: 4
+summary: PDF table extractor
+test_files: []