RubyGems - pdf_ocr - Versions diffs - 0.1.0 - Mend

pdf_ocr 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 32727eeb24656d1fce7cb43f2f5192f29cfda53192ef161cfae047f2871f6bff
+  data.tar.gz: 558586ded2489faf79ce7f36ee1ab6df267d9dc30d67e6ba554be61bde959e19
+SHA512:
+  metadata.gz: c02b99bb1e652fe8c26ad80ed8dc4652c8eab5cc9a8bb4699b656080066772f811ee66ced0faa584f9a526322620c9e628f3a47c194f54b900706f968274c4dc
+  data.tar.gz: 9d7fea0ffe63fb2c10825d906831fb70dce2f1ab3d3d0c02c814dbd499c81fa906f23bc27e794d5ed3670b381c86b9da1b6f1abc091392684f5c17f97be000b4

data/.rspec ADDED Viewed

	@@ -0,0 +1 @@
1	+ --require spec_helper

data/Gemfile ADDED Viewed

@@ -0,0 +1,13 @@
+# frozen_string_literal: true
+source "https://rubygems.org"
+# Specify your gem's dependencies in ocr.gemspec
+gemspec
+gem "rake", "~> 13.0"
+gem "rspec"
+gem "pdf-reader"
+gem "mini_magick"
+gem "byebug"
+gem "rtesseract"

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,63 @@
+PATH
+  remote: .
+  specs:
+    pdf_ocr (0.1.0)
+      mini_magick
+      pdf-reader
+      rtesseract
+GEM
+  remote: https://rubygems.org/
+  specs:
+    Ascii85 (2.0.1)
+    afm (0.2.2)
+    bigdecimal (3.3.1)
+    byebug (12.0.0)
+    diff-lcs (1.6.2)
+    hashery (2.1.2)
+    mini_magick (4.13.2)
+    mini_portile2 (2.8.9)
+    nokogiri (1.18.10)
+      mini_portile2 (~> 2.8.2)
+      racc (~> 1.4)
+    pdf-reader (2.15.0)
+      Ascii85 (>= 1.0, < 3.0, != 2.0.0)
+      afm (>= 0.2.1, < 2)
+      hashery (~> 2.0)
+      ruby-rc4
+      ttfunk
+    racc (1.8.1)
+    rake (13.3.0)
+    rspec (3.13.1)
+      rspec-core (~> 3.13.0)
+      rspec-expectations (~> 3.13.0)
+      rspec-mocks (~> 3.13.0)
+    rspec-core (3.13.5)
+      rspec-support (~> 3.13.0)
+    rspec-expectations (3.13.5)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.13.0)
+    rspec-mocks (3.13.6)
+      diff-lcs (>= 1.2.0, < 2.0)
+      rspec-support (~> 3.13.0)
+    rspec-support (3.13.6)
+    rtesseract (2.2.0)
+      nokogiri
+    ruby-rc4 (0.1.5)
+    ttfunk (1.8.0)
+      bigdecimal (~> 3.1)
+PLATFORMS
+  x86_64-linux
+DEPENDENCIES
+  byebug
+  mini_magick
+  pdf-reader
+  pdf_ocr!
+  rake (~> 13.0)
+  rspec
+  rtesseract
+BUNDLED WITH
+   2.4.12

data/README.md ADDED Viewed

@@ -0,0 +1,138 @@
+# OCR
+A lightweight Ruby gem for extracting text from PDFs, including scanned PDFs using OCR.
+This gem supports:
+- PDFs with readable text
+- Scanned PDFs using Tesseract OCR
+- File objects, file paths, StringIO, and Rails/ActiveStorage uploads
+- Fully Rails-independent
+---
+## 🚀 Features
+- Detect if PDF is scanned or text-based
+- Extract text from normal PDFs using `PDF::Reader`
+- Extract text from scanned PDFs using `RTesseract` and `MiniMagick`
+- Automatic cleanup of temporary images
+---
+## 💻 Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'ocr', git: 'https://github.com/your_username/ocr.git'
+```
+Or install directly:
+```ruby
+gem install ocr
+```
+## Dependencies
+- PDF::Reader
+- RTesseract
+- MiniMagick
+- Tesseract OCR (system-level executable)
+- pdftoppm from Poppler utils (for converting PDF pages to images)
+## ⚙️ Usage
+```ruby
+require 'ocr'
+require 'stringio'
+# From a File object
+file = File.open("path/to/document.pdf")
+result = Ocr::DataExtractor.new(file).call
+puts result["raw_text"] if result["success"]
+# From a file path string
+result = Ocr::DataExtractor.new("path/to/document.pdf").call
+# From a StringIO object (in-memory PDF)
+pdf_data = StringIO.new(File.read("path/to/document.pdf"))
+result = Ocr::DataExtractor.new(pdf_data).call
+```
+## Example Result
+```ruby
+{
+  "success" => true,
+  "raw_text" => "Extracted text content from PDF ..."
+}
+```
+- If OCR fails for a scanned PDF:
+```ruby
+{
+  "success" => false,
+  "message" => "Unable to extract text using OCR"
+}
+```
+## 🔧 Notes
+1. Ensure Tesseract OCR is installed on your system:
+```
+# Ubuntu/Debian
+sudo apt install tesseract-ocr
+# MacOS (with Homebrew)
+brew install tesseract
+```
+2. Ensure pdftoppm is installed (for PDF-to-image conversion):
+```
+# Ubuntu/Debian
+sudo apt install poppler-utils
+# MacOS (with Homebrew)
+brew install poppler
+```
+3. This gem does not require Rails, but it will work with Rails ActiveStorage objects that respond to .open.
+## 🧪 Running Tests
+```
+bundle install
+bundle exec rspec
+```
+- PDFs with selectable text
+- Scanned PDFs
+- Malformed PDFs (fallback to OCR)
+## 📝 Contributing
+- Fork the repository
+- Create your feature branch (git checkout -b your-feature)
+- Commit your changes (git commit -am 'Add new feature')
+- Push to the branch (git push origin your-feature)
+- Open a Pull Request
+## 📝 License
+MIT License © RaviShankarSinghal
+---
+This version includes:
+- Version and build badges (replace with your repo info)
+- Clear installation instructions
+- Usage examples for File, path, and StringIO
+- System dependencies
+- Test instructions
+- Contributing guidelines
+---

data/Rakefile ADDED Viewed

@@ -0,0 +1,4 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+task default: %i[]

data/lib/ocr/data_extractor.rb ADDED Viewed

@@ -0,0 +1,122 @@
+require "mini_magick"
+require "pdf/reader"
+require "rtesseract"
+require "securerandom"
+require "shellwords"
+require "tmpdir"
+module Ocr
+  class DataExtractor
+    def initialize(document)
+      @document = document
+    end
+    def call
+      ocr_data(@document)
+    end
+    private
+    def ocr_data(document)
+      extracted_text = ""
+      is_scanned = false
+      file = get_file_from(document)
+      reader = if file.respond_to?(:path)
+                PDF::Reader.new(file.path)
+              else
+                PDF::Reader.new(file)
+              end
+      reader.pages.each do |page|
+        page_text = safe_page_text(page)
+        extracted_text << " " << page_text
+        if page_text.strip.empty? || mostly_junk?(page_text)
+          is_scanned = true
+          break
+        end
+      end
+      if is_scanned || scanned_pdf?(extracted_text)
+        scanned_pdf_ocr(file)
+      else
+        { "success" => true, "raw_text" => extracted_text.strip }
+      end
+    rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError => e
+      log_warning "PDF parsing failed: #{e.message}"
+      scanned_pdf_ocr(file)
+    end
+    def get_file_from(document)
+      return document.tap(&:open) if document.respond_to?(:open)
+      return document if document.is_a?(File)
+      return document if document.respond_to?(:read)
+      return File.open(document) if document.is_a?(String)
+      raise ArgumentError, "Unsupported document type: #{document.class}"
+    end
+    def safe_page_text(page)
+      page.text.to_s.encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
+    rescue
+      ""
+    end
+    def scanned_pdf?(text)
+      return true if text.empty?
+      junk_ratio = text.count("^A-Za-z0-9\s").to_f / text.size
+      junk_ratio > 0.5 || text.size < 100
+    end
+    def mostly_junk?(text)
+      return true if text.empty?
+      text.scan(/[A-Za-z]/).count < (text.size * 0.2)
+    end
+    def scanned_pdf_ocr(file)
+      images = []
+      full_text = ""
+      images = if file.respond_to?(:path)
+                convert_pdf_to_images(file.path)
+              else
+                convert_pdf_to_images(file)
+              end
+      full_text += images.map { |img| extract_text(img) }.join(" ")
+      unless full_text.strip.empty?
+        { "success" => true, "raw_text" => full_text.strip }
+      else
+        { "success" => false, "message" => "Unable to extract text using OCR" }
+      end
+    ensure
+      cleanup(images)
+    end
+    def convert_pdf_to_images(pdf_path)
+      output_prefix = File.join(Dir.tmpdir, "ocr_page_#{SecureRandom.hex(4)}")
+      system("pdftoppm -png -r 300 #{Shellwords.escape(pdf_path)} #{Shellwords.escape(output_prefix)}")
+      Dir["#{output_prefix}-*.png"]
+    end
+    def extract_text(image_path)
+      RTesseract.new(image_path, lang: "eng", processor: "mini_magick").to_s
+    rescue => e
+      log_warning "OCR failed on #{image_path}: #{e.message}"
+      ""
+    end
+    def cleanup(images)
+      images&.each { |img| File.delete(img) if File.exist?(img) }
+    end
+    def log_warning(message)
+      if defined?(Rails)
+        Rails.logger.warn(message)
+      else
+        warn(message)
+      end
+    end
+  end
+end

data/lib/ocr/version.rb ADDED Viewed

@@ -0,0 +1,5 @@
+# frozen_string_literal: true
+module Ocr
+  VERSION = "0.1.0"
+end

data/lib/ocr.rb ADDED Viewed

@@ -0,0 +1,9 @@
+# frozen_string_literal: true
+require_relative "ocr/version"
+require_relative "ocr/data_extractor"
+module Ocr
+  class Error < StandardError; end
+  # Your code goes here...
+end

data/ocr.gemspec ADDED Viewed

@@ -0,0 +1,46 @@
+# frozen_string_literal: true
+require_relative "lib/ocr/version"
+Gem::Specification.new do |spec|
+  spec.name          = "pdf_ocr"
+  spec.version       = Ocr::VERSION
+  spec.authors       = ["Ravi Shankar Singhal"]
+  spec.email         = ["ravi.singhal2308@gmail.com"]
+  spec.summary       = "A lightweight Ruby gem for extracting text from images using OCR."
+  spec.description   = "OCR is a Ruby gem that allows you to easily extract text from image files (JPG, PNG, PDF) using Tesseract OCR engine. It provides a simple, intuitive interface for integrating OCR capabilities into your Ruby or Rails applications."
+  spec.homepage      = "https://github.com/RaviShankarSinghal/ocr_gem"
+  spec.license       = "MIT"
+  spec.required_ruby_version = ">= 2.6.0"
+  spec.metadata = {
+    "homepage_uri"   => spec.homepage,
+    "source_code_uri" => "https://github.com/RaviShankarSinghal/ocr_gem",
+    "changelog_uri"   => "https://github.com/RaviShankarSinghal/ocr_gem/blob/main/CHANGELOG.md",
+    "documentation_uri" => "https://rubydoc.info/gems/ocr"
+  }
+  spec.files = Dir.chdir(__dir__) do
+    `git ls-files -z`.split("\x0").reject do |f|
+      (File.expand_path(f) == __FILE__) ||
+        f.start_with?(*%w[bin/ test/ spec/ features/ .git .circleci appveyor])
+    end
+  end
+  spec.bindir        = "exe"
+  spec.executables   = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
+  spec.require_paths = ["lib"]
+  # Common dependencies for OCR-based Ruby gems
+  # Runtime dependencies
+  spec.add_runtime_dependency "pdf-reader"
+  spec.add_runtime_dependency "mini_magick"
+  spec.add_runtime_dependency "rtesseract"
+  # Development dependencies
+  spec.add_development_dependency "rspec"
+  spec.add_development_dependency "byebug"
+end

data/sig/ocr.rbs ADDED Viewed

@@ -0,0 +1,4 @@
+module Ocr
+  VERSION: String
+  # See the writing guide of rbs: https://github.com/ruby/rbs#guides
+end

metadata ADDED Viewed

@@ -0,0 +1,129 @@
+--- !ruby/object:Gem::Specification
+name: pdf_ocr
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Ravi Shankar Singhal
+autorequire:
+bindir: exe
+cert_chain: []
+date: 2025-10-24 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: pdf-reader
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: mini_magick
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rtesseract
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: byebug
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+description: OCR is a Ruby gem that allows you to easily extract text from image files
+  (JPG, PNG, PDF) using Tesseract OCR engine. It provides a simple, intuitive interface
+  for integrating OCR capabilities into your Ruby or Rails applications.
+email:
+- ravi.singhal2308@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- ".rspec"
+- Gemfile
+- Gemfile.lock
+- README.md
+- Rakefile
+- lib/ocr.rb
+- lib/ocr/data_extractor.rb
+- lib/ocr/version.rb
+- ocr.gemspec
+- sig/ocr.rbs
+homepage: https://github.com/RaviShankarSinghal/ocr_gem
+licenses:
+- MIT
+metadata:
+  homepage_uri: https://github.com/RaviShankarSinghal/ocr_gem
+  source_code_uri: https://github.com/RaviShankarSinghal/ocr_gem
+  changelog_uri: https://github.com/RaviShankarSinghal/ocr_gem/blob/main/CHANGELOG.md
+  documentation_uri: https://rubydoc.info/gems/ocr
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: 2.6.0
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.3.7
+signing_key:
+specification_version: 4
+summary: A lightweight Ruby gem for extracting text from images using OCR.
+test_files: []