RubyGems - ocr-file - Versions diffs - 0.0.1 - Mend

ocr-file 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

checksums.yaml +7 -0
data/.gitignore +56 -0
data/CODE_OF_CONDUCT.md +74 -0
data/Gemfile +5 -0
data/Gemfile.lock +83 -0
data/LICENSE +21 -0
data/README.md +121 -0
data/Rakefile +10 -0
data/bin/console +11 -0
data/bin/ocr-file +5 -0
data/bin/setup +8 -0
data/lib/ocr-file/cli.rb +5 -0
data/lib/ocr-file/document.rb +195 -0
data/lib/ocr-file/file_helpers.rb +40 -0
data/lib/ocr-file/image_engines/image_magick.rb +14 -0
data/lib/ocr-file/image_engines/pdf_engine.rb +75 -0
data/lib/ocr-file/image_engines/pdftoppm.rb +27 -0
data/lib/ocr-file/ocr_engines/cloud_vision.rb +59 -0
data/lib/ocr-file/ocr_engines/tesseract.rb +22 -0
data/lib/ocr-file/version.rb +3 -0
data/lib/ocr-file.rb +19 -0
data/ocr-file.gemspec +38 -0
metadata +151 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: 6b558ce36c35e74b410f42928eae1a987485d1bbd64da77750574062bc05b91e
+  data.tar.gz: d906c620a02c5a2d139b3d89d05e9b3872ee6c929b4aa661b20f9033d8f3605f
+SHA512:
+  metadata.gz: 81049908609ba3d622be2b6f99dabeca2960a455fa3d56ee1fca4c177c2ee4365281421c1128ec3fa5476d068daa53b3d7f7600c5fd1c31fcb5834ca688f9747
+  data.tar.gz: 1a7dcd56a7196694371abf70633635545138bdc7bc0af2873fc5e7c22bdbfc97e9986ba1a18afc288b24e488caf999b644ec1f9d8889ce6e5efa6fcfe776c204

data/.gitignore ADDED Viewed

@@ -0,0 +1,56 @@
+*.gem
+*.rbc
+/.config
+/coverage/
+/InstalledFiles
+/pkg/
+/spec/reports/
+/spec/examples.txt
+/test/tmp/
+/test/version_tmp/
+/tmp/
+# Used by dotenv library to load environment variables.
+# .env
+# Ignore Byebug command history file.
+.byebug_history
+## Specific to RubyMotion:
+.dat*
+.repl_history
+build/
+*.bridgesupport
+build-iPhoneOS/
+build-iPhoneSimulator/
+## Specific to RubyMotion (use of CocoaPods):
+#
+# We recommend against adding the Pods directory to your .gitignore. However
+# you should judge for yourself, the pros and cons are mentioned at:
+# https://guides.cocoapods.org/using/using-cocoapods.html#should-i-check-the-pods-directory-into-source-control
+#
+# vendor/Pods/
+## Documentation cache and generated files:
+/.yardoc/
+/_yardoc/
+/doc/
+/rdoc/
+## Environment normalization:
+/.bundle/
+/vendor/bundle
+/lib/bundler/man/
+# for a library or gem, you might want to ignore these files since the code is
+# intended to run in multiple environments; otherwise, check them in:
+# Gemfile.lock
+# .ruby-version
+# .ruby-gemset
+# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
+.rvmrc
+# Used by RuboCop. Remote config files pulled in from inherit_from directive.
+# .rubocop-https?--*

data/CODE_OF_CONDUCT.md ADDED Viewed

@@ -0,0 +1,74 @@
+# Contributor Covenant Code of Conduct
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, gender identity and expression, level of experience,
+nationality, personal appearance, race, religion, or sexual identity and
+orientation.
+## Our Standards
+Examples of behaviour that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+Examples of unacceptable behaviour by participants include:
+* The use of sexualised language or imagery and unwelcome sexual attention or
+advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behaviour and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behaviour.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviours that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behaviour may be
+reported by contacting the project team at contact@jasonchalom.com. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at [http://contributor-covenant.org/version/1/4][version]
+[homepage]: http://contributor-covenant.org
+[version]: http://contributor-covenant.org/version/1/4/

data/Gemfile ADDED Viewed

@@ -0,0 +1,5 @@
+source "https://rubygems.org"
+git_source(:github) {|repo_name| "https://github.com/TRex22/ocr-file" }
+gemspec

data/Gemfile.lock ADDED Viewed

@@ -0,0 +1,83 @@
+PATH
+  remote: .
+  specs:
+    ocr-file (0.0.1)
+      active_attr (~> 0.15.4)
+      console-style (~> 0.0.1)
+      hexapdf (~> 0.23.0)
+      mini_magick (~> 4.11.0)
+      rtesseract (~> 3.1.2)
+GEM
+  remote: https://rubygems.org/
+  specs:
+    actionpack (7.0.3)
+      actionview (= 7.0.3)
+      activesupport (= 7.0.3)
+      rack (~> 2.0, >= 2.2.0)
+      rack-test (>= 0.6.3)
+      rails-dom-testing (~> 2.0)
+      rails-html-sanitizer (~> 1.0, >= 1.2.0)
+    actionview (7.0.3)
+      activesupport (= 7.0.3)
+      builder (~> 3.1)
+      erubi (~> 1.4)
+      rails-dom-testing (~> 2.0)
+      rails-html-sanitizer (~> 1.1, >= 1.2.0)
+    active_attr (0.15.4)
+      actionpack (>= 3.0.2, < 7.1)
+      activemodel (>= 3.0.2, < 7.1)
+      activesupport (>= 3.0.2, < 7.1)
+    activemodel (7.0.3)
+      activesupport (= 7.0.3)
+    activesupport (7.0.3)
+      concurrent-ruby (~> 1.0, >= 1.0.2)
+      i18n (>= 1.6, < 2)
+      minitest (>= 5.1)
+      tzinfo (~> 2.0)
+    builder (3.2.4)
+    cmdparse (3.0.7)
+    coderay (1.1.3)
+    concurrent-ruby (1.1.10)
+    console-style (0.0.1)
+    crass (1.0.6)
+    erubi (1.10.0)
+    geom2d (0.3.1)
+    hexapdf (0.23.0)
+      cmdparse (~> 3.0, >= 3.0.3)
+      geom2d (~> 0.3)
+    i18n (1.10.0)
+      concurrent-ruby (~> 1.0)
+    loofah (2.18.0)
+      crass (~> 1.0.2)
+      nokogiri (>= 1.5.9)
+    method_source (1.0.0)
+    mini_magick (4.11.0)
+    minitest (5.16.0)
+    nokogiri (1.13.6-arm64-darwin)
+      racc (~> 1.4)
+    pry (0.14.1)
+      coderay (~> 1.1)
+      method_source (~> 1.0)
+    racc (1.6.0)
+    rack (2.2.3.1)
+    rack-test (1.1.0)
+      rack (>= 1.0, < 3)
+    rails-dom-testing (2.0.3)
+      activesupport (>= 4.2.0)
+      nokogiri (>= 1.6)
+    rails-html-sanitizer (1.4.3)
+      loofah (~> 2.3)
+    rtesseract (3.1.2)
+    tzinfo (2.0.4)
+      concurrent-ruby (~> 1.0)
+PLATFORMS
+  arm64-darwin-20
+DEPENDENCIES
+  ocr-file!
+  pry (~> 0.14.1)
+BUNDLED WITH
+   2.3.5

data/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2022 Jason Chalom
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,121 @@
+#  OCR-File
+A tool to combine PDF tools, OCR tools and image processing into a
+single interface as both a CLI and a library.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'ocr-file'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install ocr-file
+### Other required dependencies
+You will need to install `tesseract` with your desired language on your system,
+`pdftoppm` needs to be available and also `image-magick`.
+## Usage
+```ruby
+  require 'ocr-file'
+  config = {
+    # Images from PDF
+    filetype: 'png',
+    quality: 100,
+    dpi: 300,
+    # Text to PDF
+    font: 'Helvetica',
+    font_size: 5, #8 # 12
+    text_x: 20,
+    text_y: 800,
+    minimum_word: 5,
+    # Cloud-Vision OCR
+    image_annotator: nil, # Needed for Cloud-Vision
+    type_of_ocr: OcrFile::OcrEngines::CloudVision::DOCUMENT_TEXT_DETECTION,
+    ocr_engine: 'tesseract', # 'cloud-vision'
+    # Image Pre-Processing
+    image_pre_preprocess: true,
+    effects: ['bw', 'norm'],
+    threshold: 0.25,
+    # PDF to Image Processing
+    optimise_pdf: true,
+    extract_pdf_images: true, # if false will screenshot each PDF page
+    temp_filename_prefix: 'image',
+    # Console Output
+    verbose: true,
+  }
+  doc = OcrFile::Document.new(
+    original_file_path: '/path-to-original-file/', # supports PDFs and images
+    save_file_path: '/folder-to-save-to/',
+    config: config # Not needed as defaults are used when not provided
+  )
+  doc.to_s # Returns text, removes temp files and wont save
+  doc.to_pdf # Saves a PDF (either searchable over the images or dumped text)
+  doc.to_text # Saves a text file with OCR text
+  # How to generate PDFs of images or text files:
+  original_file_path = 'file.txt' OR 'file.png'
+  doc = OcrFile::Document.new(
+    original_file_path: original_file_path, # supports PDFs and images
+    save_file_path: '/folder-to-save-to/',
+    config: config # Not needed as defaults are used when not provided
+  )
+  doc.to_pdf
+  # How to merge files into a single PDF:
+  filepaths = []
+  documents = file_paths.map { |path| OcrFile::ImageEngines::PdfEngine.open_pdf(path, password: '') }
+  merged_document = OcrFile::ImageEngines::PdfEngine.merge(documents)
+  OcrFile::ImageEngines::PdfEngine.save_pdf(merged_document, save_file_path, optimise: true)
+```
+### Notes / Tips
+Set `extract_pdf_images` to `false` for higher quality OCR. However this will consume more temporary space per PDF page and also be considerably slower.
+Image pre-processing is not yet implemented.
+## Development
+After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
+To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
+### TODOs
+- input validation
+- CLI
+- image processing
+- password
+- Base64 encoding
+- requirements checking (installed dependencies etc ...)
+- Tests
+- Configurable temp folder cleanup
+- Improve console output
+### Tests
+To run tests execute:
+    $ rake test
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/trex22/ocr-file. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [Contributor Covenant](http://contributor-covenant.org) code of conduct.
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
+## Code of Conduct
+Everyone interacting in the OCR-File: project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/trex22/ocr-file/blob/master/CODE_OF_CONDUCT.md).

data/Rakefile ADDED Viewed

@@ -0,0 +1,10 @@
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new(:test) do |t|
+  t.libs << "test"
+  t.libs << "lib"
+  t.test_files = FileList["test/**/*_test.rb"]
+end
+task :default => :test

data/bin/console ADDED Viewed

@@ -0,0 +1,11 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "ocr-file"
+# You can add fixtures and/or initialization code here to make experimenting
+# with your gem easier. You can also use a different console, if you like.
+# (If you use this, don't forget to add pry to your Gemfile!)
+require "pry"
+Pry.start

data/bin/ocr-file ADDED Viewed

@@ -0,0 +1,5 @@
+#!/usr/bin/env ruby -wU
+require 'ocr-file'
+puts "Hello, world!"

data/bin/setup ADDED Viewed

@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
+set -vx
+bundle install
+# Do any other automated setup that you need to do here

data/lib/ocr-file/cli.rb ADDED Viewed

@@ -0,0 +1,5 @@
+module OcrFile
+  module Cli
+  end
+end

data/lib/ocr-file/document.rb ADDED Viewed

@@ -0,0 +1,195 @@
+module OcrFile
+  class Document
+    ACCEPTED_IMAGE_TYPES = ['png', 'jpeg', 'jpg', 'tiff', 'bmp']
+    PAGE_BREAK = "\n\r\n" # TODO: Make configurable
+    DEFAULT_CONFIG = {
+      # Images from PDF
+      filetype: 'png',
+      quality: 100,
+      dpi: 300,
+      # Text to PDF
+      font: 'Helvetica',
+      font_size: 5, #8 # 12
+      text_x: 20,
+      text_y: 800,
+      minimum_word: 5,
+      # Cloud-Vision OCR
+      image_annotator: nil, # Needed for Cloud-Vision
+      type_of_ocr: OcrFile::OcrEngines::CloudVision::DOCUMENT_TEXT_DETECTION,
+      ocr_engine: 'tesseract', # 'cloud-vision'
+      # Image Pre-Processing
+      image_pre_preprocess: true,
+      effects: ['bw', 'norm'],
+      threshold: 0.25,
+      # PDF to Image Processing
+      optimise_pdf: true,
+      extract_pdf_images: true, # if false will screenshot each PDF page
+      temp_filename_prefix: 'image',
+      # Console Output
+      verbose: true,
+    }
+    attr_reader :original_file_path,
+      :filename,
+      :save_file_path,
+      :final_save_file,
+      :config,
+      :ocr_engine
+    # save_file_path will also generate a tmp path for tmp files. Expected folder path
+    # TODO: Add in more input validation
+    def initialize(original_file_path:, save_file_path:, config: DEFAULT_CONFIG)
+      @original_file_path = original_file_path
+      @filename = original_file_path.split('/').last.split('.').first
+      date = Time.now.to_s.split(' ').first
+      @save_file_path = save_file_path
+      @final_save_file = "#{@save_file_path}/#{@filename}-#{date}-#{Time.now.to_i}"
+      @config = config
+      @ocr_engine = find_ocr_engine(config[:ocr_engine])
+    end
+    def pdf?
+      @original_file_path.include?('.pdf')
+    end
+    def image?
+      return false if pdf?
+      ACCEPTED_IMAGE_TYPES.any? { |type| @original_file_path.include?(".#{type}")}
+    end
+    # Treat anything which isnt a PDF or image as text
+    def text?
+      !pdf? && !image?
+    end
+    def to_pdf
+      if pdf?
+        create_temp_folder
+        image_paths = extract_image_paths_from_pdf(@original_file_path)
+        pdfs_to_merge = []
+        image_paths.each do |image_path|
+          pdfs_to_merge << @ocr_engine.ocr_to_pdf(image_path, options: @config)
+        end
+        merged_pdf = OcrFile::ImageEngines::PdfEngine.merge(pdfs_to_merge)
+        OcrFile::ImageEngines::PdfEngine
+          .save_pdf(merged_pdf, "#{@final_save_file}.pdf", optimise: @config[:optimise_pdf])
+        close
+      elsif text?
+        text = ::OcrFile::FileHelpers.open_text_file(@original_file_path)
+        pdf_file = OcrFile::ImageEngines::PdfEngine.pdf_from_text(text, @config)
+        OcrFile::ImageEngines::PdfEngine
+          .save_pdf(pdf_file, "#{@final_save_file}.pdf", optimise: @config[:optimise_pdf])
+      else # is an image
+        ocr_image_to_pdf
+      end
+    end
+    def to_text
+      if pdf?
+        create_temp_folder
+        image_paths = extract_image_paths_from_pdf(@original_file_path)
+        image_paths.each do |image_path|
+          text = @ocr_engine.ocr_to_text(image_path, options: @config)
+          ::OcrFile::FileHelpers.append_file("#{@final_save_file}.txt", "#{text}#{PAGE_BREAK}")
+        end
+        close
+      elsif text?
+        ::OcrFile::FileHelpers.open_text_file(@original_file_path)
+      else # is an image
+        ocr_image_to_text(save: true)
+      end
+    end
+    def to_s
+      if pdf?
+        create_temp_folder
+        image_paths = extract_image_paths_from_pdf(@original_file_path)
+        text = ''
+        image_paths.each do |image_path|
+          text = "#{text}#{PAGE_BREAK}#{@ocr_engine.ocr_to_text(image_path, options: @config)}"
+        end
+        close
+        text
+      elsif text?
+        ::OcrFile::FileHelpers.open_text_file(@original_file_path)
+      else # is an image
+        ocr_image_to_text(save: false)
+      end
+    end
+    def close
+      ::OcrFile::FileHelpers.clear_folder(@temp_folder_path)
+    end
+    private
+    def extract_image_paths_from_pdf(file_path)
+      document = OcrFile::ImageEngines::PdfEngine.open_pdf(file_path, password: '')
+      if @config[:extract_pdf_images]
+        OcrFile::ImageEngines::PdfEngine
+          .extract_images(document, @temp_folder_path, verbose: @config[:verbose])
+      else # Generate screenshots of each image
+        OcrFile::ImageEngines::Pdftoppm.images_from_pdf(
+          file_path,
+          @temp_folder_path,
+          filename: @config[:temp_filename_prefix],
+          filetype: @config[:filetype],
+          quality: @config[:quality],
+          dpi: @config[:dpi],
+          verbose: @config[:verbose]
+        )
+      end
+    end
+    def create_temp_folder
+      # TODO: Make this a bit more robust
+      @temp_folder_path = "#{save_file_path}/temp/".gsub(' ', '\ ')
+      ::OcrFile::FileHelpers.make_directory(@temp_folder_path)
+    end
+    def ocr_image_to_pdf
+      pdf_document = @ocr_engine.ocr_to_pdf(@original_file_path, options: @config)
+      OcrFile::ImageEngines::PdfEngine
+        .save_pdf(pdf_document, "#{@final_save_file}.pdf", optimise: @config[:optimise_pdf])
+    end
+    def ocr_image_to_text(save: true)
+      text = @ocr_engine.ocr_to_text(@original_file_path, options: @config)
+      if save
+        ::OcrFile::FileHelpers.append_file("#{@final_save_file}.txt", text)
+      else
+        text
+      end
+    end
+    def find_ocr_engine(engine_id)
+      ocr_engine_constants
+        .map { |c| ocr_module(c) }
+        .find { |selected_module| selected_module.id == engine_id }
+    end
+    def ocr_module(constant)
+      OcrFile::OcrEngines.const_get(constant)
+    end
+    def ocr_engine_constants
+      OcrFile::OcrEngines.constants
+    end
+  end
+end

data/lib/ocr-file/file_helpers.rb ADDED Viewed

@@ -0,0 +1,40 @@
+module OcrFile
+  module FileHelpers
+    extend self
+    def merge_pdfs(file_paths, save_file_path)
+      documents = file_paths.map { |path| OcrFile::ImageEngines::PdfEngine.open_pdf(path) }
+      merged_document = OcrFile::ImageEngines::PdfEngine.merge(documents)
+      save_pdf(merged_document, save_file_path, optimise: true)
+    end
+    # Beware this is dangerous!
+    def clear_folder(path)
+      return unless path.include?('/temp') # Small hacky safeguard
+      `rm -rf #{path}` # Cleanup
+    end
+    def make_directory(path)
+      `mkdir -p #{path}`
+    end
+    def open_json(path)
+      JSON.parse(File.read(path))
+    end
+    def append_file(path, text)
+      File.open(path, 'a') { |file| file.write(text) }
+    end
+    def open_text_file(path)
+      File.read(path)
+    end
+    def fetch_temp_image_paths(save_path, temp_filename, filetype)
+      filenames = `ls #{save_path} | grep .#{filetype}`.split("\n")
+      filenames.map do |filename|
+        "#{save_path}/#{filename}"
+      end
+    end
+  end
+end

data/lib/ocr-file/image_engines/image_magick.rb ADDED Viewed

@@ -0,0 +1,14 @@
+module OcrFile
+  module ImageEngines
+    module ImageMagick
+      extend self
+      # TODO:
+      # B/W
+      # Contrast
+      # Image Norm
+      # Threshold
+      # Conversion of image types
+    end
+  end
+end

data/lib/ocr-file/image_engines/pdf_engine.rb ADDED Viewed

@@ -0,0 +1,75 @@
+module OcrFile
+  module ImageEngines
+    module PdfEngine
+      extend self
+      PAGE_BREAK = "\n\r\n"
+      DEFAULT_PAGE_OPTIONS = {
+        font: 'Helvetica',
+        font_size: 5, #8 # 12
+        text_x: 20,
+        text_y: 800,
+        minimum_word: 5,
+      }
+      def pdf_from_text(text, options = DEFAULT_PAGE_OPTIONS)
+        document = ::HexaPDF::Document.new
+        text
+          .split(PAGE_BREAK)
+          .reject { |line| line.size < options[:minimum_word] }
+          .each { |page_text| document = add_page(document, page_text, options) }
+        document
+      end
+      def add_page(document, text, options)
+        canvas = document.pages.add.canvas
+        canvas.font(options[:font], size: options[:font_size])
+        canvas.text(text, at: [options[:text_x], options[:text_y]])
+        document
+      end
+      def save_pdf(document, save_file_path, optimise: true)
+        document.write(save_file_path, optimize: true)
+      end
+      def open_pdf(file, password: '')
+        ::HexaPDF::Document.open(file, decryption_opts: { password: password })
+      end
+      def extract_images(document, save_path, verbose: false)
+        image_paths = []
+        ::HexaPDF::CLI::Images.new.send(:each_image, document) do |image, index, pindex, (_x_ppi, _y_ppi)|
+          puts "Processing page: #{pindex} ..."
+          info = image.info
+          if info.writable
+            image_filename = "#{index}.#{image.info.extension}"
+            image_path = "#{save_path}/#{image_filename}"
+            image.write(image_path)
+            image_paths << image_path
+          elsif command_parser.verbosity_warning?
+            puts style("Warning (image #{index}, page #{pindex}): PDF image format not supported for writing", RED)
+          end
+        end
+        image_paths
+      end
+      def merge(documents)
+        target = ::HexaPDF::Document.new
+        documents.each do |document|
+          document.pages.each { |page| target.pages << target.import(page) }
+        end
+        target
+      end
+    end
+  end
+end

data/lib/ocr-file/image_engines/pdftoppm.rb ADDED Viewed

@@ -0,0 +1,27 @@
+module OcrFile
+  module ImageEngines
+    module Pdftoppm
+      extend self
+      # TODO: other options
+      # https://www.xpdfreader.com/pdftoppm-man.html
+      # password
+      # −mono Generate a monochrome PBM file (instead of an RGB PPM file).
+      # −gray Generate a grayscale PGM file (instead of an RGB PPM file).
+      # −cmyk Generate a CMYK PAM file (instead of an RGB PPM file).
+      def images_from_pdf(pdf_path, save_path, filename: 'image', filetype: 'png', quality: 100, dpi: 300, verbose: true)
+        print 'Generating screenshots of each PDF page ... '
+        if filetype == 'jpg'
+          `pdftoppm -jpeg -jpegopt quality=#{quality} -r #{dpi} #{pdf_path} #{save_path}/#{filename}`
+        else
+          `pdftoppm -#{filetype} -r #{dpi} #{pdf_path} #{save_path}/#{filename}`
+        end
+        puts 'Complete!'
+        OcrFile::FileHelpers.fetch_temp_image_paths(save_path, filename, filetype)
+      end
+    end
+  end
+end

data/lib/ocr-file/ocr_engines/cloud_vision.rb ADDED Viewed

@@ -0,0 +1,59 @@
+module OcrFile
+  module OcrEngines
+    module CloudVision
+      extend self
+      DEFAULT_LANGUAGE = 'en'
+      # Available Types: https://github.com/googleapis/google-cloud-ruby/blob/master/google-cloud-vision/lib/google/cloud/vision/v1/image_annotator_pb.rb
+      TEXT_DETECTION = 'TEXT_DETECTION' # Used for low-quality images
+      DOCUMENT_TEXT_DETECTION = 'DOCUMENT_TEXT_DETECTION' # Used for dense text documents
+      def id
+        'cloud-vision'
+      end
+      def ocr_to_text(file_path, options: { type_of_ocr: '', image_annotator: nil })
+        type_of_ocr = options[:type_of_ocr]
+        image_annotator = options[:image_annotator]
+        response = detect_text(type_of_ocr, file_path, image_annotator)
+        extract_text(response)
+      end
+      def ocr_to_pdf(file_path, options: { type_of_ocr: '', image_annotator: nil })
+        text = ocr_to_text(file_path, options: { type_of_ocr: '', image_annotator: nil })
+        OcrFile::ImageEngines::PdfEngine.pdf_from_text(text, options)
+      end
+      private
+      def detect_text(type_of_ocr, image_path, image_annotator)
+        if type_of_ocr == 'DOCUMENT_TEXT_DETECTION'
+          image_annotator.document_text_detection(image: image_path)
+        else
+          image_annotator.text_detection(image: image_path)
+        end
+      end
+      def extract_text(response)
+        raw_text = ''
+        foreign_text = ''
+        response.responses.each do |section|
+          section.text_annotations.each do |annotation|
+            raw_text << annotation.description
+            if annotation.locale && annotation.locale != DEFAULT_LANGUAGE
+              foreign_text << annotation.description
+            end
+          end
+        end
+        raw_text = raw_text.split("\n")
+        raw_text.pop # Remove the last line
+        raw_text.join("\n")
+      end
+    end
+  end
+end

data/lib/ocr-file/ocr_engines/tesseract.rb ADDED Viewed

@@ -0,0 +1,22 @@
+module OcrFile
+  module OcrEngines
+    module Tesseract
+      extend self
+      def id
+        'tesseract'
+      end
+      def ocr_to_text(file_path, options: {})
+        image = ::RTesseract.new(file_path)
+        image.to_s # Getting the value
+      end
+      def ocr_to_pdf(file_path, options: {})
+        image = ::RTesseract.new(file_path)
+        raw_output = image.to_pdf  # Getting open file of pdf
+        OcrFile::ImageEngines::PdfEngine.open_pdf(raw_output, password: '')
+      end
+    end
+  end
+end

data/lib/ocr-file/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module OcrFile
+  VERSION = "0.0.1"
+end

data/lib/ocr-file.rb ADDED Viewed

@@ -0,0 +1,19 @@
+require 'hexapdf'
+require 'hexapdf/cli/images'
+require 'rtesseract'
+require 'mini_magick'
+require 'ocr-file/version'
+require 'ocr-file/image_engines/pdf_engine'
+require 'ocr-file/image_engines/image_magick'
+require 'ocr-file/image_engines/pdftoppm'
+require 'ocr-file/ocr_engines/tesseract'
+require 'ocr-file/ocr_engines/cloud_vision'
+require 'ocr-file/file_helpers'
+require 'ocr-file/document'
+require 'ocr-file/cli'
+module OcrFile
+  class Error < StandardError; end
+end

data/ocr-file.gemspec ADDED Viewed

@@ -0,0 +1,38 @@
+lib = File.expand_path("../lib", __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require "ocr-file/version"
+Gem::Specification.new do |spec|
+  spec.name          = "ocr-file"
+  spec.version       = OcrFile::VERSION
+  spec.authors       = ["trex22"]
+  spec.email         = ["contact@jasonchalom.com"]
+  spec.summary       = "A tool to combine PDF tools, OCR tools and image processing into a single interface as both a CLI and a library."
+  spec.description   = "A tool to combine PDF tools, OCR tools and image processing into a single interface as both a CLI and a library."
+  spec.homepage      = "https://github.com/TRex22/ocr-file"
+  spec.license       = "MIT"
+  # Specify which files should be added to the gem when it is released.
+  # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
+  spec.files         = Dir.chdir(File.expand_path('..', __FILE__)) do
+    `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
+  end
+  spec.bindir        = "bin"
+  spec.executables   = ["ocr-file"] #spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  # Dependencies
+  spec.add_dependency "console-style", "~> 0.0.1"
+  spec.add_dependency "active_attr", "~> 0.15.4"
+  spec.add_dependency "hexapdf", "~> 0.23.0"
+  spec.add_dependency "rtesseract", "~> 3.1.2"
+  spec.add_dependency "mini_magick", "~> 4.11.0"
+  # Development Dependencies
+  spec.add_development_dependency "pry", "~> 0.14.1"
+end

metadata ADDED Viewed

@@ -0,0 +1,151 @@
+--- !ruby/object:Gem::Specification
+name: ocr-file
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- trex22
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2022-06-19 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: console-style
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.0.1
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.0.1
+- !ruby/object:Gem::Dependency
+  name: active_attr
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.15.4
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.15.4
+- !ruby/object:Gem::Dependency
+  name: hexapdf
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.23.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.23.0
+- !ruby/object:Gem::Dependency
+  name: rtesseract
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 3.1.2
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 3.1.2
+- !ruby/object:Gem::Dependency
+  name: mini_magick
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 4.11.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 4.11.0
+- !ruby/object:Gem::Dependency
+  name: pry
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.14.1
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: 0.14.1
+description: A tool to combine PDF tools, OCR tools and image processing into a single
+  interface as both a CLI and a library.
+email:
+- contact@jasonchalom.com
+executables:
+- ocr-file
+extensions: []
+extra_rdoc_files: []
+files:
+- ".gitignore"
+- CODE_OF_CONDUCT.md
+- Gemfile
+- Gemfile.lock
+- LICENSE
+- README.md
+- Rakefile
+- bin/console
+- bin/ocr-file
+- bin/setup
+- lib/ocr-file.rb
+- lib/ocr-file/cli.rb
+- lib/ocr-file/document.rb
+- lib/ocr-file/file_helpers.rb
+- lib/ocr-file/image_engines/image_magick.rb
+- lib/ocr-file/image_engines/pdf_engine.rb
+- lib/ocr-file/image_engines/pdftoppm.rb
+- lib/ocr-file/ocr_engines/cloud_vision.rb
+- lib/ocr-file/ocr_engines/tesseract.rb
+- lib/ocr-file/version.rb
+- ocr-file.gemspec
+homepage: https://github.com/TRex22/ocr-file
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.3.4
+signing_key:
+specification_version: 4
+summary: A tool to combine PDF tools, OCR tools and image processing into a single
+  interface as both a CLI and a library.
+test_files: []