RubyGems - pdf2markdownOCR - Versions diffs - 0.0.1 - Mend

pdf2markdownOCR 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +7 -0
data/README.md +130 -0
data/bin/console +29 -0
data/bin/pdf2markdownocr +6 -0
data/lib/pdf2markdownOCR/cli.rb +76 -0
data/lib/pdf2markdownOCR/configuration.rb +31 -0
data/lib/pdf2markdownOCR/llm_api.rb +78 -0
data/lib/pdf2markdownOCR/pdf2image.rb +53 -0
data/lib/pdf2markdownOCR/version.rb +9 -0
data/lib/pdf2markdownOCR.rb +64 -0
metadata +53 -0

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA256:
+  metadata.gz: f125dfcb354343c2aa778d36b422f338db5bf24734e48acbfe53bd57ad343569
+  data.tar.gz: 620f9cc746b7da2a38b8b75fa8b9ca87ff44c8d13215aa1cea54e6d0c14aa36d
+SHA512:
+  metadata.gz: 4dd91a17be2519f2e6f8fb41455f7d27ec7a123f36d55fff0fa891a2cadfcbee36d7bc39d9848b32262abecf4055f2fe21631b69f9e5204a32b60fc695b48198
+  data.tar.gz: bd2aad9b2417f69c9d1240ca8a06e20bfaf2b4d2ed5f7a47609b84acab1d73dd67983a18446467628d8279cac31f046fca013a69187f30cab3c726236a36a90c

data/README.md ADDED Viewed

@@ -0,0 +1,130 @@
+# pdf2markdownOCR
+A Ruby gem for converting PDF documents to Markdown using a locally-hosted vision LLM (OCR via AI).
+Pages are rendered as high-resolution PNG images and then sent to an OpenAI-compatible API endpoint for text extraction.
+## Requirements
+- Ruby >= 3.1
+- poppler-utils (`pdftoppm`, `pdfinfo`)
+- OpenAI-compatible vision LLM server (e.g. [vLLM](https://github.com/vllm-project/vllm), [Ollama](https://ollama.com), [llama.cpp](https://github.com/ggml-org/llama.cpp))
+Install poppler on Debian/Ubuntu:
+```bash
+sudo apt install poppler-utils
+```
+You can get [Deepseek's OCR-2 model at huggingface](https://huggingface.co/deepseek-ai/DeepSeek-OCR-2)
+## Installation
+Add to your `Gemfile`:
+```ruby
+gem 'pdf2markdownOCR'
+```
+Then run:
+```bash
+bundle install
+```
+Or install directly:
+```bash
+gem install pdf2markdownOCR
+```
+## Configuration
+Configuration can be set via a block or via environment variables. The block takes priority.
+### Via configure block
+You can configure the gem using a configuration block. This are the options and its default values.
+```ruby
+require 'pdf2markdownOCR'
+Pdf2MarkdownOCR.configure do |config|
+  # URL of your OpenAI-compatible LLM server
+  config.llm_api_url = "http://localhost:8000"
+  # Model name to request from the server
+  config.llm_model = "deepseek-ai/DeepSeek-OCR-2"
+  # PNG resolution used when rasterising PDF pages (higher = better OCR, slower)
+  config.png_dpi_resolution = 300
+  # Conversion mode: :single_thread or :multi_thread
+  # :multi_thread converts all pages to pngs in parallel threads
+  config.mode = :multi_thread
+  # The gem uses Ruby's stdlib `Logger` writing to `$stdout`. You can provide your own instance. To silence it completely, just pass Logger.new("/dev/null")
+  config.logger = Logger.new($stdout).tap do |log|
+    log.progname = self.class.name.split('::').first
+  end
+end
+```
+## Usage as a library
+### Convert a PDF and get Markdown as a string
+```ruby
+require 'pdf2markdownOCR'
+markdown = Pdf2MarkdownOCR.convert_pdf("document.pdf")
+puts markdown
+```
+### Convert a PDF and write directly to a file
+```ruby
+Pdf2MarkdownOCR.convert_pdf("document.pdf", "output.md")
+# => nil  (content written to output.md)
+```
+## Usage as a CLI
+After installation the `pdf2markdownocr` executable is available on your `PATH`. Options are the same as in the configuration block
+```
+Usage: pdf2markdownocr [options] <pdf_path>
+Converts a PDF file to Markdown using OCR.
+Options:
+  -o, --output FILE    Output Markdown file
+  --llm-api-url OpenAI compatible server URL
+  --llm-model MODEL
+  --mode Processing mode: single_thread or multi_thread
+  --png-dpi DPI resolution for PNG conversion
+  -h, --help Show help message
+```
+### Examples
+```bash
+# Basic conversion (output saved to output.md)
+pdf2markdownocr document.pdf
+# Custom output file
+pdf2markdownocr document.pdf -o result.md
+# Custom llm
+pdf2markdownocr document.pdf -o result.md --llm-api-url http://localhost:9800 --llm-model deepseek-ai/DeepSeek-OCR
+# Print version
+pdf2markdownocr --version
+```
+## License
+MIT

data/bin/console ADDED Viewed

@@ -0,0 +1,29 @@
+#!/usr/bin/env ruby
+require "bundler/setup"
+require "pdf2markdownOCR"
+# Add reload method for manual reloading
+def reload!
+  # Remove existing constants
+  Object.send(:remove_const, :Pdf2MarkdownOCR) if defined?(Pdf2MarkdownOCR)
+  # Clear loaded features
+  $LOADED_FEATURES.select { |f| f.include?('pdf2markdownOCR') }.each do |f|
+    $LOADED_FEATURES.delete(f)
+  end
+  # Require the main gem file again
+  require "pdf2markdownOCR"
+  puts "🔄 Reloaded Pdf2MarkdownOCR gem"
+rescue => e
+  puts "❌ Error reloading gem: #{e.message}"
+end
+puts "🚀 Pdf2MarkdownOCR console started"
+puts "Use 'reload!' to reload the gem after making changes"
+require "pry"
+Pry.start(__FILE__)

data/bin/pdf2markdownocr ADDED Viewed

@@ -0,0 +1,6 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require_relative '../lib/pdf2markdownOCR/cli'
+Pdf2MarkdownOCR::CLI.run

data/lib/pdf2markdownOCR/cli.rb ADDED Viewed

@@ -0,0 +1,76 @@
+# frozen_string_literal: true
+require 'optparse'
+require_relative '../pdf2markdownOCR'
+require_relative 'version'
+module Pdf2MarkdownOCR
+  class CLI
+    def self.run(argv = ARGV)
+      options = {
+      }
+      parser = OptionParser.new do |opts|
+        opts.banner = "Usage: pdf2markdownocr [options] <pdf_path>"
+        opts.separator ""
+        opts.separator "Converts a PDF file to Markdown using OCR. If no output file is specified, the Markdown content will be printed to STDOUT."
+        opts.separator ""
+        opts.separator "Options:"
+        opts.on("-o", "--output FILE", "Output Markdown file") do |file|
+          options[:output] = file
+        end
+        opts.on("--llm-api-url URL", "OpenAI compatible server URL (default: http://localhost:8000)") do |url|
+          options[:llm_api_url] = url
+        end
+        opts.on("--llm-model MODEL", "LLM model to use (default: deepseek-ai/DeepSeek-OCR-2)") do |model|
+          options[:llm_model] = model
+        end
+        opts.on("--mode MODE", "Processing mode: single_thread or multi_thread (default: multi_thread)") do |mode|
+          options[:mode] = mode
+        end
+        opts.on("--png-dpi DPI", Integer, "DPI resolution for PNG conversion (default: 300)") do |dpi|
+          options[:png_dpi] = dpi
+        end
+        opts.on("-v", "--version", "Print version") do
+          puts Pdf2MarkdownOCR.gem_version
+          exit
+        end
+        opts.on("-h", "--help", "Show this help message") do
+          puts opts
+          exit
+        end
+      end
+      begin
+        parser.parse!(argv)
+      rescue OptionParser::InvalidOption => e
+        abort "Error: #{e.message}\n\n#{parser}"
+      end
+      pdf_path = argv.shift
+      if pdf_path.nil? || pdf_path.empty?
+        abort "Error: no PDF file specified.\n\n#{parser}"
+      end
+      Pdf2MarkdownOCR.configure do |config|
+        config.llm_api_url = options[:llm_api_url] if options[:llm_api_url]
+        config.llm_model = options[:llm_model] if options[:llm_model]
+        config.mode = options[:mode] if options[:mode]
+        config.png_dpi_resolution = options[:png_dpi] if options[:png_dpi]
+      end
+      markdown_content = Pdf2MarkdownOCR.convert_pdf(pdf_path, options[:output])
+      if markdown_content && !markdown_content.empty? && !options[:output]
+        puts markdown_content
+      end
+    end
+  end
+end

data/lib/pdf2markdownOCR/configuration.rb ADDED Viewed

@@ -0,0 +1,31 @@
+module Pdf2MarkdownOCR
+  class Configuration
+    VALID_MODES = %i[single_thread multi_thread].freeze
+    attr_accessor :llm_api_url, :llm_model, :logger,
+    :png_dpi_resolution
+    attr_reader :mode
+    def mode=(value)
+      unless VALID_MODES.include?(value.to_sym)
+        raise ArgumentError, "Invalid mode #{value.inspect}. Must be one of: #{VALID_MODES.join(', ')}"
+      end
+      @mode = value.to_sym
+    end
+    def initialize
+      @llm_api_url = ENV['LLM_API_URL'] || "http://localhost:8000"
+      @llm_model = ENV['LLM_MODEL'] || "deepseek-ai/DeepSeek-OCR-2"
+      @png_dpi_resolution = ENV['PNG_DPI'] || 300
+      @mode = :multi_thread
+      @logger = Logger.new($stdout).tap do |log|
+        log.progname = self.class.name.split('::').first
+      end
+    end
+  end
+end

data/lib/pdf2markdownOCR/llm_api.rb ADDED Viewed

@@ -0,0 +1,78 @@
+module Pdf2MarkdownOCR
+  module LlmApi
+    def self.payload(image_path)
+      image_url = Base64.strict_encode64(File.binread(image_path))
+      payload = {
+        model: Pdf2MarkdownOCR.configuration.llm_model,
+        messages: [
+          {
+            role: "user",
+            content: [
+              {
+                type: "image_url",
+                image_url: {
+                  url: "data:image/png;base64,#{image_url}"
+                }
+              },
+              {
+                type: "text",
+                text: "<image_url>\n Free OCR."
+              }
+            ]
+          }
+        ],
+      }
+      payload
+    end
+    def self.ocr_images(images)
+      markdown_pages = []
+      Pdf2MarkdownOCR.configuration.logger.info "OCR #{images.size} images"
+      t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+      hydra = Typhoeus::Hydra.new
+      images.each_with_index do |image_path, index|
+        payload = Pdf2MarkdownOCR::LlmApi.payload(image_path)
+        request = Typhoeus::Request.new(
+          "#{Pdf2MarkdownOCR.configuration.llm_api_url}/v1/chat/completions",
+          method: :post,
+          body: payload.to_json,
+          headers: { "Content-Type" => "application/json" },
+          timeout: 600 # Default OpenAI timeout is 600 seconds
+        )
+        request.on_complete do |response|
+          if response.success?
+            parsed_response = JSON.parse(response.body)
+            markdown_page = parsed_response.dig("choices", 0, "message", "content") || ""
+            if markdown_page && !markdown_page.empty?
+              markdown_pages << { index: index, content: markdown_page }
+            else
+              Pdf2MarkdownOCR.configuration.logger.warn "Warning: No Markdown content generated for #{image_path}"
+            end
+          else
+            Pdf2MarkdownOCR.configuration.logger.error "Error processing #{image_path}: #{response.return_message} (#{response.code})"
+          end
+        end
+        hydra.queue(request)
+      end
+      hydra.run
+      t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+      Pdf2MarkdownOCR.configuration.logger.info "Total Image processing time: #{(t2 - t1).round(2)} seconds"
+      markdown_content = markdown_pages.sort_by { |page| page[:index] }.map { |page| page[:content] }.join("\n\n---\n\n")
+      markdown_content
+    end
+  end
+end

data/lib/pdf2markdownOCR/pdf2image.rb ADDED Viewed

@@ -0,0 +1,53 @@
+require 'tempfile'
+require 'terrapin'
+module Pdf2MarkdownOCR
+  module Pdf2Image
+    def self.single_thread_conversion(pdf_path, output_prefix)
+      Pdf2MarkdownOCR.configuration.logger.info "Converting #{pdf_path} into images. DPI: #{Pdf2MarkdownOCR.configuration.png_dpi_resolution}"
+      t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+      line = Terrapin::CommandLine.new(
+        "pdftoppm",
+        "-png -r #{Pdf2MarkdownOCR.configuration.png_dpi_resolution} #{pdf_path} #{output_prefix}/pdf2ocr"
+      )
+      line.run
+      t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+      Pdf2MarkdownOCR.configuration.logger.info "PDF to image conversion time: #{(t2 - t1).round(2)} seconds"
+      Dir.glob("#{output_prefix}/pdf2ocr*.png").sort
+    end
+    def self.multi_thread_conversion(pdf_path, output_prefix)
+      Pdf2MarkdownOCR.configuration.logger.info "Converting #{pdf_path} into images. DPI: #{Pdf2MarkdownOCR.configuration.png_dpi_resolution}"
+      # Get total page count
+      info = Terrapin::CommandLine.new("pdfinfo", pdf_path).run
+      total_pages = info.match(/^Pages:\s+(\d+)/)[1].to_i
+      Pdf2MarkdownOCR.configuration.logger.info "Total pages: #{total_pages}"
+      t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+      threads = (1..total_pages).map do |page|
+        Thread.new do
+          Terrapin::CommandLine.new(
+            "pdftoppm",
+            "-png -r #{Pdf2MarkdownOCR.configuration.png_dpi_resolution} -f #{page} -l #{page} #{pdf_path} #{output_prefix}/pdf2ocr"
+          ).run
+          Pdf2MarkdownOCR.configuration.logger.info "Converted page #{page}/#{total_pages}"
+        end
+      end
+      threads.each(&:join)
+      t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+      Pdf2MarkdownOCR.configuration.logger.info "PDF to image conversion time: #{(t2 - t1).round(2)} seconds"
+      Dir.glob("#{output_prefix}/pdf2ocr*.png").sort
+    end
+  end
+end

data/lib/pdf2markdownOCR/version.rb ADDED Viewed

@@ -0,0 +1,9 @@
+# frozen_string_literal: true
+module Pdf2MarkdownOCR
+  VERSION = "0.0.1"
+  def self.gem_version
+    Gem::Version.new(VERSION)
+  end
+end

data/lib/pdf2markdownOCR.rb ADDED Viewed

@@ -0,0 +1,64 @@
+# frozen_string_literal: true
+# modules
+#
+require "base64"
+require "httparty"
+require 'json'
+require 'typhoeus'
+require_relative "pdf2markdownOCR/version"
+require_relative "pdf2markdownOCR/configuration"
+require_relative "pdf2markdownOCR/pdf2image"
+require_relative "pdf2markdownOCR/llm_api"
+module Pdf2MarkdownOCR
+  class FileNotFoundError < StandardError; end
+  class << self
+    attr_writer :configuration
+    def configuration
+      @configuration ||= Configuration.new
+    end
+    def configure
+      yield(configuration)
+    end
+  end
+  def self.convert_pdf(pdf_path, output_file = nil)
+    Pdf2MarkdownOCR.configuration.logger.info "Parsing PDF file: #{pdf_path}"
+    unless File.exist?(pdf_path)
+      Pdf2MarkdownOCR.configuration.logger.error "File not found: #{pdf_path}"
+      raise FileNotFoundError, "File not found: #{pdf_path}"
+    end
+    markdown_content = ""
+    begin
+      tempdir = Dir.mktmpdir
+      images = []
+      if Pdf2MarkdownOCR.configuration.mode == :multi_thread
+        images = Pdf2MarkdownOCR::Pdf2Image.multi_thread_conversion(pdf_path, tempdir)
+      else
+        images = Pdf2MarkdownOCR::Pdf2Image.single_thread_conversion(pdf_path, tempdir)
+      end
+      markdown_content = Pdf2MarkdownOCR::LlmApi.ocr_images(images)
+    ensure
+      # Clean up temporary directory after processing
+      FileUtils.remove_entry(tempdir) if tempdir && Dir.exist?(tempdir)
+    end
+    # If output file is configured, write the markdown content to the file, otherwise return it as a string
+    if output_file && !output_file.empty?
+      File.write(output_file, markdown_content)
+      return nil
+    end
+    markdown_content
+  end
+end

metadata ADDED Viewed

@@ -0,0 +1,53 @@
+--- !ruby/object:Gem::Specification
+name: pdf2markdownOCR
+version: !ruby/object:Gem::Version
+  version: 0.0.1
+platform: ruby
+authors:
+- Guillermo Molini
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2026-06-04 00:00:00.000000000 Z
+dependencies: []
+description: A Ruby library for converting PDF documents to Markdown using OCR.
+email:
+- guillermo.molini@gmail.com
+executables:
+- pdf2markdownocr
+extensions: []
+extra_rdoc_files: []
+files:
+- README.md
+- bin/console
+- bin/pdf2markdownocr
+- lib/pdf2markdownOCR.rb
+- lib/pdf2markdownOCR/cli.rb
+- lib/pdf2markdownOCR/configuration.rb
+- lib/pdf2markdownOCR/llm_api.rb
+- lib/pdf2markdownOCR/pdf2image.rb
+- lib/pdf2markdownOCR/version.rb
+homepage: https://github.com/GMolini/pdf2markdownOCR
+licenses:
+- MIT
+metadata: {}
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '3.1'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubygems_version: 3.4.10
+signing_key:
+specification_version: 4
+summary: PDF to Markdown OCR
+test_files: []