pdf2markdownOCR 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: f125dfcb354343c2aa778d36b422f338db5bf24734e48acbfe53bd57ad343569
4
+ data.tar.gz: 620f9cc746b7da2a38b8b75fa8b9ca87ff44c8d13215aa1cea54e6d0c14aa36d
5
+ SHA512:
6
+ metadata.gz: 4dd91a17be2519f2e6f8fb41455f7d27ec7a123f36d55fff0fa891a2cadfcbee36d7bc39d9848b32262abecf4055f2fe21631b69f9e5204a32b60fc695b48198
7
+ data.tar.gz: bd2aad9b2417f69c9d1240ca8a06e20bfaf2b4d2ed5f7a47609b84acab1d73dd67983a18446467628d8279cac31f046fca013a69187f30cab3c726236a36a90c
data/README.md ADDED
@@ -0,0 +1,130 @@
1
+ # pdf2markdownOCR
2
+
3
+ A Ruby gem for converting PDF documents to Markdown using a locally-hosted vision LLM (OCR via AI).
4
+ Pages are rendered as high-resolution PNG images and then sent to an OpenAI-compatible API endpoint for text extraction.
5
+
6
+ ## Requirements
7
+
8
+ - Ruby >= 3.1
9
+ - poppler-utils (`pdftoppm`, `pdfinfo`)
10
+ - OpenAI-compatible vision LLM server (e.g. [vLLM](https://github.com/vllm-project/vllm), [Ollama](https://ollama.com), [llama.cpp](https://github.com/ggml-org/llama.cpp))
11
+
12
+ Install poppler on Debian/Ubuntu:
13
+
14
+ ```bash
15
+ sudo apt install poppler-utils
16
+ ```
17
+
18
+ You can get [Deepseek's OCR-2 model at huggingface](https://huggingface.co/deepseek-ai/DeepSeek-OCR-2)
19
+
20
+ ## Installation
21
+
22
+ Add to your `Gemfile`:
23
+
24
+ ```ruby
25
+ gem 'pdf2markdownOCR'
26
+ ```
27
+
28
+ Then run:
29
+
30
+ ```bash
31
+ bundle install
32
+ ```
33
+
34
+ Or install directly:
35
+
36
+ ```bash
37
+ gem install pdf2markdownOCR
38
+ ```
39
+
40
+ ## Configuration
41
+
42
+ Configuration can be set via a block or via environment variables. The block takes priority.
43
+
44
+ ### Via configure block
45
+
46
+ You can configure the gem using a configuration block. This are the options and its default values.
47
+
48
+ ```ruby
49
+ require 'pdf2markdownOCR'
50
+
51
+ Pdf2MarkdownOCR.configure do |config|
52
+ # URL of your OpenAI-compatible LLM server
53
+ config.llm_api_url = "http://localhost:8000"
54
+
55
+ # Model name to request from the server
56
+ config.llm_model = "deepseek-ai/DeepSeek-OCR-2"
57
+
58
+ # PNG resolution used when rasterising PDF pages (higher = better OCR, slower)
59
+ config.png_dpi_resolution = 300
60
+
61
+ # Conversion mode: :single_thread or :multi_thread
62
+ # :multi_thread converts all pages to pngs in parallel threads
63
+ config.mode = :multi_thread
64
+
65
+ # The gem uses Ruby's stdlib `Logger` writing to `$stdout`. You can provide your own instance. To silence it completely, just pass Logger.new("/dev/null")
66
+
67
+ config.logger = Logger.new($stdout).tap do |log|
68
+ log.progname = self.class.name.split('::').first
69
+ end
70
+ end
71
+ ```
72
+
73
+ ## Usage as a library
74
+
75
+ ### Convert a PDF and get Markdown as a string
76
+
77
+ ```ruby
78
+ require 'pdf2markdownOCR'
79
+
80
+ markdown = Pdf2MarkdownOCR.convert_pdf("document.pdf")
81
+ puts markdown
82
+ ```
83
+
84
+ ### Convert a PDF and write directly to a file
85
+
86
+ ```ruby
87
+
88
+ Pdf2MarkdownOCR.convert_pdf("document.pdf", "output.md")
89
+ # => nil (content written to output.md)
90
+ ```
91
+
92
+ ## Usage as a CLI
93
+
94
+ After installation the `pdf2markdownocr` executable is available on your `PATH`. Options are the same as in the configuration block
95
+
96
+ ```
97
+ Usage: pdf2markdownocr [options] <pdf_path>
98
+
99
+ Converts a PDF file to Markdown using OCR.
100
+
101
+ Options:
102
+ -o, --output FILE Output Markdown file
103
+ --llm-api-url OpenAI compatible server URL
104
+ --llm-model MODEL
105
+ --mode Processing mode: single_thread or multi_thread
106
+ --png-dpi DPI resolution for PNG conversion
107
+ -h, --help Show help message
108
+
109
+ ```
110
+
111
+ ### Examples
112
+
113
+ ```bash
114
+ # Basic conversion (output saved to output.md)
115
+ pdf2markdownocr document.pdf
116
+
117
+ # Custom output file
118
+ pdf2markdownocr document.pdf -o result.md
119
+
120
+ # Custom llm
121
+
122
+ pdf2markdownocr document.pdf -o result.md --llm-api-url http://localhost:9800 --llm-model deepseek-ai/DeepSeek-OCR
123
+
124
+ # Print version
125
+ pdf2markdownocr --version
126
+ ```
127
+
128
+ ## License
129
+
130
+ MIT
data/bin/console ADDED
@@ -0,0 +1,29 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "pdf2markdownOCR"
5
+
6
+ # Add reload method for manual reloading
7
+ def reload!
8
+ # Remove existing constants
9
+ Object.send(:remove_const, :Pdf2MarkdownOCR) if defined?(Pdf2MarkdownOCR)
10
+
11
+ # Clear loaded features
12
+ $LOADED_FEATURES.select { |f| f.include?('pdf2markdownOCR') }.each do |f|
13
+ $LOADED_FEATURES.delete(f)
14
+ end
15
+
16
+ # Require the main gem file again
17
+ require "pdf2markdownOCR"
18
+
19
+ puts "🔄 Reloaded Pdf2MarkdownOCR gem"
20
+ rescue => e
21
+ puts "❌ Error reloading gem: #{e.message}"
22
+ end
23
+
24
+ puts "🚀 Pdf2MarkdownOCR console started"
25
+ puts "Use 'reload!' to reload the gem after making changes"
26
+
27
+
28
+ require "pry"
29
+ Pry.start(__FILE__)
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ require_relative '../lib/pdf2markdownOCR/cli'
5
+
6
+ Pdf2MarkdownOCR::CLI.run
@@ -0,0 +1,76 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'optparse'
4
+ require_relative '../pdf2markdownOCR'
5
+ require_relative 'version'
6
+
7
+ module Pdf2MarkdownOCR
8
+ class CLI
9
+ def self.run(argv = ARGV)
10
+ options = {
11
+ }
12
+
13
+ parser = OptionParser.new do |opts|
14
+ opts.banner = "Usage: pdf2markdownocr [options] <pdf_path>"
15
+ opts.separator ""
16
+ opts.separator "Converts a PDF file to Markdown using OCR. If no output file is specified, the Markdown content will be printed to STDOUT."
17
+ opts.separator ""
18
+ opts.separator "Options:"
19
+
20
+ opts.on("-o", "--output FILE", "Output Markdown file") do |file|
21
+ options[:output] = file
22
+ end
23
+
24
+ opts.on("--llm-api-url URL", "OpenAI compatible server URL (default: http://localhost:8000)") do |url|
25
+ options[:llm_api_url] = url
26
+ end
27
+
28
+ opts.on("--llm-model MODEL", "LLM model to use (default: deepseek-ai/DeepSeek-OCR-2)") do |model|
29
+ options[:llm_model] = model
30
+ end
31
+
32
+ opts.on("--mode MODE", "Processing mode: single_thread or multi_thread (default: multi_thread)") do |mode|
33
+ options[:mode] = mode
34
+ end
35
+
36
+ opts.on("--png-dpi DPI", Integer, "DPI resolution for PNG conversion (default: 300)") do |dpi|
37
+ options[:png_dpi] = dpi
38
+ end
39
+
40
+ opts.on("-v", "--version", "Print version") do
41
+ puts Pdf2MarkdownOCR.gem_version
42
+ exit
43
+ end
44
+
45
+ opts.on("-h", "--help", "Show this help message") do
46
+ puts opts
47
+ exit
48
+ end
49
+ end
50
+
51
+ begin
52
+ parser.parse!(argv)
53
+ rescue OptionParser::InvalidOption => e
54
+ abort "Error: #{e.message}\n\n#{parser}"
55
+ end
56
+
57
+ pdf_path = argv.shift
58
+
59
+ if pdf_path.nil? || pdf_path.empty?
60
+ abort "Error: no PDF file specified.\n\n#{parser}"
61
+ end
62
+
63
+ Pdf2MarkdownOCR.configure do |config|
64
+ config.llm_api_url = options[:llm_api_url] if options[:llm_api_url]
65
+ config.llm_model = options[:llm_model] if options[:llm_model]
66
+ config.mode = options[:mode] if options[:mode]
67
+ config.png_dpi_resolution = options[:png_dpi] if options[:png_dpi]
68
+ end
69
+
70
+ markdown_content = Pdf2MarkdownOCR.convert_pdf(pdf_path, options[:output])
71
+ if markdown_content && !markdown_content.empty? && !options[:output]
72
+ puts markdown_content
73
+ end
74
+ end
75
+ end
76
+ end
@@ -0,0 +1,31 @@
1
+ module Pdf2MarkdownOCR
2
+
3
+ class Configuration
4
+ VALID_MODES = %i[single_thread multi_thread].freeze
5
+
6
+ attr_accessor :llm_api_url, :llm_model, :logger,
7
+ :png_dpi_resolution
8
+
9
+ attr_reader :mode
10
+
11
+ def mode=(value)
12
+ unless VALID_MODES.include?(value.to_sym)
13
+ raise ArgumentError, "Invalid mode #{value.inspect}. Must be one of: #{VALID_MODES.join(', ')}"
14
+ end
15
+ @mode = value.to_sym
16
+ end
17
+
18
+ def initialize
19
+ @llm_api_url = ENV['LLM_API_URL'] || "http://localhost:8000"
20
+ @llm_model = ENV['LLM_MODEL'] || "deepseek-ai/DeepSeek-OCR-2"
21
+ @png_dpi_resolution = ENV['PNG_DPI'] || 300
22
+ @mode = :multi_thread
23
+
24
+ @logger = Logger.new($stdout).tap do |log|
25
+ log.progname = self.class.name.split('::').first
26
+ end
27
+ end
28
+
29
+ end
30
+
31
+ end
@@ -0,0 +1,78 @@
1
+ module Pdf2MarkdownOCR
2
+ module LlmApi
3
+
4
+ def self.payload(image_path)
5
+
6
+ image_url = Base64.strict_encode64(File.binread(image_path))
7
+
8
+ payload = {
9
+ model: Pdf2MarkdownOCR.configuration.llm_model,
10
+ messages: [
11
+ {
12
+ role: "user",
13
+ content: [
14
+ {
15
+ type: "image_url",
16
+ image_url: {
17
+ url: "data:image/png;base64,#{image_url}"
18
+ }
19
+ },
20
+ {
21
+ type: "text",
22
+ text: "<image_url>\n Free OCR."
23
+ }
24
+ ]
25
+ }
26
+ ],
27
+ }
28
+
29
+ payload
30
+ end
31
+
32
+ def self.ocr_images(images)
33
+ markdown_pages = []
34
+
35
+ Pdf2MarkdownOCR.configuration.logger.info "OCR #{images.size} images"
36
+ t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
37
+ hydra = Typhoeus::Hydra.new
38
+ images.each_with_index do |image_path, index|
39
+
40
+ payload = Pdf2MarkdownOCR::LlmApi.payload(image_path)
41
+ request = Typhoeus::Request.new(
42
+ "#{Pdf2MarkdownOCR.configuration.llm_api_url}/v1/chat/completions",
43
+ method: :post,
44
+ body: payload.to_json,
45
+ headers: { "Content-Type" => "application/json" },
46
+ timeout: 600 # Default OpenAI timeout is 600 seconds
47
+ )
48
+
49
+ request.on_complete do |response|
50
+ if response.success?
51
+ parsed_response = JSON.parse(response.body)
52
+ markdown_page = parsed_response.dig("choices", 0, "message", "content") || ""
53
+
54
+ if markdown_page && !markdown_page.empty?
55
+ markdown_pages << { index: index, content: markdown_page }
56
+ else
57
+ Pdf2MarkdownOCR.configuration.logger.warn "Warning: No Markdown content generated for #{image_path}"
58
+ end
59
+ else
60
+ Pdf2MarkdownOCR.configuration.logger.error "Error processing #{image_path}: #{response.return_message} (#{response.code})"
61
+ end
62
+ end
63
+
64
+
65
+
66
+ hydra.queue(request)
67
+ end
68
+ hydra.run
69
+
70
+ t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
71
+ Pdf2MarkdownOCR.configuration.logger.info "Total Image processing time: #{(t2 - t1).round(2)} seconds"
72
+
73
+ markdown_content = markdown_pages.sort_by { |page| page[:index] }.map { |page| page[:content] }.join("\n\n---\n\n")
74
+ markdown_content
75
+ end
76
+ end
77
+ end
78
+
@@ -0,0 +1,53 @@
1
+ require 'tempfile'
2
+ require 'terrapin'
3
+
4
+ module Pdf2MarkdownOCR
5
+ module Pdf2Image
6
+ def self.single_thread_conversion(pdf_path, output_prefix)
7
+ Pdf2MarkdownOCR.configuration.logger.info "Converting #{pdf_path} into images. DPI: #{Pdf2MarkdownOCR.configuration.png_dpi_resolution}"
8
+
9
+ t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
10
+ line = Terrapin::CommandLine.new(
11
+ "pdftoppm",
12
+ "-png -r #{Pdf2MarkdownOCR.configuration.png_dpi_resolution} #{pdf_path} #{output_prefix}/pdf2ocr"
13
+ )
14
+ line.run
15
+
16
+ t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
17
+ Pdf2MarkdownOCR.configuration.logger.info "PDF to image conversion time: #{(t2 - t1).round(2)} seconds"
18
+ Dir.glob("#{output_prefix}/pdf2ocr*.png").sort
19
+ end
20
+
21
+ def self.multi_thread_conversion(pdf_path, output_prefix)
22
+
23
+ Pdf2MarkdownOCR.configuration.logger.info "Converting #{pdf_path} into images. DPI: #{Pdf2MarkdownOCR.configuration.png_dpi_resolution}"
24
+
25
+ # Get total page count
26
+ info = Terrapin::CommandLine.new("pdfinfo", pdf_path).run
27
+ total_pages = info.match(/^Pages:\s+(\d+)/)[1].to_i
28
+
29
+ Pdf2MarkdownOCR.configuration.logger.info "Total pages: #{total_pages}"
30
+ t1 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
31
+ threads = (1..total_pages).map do |page|
32
+ Thread.new do
33
+ Terrapin::CommandLine.new(
34
+ "pdftoppm",
35
+ "-png -r #{Pdf2MarkdownOCR.configuration.png_dpi_resolution} -f #{page} -l #{page} #{pdf_path} #{output_prefix}/pdf2ocr"
36
+ ).run
37
+
38
+ Pdf2MarkdownOCR.configuration.logger.info "Converted page #{page}/#{total_pages}"
39
+ end
40
+ end
41
+
42
+ threads.each(&:join)
43
+ t2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
44
+ Pdf2MarkdownOCR.configuration.logger.info "PDF to image conversion time: #{(t2 - t1).round(2)} seconds"
45
+
46
+ Dir.glob("#{output_prefix}/pdf2ocr*.png").sort
47
+ end
48
+
49
+ end
50
+
51
+
52
+
53
+ end
@@ -0,0 +1,9 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Pdf2MarkdownOCR
4
+ VERSION = "0.0.1"
5
+
6
+ def self.gem_version
7
+ Gem::Version.new(VERSION)
8
+ end
9
+ end
@@ -0,0 +1,64 @@
1
+ # frozen_string_literal: true
2
+ # modules
3
+ #
4
+
5
+ require "base64"
6
+ require "httparty"
7
+ require 'json'
8
+ require 'typhoeus'
9
+
10
+ require_relative "pdf2markdownOCR/version"
11
+ require_relative "pdf2markdownOCR/configuration"
12
+ require_relative "pdf2markdownOCR/pdf2image"
13
+ require_relative "pdf2markdownOCR/llm_api"
14
+
15
+ module Pdf2MarkdownOCR
16
+ class FileNotFoundError < StandardError; end
17
+
18
+ class << self
19
+ attr_writer :configuration
20
+
21
+ def configuration
22
+ @configuration ||= Configuration.new
23
+ end
24
+
25
+ def configure
26
+ yield(configuration)
27
+ end
28
+
29
+ end
30
+
31
+ def self.convert_pdf(pdf_path, output_file = nil)
32
+
33
+ Pdf2MarkdownOCR.configuration.logger.info "Parsing PDF file: #{pdf_path}"
34
+ unless File.exist?(pdf_path)
35
+ Pdf2MarkdownOCR.configuration.logger.error "File not found: #{pdf_path}"
36
+ raise FileNotFoundError, "File not found: #{pdf_path}"
37
+ end
38
+
39
+ markdown_content = ""
40
+ begin
41
+ tempdir = Dir.mktmpdir
42
+
43
+ images = []
44
+ if Pdf2MarkdownOCR.configuration.mode == :multi_thread
45
+ images = Pdf2MarkdownOCR::Pdf2Image.multi_thread_conversion(pdf_path, tempdir)
46
+ else
47
+ images = Pdf2MarkdownOCR::Pdf2Image.single_thread_conversion(pdf_path, tempdir)
48
+ end
49
+
50
+ markdown_content = Pdf2MarkdownOCR::LlmApi.ocr_images(images)
51
+ ensure
52
+ # Clean up temporary directory after processing
53
+ FileUtils.remove_entry(tempdir) if tempdir && Dir.exist?(tempdir)
54
+ end
55
+
56
+ # If output file is configured, write the markdown content to the file, otherwise return it as a string
57
+ if output_file && !output_file.empty?
58
+ File.write(output_file, markdown_content)
59
+ return nil
60
+ end
61
+ markdown_content
62
+ end
63
+
64
+ end
metadata ADDED
@@ -0,0 +1,53 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: pdf2markdownOCR
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - Guillermo Molini
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2026-06-04 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: A Ruby library for converting PDF documents to Markdown using OCR.
14
+ email:
15
+ - guillermo.molini@gmail.com
16
+ executables:
17
+ - pdf2markdownocr
18
+ extensions: []
19
+ extra_rdoc_files: []
20
+ files:
21
+ - README.md
22
+ - bin/console
23
+ - bin/pdf2markdownocr
24
+ - lib/pdf2markdownOCR.rb
25
+ - lib/pdf2markdownOCR/cli.rb
26
+ - lib/pdf2markdownOCR/configuration.rb
27
+ - lib/pdf2markdownOCR/llm_api.rb
28
+ - lib/pdf2markdownOCR/pdf2image.rb
29
+ - lib/pdf2markdownOCR/version.rb
30
+ homepage: https://github.com/GMolini/pdf2markdownOCR
31
+ licenses:
32
+ - MIT
33
+ metadata: {}
34
+ post_install_message:
35
+ rdoc_options: []
36
+ require_paths:
37
+ - lib
38
+ required_ruby_version: !ruby/object:Gem::Requirement
39
+ requirements:
40
+ - - ">="
41
+ - !ruby/object:Gem::Version
42
+ version: '3.1'
43
+ required_rubygems_version: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ requirements: []
49
+ rubygems_version: 3.4.10
50
+ signing_key:
51
+ specification_version: 4
52
+ summary: PDF to Markdown OCR
53
+ test_files: []