tesseract_ffi 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: b299b8e91905a54a81999730de02ae2dbf5ed58cf2bbf826854c178fd5f33296
4
+ data.tar.gz: 8906f6a6dd21011c076c8d785f6ac99e2297112831f829a924a32f3763b09713
5
+ SHA512:
6
+ metadata.gz: 9d11f40349117f99080ba4d3eb7f75c4c3d44941c6a3e4aaeec71f22b0ac8e6bc0a361e89c6c657a3fd864853300112883c42162ee5af89f20b841a3cb91f544
7
+ data.tar.gz: cb2e721f373141bd6f42ef97ebd39257a6735c1b34e241739cdc4cfcdf5f0d7c6747ad95df3b6aa13b12e8a49115081dda447c708c5a6b64718d9dc57c739d2c
@@ -0,0 +1,2 @@
1
+ forms.sublime*
2
+ coverage
@@ -0,0 +1,74 @@
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ In the interest of fostering an open and welcoming environment, we as
6
+ contributors and maintainers pledge to making participation in our project and
7
+ our community a harassment-free experience for everyone, regardless of age, body
8
+ size, disability, ethnicity, gender identity and expression, level of experience,
9
+ nationality, personal appearance, race, religion, or sexual identity and
10
+ orientation.
11
+
12
+ ## Our Standards
13
+
14
+ Examples of behavior that contributes to creating a positive environment
15
+ include:
16
+
17
+ * Using welcoming and inclusive language
18
+ * Being respectful of differing viewpoints and experiences
19
+ * Gracefully accepting constructive criticism
20
+ * Focusing on what is best for the community
21
+ * Showing empathy towards other community members
22
+
23
+ Examples of unacceptable behavior by participants include:
24
+
25
+ * The use of sexualized language or imagery and unwelcome sexual attention or
26
+ advances
27
+ * Trolling, insulting/derogatory comments, and personal or political attacks
28
+ * Public or private harassment
29
+ * Publishing others' private information, such as a physical or electronic
30
+ address, without explicit permission
31
+ * Other conduct which could reasonably be considered inappropriate in a
32
+ professional setting
33
+
34
+ ## Our Responsibilities
35
+
36
+ Project maintainers are responsible for clarifying the standards of acceptable
37
+ behavior and are expected to take appropriate and fair corrective action in
38
+ response to any instances of unacceptable behavior.
39
+
40
+ Project maintainers have the right and responsibility to remove, edit, or
41
+ reject comments, commits, code, wiki edits, issues, and other contributions
42
+ that are not aligned to this Code of Conduct, or to ban temporarily or
43
+ permanently any contributor for other behaviors that they deem inappropriate,
44
+ threatening, offensive, or harmful.
45
+
46
+ ## Scope
47
+
48
+ This Code of Conduct applies both within project spaces and in public spaces
49
+ when an individual is representing the project or its community. Examples of
50
+ representing a project or community include using an official project e-mail
51
+ address, posting via an official social media account, or acting as an appointed
52
+ representative at an online or offline event. Representation of a project may be
53
+ further defined and clarified by project maintainers.
54
+
55
+ ## Enforcement
56
+
57
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
58
+ reported by contacting the project team at dverrier@gmail.com. All
59
+ complaints will be reviewed and investigated and will result in a response that
60
+ is deemed necessary and appropriate to the circumstances. The project team is
61
+ obligated to maintain confidentiality with regard to the reporter of an incident.
62
+ Further details of specific enforcement policies may be posted separately.
63
+
64
+ Project maintainers who do not follow or enforce the Code of Conduct in good
65
+ faith may face temporary or permanent repercussions as determined by other
66
+ members of the project's leadership.
67
+
68
+ ## Attribution
69
+
70
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71
+ available at [https://contributor-covenant.org/version/1/4][version]
72
+
73
+ [homepage]: https://contributor-covenant.org
74
+ [version]: https://contributor-covenant.org/version/1/4/
data/Gemfile ADDED
@@ -0,0 +1,18 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gem 'rake'
4
+ gem 'ffi'
5
+
6
+ group :dev, :test do
7
+ gem 'awesome_print'
8
+ gem 'rdoc'
9
+ gem 'hocr_reader'
10
+
11
+ end
12
+
13
+ group :test do
14
+ gem 'minitest'
15
+ gem 'mocha'
16
+ gem 'nokogiri'
17
+ gem 'simplecov', require: false
18
+ end
@@ -0,0 +1,36 @@
1
+ GEM
2
+ remote: https://rubygems.org/
3
+ specs:
4
+ awesome_print (1.8.0)
5
+ docile (1.3.2)
6
+ ffi (1.13.1)
7
+ hocr_reader (0.1.0)
8
+ nokogiri (~> 1.10.10)
9
+ mini_portile2 (2.4.0)
10
+ minitest (5.14.1)
11
+ mocha (1.11.2)
12
+ nokogiri (1.10.10)
13
+ mini_portile2 (~> 2.4.0)
14
+ rake (13.0.1)
15
+ rdoc (6.2.1)
16
+ simplecov (0.18.5)
17
+ docile (~> 1.1)
18
+ simplecov-html (~> 0.11)
19
+ simplecov-html (0.12.2)
20
+
21
+ PLATFORMS
22
+ ruby
23
+
24
+ DEPENDENCIES
25
+ awesome_print
26
+ ffi
27
+ hocr_reader
28
+ minitest
29
+ mocha
30
+ nokogiri
31
+ rake
32
+ rdoc
33
+ simplecov
34
+
35
+ BUNDLED WITH
36
+ 2.1.4
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020 David Verrier
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,181 @@
1
+ # Tesseract_FFI
2
+
3
+ Welcome to Tesseract_FFI!
4
+ This is a ruby wrapper to the Tesseract library. _Before installing this gem, make sure that Tesseract runs. For example, run the command_
5
+
6
+ ```bash
7
+ $ tesseract --version
8
+ ```
9
+
10
+ and under Linux, etc you should see something like
11
+ ```bash
12
+ tesseract 4.1.1-rc2-25-g9707
13
+ leptonica-1.78.0
14
+ libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
15
+ Found AVX2
16
+ Found AVX
17
+ Found FMA
18
+ Found SSE
19
+ Found libarchive 3.1.2
20
+ ```
21
+ Don't know about Windows, apart from the Windows Subsystem for Linux works really well!
22
+
23
+
24
+
25
+ ## Installation
26
+
27
+ Add this line to your application's Gemfile:
28
+
29
+ ```ruby
30
+ gem 'tesseract_ffi'
31
+ ```
32
+
33
+ And then execute:
34
+ ```bash
35
+ $ bundle install
36
+ ```
37
+ Or install it yourself as:
38
+ ```bash
39
+ $ gem install tesseract_ffi
40
+ ```
41
+
42
+ ## Usage
43
+ The fastest way to get going is to use the high-level functions that will probably suit most people , most of the time.
44
+
45
+ #### To convert an image to a string
46
+ ```ruby
47
+ require 'tesseract_ffi'
48
+
49
+ TesseractFFI.to_text('my_image.png')
50
+ ```
51
+ #### To convert an image to a searchable PDF file
52
+ ```ruby
53
+ require 'tesseract_ffi'
54
+ TesseractFFI.to_pdf('my_image.png', 'output_file')
55
+ ```
56
+
57
+ ## Languages
58
+ When the default 'recognition of English' is not suitable, you can change it. The abreviations used for some common European languages are
59
+
60
+ * deu - German
61
+ * eng - English
62
+ * fra - French
63
+ * ita - Italian
64
+ * nld - Dutch
65
+ * por - Portuguese
66
+ * spa - Spanish
67
+
68
+ but Tesseract itself supports many, many languages including but not limited to chi_sim (Chinese simplified), chi_tra (Chinese traditional), chr (Cherokee), cym (Welsh), frk (Frankish), frm (French, Middle, ca.1400-1600). Just ensure that you have the corresponding Tesseract language recognition libraries installed. The best way to confirm this is directly from the command line. For example, to ensure that French recognition files are available to tesseract, type this command to recognise a test image in French
69
+ ```bash
70
+ tesseract imagename.png mytext -l fra
71
+ ```
72
+ To call from within a ruby file, the following snippet should work for an image in German:
73
+ ```ruby
74
+ require 'tesseract_ffi'
75
+
76
+ TesseractFFI.to_text('my_image.png', 'deu')
77
+ ```
78
+ or by creating Ruby objects:
79
+
80
+ ```ruby
81
+ require 'tesseract_ffi'
82
+ tess = TesseractFFI::Tesseract.new(
83
+ language:'fra',
84
+ file_name: 'test/images/bonjour.png')
85
+ tess.recognize
86
+ text = tess.utf8_text
87
+ ```
88
+
89
+ ## To Generate HOCR
90
+ Wikipedia says HOCR 'is an open standard of data representation for formatted text
91
+ obtained from optical character recognition (OCR)'. Tesseract can produce it and generate the bounding boxes of the words, the lines, the paragraphs on a page.
92
+
93
+ ```ruby
94
+ require 'tesseract_ffi'
95
+ tess = TesseractFFI::Tesseract.new(
96
+ file_name: 'test/images/4words.png',
97
+ source_resolution:96)
98
+ tess.recognize
99
+ text = tess.hocr_text
100
+ ```
101
+
102
+ ```xml
103
+ <div class='ocr_page' id='page_18' title='image ""; bbox 0 0 341 17; ppageno 17'>
104
+ <div class='ocr_carea' id='block_18_1' title="bbox 0 0 341 17">
105
+ <p class='ocr_par' id='par_18_1' lang='eng' title="bbox 0 0 341 17">
106
+ <span class='ocr_line' id='line_18_1' title="bbox 0 0 341 17; baseline -0.012 -1; x_size 16; x_descenders 4; x_ascenders 4">
107
+ <span class='ocrx_word' id='word_18_1' title='bbox 0 4 49 17; x_wconf 92'>Name</span>
108
+ <span class='ocrx_word' id='word_18_2' title='bbox 54 4 94 17; x_wconf 90'>Arial</span>
109
+ <span class='ocrx_word' id='word_18_3' title='bbox 237 0 296 15; x_wconf 90'>Century</span>
110
+ <span class='ocrx_word' id='word_18_4' title='bbox 302 0 341 12; x_wconf 90'>Peter</span>
111
+ </span>
112
+ </p>
113
+ </div>
114
+ </div>
115
+ ```
116
+
117
+ ## Recognise Part of an Image
118
+ ```ruby
119
+ require 'tesseract_ffi'
120
+ tess = TesseractFFI::Tesseract.new(
121
+ file_name: 'test/images/4words.png')
122
+
123
+ # tess.recognize_rectangle(x,y,w,h)
124
+ tess.recognize_rectangle(300, 0, 41, 15)
125
+ text = tess.utf8_text
126
+ # => "Peter"
127
+
128
+ ```
129
+
130
+ ## General Structure
131
+
132
+ Create a TesseractFFI::Tesseract object specifying the image file, the language(s) and, optionally the source resolution (dpi of the image) and the OCR Engine Mode, OEM. The default is to use the latest mode, which uses a neural network for the recognition. For some purposes, such as typeface/font recognition, it can be desirable to use the legacy mode even though the recognition is not usually as good.
133
+
134
+ ```ruby
135
+ require 'tesseract_ffi'
136
+ tess = TesseractFFI::Tesseract.new(
137
+ file_name: 'test/images/4words.png',
138
+ source_resolution:96)
139
+
140
+ ```
141
+ Then call Tesseract.setup with the desired methods in a block.
142
+
143
+ ```ruby
144
+
145
+ tess.setup do
146
+ tess.set_rectangle(300, 0, 40, 20)
147
+ tess.ocr
148
+ puts tess.utf8_text
149
+ # => Peter
150
+ tess.set_rectangle(0, 0, 340, 17)
151
+ tess.ocr
152
+ puts tess.utf8_text
153
+ # => Name Arial Century Peter
154
+ end
155
+
156
+ ```
157
+
158
+
159
+ ## Low Level Calls
160
+
161
+ If you look under the hood, there are intermediate ruby methods that do most things, and some very low level functions that make calls to the C-API of Tesseract using the wonderful FFI library. The low level functions give alarming error messages and often stack dump if called in the wrong order, so they are not for the feint of heart. But if your screen allows to scroll back 1000 lines, you can usually see where the call to Tesseract went wrong. This gem aims to hide the complexity of the direct calls to the C library. The examples directory includes a couple of files that show the way to proceed at the different levels of complexity and the tests show more usage.
162
+
163
+
164
+ ## Development
165
+
166
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `bundle exec rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
167
+
168
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
169
+
170
+ ## Contributing
171
+
172
+ Bug reports and pull requests are welcome on GitHub at https://github.com/dverrier/tesseract_ffi. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/dverrier/tesseract_ffi/blob/master/CODE_OF_CONDUCT.md).
173
+
174
+
175
+ ## License
176
+
177
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
178
+
179
+ ## Code of Conduct
180
+
181
+ Everyone interacting in the Tesseract_FFI project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/tessy/blob/master/CODE_OF_CONDUCT.md).
@@ -0,0 +1,24 @@
1
+ require "bundler/gem_tasks"
2
+ require "rake/testtask"
3
+
4
+ Rake::TestTask.new(:test) do |t|
5
+ t.libs << "test"
6
+ t.libs << "lib"
7
+ t.test_files = FileList["test/**/*_test.rb"]
8
+ end
9
+
10
+ Rake::TestTask.new(:test_system) do |test|
11
+ test.libs << 'lib' << 'test'
12
+ test.pattern = 'test/system/*_test.rb'
13
+ test.verbose = true
14
+ end
15
+
16
+ Rake::TestTask.new(:test_units) do |test|
17
+ test.libs << 'lib' << 'test'
18
+ test.pattern = 'test/unit/*_test.rb'
19
+ test.verbose = true
20
+ end
21
+
22
+ task :default => :test
23
+
24
+
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "tesseract_ffi"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,7 @@
1
+
2
+ # execute from the top-level directory to get the
3
+ # path to the image correct
4
+ require 'tesseract_ffi'
5
+ puts TesseractFFI.to_text('test/images/bonjour.png','fra')
6
+ # => Bonjour les enfants
7
+
@@ -0,0 +1,31 @@
1
+ # execute from the top-level directory to get the
2
+ # path to the image correct
3
+
4
+ require 'tesseract_ffi'
5
+ #
6
+ # This example shows how to use the low-level functions directly
7
+ # to convert an image to a searchable pdf file
8
+ #
9
+
10
+ input_filename = 'test/images/4words.png'
11
+ output_stem = 'myfirst'
12
+ image = TesseractFFI.tess_pix_read(input_filename)
13
+
14
+
15
+ lang = 'eng'
16
+ handle = TesseractFFI.tess_create
17
+
18
+ if TesseractFFI.tess_init(handle, 0, lang, 0) > 0
19
+ puts "Error initializing tesseract"
20
+ end
21
+
22
+ TesseractFFI.tess_set_image(handle, image)
23
+
24
+ datapath = TesseractFFI.tess_get_datapath(handle)
25
+ pdf_renderer = TesseractFFI.tess_pdf_renderer_create(output_stem, datapath, false)
26
+ TesseractFFI.tess_process_pages(handle, input_filename, nil, 5000, pdf_renderer)
27
+ puts "PDF file " + output_stem + ".pdf written"
28
+
29
+ TesseractFFI.tess_end(handle)
30
+ TesseractFFI.tess_delete(handle)
31
+
@@ -0,0 +1,39 @@
1
+ # execute from the top-level directory to get the
2
+ # path to the image correct
3
+
4
+ require 'tesseract_ffi'
5
+
6
+ #
7
+ # This example calls the recogniser and
8
+ # prints the results in raw text form and
9
+ # xml form
10
+
11
+ puts "Found Tesserect #{TesseractFFI.version}"
12
+ filename = 'test/images/4words.png'
13
+ image = TesseractFFI.tess_pix_read(filename)
14
+
15
+ lang = 'eng'
16
+ handle = TesseractFFI.tess_create
17
+
18
+ if TesseractFFI.tess_init(handle, 0, lang, 0) > 0
19
+ puts "Error initializing tesseract"
20
+ end
21
+
22
+ TesseractFFI.tess_set_image(handle, image)
23
+
24
+
25
+ TesseractFFI.tess_set_source_resolution(handle, 90)
26
+
27
+ if TesseractFFI.tess_recognize(handle, 0) > 0
28
+ puts "Error in Tesseract recognition"
29
+ end
30
+
31
+ # print the text directly
32
+ puts TesseractFFI.tess_get_utf8(handle,0)
33
+
34
+ # print the HOCR format
35
+ puts TesseractFFI.tess_get_hocr(handle,0)
36
+
37
+ TesseractFFI.tess_end(handle)
38
+ TesseractFFI.tess_delete(handle)
39
+
@@ -0,0 +1,15 @@
1
+ require 'tesseract_ffi'
2
+ tess = TesseractFFI::Tesseract.new(
3
+ file_name: 'test/images/4words.png',
4
+ source_resolution:96)
5
+
6
+ tess.setup do
7
+ # x,y,w,h
8
+ tess.set_rectangle(300, 0, 40, 20)
9
+ tess.ocr
10
+ puts tess.utf8_text
11
+ tess.set_rectangle(0,0,340,17)
12
+ tess.ocr
13
+ puts tess.utf8_text
14
+ # and so on
15
+ end
@@ -0,0 +1,76 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'ffi'
4
+ require 'tesseract_ffi/version'
5
+ require 'tesseract_ffi/conf_vars' # mix-in to tesseract
6
+ require 'tesseract_ffi/oem' # mix-in to tesseract
7
+ require 'tesseract_ffi/tesseract'
8
+ require 'tesseract_ffi/tess_exception'
9
+ require 'tesseract_ffi/quick'
10
+
11
+ # module TesseractFFI
12
+ module TesseractFFI
13
+ extend FFI::Library
14
+
15
+ # OCR Engine Modes OEM
16
+ LEGACY = 0
17
+ LTSM = 1
18
+ LEGACY_LTSM = 2
19
+ DEFAULT = 3
20
+
21
+ # class FFIIntPtr
22
+ class FFIIntPtr < FFI::Struct
23
+ layout :value, :int
24
+ end
25
+
26
+ # class FFIDoublePtr
27
+ class FFIDoublePtr < FFI::Struct
28
+ layout :value, :double
29
+ end
30
+
31
+ ffi_lib '/usr/lib/x86_64-linux-gnu/libtesseract.so'
32
+
33
+ attach_function :version, 'TessVersion', [], :string
34
+ attach_function :tess_create, 'TessBaseAPICreate', [], :pointer
35
+ attach_function :tess_delete, 'TessBaseAPIDelete', [:pointer], :int
36
+
37
+ attach_function :tess_init3,
38
+ 'TessBaseAPIInit3', %i[pointer int string], :int
39
+ attach_function :tess_init,
40
+ 'TessBaseAPIInit2', %i[pointer int string int], :int
41
+
42
+ attach_function :tess_end,
43
+ 'TessBaseAPIEnd', [:pointer], :int
44
+
45
+ attach_function :tess_set_image,
46
+ 'TessBaseAPISetImage2', %i[pointer buffer_in], :int
47
+ attach_function :tess_recognize,
48
+ 'TessBaseAPIRecognize', %i[pointer int], :int
49
+ attach_function :tess_pix_read,
50
+ 'pixRead', [:string], :pointer
51
+
52
+ attach_function :tess_get_utf8,
53
+ 'TessBaseAPIGetUTF8Text', %i[pointer int], :string
54
+ attach_function :tess_get_hocr,
55
+ 'TessBaseAPIGetHOCRText', %i[pointer int], :string
56
+
57
+ attach_function :tess_set_rectangle,
58
+ 'TessBaseAPISetRectangle', %i[pointer int int int int], :void
59
+ attach_function :tess_set_source_resolution,
60
+ 'TessBaseAPISetSourceResolution', %i[pointer int], :void
61
+ attach_function :tess_set_output_name, 'TessBaseAPISetOutputName', [:pointer], :void
62
+
63
+ attach_function :tess_print_to_file, 'TessBaseAPIPrintVariablesToFile', %i[pointer buffer_in], :bool
64
+
65
+ attach_function :tess_get_int_variable, 'TessBaseAPIGetIntVariable', [:pointer, :pointer, FFIIntPtr], :bool
66
+ attach_function :tess_get_double_variable, 'TessBaseAPIGetDoubleVariable', [:pointer, :pointer, FFIDoublePtr], :bool
67
+ attach_function :tess_set_variable, 'TessBaseAPISetVariable', %i[pointer pointer pointer], :bool
68
+ # GetVariableAsString not supported by C API
69
+ # attach_function :tess_get_variable_as_string, 'TessBaseAPIGetVariableAsString', [:pointer, :string, :pointer], :bool
70
+
71
+ attach_function :tess_get_init_languages_as_string, 'TessBaseAPIGetInitLanguagesAsString', [:pointer], :string
72
+ attach_function :tess_get_oem, 'TessBaseAPIOem', [:pointer], :int
73
+ attach_function :tess_get_datapath, 'TessBaseAPIGetDatapath', [:pointer], :pointer
74
+ attach_function :tess_pdf_renderer_create, 'TessPDFRendererCreate', %i[pointer pointer bool], :pointer
75
+ attach_function :tess_process_pages, 'TessBaseAPIProcessPages', %i[pointer pointer pointer int pointer], :bool
76
+ end
@@ -0,0 +1,42 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TesseractFFI
4
+ # module ConfVars
5
+ module ConfVars
6
+ def get_double_variable(var_name)
7
+ d_ptr = TesseractFFI::FFIDoublePtr.new
8
+
9
+ unless tess_get_double_variable(@handle, var_name, d_ptr)
10
+ raise TessException.new(error_msg: 'Unable to get config variable ' + var_name)
11
+ end
12
+
13
+ d_ptr[:value]
14
+ end
15
+
16
+ def get_integer_variable(var_name)
17
+ i_ptr = TesseractFFI::FFIIntPtr.new
18
+
19
+ unless tess_get_int_variable(@handle, var_name, i_ptr)
20
+ raise TessException.new(error_msg: 'Unable to get config variable ' + var_name)
21
+ end
22
+
23
+ i_ptr[:value]
24
+ end
25
+
26
+ def set_variable(var_name, value)
27
+ mem_ptr = FFI::MemoryPointer.from_string(value.to_s)
28
+ unless tess_set_variable(@handle, var_name, mem_ptr)
29
+ raise TessException.new(error_msg: 'Unable to set config variable ' + var_name)
30
+ end
31
+
32
+ true
33
+ end
34
+
35
+ def print_variables_to_file(file_name)
36
+ result = tess_print_to_file(@handle, file_name)
37
+ raise TessException.new(error_msg: 'Unable to print variables to ' + file_name) unless result
38
+
39
+ result
40
+ end
41
+ end
42
+ end
@@ -0,0 +1,14 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TesseractFFI
4
+ # module OEM
5
+ module OEM
6
+ def oem
7
+ ocr_engine_mode = nil
8
+ setup do
9
+ ocr_engine_mode = TesseractFFI.tess_get_oem(@handle)
10
+ end
11
+ ocr_engine_mode
12
+ end
13
+ end
14
+ end
@@ -0,0 +1,16 @@
1
+ # frozen_string_literal: true
2
+
3
+ # module TesseractFFI Quick API
4
+ module TesseractFFI
5
+ def self.to_text(file_name, language = 'eng')
6
+ t = Tesseract.new(file_name: file_name, language: language)
7
+ t.recognize
8
+ t.utf8_text
9
+ end
10
+
11
+ def self.to_pdf(in_file_name, out_file_root)
12
+ t = Tesseract.new(file_name: in_file_name)
13
+ t.convert_to_pdf(out_file_root)
14
+ t.utf8_text
15
+ end
16
+ end
@@ -0,0 +1,11 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TesseractFFI
4
+ # class TessException
5
+ class TessException < Gem::Exception
6
+ attr :error
7
+ def initialize(error_msg)
8
+ @error = error_msg
9
+ end
10
+ end
11
+ end
@@ -0,0 +1,94 @@
1
+ # frozen_string_literal: true
2
+
3
+ module TesseractFFI
4
+ # class Tesseract
5
+ class Tesseract
6
+ include TesseractFFI
7
+ include ConfVars
8
+ include OEM
9
+
10
+ attr_accessor :language, :file_name, :source_resolution
11
+ attr_reader :utf8_text, :hocr_text, :errors
12
+
13
+ def initialize(file_name: nil, language: 'eng', source_resolution: 72, oem: DEFAULT)
14
+ @language = language
15
+ raise TessException.new(error_msg: 'file_name must be provided') unless file_name
16
+
17
+ raise TessException.new(error_msg: "File #{file_name} not found") unless File.exist? file_name
18
+
19
+ @file_name = file_name
20
+ @source_resolution = source_resolution
21
+ @oem = oem
22
+ @errors = []
23
+ end
24
+
25
+ # rubocop:disable Metrics/AbcSize, Metrics/MethodLength
26
+ def setup
27
+ @handle = tess_create
28
+ raise TessException.new(error_msg: 'Library Error') unless @handle
29
+
30
+ result = tess_init(@handle, 0, @language, @oem)
31
+ raise TessException.new(error_msg: 'Init Error') if result != 0
32
+
33
+ @image = tess_pix_read(@file_name)
34
+ image_status = tess_set_image(@handle, @image)
35
+ raise TessException.new(error_msg: "Unable to set image #{@file_name}") if image_status != 0
36
+
37
+ yield # run the block for recognition etc
38
+ rescue TessException => e
39
+ @errors << "Tesseract Error #{e.error[:error_msg]}"
40
+ raise
41
+ ensure
42
+ tess_end(@handle)
43
+ tess_delete(@handle)
44
+ end
45
+ # rubocop:enable Metrics/AbcSize, Metrics/MethodLength
46
+
47
+ def ocr
48
+ tess_set_source_resolution(@handle, @source_resolution)
49
+ raise TessException.new(error_msg: 'Recognition Error') if tess_recognize(@handle, 0) != 0
50
+
51
+ @utf8_text = ''
52
+ text = tess_get_utf8(@handle, 0)
53
+ @utf8_text = text.encode('UTF-8') if text
54
+ @hocr_text = tess_get_hocr(@handle, 0)
55
+ end
56
+
57
+ def recognize
58
+ setup do
59
+ ocr
60
+ end
61
+ end
62
+
63
+ def convert_to_pdf(output_stem)
64
+ setup do
65
+ datapath = TesseractFFI.tess_get_datapath(@handle)
66
+ pdf_renderer = TesseractFFI.tess_pdf_renderer_create(output_stem, datapath, false)
67
+ TesseractFFI.tess_process_pages(@handle, @file_name, nil, 5000, pdf_renderer)
68
+ end
69
+ end
70
+
71
+ def set_rectangle(x_coord, y_coord, width, height)
72
+ tess_set_rectangle(@handle, x_coord, y_coord, width, height)
73
+ end
74
+
75
+ def recognize_rectangle(x_coord, y_coord, width, height)
76
+ setup do
77
+ set_rectangle(x_coord, y_coord, width, height)
78
+ ocr
79
+ end
80
+ end
81
+
82
+ def recognize_rectangles(rectangle_list)
83
+ texts = []
84
+ setup do
85
+ rectangle_list.each do |r|
86
+ set_rectangle(r[0], r[1], r[2], r[3])
87
+ ocr
88
+ texts << @utf8_text.strip
89
+ end
90
+ end
91
+ texts
92
+ end
93
+ end
94
+ end
@@ -0,0 +1,6 @@
1
+ # frozen_string_literal: true
2
+
3
+ # module with version
4
+ module TesseractFFI
5
+ VERSION = '0.2.0'
6
+ end
File without changes
metadata ADDED
@@ -0,0 +1,155 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: tesseract_ffi
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.2.0
5
+ platform: ruby
6
+ authors:
7
+ - David Verrier
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2020-08-12 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: ffi
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '0'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: minitest
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: 5.14.1
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: 5.14.1
41
+ - !ruby/object:Gem::Dependency
42
+ name: mocha
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: 1.11.2
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: 1.11.2
55
+ - !ruby/object:Gem::Dependency
56
+ name: simplecov
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: 0.18.5
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: 0.18.5
69
+ - !ruby/object:Gem::Dependency
70
+ name: awesome_print
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">"
74
+ - !ruby/object:Gem::Version
75
+ version: 1.8.0
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">"
81
+ - !ruby/object:Gem::Version
82
+ version: 1.8.0
83
+ - !ruby/object:Gem::Dependency
84
+ name: nokogiri
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">"
88
+ - !ruby/object:Gem::Version
89
+ version: 1.10.10
90
+ type: :development
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">"
95
+ - !ruby/object:Gem::Version
96
+ version: 1.10.10
97
+ description: This wrapper around the C-API allows use of the legacy modes of the recognition
98
+ engine.
99
+ email:
100
+ - dverrier@gmail.com
101
+ executables:
102
+ - console
103
+ - setup
104
+ extensions: []
105
+ extra_rdoc_files: []
106
+ files:
107
+ - ".gitignore"
108
+ - CODE_OF_CONDUCT.md
109
+ - Gemfile
110
+ - Gemfile.lock
111
+ - LICENSE.txt
112
+ - README.md
113
+ - Rakefile
114
+ - bin/console
115
+ - bin/setup
116
+ - examples/bonjour.rb
117
+ - examples/low_level_pdf.rb
118
+ - examples/low_level_recognition.rb
119
+ - examples/two_rectangles.rb
120
+ - lib/tesseract_ffi.rb
121
+ - lib/tesseract_ffi/conf_vars.rb
122
+ - lib/tesseract_ffi/oem.rb
123
+ - lib/tesseract_ffi/quick.rb
124
+ - lib/tesseract_ffi/tess_exception.rb
125
+ - lib/tesseract_ffi/tesseract.rb
126
+ - lib/tesseract_ffi/version.rb
127
+ - tmp/keep.txt
128
+ homepage: https://github.com/dverrier/tesseract_ffi
129
+ licenses:
130
+ - MIT
131
+ metadata:
132
+ allowed_push_host: https://rubygems.org
133
+ homepage_uri: https://github.com/dverrier/tesseract_ffi
134
+ source_code_uri: https://github.com/dverrier/tesseract_ffi
135
+ changelog_uri: https://github.com/dverrier/tesseract_ffi
136
+ post_install_message:
137
+ rdoc_options: []
138
+ require_paths:
139
+ - lib
140
+ required_ruby_version: !ruby/object:Gem::Requirement
141
+ requirements:
142
+ - - ">="
143
+ - !ruby/object:Gem::Version
144
+ version: 2.3.0
145
+ required_rubygems_version: !ruby/object:Gem::Requirement
146
+ requirements:
147
+ - - ">="
148
+ - !ruby/object:Gem::Version
149
+ version: '0'
150
+ requirements: []
151
+ rubygems_version: 3.1.2
152
+ signing_key:
153
+ specification_version: 4
154
+ summary: This is a Ruby-wrapper around the Tesseract C-API.
155
+ test_files: []