tesseract_ffi 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +2 -0
- data/CODE_OF_CONDUCT.md +74 -0
- data/Gemfile +18 -0
- data/Gemfile.lock +36 -0
- data/LICENSE.txt +21 -0
- data/README.md +181 -0
- data/Rakefile +24 -0
- data/bin/console +14 -0
- data/bin/setup +8 -0
- data/examples/bonjour.rb +7 -0
- data/examples/low_level_pdf.rb +31 -0
- data/examples/low_level_recognition.rb +39 -0
- data/examples/two_rectangles.rb +15 -0
- data/lib/tesseract_ffi.rb +76 -0
- data/lib/tesseract_ffi/conf_vars.rb +42 -0
- data/lib/tesseract_ffi/oem.rb +14 -0
- data/lib/tesseract_ffi/quick.rb +16 -0
- data/lib/tesseract_ffi/tess_exception.rb +11 -0
- data/lib/tesseract_ffi/tesseract.rb +94 -0
- data/lib/tesseract_ffi/version.rb +6 -0
- data/tmp/keep.txt +0 -0
- metadata +155 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: b299b8e91905a54a81999730de02ae2dbf5ed58cf2bbf826854c178fd5f33296
|
4
|
+
data.tar.gz: 8906f6a6dd21011c076c8d785f6ac99e2297112831f829a924a32f3763b09713
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 9d11f40349117f99080ba4d3eb7f75c4c3d44941c6a3e4aaeec71f22b0ac8e6bc0a361e89c6c657a3fd864853300112883c42162ee5af89f20b841a3cb91f544
|
7
|
+
data.tar.gz: cb2e721f373141bd6f42ef97ebd39257a6735c1b34e241739cdc4cfcdf5f0d7c6747ad95df3b6aa13b12e8a49115081dda447c708c5a6b64718d9dc57c739d2c
|
data/.gitignore
ADDED
data/CODE_OF_CONDUCT.md
ADDED
@@ -0,0 +1,74 @@
|
|
1
|
+
# Contributor Covenant Code of Conduct
|
2
|
+
|
3
|
+
## Our Pledge
|
4
|
+
|
5
|
+
In the interest of fostering an open and welcoming environment, we as
|
6
|
+
contributors and maintainers pledge to making participation in our project and
|
7
|
+
our community a harassment-free experience for everyone, regardless of age, body
|
8
|
+
size, disability, ethnicity, gender identity and expression, level of experience,
|
9
|
+
nationality, personal appearance, race, religion, or sexual identity and
|
10
|
+
orientation.
|
11
|
+
|
12
|
+
## Our Standards
|
13
|
+
|
14
|
+
Examples of behavior that contributes to creating a positive environment
|
15
|
+
include:
|
16
|
+
|
17
|
+
* Using welcoming and inclusive language
|
18
|
+
* Being respectful of differing viewpoints and experiences
|
19
|
+
* Gracefully accepting constructive criticism
|
20
|
+
* Focusing on what is best for the community
|
21
|
+
* Showing empathy towards other community members
|
22
|
+
|
23
|
+
Examples of unacceptable behavior by participants include:
|
24
|
+
|
25
|
+
* The use of sexualized language or imagery and unwelcome sexual attention or
|
26
|
+
advances
|
27
|
+
* Trolling, insulting/derogatory comments, and personal or political attacks
|
28
|
+
* Public or private harassment
|
29
|
+
* Publishing others' private information, such as a physical or electronic
|
30
|
+
address, without explicit permission
|
31
|
+
* Other conduct which could reasonably be considered inappropriate in a
|
32
|
+
professional setting
|
33
|
+
|
34
|
+
## Our Responsibilities
|
35
|
+
|
36
|
+
Project maintainers are responsible for clarifying the standards of acceptable
|
37
|
+
behavior and are expected to take appropriate and fair corrective action in
|
38
|
+
response to any instances of unacceptable behavior.
|
39
|
+
|
40
|
+
Project maintainers have the right and responsibility to remove, edit, or
|
41
|
+
reject comments, commits, code, wiki edits, issues, and other contributions
|
42
|
+
that are not aligned to this Code of Conduct, or to ban temporarily or
|
43
|
+
permanently any contributor for other behaviors that they deem inappropriate,
|
44
|
+
threatening, offensive, or harmful.
|
45
|
+
|
46
|
+
## Scope
|
47
|
+
|
48
|
+
This Code of Conduct applies both within project spaces and in public spaces
|
49
|
+
when an individual is representing the project or its community. Examples of
|
50
|
+
representing a project or community include using an official project e-mail
|
51
|
+
address, posting via an official social media account, or acting as an appointed
|
52
|
+
representative at an online or offline event. Representation of a project may be
|
53
|
+
further defined and clarified by project maintainers.
|
54
|
+
|
55
|
+
## Enforcement
|
56
|
+
|
57
|
+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
58
|
+
reported by contacting the project team at dverrier@gmail.com. All
|
59
|
+
complaints will be reviewed and investigated and will result in a response that
|
60
|
+
is deemed necessary and appropriate to the circumstances. The project team is
|
61
|
+
obligated to maintain confidentiality with regard to the reporter of an incident.
|
62
|
+
Further details of specific enforcement policies may be posted separately.
|
63
|
+
|
64
|
+
Project maintainers who do not follow or enforce the Code of Conduct in good
|
65
|
+
faith may face temporary or permanent repercussions as determined by other
|
66
|
+
members of the project's leadership.
|
67
|
+
|
68
|
+
## Attribution
|
69
|
+
|
70
|
+
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
|
71
|
+
available at [https://contributor-covenant.org/version/1/4][version]
|
72
|
+
|
73
|
+
[homepage]: https://contributor-covenant.org
|
74
|
+
[version]: https://contributor-covenant.org/version/1/4/
|
data/Gemfile
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
source 'https://rubygems.org'
|
2
|
+
|
3
|
+
gem 'rake'
|
4
|
+
gem 'ffi'
|
5
|
+
|
6
|
+
group :dev, :test do
|
7
|
+
gem 'awesome_print'
|
8
|
+
gem 'rdoc'
|
9
|
+
gem 'hocr_reader'
|
10
|
+
|
11
|
+
end
|
12
|
+
|
13
|
+
group :test do
|
14
|
+
gem 'minitest'
|
15
|
+
gem 'mocha'
|
16
|
+
gem 'nokogiri'
|
17
|
+
gem 'simplecov', require: false
|
18
|
+
end
|
data/Gemfile.lock
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
GEM
|
2
|
+
remote: https://rubygems.org/
|
3
|
+
specs:
|
4
|
+
awesome_print (1.8.0)
|
5
|
+
docile (1.3.2)
|
6
|
+
ffi (1.13.1)
|
7
|
+
hocr_reader (0.1.0)
|
8
|
+
nokogiri (~> 1.10.10)
|
9
|
+
mini_portile2 (2.4.0)
|
10
|
+
minitest (5.14.1)
|
11
|
+
mocha (1.11.2)
|
12
|
+
nokogiri (1.10.10)
|
13
|
+
mini_portile2 (~> 2.4.0)
|
14
|
+
rake (13.0.1)
|
15
|
+
rdoc (6.2.1)
|
16
|
+
simplecov (0.18.5)
|
17
|
+
docile (~> 1.1)
|
18
|
+
simplecov-html (~> 0.11)
|
19
|
+
simplecov-html (0.12.2)
|
20
|
+
|
21
|
+
PLATFORMS
|
22
|
+
ruby
|
23
|
+
|
24
|
+
DEPENDENCIES
|
25
|
+
awesome_print
|
26
|
+
ffi
|
27
|
+
hocr_reader
|
28
|
+
minitest
|
29
|
+
mocha
|
30
|
+
nokogiri
|
31
|
+
rake
|
32
|
+
rdoc
|
33
|
+
simplecov
|
34
|
+
|
35
|
+
BUNDLED WITH
|
36
|
+
2.1.4
|
data/LICENSE.txt
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
The MIT License (MIT)
|
2
|
+
|
3
|
+
Copyright (c) 2020 David Verrier
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
7
|
+
in the Software without restriction, including without limitation the rights
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
10
|
+
furnished to do so, subject to the following conditions:
|
11
|
+
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
13
|
+
all copies or substantial portions of the Software.
|
14
|
+
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,181 @@
|
|
1
|
+
# Tesseract_FFI
|
2
|
+
|
3
|
+
Welcome to Tesseract_FFI!
|
4
|
+
This is a ruby wrapper to the Tesseract library. _Before installing this gem, make sure that Tesseract runs. For example, run the command_
|
5
|
+
|
6
|
+
```bash
|
7
|
+
$ tesseract --version
|
8
|
+
```
|
9
|
+
|
10
|
+
and under Linux, etc you should see something like
|
11
|
+
```bash
|
12
|
+
tesseract 4.1.1-rc2-25-g9707
|
13
|
+
leptonica-1.78.0
|
14
|
+
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
|
15
|
+
Found AVX2
|
16
|
+
Found AVX
|
17
|
+
Found FMA
|
18
|
+
Found SSE
|
19
|
+
Found libarchive 3.1.2
|
20
|
+
```
|
21
|
+
Don't know about Windows, apart from the Windows Subsystem for Linux works really well!
|
22
|
+
|
23
|
+
|
24
|
+
|
25
|
+
## Installation
|
26
|
+
|
27
|
+
Add this line to your application's Gemfile:
|
28
|
+
|
29
|
+
```ruby
|
30
|
+
gem 'tesseract_ffi'
|
31
|
+
```
|
32
|
+
|
33
|
+
And then execute:
|
34
|
+
```bash
|
35
|
+
$ bundle install
|
36
|
+
```
|
37
|
+
Or install it yourself as:
|
38
|
+
```bash
|
39
|
+
$ gem install tesseract_ffi
|
40
|
+
```
|
41
|
+
|
42
|
+
## Usage
|
43
|
+
The fastest way to get going is to use the high-level functions that will probably suit most people , most of the time.
|
44
|
+
|
45
|
+
#### To convert an image to a string
|
46
|
+
```ruby
|
47
|
+
require 'tesseract_ffi'
|
48
|
+
|
49
|
+
TesseractFFI.to_text('my_image.png')
|
50
|
+
```
|
51
|
+
#### To convert an image to a searchable PDF file
|
52
|
+
```ruby
|
53
|
+
require 'tesseract_ffi'
|
54
|
+
TesseractFFI.to_pdf('my_image.png', 'output_file')
|
55
|
+
```
|
56
|
+
|
57
|
+
## Languages
|
58
|
+
When the default 'recognition of English' is not suitable, you can change it. The abreviations used for some common European languages are
|
59
|
+
|
60
|
+
* deu - German
|
61
|
+
* eng - English
|
62
|
+
* fra - French
|
63
|
+
* ita - Italian
|
64
|
+
* nld - Dutch
|
65
|
+
* por - Portuguese
|
66
|
+
* spa - Spanish
|
67
|
+
|
68
|
+
but Tesseract itself supports many, many languages including but not limited to chi_sim (Chinese simplified), chi_tra (Chinese traditional), chr (Cherokee), cym (Welsh), frk (Frankish), frm (French, Middle, ca.1400-1600). Just ensure that you have the corresponding Tesseract language recognition libraries installed. The best way to confirm this is directly from the command line. For example, to ensure that French recognition files are available to tesseract, type this command to recognise a test image in French
|
69
|
+
```bash
|
70
|
+
tesseract imagename.png mytext -l fra
|
71
|
+
```
|
72
|
+
To call from within a ruby file, the following snippet should work for an image in German:
|
73
|
+
```ruby
|
74
|
+
require 'tesseract_ffi'
|
75
|
+
|
76
|
+
TesseractFFI.to_text('my_image.png', 'deu')
|
77
|
+
```
|
78
|
+
or by creating Ruby objects:
|
79
|
+
|
80
|
+
```ruby
|
81
|
+
require 'tesseract_ffi'
|
82
|
+
tess = TesseractFFI::Tesseract.new(
|
83
|
+
language:'fra',
|
84
|
+
file_name: 'test/images/bonjour.png')
|
85
|
+
tess.recognize
|
86
|
+
text = tess.utf8_text
|
87
|
+
```
|
88
|
+
|
89
|
+
## To Generate HOCR
|
90
|
+
Wikipedia says HOCR 'is an open standard of data representation for formatted text
|
91
|
+
obtained from optical character recognition (OCR)'. Tesseract can produce it and generate the bounding boxes of the words, the lines, the paragraphs on a page.
|
92
|
+
|
93
|
+
```ruby
|
94
|
+
require 'tesseract_ffi'
|
95
|
+
tess = TesseractFFI::Tesseract.new(
|
96
|
+
file_name: 'test/images/4words.png',
|
97
|
+
source_resolution:96)
|
98
|
+
tess.recognize
|
99
|
+
text = tess.hocr_text
|
100
|
+
```
|
101
|
+
|
102
|
+
```xml
|
103
|
+
<div class='ocr_page' id='page_18' title='image ""; bbox 0 0 341 17; ppageno 17'>
|
104
|
+
<div class='ocr_carea' id='block_18_1' title="bbox 0 0 341 17">
|
105
|
+
<p class='ocr_par' id='par_18_1' lang='eng' title="bbox 0 0 341 17">
|
106
|
+
<span class='ocr_line' id='line_18_1' title="bbox 0 0 341 17; baseline -0.012 -1; x_size 16; x_descenders 4; x_ascenders 4">
|
107
|
+
<span class='ocrx_word' id='word_18_1' title='bbox 0 4 49 17; x_wconf 92'>Name</span>
|
108
|
+
<span class='ocrx_word' id='word_18_2' title='bbox 54 4 94 17; x_wconf 90'>Arial</span>
|
109
|
+
<span class='ocrx_word' id='word_18_3' title='bbox 237 0 296 15; x_wconf 90'>Century</span>
|
110
|
+
<span class='ocrx_word' id='word_18_4' title='bbox 302 0 341 12; x_wconf 90'>Peter</span>
|
111
|
+
</span>
|
112
|
+
</p>
|
113
|
+
</div>
|
114
|
+
</div>
|
115
|
+
```
|
116
|
+
|
117
|
+
## Recognise Part of an Image
|
118
|
+
```ruby
|
119
|
+
require 'tesseract_ffi'
|
120
|
+
tess = TesseractFFI::Tesseract.new(
|
121
|
+
file_name: 'test/images/4words.png')
|
122
|
+
|
123
|
+
# tess.recognize_rectangle(x,y,w,h)
|
124
|
+
tess.recognize_rectangle(300, 0, 41, 15)
|
125
|
+
text = tess.utf8_text
|
126
|
+
# => "Peter"
|
127
|
+
|
128
|
+
```
|
129
|
+
|
130
|
+
## General Structure
|
131
|
+
|
132
|
+
Create a TesseractFFI::Tesseract object specifying the image file, the language(s) and, optionally the source resolution (dpi of the image) and the OCR Engine Mode, OEM. The default is to use the latest mode, which uses a neural network for the recognition. For some purposes, such as typeface/font recognition, it can be desirable to use the legacy mode even though the recognition is not usually as good.
|
133
|
+
|
134
|
+
```ruby
|
135
|
+
require 'tesseract_ffi'
|
136
|
+
tess = TesseractFFI::Tesseract.new(
|
137
|
+
file_name: 'test/images/4words.png',
|
138
|
+
source_resolution:96)
|
139
|
+
|
140
|
+
```
|
141
|
+
Then call Tesseract.setup with the desired methods in a block.
|
142
|
+
|
143
|
+
```ruby
|
144
|
+
|
145
|
+
tess.setup do
|
146
|
+
tess.set_rectangle(300, 0, 40, 20)
|
147
|
+
tess.ocr
|
148
|
+
puts tess.utf8_text
|
149
|
+
# => Peter
|
150
|
+
tess.set_rectangle(0, 0, 340, 17)
|
151
|
+
tess.ocr
|
152
|
+
puts tess.utf8_text
|
153
|
+
# => Name Arial Century Peter
|
154
|
+
end
|
155
|
+
|
156
|
+
```
|
157
|
+
|
158
|
+
|
159
|
+
## Low Level Calls
|
160
|
+
|
161
|
+
If you look under the hood, there are intermediate ruby methods that do most things, and some very low level functions that make calls to the C-API of Tesseract using the wonderful FFI library. The low level functions give alarming error messages and often stack dump if called in the wrong order, so they are not for the feint of heart. But if your screen allows to scroll back 1000 lines, you can usually see where the call to Tesseract went wrong. This gem aims to hide the complexity of the direct calls to the C library. The examples directory includes a couple of files that show the way to proceed at the different levels of complexity and the tests show more usage.
|
162
|
+
|
163
|
+
|
164
|
+
## Development
|
165
|
+
|
166
|
+
After checking out the repo, run `bin/setup` to install dependencies. Then, run `bundle exec rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
167
|
+
|
168
|
+
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
|
169
|
+
|
170
|
+
## Contributing
|
171
|
+
|
172
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/dverrier/tesseract_ffi. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](https://github.com/dverrier/tesseract_ffi/blob/master/CODE_OF_CONDUCT.md).
|
173
|
+
|
174
|
+
|
175
|
+
## License
|
176
|
+
|
177
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
178
|
+
|
179
|
+
## Code of Conduct
|
180
|
+
|
181
|
+
Everyone interacting in the Tesseract_FFI project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/[USERNAME]/tessy/blob/master/CODE_OF_CONDUCT.md).
|
data/Rakefile
ADDED
@@ -0,0 +1,24 @@
|
|
1
|
+
require "bundler/gem_tasks"
|
2
|
+
require "rake/testtask"
|
3
|
+
|
4
|
+
Rake::TestTask.new(:test) do |t|
|
5
|
+
t.libs << "test"
|
6
|
+
t.libs << "lib"
|
7
|
+
t.test_files = FileList["test/**/*_test.rb"]
|
8
|
+
end
|
9
|
+
|
10
|
+
Rake::TestTask.new(:test_system) do |test|
|
11
|
+
test.libs << 'lib' << 'test'
|
12
|
+
test.pattern = 'test/system/*_test.rb'
|
13
|
+
test.verbose = true
|
14
|
+
end
|
15
|
+
|
16
|
+
Rake::TestTask.new(:test_units) do |test|
|
17
|
+
test.libs << 'lib' << 'test'
|
18
|
+
test.pattern = 'test/unit/*_test.rb'
|
19
|
+
test.verbose = true
|
20
|
+
end
|
21
|
+
|
22
|
+
task :default => :test
|
23
|
+
|
24
|
+
|
data/bin/console
ADDED
@@ -0,0 +1,14 @@
|
|
1
|
+
#!/usr/bin/env ruby
|
2
|
+
|
3
|
+
require "bundler/setup"
|
4
|
+
require "tesseract_ffi"
|
5
|
+
|
6
|
+
# You can add fixtures and/or initialization code here to make experimenting
|
7
|
+
# with your gem easier. You can also use a different console, if you like.
|
8
|
+
|
9
|
+
# (If you use this, don't forget to add pry to your Gemfile!)
|
10
|
+
# require "pry"
|
11
|
+
# Pry.start
|
12
|
+
|
13
|
+
require "irb"
|
14
|
+
IRB.start(__FILE__)
|
data/bin/setup
ADDED
data/examples/bonjour.rb
ADDED
@@ -0,0 +1,31 @@
|
|
1
|
+
# execute from the top-level directory to get the
|
2
|
+
# path to the image correct
|
3
|
+
|
4
|
+
require 'tesseract_ffi'
|
5
|
+
#
|
6
|
+
# This example shows how to use the low-level functions directly
|
7
|
+
# to convert an image to a searchable pdf file
|
8
|
+
#
|
9
|
+
|
10
|
+
input_filename = 'test/images/4words.png'
|
11
|
+
output_stem = 'myfirst'
|
12
|
+
image = TesseractFFI.tess_pix_read(input_filename)
|
13
|
+
|
14
|
+
|
15
|
+
lang = 'eng'
|
16
|
+
handle = TesseractFFI.tess_create
|
17
|
+
|
18
|
+
if TesseractFFI.tess_init(handle, 0, lang, 0) > 0
|
19
|
+
puts "Error initializing tesseract"
|
20
|
+
end
|
21
|
+
|
22
|
+
TesseractFFI.tess_set_image(handle, image)
|
23
|
+
|
24
|
+
datapath = TesseractFFI.tess_get_datapath(handle)
|
25
|
+
pdf_renderer = TesseractFFI.tess_pdf_renderer_create(output_stem, datapath, false)
|
26
|
+
TesseractFFI.tess_process_pages(handle, input_filename, nil, 5000, pdf_renderer)
|
27
|
+
puts "PDF file " + output_stem + ".pdf written"
|
28
|
+
|
29
|
+
TesseractFFI.tess_end(handle)
|
30
|
+
TesseractFFI.tess_delete(handle)
|
31
|
+
|
@@ -0,0 +1,39 @@
|
|
1
|
+
# execute from the top-level directory to get the
|
2
|
+
# path to the image correct
|
3
|
+
|
4
|
+
require 'tesseract_ffi'
|
5
|
+
|
6
|
+
#
|
7
|
+
# This example calls the recogniser and
|
8
|
+
# prints the results in raw text form and
|
9
|
+
# xml form
|
10
|
+
|
11
|
+
puts "Found Tesserect #{TesseractFFI.version}"
|
12
|
+
filename = 'test/images/4words.png'
|
13
|
+
image = TesseractFFI.tess_pix_read(filename)
|
14
|
+
|
15
|
+
lang = 'eng'
|
16
|
+
handle = TesseractFFI.tess_create
|
17
|
+
|
18
|
+
if TesseractFFI.tess_init(handle, 0, lang, 0) > 0
|
19
|
+
puts "Error initializing tesseract"
|
20
|
+
end
|
21
|
+
|
22
|
+
TesseractFFI.tess_set_image(handle, image)
|
23
|
+
|
24
|
+
|
25
|
+
TesseractFFI.tess_set_source_resolution(handle, 90)
|
26
|
+
|
27
|
+
if TesseractFFI.tess_recognize(handle, 0) > 0
|
28
|
+
puts "Error in Tesseract recognition"
|
29
|
+
end
|
30
|
+
|
31
|
+
# print the text directly
|
32
|
+
puts TesseractFFI.tess_get_utf8(handle,0)
|
33
|
+
|
34
|
+
# print the HOCR format
|
35
|
+
puts TesseractFFI.tess_get_hocr(handle,0)
|
36
|
+
|
37
|
+
TesseractFFI.tess_end(handle)
|
38
|
+
TesseractFFI.tess_delete(handle)
|
39
|
+
|
@@ -0,0 +1,15 @@
|
|
1
|
+
require 'tesseract_ffi'
|
2
|
+
tess = TesseractFFI::Tesseract.new(
|
3
|
+
file_name: 'test/images/4words.png',
|
4
|
+
source_resolution:96)
|
5
|
+
|
6
|
+
tess.setup do
|
7
|
+
# x,y,w,h
|
8
|
+
tess.set_rectangle(300, 0, 40, 20)
|
9
|
+
tess.ocr
|
10
|
+
puts tess.utf8_text
|
11
|
+
tess.set_rectangle(0,0,340,17)
|
12
|
+
tess.ocr
|
13
|
+
puts tess.utf8_text
|
14
|
+
# and so on
|
15
|
+
end
|
@@ -0,0 +1,76 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'ffi'
|
4
|
+
require 'tesseract_ffi/version'
|
5
|
+
require 'tesseract_ffi/conf_vars' # mix-in to tesseract
|
6
|
+
require 'tesseract_ffi/oem' # mix-in to tesseract
|
7
|
+
require 'tesseract_ffi/tesseract'
|
8
|
+
require 'tesseract_ffi/tess_exception'
|
9
|
+
require 'tesseract_ffi/quick'
|
10
|
+
|
11
|
+
# module TesseractFFI
|
12
|
+
module TesseractFFI
|
13
|
+
extend FFI::Library
|
14
|
+
|
15
|
+
# OCR Engine Modes OEM
|
16
|
+
LEGACY = 0
|
17
|
+
LTSM = 1
|
18
|
+
LEGACY_LTSM = 2
|
19
|
+
DEFAULT = 3
|
20
|
+
|
21
|
+
# class FFIIntPtr
|
22
|
+
class FFIIntPtr < FFI::Struct
|
23
|
+
layout :value, :int
|
24
|
+
end
|
25
|
+
|
26
|
+
# class FFIDoublePtr
|
27
|
+
class FFIDoublePtr < FFI::Struct
|
28
|
+
layout :value, :double
|
29
|
+
end
|
30
|
+
|
31
|
+
ffi_lib '/usr/lib/x86_64-linux-gnu/libtesseract.so'
|
32
|
+
|
33
|
+
attach_function :version, 'TessVersion', [], :string
|
34
|
+
attach_function :tess_create, 'TessBaseAPICreate', [], :pointer
|
35
|
+
attach_function :tess_delete, 'TessBaseAPIDelete', [:pointer], :int
|
36
|
+
|
37
|
+
attach_function :tess_init3,
|
38
|
+
'TessBaseAPIInit3', %i[pointer int string], :int
|
39
|
+
attach_function :tess_init,
|
40
|
+
'TessBaseAPIInit2', %i[pointer int string int], :int
|
41
|
+
|
42
|
+
attach_function :tess_end,
|
43
|
+
'TessBaseAPIEnd', [:pointer], :int
|
44
|
+
|
45
|
+
attach_function :tess_set_image,
|
46
|
+
'TessBaseAPISetImage2', %i[pointer buffer_in], :int
|
47
|
+
attach_function :tess_recognize,
|
48
|
+
'TessBaseAPIRecognize', %i[pointer int], :int
|
49
|
+
attach_function :tess_pix_read,
|
50
|
+
'pixRead', [:string], :pointer
|
51
|
+
|
52
|
+
attach_function :tess_get_utf8,
|
53
|
+
'TessBaseAPIGetUTF8Text', %i[pointer int], :string
|
54
|
+
attach_function :tess_get_hocr,
|
55
|
+
'TessBaseAPIGetHOCRText', %i[pointer int], :string
|
56
|
+
|
57
|
+
attach_function :tess_set_rectangle,
|
58
|
+
'TessBaseAPISetRectangle', %i[pointer int int int int], :void
|
59
|
+
attach_function :tess_set_source_resolution,
|
60
|
+
'TessBaseAPISetSourceResolution', %i[pointer int], :void
|
61
|
+
attach_function :tess_set_output_name, 'TessBaseAPISetOutputName', [:pointer], :void
|
62
|
+
|
63
|
+
attach_function :tess_print_to_file, 'TessBaseAPIPrintVariablesToFile', %i[pointer buffer_in], :bool
|
64
|
+
|
65
|
+
attach_function :tess_get_int_variable, 'TessBaseAPIGetIntVariable', [:pointer, :pointer, FFIIntPtr], :bool
|
66
|
+
attach_function :tess_get_double_variable, 'TessBaseAPIGetDoubleVariable', [:pointer, :pointer, FFIDoublePtr], :bool
|
67
|
+
attach_function :tess_set_variable, 'TessBaseAPISetVariable', %i[pointer pointer pointer], :bool
|
68
|
+
# GetVariableAsString not supported by C API
|
69
|
+
# attach_function :tess_get_variable_as_string, 'TessBaseAPIGetVariableAsString', [:pointer, :string, :pointer], :bool
|
70
|
+
|
71
|
+
attach_function :tess_get_init_languages_as_string, 'TessBaseAPIGetInitLanguagesAsString', [:pointer], :string
|
72
|
+
attach_function :tess_get_oem, 'TessBaseAPIOem', [:pointer], :int
|
73
|
+
attach_function :tess_get_datapath, 'TessBaseAPIGetDatapath', [:pointer], :pointer
|
74
|
+
attach_function :tess_pdf_renderer_create, 'TessPDFRendererCreate', %i[pointer pointer bool], :pointer
|
75
|
+
attach_function :tess_process_pages, 'TessBaseAPIProcessPages', %i[pointer pointer pointer int pointer], :bool
|
76
|
+
end
|
@@ -0,0 +1,42 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TesseractFFI
|
4
|
+
# module ConfVars
|
5
|
+
module ConfVars
|
6
|
+
def get_double_variable(var_name)
|
7
|
+
d_ptr = TesseractFFI::FFIDoublePtr.new
|
8
|
+
|
9
|
+
unless tess_get_double_variable(@handle, var_name, d_ptr)
|
10
|
+
raise TessException.new(error_msg: 'Unable to get config variable ' + var_name)
|
11
|
+
end
|
12
|
+
|
13
|
+
d_ptr[:value]
|
14
|
+
end
|
15
|
+
|
16
|
+
def get_integer_variable(var_name)
|
17
|
+
i_ptr = TesseractFFI::FFIIntPtr.new
|
18
|
+
|
19
|
+
unless tess_get_int_variable(@handle, var_name, i_ptr)
|
20
|
+
raise TessException.new(error_msg: 'Unable to get config variable ' + var_name)
|
21
|
+
end
|
22
|
+
|
23
|
+
i_ptr[:value]
|
24
|
+
end
|
25
|
+
|
26
|
+
def set_variable(var_name, value)
|
27
|
+
mem_ptr = FFI::MemoryPointer.from_string(value.to_s)
|
28
|
+
unless tess_set_variable(@handle, var_name, mem_ptr)
|
29
|
+
raise TessException.new(error_msg: 'Unable to set config variable ' + var_name)
|
30
|
+
end
|
31
|
+
|
32
|
+
true
|
33
|
+
end
|
34
|
+
|
35
|
+
def print_variables_to_file(file_name)
|
36
|
+
result = tess_print_to_file(@handle, file_name)
|
37
|
+
raise TessException.new(error_msg: 'Unable to print variables to ' + file_name) unless result
|
38
|
+
|
39
|
+
result
|
40
|
+
end
|
41
|
+
end
|
42
|
+
end
|
@@ -0,0 +1,16 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
# module TesseractFFI Quick API
|
4
|
+
module TesseractFFI
|
5
|
+
def self.to_text(file_name, language = 'eng')
|
6
|
+
t = Tesseract.new(file_name: file_name, language: language)
|
7
|
+
t.recognize
|
8
|
+
t.utf8_text
|
9
|
+
end
|
10
|
+
|
11
|
+
def self.to_pdf(in_file_name, out_file_root)
|
12
|
+
t = Tesseract.new(file_name: in_file_name)
|
13
|
+
t.convert_to_pdf(out_file_root)
|
14
|
+
t.utf8_text
|
15
|
+
end
|
16
|
+
end
|
@@ -0,0 +1,94 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module TesseractFFI
|
4
|
+
# class Tesseract
|
5
|
+
class Tesseract
|
6
|
+
include TesseractFFI
|
7
|
+
include ConfVars
|
8
|
+
include OEM
|
9
|
+
|
10
|
+
attr_accessor :language, :file_name, :source_resolution
|
11
|
+
attr_reader :utf8_text, :hocr_text, :errors
|
12
|
+
|
13
|
+
def initialize(file_name: nil, language: 'eng', source_resolution: 72, oem: DEFAULT)
|
14
|
+
@language = language
|
15
|
+
raise TessException.new(error_msg: 'file_name must be provided') unless file_name
|
16
|
+
|
17
|
+
raise TessException.new(error_msg: "File #{file_name} not found") unless File.exist? file_name
|
18
|
+
|
19
|
+
@file_name = file_name
|
20
|
+
@source_resolution = source_resolution
|
21
|
+
@oem = oem
|
22
|
+
@errors = []
|
23
|
+
end
|
24
|
+
|
25
|
+
# rubocop:disable Metrics/AbcSize, Metrics/MethodLength
|
26
|
+
def setup
|
27
|
+
@handle = tess_create
|
28
|
+
raise TessException.new(error_msg: 'Library Error') unless @handle
|
29
|
+
|
30
|
+
result = tess_init(@handle, 0, @language, @oem)
|
31
|
+
raise TessException.new(error_msg: 'Init Error') if result != 0
|
32
|
+
|
33
|
+
@image = tess_pix_read(@file_name)
|
34
|
+
image_status = tess_set_image(@handle, @image)
|
35
|
+
raise TessException.new(error_msg: "Unable to set image #{@file_name}") if image_status != 0
|
36
|
+
|
37
|
+
yield # run the block for recognition etc
|
38
|
+
rescue TessException => e
|
39
|
+
@errors << "Tesseract Error #{e.error[:error_msg]}"
|
40
|
+
raise
|
41
|
+
ensure
|
42
|
+
tess_end(@handle)
|
43
|
+
tess_delete(@handle)
|
44
|
+
end
|
45
|
+
# rubocop:enable Metrics/AbcSize, Metrics/MethodLength
|
46
|
+
|
47
|
+
def ocr
|
48
|
+
tess_set_source_resolution(@handle, @source_resolution)
|
49
|
+
raise TessException.new(error_msg: 'Recognition Error') if tess_recognize(@handle, 0) != 0
|
50
|
+
|
51
|
+
@utf8_text = ''
|
52
|
+
text = tess_get_utf8(@handle, 0)
|
53
|
+
@utf8_text = text.encode('UTF-8') if text
|
54
|
+
@hocr_text = tess_get_hocr(@handle, 0)
|
55
|
+
end
|
56
|
+
|
57
|
+
def recognize
|
58
|
+
setup do
|
59
|
+
ocr
|
60
|
+
end
|
61
|
+
end
|
62
|
+
|
63
|
+
def convert_to_pdf(output_stem)
|
64
|
+
setup do
|
65
|
+
datapath = TesseractFFI.tess_get_datapath(@handle)
|
66
|
+
pdf_renderer = TesseractFFI.tess_pdf_renderer_create(output_stem, datapath, false)
|
67
|
+
TesseractFFI.tess_process_pages(@handle, @file_name, nil, 5000, pdf_renderer)
|
68
|
+
end
|
69
|
+
end
|
70
|
+
|
71
|
+
def set_rectangle(x_coord, y_coord, width, height)
|
72
|
+
tess_set_rectangle(@handle, x_coord, y_coord, width, height)
|
73
|
+
end
|
74
|
+
|
75
|
+
def recognize_rectangle(x_coord, y_coord, width, height)
|
76
|
+
setup do
|
77
|
+
set_rectangle(x_coord, y_coord, width, height)
|
78
|
+
ocr
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
def recognize_rectangles(rectangle_list)
|
83
|
+
texts = []
|
84
|
+
setup do
|
85
|
+
rectangle_list.each do |r|
|
86
|
+
set_rectangle(r[0], r[1], r[2], r[3])
|
87
|
+
ocr
|
88
|
+
texts << @utf8_text.strip
|
89
|
+
end
|
90
|
+
end
|
91
|
+
texts
|
92
|
+
end
|
93
|
+
end
|
94
|
+
end
|
data/tmp/keep.txt
ADDED
File without changes
|
metadata
ADDED
@@ -0,0 +1,155 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: tesseract_ffi
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.2.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- David Verrier
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
date: 2020-08-12 00:00:00.000000000 Z
|
12
|
+
dependencies:
|
13
|
+
- !ruby/object:Gem::Dependency
|
14
|
+
name: ffi
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
16
|
+
requirements:
|
17
|
+
- - ">="
|
18
|
+
- !ruby/object:Gem::Version
|
19
|
+
version: '0'
|
20
|
+
type: :runtime
|
21
|
+
prerelease: false
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
23
|
+
requirements:
|
24
|
+
- - ">="
|
25
|
+
- !ruby/object:Gem::Version
|
26
|
+
version: '0'
|
27
|
+
- !ruby/object:Gem::Dependency
|
28
|
+
name: minitest
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - "~>"
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: 5.14.1
|
34
|
+
type: :development
|
35
|
+
prerelease: false
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
37
|
+
requirements:
|
38
|
+
- - "~>"
|
39
|
+
- !ruby/object:Gem::Version
|
40
|
+
version: 5.14.1
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: mocha
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - "~>"
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: 1.11.2
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - "~>"
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: 1.11.2
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: simplecov
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - "~>"
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: 0.18.5
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - "~>"
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: 0.18.5
|
69
|
+
- !ruby/object:Gem::Dependency
|
70
|
+
name: awesome_print
|
71
|
+
requirement: !ruby/object:Gem::Requirement
|
72
|
+
requirements:
|
73
|
+
- - ">"
|
74
|
+
- !ruby/object:Gem::Version
|
75
|
+
version: 1.8.0
|
76
|
+
type: :development
|
77
|
+
prerelease: false
|
78
|
+
version_requirements: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - ">"
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: 1.8.0
|
83
|
+
- !ruby/object:Gem::Dependency
|
84
|
+
name: nokogiri
|
85
|
+
requirement: !ruby/object:Gem::Requirement
|
86
|
+
requirements:
|
87
|
+
- - ">"
|
88
|
+
- !ruby/object:Gem::Version
|
89
|
+
version: 1.10.10
|
90
|
+
type: :development
|
91
|
+
prerelease: false
|
92
|
+
version_requirements: !ruby/object:Gem::Requirement
|
93
|
+
requirements:
|
94
|
+
- - ">"
|
95
|
+
- !ruby/object:Gem::Version
|
96
|
+
version: 1.10.10
|
97
|
+
description: This wrapper around the C-API allows use of the legacy modes of the recognition
|
98
|
+
engine.
|
99
|
+
email:
|
100
|
+
- dverrier@gmail.com
|
101
|
+
executables:
|
102
|
+
- console
|
103
|
+
- setup
|
104
|
+
extensions: []
|
105
|
+
extra_rdoc_files: []
|
106
|
+
files:
|
107
|
+
- ".gitignore"
|
108
|
+
- CODE_OF_CONDUCT.md
|
109
|
+
- Gemfile
|
110
|
+
- Gemfile.lock
|
111
|
+
- LICENSE.txt
|
112
|
+
- README.md
|
113
|
+
- Rakefile
|
114
|
+
- bin/console
|
115
|
+
- bin/setup
|
116
|
+
- examples/bonjour.rb
|
117
|
+
- examples/low_level_pdf.rb
|
118
|
+
- examples/low_level_recognition.rb
|
119
|
+
- examples/two_rectangles.rb
|
120
|
+
- lib/tesseract_ffi.rb
|
121
|
+
- lib/tesseract_ffi/conf_vars.rb
|
122
|
+
- lib/tesseract_ffi/oem.rb
|
123
|
+
- lib/tesseract_ffi/quick.rb
|
124
|
+
- lib/tesseract_ffi/tess_exception.rb
|
125
|
+
- lib/tesseract_ffi/tesseract.rb
|
126
|
+
- lib/tesseract_ffi/version.rb
|
127
|
+
- tmp/keep.txt
|
128
|
+
homepage: https://github.com/dverrier/tesseract_ffi
|
129
|
+
licenses:
|
130
|
+
- MIT
|
131
|
+
metadata:
|
132
|
+
allowed_push_host: https://rubygems.org
|
133
|
+
homepage_uri: https://github.com/dverrier/tesseract_ffi
|
134
|
+
source_code_uri: https://github.com/dverrier/tesseract_ffi
|
135
|
+
changelog_uri: https://github.com/dverrier/tesseract_ffi
|
136
|
+
post_install_message:
|
137
|
+
rdoc_options: []
|
138
|
+
require_paths:
|
139
|
+
- lib
|
140
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
141
|
+
requirements:
|
142
|
+
- - ">="
|
143
|
+
- !ruby/object:Gem::Version
|
144
|
+
version: 2.3.0
|
145
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
146
|
+
requirements:
|
147
|
+
- - ">="
|
148
|
+
- !ruby/object:Gem::Version
|
149
|
+
version: '0'
|
150
|
+
requirements: []
|
151
|
+
rubygems_version: 3.1.2
|
152
|
+
signing_key:
|
153
|
+
specification_version: 4
|
154
|
+
summary: This is a Ruby-wrapper around the Tesseract C-API.
|
155
|
+
test_files: []
|