tesseract-ocr 0.1.5 → 0.1.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/README.md +36 -20
- data/bin/tesseract.rb +1 -1
- data/lib/tesseract-ocr.rb +2 -2
- data/lib/tesseract/api.rb +5 -2
- data/lib/tesseract/api/image.rb +5 -5
- data/lib/tesseract/api/iterator.rb +6 -3
- data/lib/tesseract/c.rb +6 -0
- data/lib/tesseract/c/leptonica.rb +1 -1
- data/lib/tesseract/version.rb +1 -1
- data/tesseract-ocr.gemspec +8 -7
- metadata +19 -27
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 1cb827519f64b7aba71ee22666e49e952a92ec41
|
4
|
+
data.tar.gz: 2687f882e6e704ebd53026b522e471d7aa57c914
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 356b6ed6de748cfaf3dbb017346400a802e4a0c060759f29cc2ddb254448ab4584d655f4bc794ba198cbd43bda35d8cb802665ff2f74ae60ca1f305c89d59f88
|
7
|
+
data.tar.gz: 2340f05a490d53fb809a5bdcefed75a547674bbcc02b3200cf505335547e073f9fd2a2a04409e31de55865ae3afdc08d5121480246d06bce0432e4e4b17d794f
|
data/README.md
CHANGED
@@ -1,17 +1,25 @@
|
|
1
1
|
ruby-tesseract - Ruby bindings and wrapper
|
2
2
|
==========================================
|
3
|
-
This wrapper binds the TessBaseAPI object through ffi-inline (which means it
|
4
|
-
and then proceeds to wrap said API in a more ruby-esque
|
3
|
+
This wrapper binds the TessBaseAPI object through ffi-inline (which means it
|
4
|
+
will work on JRuby too) and then proceeds to wrap said API in a more ruby-esque
|
5
|
+
Engine class.
|
5
6
|
|
6
7
|
Making it work
|
7
8
|
--------------
|
8
|
-
To make this library work you need tesseract-ocr and leptonica libraries and
|
9
|
+
To make this library work you need tesseract-ocr and leptonica libraries and
|
10
|
+
headers and a C++ compiler.
|
9
11
|
|
10
12
|
The gem is called `tesseract-ocr`.
|
11
13
|
|
14
|
+
If you're on a distribution that separates the libraries from headers, remember
|
15
|
+
to install the *-dev* package.
|
16
|
+
|
17
|
+
On Debian you will need to `libleptonica-dev` and `libtesseract-dev`.
|
18
|
+
|
12
19
|
Examples
|
13
20
|
--------
|
14
|
-
Following are some examples that show the functionalities provided by
|
21
|
+
Following are some examples that show the functionalities provided by
|
22
|
+
tesseract-ocr.
|
15
23
|
|
16
24
|
### Basic functionality of tesseract
|
17
25
|
|
@@ -26,21 +34,21 @@ e = Tesseract::Engine.new {|e|
|
|
26
34
|
e.text_for('test/first.png').strip # => 'ABC'
|
27
35
|
```
|
28
36
|
|
29
|
-
You can pass to `#text_for` either a path, an IO object, a string containing
|
30
|
-
an object that responds to `#to_blob` (for example
|
31
|
-
the format has to be supported by leptonica.
|
37
|
+
You can pass to `#text_for` either a path, an IO object, a string containing
|
38
|
+
the image or an object that responds to `#to_blob` (for example
|
39
|
+
Magick::Image), keep in mind that the format has to be supported by leptonica.
|
32
40
|
|
33
41
|
### Accessing advanced features
|
34
42
|
|
35
|
-
With advanced features you get access to blocks, paragraphs, lines, words and
|
43
|
+
With advanced features you get access to blocks, paragraphs, lines, words and
|
44
|
+
symbols.
|
36
45
|
|
37
|
-
|
38
|
-
|
39
|
-
etc.)
|
46
|
+
Replace **level** in method names with either `block`, `paragraph`, `line`,
|
47
|
+
`word` or `symbol`.
|
40
48
|
|
41
|
-
The following kind of accessors need a block to be passed and they pass to the
|
42
|
-
`Element` object. The Element object has various getters to access
|
43
|
-
talk about them later.
|
49
|
+
The following kind of accessors need a block to be passed and they pass to the
|
50
|
+
block each `Element` object. The Element object has various getters to access
|
51
|
+
certain features, I'll talk about them later.
|
44
52
|
|
45
53
|
The methods are:
|
46
54
|
|
@@ -48,9 +56,10 @@ The methods are:
|
|
48
56
|
* `each_level_for`
|
49
57
|
* `each_level_at`
|
50
58
|
|
51
|
-
The following accessors instead return an `Array` of `Element`s with cached
|
52
|
-
are cached beacause the values accessible in the `Element`
|
53
|
-
API, and that state changes if you
|
59
|
+
The following accessors instead return an `Array` of `Element`s with cached
|
60
|
+
getters, the getters are cached beacause the values accessible in the `Element`
|
61
|
+
are linked to the state of the internal API, and that state changes if you
|
62
|
+
access something else.
|
54
63
|
|
55
64
|
The methods are:
|
56
65
|
|
@@ -65,16 +74,19 @@ Each `Element` object has the following getters:
|
|
65
74
|
* `bounding_box`, this will return the box where the element is confined into
|
66
75
|
* `binary_image`, this will return the bichromatic image of the element
|
67
76
|
* `image`, this will return the image of the element
|
68
|
-
* `baseline`, this will return the line where the text is with a pair of
|
77
|
+
* `baseline`, this will return the line where the text is with a pair of
|
78
|
+
coordinates
|
69
79
|
* `orientation`, this will return the orientation of the element
|
70
80
|
* `text`, this will return the text of the element
|
71
81
|
* `confidence`, this will return the confidence of correctness for the element
|
72
82
|
|
73
83
|
`Block` elements also have `type` accessors that specify the type of the block.
|
74
84
|
|
75
|
-
`Word` elements also have `font_attributes`, `from_dictionary?` and `numeric?`
|
85
|
+
`Word` elements also have `font_attributes`, `from_dictionary?` and `numeric?`
|
86
|
+
getters.
|
76
87
|
|
77
|
-
`Symbol` elements also have `superscript?`, `subscript?` and `dropcap?`
|
88
|
+
`Symbol` elements also have `superscript?`, `subscript?` and `dropcap?`
|
89
|
+
getters.
|
78
90
|
|
79
91
|
Using the binary
|
80
92
|
----------------
|
@@ -97,3 +109,7 @@ ABC
|
|
97
109
|
> tesseract.rb -c test/first.png
|
98
110
|
86
|
99
111
|
```
|
112
|
+
|
113
|
+
License
|
114
|
+
-------
|
115
|
+
The license is BSD one clause.
|
data/bin/tesseract.rb
CHANGED
data/lib/tesseract-ocr.rb
CHANGED
data/lib/tesseract/api.rb
CHANGED
@@ -155,12 +155,15 @@ class API
|
|
155
155
|
|
156
156
|
def get_text
|
157
157
|
pointer = C::BaseAPI.get_utf8_text(to_ffi)
|
158
|
-
|
158
|
+
|
159
|
+
return if pointer.null?
|
160
|
+
|
161
|
+
result = pointer.read_string
|
159
162
|
result.force_encoding 'UTF-8'
|
160
163
|
|
161
164
|
result
|
162
165
|
ensure
|
163
|
-
C.free_array_of_char(pointer)
|
166
|
+
C.free_array_of_char(pointer) unless pointer.null?
|
164
167
|
end
|
165
168
|
|
166
169
|
def get_box (page = 0)
|
data/lib/tesseract/api/image.rb
CHANGED
@@ -25,7 +25,7 @@
|
|
25
25
|
module Tesseract; class API
|
26
26
|
|
27
27
|
class Image
|
28
|
-
def self.new (image)
|
28
|
+
def self.new (image, x = 0, y = 0)
|
29
29
|
image = if image.is_a?(String) && (File.exists?(File.expand_path(image)) rescue nil)
|
30
30
|
C::Leptonica.pix_read(File.expand_path(image))
|
31
31
|
elsif image.is_a?(String)
|
@@ -36,11 +36,13 @@ class Image
|
|
36
36
|
image = image.to_blob
|
37
37
|
|
38
38
|
C::Leptonica.pix_read_mem(image, image.bytesize)
|
39
|
+
else
|
40
|
+
image
|
39
41
|
end
|
40
42
|
|
41
43
|
raise ArgumentError, 'invalid image' if image.nil? || image.null?
|
42
44
|
|
43
|
-
super(image)
|
45
|
+
super(image, x, y)
|
44
46
|
end
|
45
47
|
|
46
48
|
attr_accessor :x, :y
|
@@ -68,10 +70,8 @@ class Image
|
|
68
70
|
size = FFI::MemoryPointer.new(:size_t)
|
69
71
|
|
70
72
|
C::Leptonica.pix_write_mem(to_ffi, data, size, C.for_enum(format))
|
71
|
-
|
72
73
|
result = data.typecast(:pointer).read_string(size.typecast(:size_t))
|
73
|
-
|
74
|
-
data.typecast(:pointer).free
|
74
|
+
C.free(data.typecast(:pointer))
|
75
75
|
|
76
76
|
result
|
77
77
|
end
|
@@ -62,7 +62,7 @@ class Iterator
|
|
62
62
|
def get_image (level = :word, padding = 0)
|
63
63
|
image = C::Iterator.get_image(to_ffi, C.for_enum(level), padding)
|
64
64
|
|
65
|
-
Image.new(image
|
65
|
+
Image.new(image[:pix], image[:x], image[:y])
|
66
66
|
end
|
67
67
|
|
68
68
|
def baseline (level = :word)
|
@@ -75,12 +75,15 @@ class Iterator
|
|
75
75
|
|
76
76
|
def get_text (level = :word)
|
77
77
|
pointer = C::Iterator.get_utf8_text(to_ffi, C.for_enum(level))
|
78
|
-
|
78
|
+
|
79
|
+
return if pointer.null?
|
80
|
+
|
81
|
+
result = pointer.read_string
|
79
82
|
result.force_encoding 'UTF-8'
|
80
83
|
|
81
84
|
result
|
82
85
|
ensure
|
83
|
-
C.free_array_of_char(pointer)
|
86
|
+
C.free_array_of_char(pointer) unless pointer.null?
|
84
87
|
end
|
85
88
|
|
86
89
|
def confidence (level = :word)
|
data/lib/tesseract/c.rb
CHANGED
data/lib/tesseract/version.rb
CHANGED
data/tesseract-ocr.gemspec
CHANGED
@@ -1,13 +1,14 @@
|
|
1
1
|
Kernel.load 'lib/tesseract/version.rb'
|
2
2
|
|
3
3
|
Gem::Specification.new {|s|
|
4
|
-
s.name
|
5
|
-
s.version
|
6
|
-
s.author
|
7
|
-
s.email
|
8
|
-
s.homepage
|
9
|
-
s.platform
|
10
|
-
s.summary
|
4
|
+
s.name = 'tesseract-ocr'
|
5
|
+
s.version = Tesseract.version
|
6
|
+
s.author = 'meh.'
|
7
|
+
s.email = 'meh@schizofreni.co'
|
8
|
+
s.homepage = 'http://github.com/meh/ruby-tesseract-ocr'
|
9
|
+
s.platform = Gem::Platform::RUBY
|
10
|
+
s.summary = 'A wrapper library to the tesseract-ocr API.'
|
11
|
+
s.license = 'BSD'
|
11
12
|
|
12
13
|
s.files = `git ls-files`.split("\n")
|
13
14
|
s.executables = `git ls-files -- bin/*`.split("\n").map { |f| File.basename(f) }
|
metadata
CHANGED
@@ -1,82 +1,73 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: tesseract-ocr
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
5
|
-
prerelease:
|
4
|
+
version: 0.1.6
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- meh.
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date:
|
11
|
+
date: 2014-02-24 00:00:00.000000000 Z
|
13
12
|
dependencies:
|
14
13
|
- !ruby/object:Gem::Dependency
|
15
14
|
name: call-me
|
16
15
|
requirement: !ruby/object:Gem::Requirement
|
17
|
-
none: false
|
18
16
|
requirements:
|
19
|
-
- -
|
17
|
+
- - ">="
|
20
18
|
- !ruby/object:Gem::Version
|
21
19
|
version: '0'
|
22
20
|
type: :runtime
|
23
21
|
prerelease: false
|
24
22
|
version_requirements: !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
23
|
requirements:
|
27
|
-
- -
|
24
|
+
- - ">="
|
28
25
|
- !ruby/object:Gem::Version
|
29
26
|
version: '0'
|
30
27
|
- !ruby/object:Gem::Dependency
|
31
28
|
name: iso-639
|
32
29
|
requirement: !ruby/object:Gem::Requirement
|
33
|
-
none: false
|
34
30
|
requirements:
|
35
|
-
- -
|
31
|
+
- - ">="
|
36
32
|
- !ruby/object:Gem::Version
|
37
33
|
version: '0'
|
38
34
|
type: :runtime
|
39
35
|
prerelease: false
|
40
36
|
version_requirements: !ruby/object:Gem::Requirement
|
41
|
-
none: false
|
42
37
|
requirements:
|
43
|
-
- -
|
38
|
+
- - ">="
|
44
39
|
- !ruby/object:Gem::Version
|
45
40
|
version: '0'
|
46
41
|
- !ruby/object:Gem::Dependency
|
47
42
|
name: ffi-extra
|
48
43
|
requirement: !ruby/object:Gem::Requirement
|
49
|
-
none: false
|
50
44
|
requirements:
|
51
|
-
- -
|
45
|
+
- - ">="
|
52
46
|
- !ruby/object:Gem::Version
|
53
47
|
version: '0'
|
54
48
|
type: :runtime
|
55
49
|
prerelease: false
|
56
50
|
version_requirements: !ruby/object:Gem::Requirement
|
57
|
-
none: false
|
58
51
|
requirements:
|
59
|
-
- -
|
52
|
+
- - ">="
|
60
53
|
- !ruby/object:Gem::Version
|
61
54
|
version: '0'
|
62
55
|
- !ruby/object:Gem::Dependency
|
63
56
|
name: ffi-inline
|
64
57
|
requirement: !ruby/object:Gem::Requirement
|
65
|
-
none: false
|
66
58
|
requirements:
|
67
|
-
- -
|
59
|
+
- - ">="
|
68
60
|
- !ruby/object:Gem::Version
|
69
61
|
version: '0'
|
70
62
|
type: :runtime
|
71
63
|
prerelease: false
|
72
64
|
version_requirements: !ruby/object:Gem::Requirement
|
73
|
-
none: false
|
74
65
|
requirements:
|
75
|
-
- -
|
66
|
+
- - ">="
|
76
67
|
- !ruby/object:Gem::Version
|
77
68
|
version: '0'
|
78
69
|
description:
|
79
|
-
email: meh@
|
70
|
+
email: meh@schizofreni.co
|
80
71
|
executables:
|
81
72
|
- tesseract-train.rb
|
82
73
|
- tesseract.rb
|
@@ -221,28 +212,28 @@ files:
|
|
221
212
|
- test/test-european.jpg
|
222
213
|
- test/test.png
|
223
214
|
homepage: http://github.com/meh/ruby-tesseract-ocr
|
224
|
-
licenses:
|
215
|
+
licenses:
|
216
|
+
- BSD
|
217
|
+
metadata: {}
|
225
218
|
post_install_message:
|
226
219
|
rdoc_options: []
|
227
220
|
require_paths:
|
228
221
|
- lib
|
229
222
|
required_ruby_version: !ruby/object:Gem::Requirement
|
230
|
-
none: false
|
231
223
|
requirements:
|
232
|
-
- -
|
224
|
+
- - ">="
|
233
225
|
- !ruby/object:Gem::Version
|
234
226
|
version: '0'
|
235
227
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
236
|
-
none: false
|
237
228
|
requirements:
|
238
|
-
- -
|
229
|
+
- - ">="
|
239
230
|
- !ruby/object:Gem::Version
|
240
231
|
version: '0'
|
241
232
|
requirements: []
|
242
233
|
rubyforge_project:
|
243
|
-
rubygems_version:
|
234
|
+
rubygems_version: 2.2.0
|
244
235
|
signing_key:
|
245
|
-
specification_version:
|
236
|
+
specification_version: 4
|
246
237
|
summary: A wrapper library to the tesseract-ocr API.
|
247
238
|
test_files:
|
248
239
|
- test/first.png
|
@@ -252,3 +243,4 @@ test_files:
|
|
252
243
|
- test/tesseract_spec.rb
|
253
244
|
- test/test-european.jpg
|
254
245
|
- test/test.png
|
246
|
+
has_rdoc:
|