tesseract-ocr 0.1.5 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 1cb827519f64b7aba71ee22666e49e952a92ec41
4
+ data.tar.gz: 2687f882e6e704ebd53026b522e471d7aa57c914
5
+ SHA512:
6
+ metadata.gz: 356b6ed6de748cfaf3dbb017346400a802e4a0c060759f29cc2ddb254448ab4584d655f4bc794ba198cbd43bda35d8cb802665ff2f74ae60ca1f305c89d59f88
7
+ data.tar.gz: 2340f05a490d53fb809a5bdcefed75a547674bbcc02b3200cf505335547e073f9fd2a2a04409e31de55865ae3afdc08d5121480246d06bce0432e4e4b17d794f
data/README.md CHANGED
@@ -1,17 +1,25 @@
1
1
  ruby-tesseract - Ruby bindings and wrapper
2
2
  ==========================================
3
- This wrapper binds the TessBaseAPI object through ffi-inline (which means it will work on JRuby too)
4
- and then proceeds to wrap said API in a more ruby-esque Engine class.
3
+ This wrapper binds the TessBaseAPI object through ffi-inline (which means it
4
+ will work on JRuby too) and then proceeds to wrap said API in a more ruby-esque
5
+ Engine class.
5
6
 
6
7
  Making it work
7
8
  --------------
8
- To make this library work you need tesseract-ocr and leptonica libraries and headers and a C++ compiler.
9
+ To make this library work you need tesseract-ocr and leptonica libraries and
10
+ headers and a C++ compiler.
9
11
 
10
12
  The gem is called `tesseract-ocr`.
11
13
 
14
+ If you're on a distribution that separates the libraries from headers, remember
15
+ to install the *-dev* package.
16
+
17
+ On Debian you will need to `libleptonica-dev` and `libtesseract-dev`.
18
+
12
19
  Examples
13
20
  --------
14
- Following are some examples that show the functionalities provided by tesseract-ocr.
21
+ Following are some examples that show the functionalities provided by
22
+ tesseract-ocr.
15
23
 
16
24
  ### Basic functionality of tesseract
17
25
 
@@ -26,21 +34,21 @@ e = Tesseract::Engine.new {|e|
26
34
  e.text_for('test/first.png').strip # => 'ABC'
27
35
  ```
28
36
 
29
- You can pass to `#text_for` either a path, an IO object, a string containing the image or
30
- an object that responds to `#to_blob` (for example Magick::Image), keep in mind that
31
- the format has to be supported by leptonica.
37
+ You can pass to `#text_for` either a path, an IO object, a string containing
38
+ the image or an object that responds to `#to_blob` (for example
39
+ Magick::Image), keep in mind that the format has to be supported by leptonica.
32
40
 
33
41
  ### Accessing advanced features
34
42
 
35
- With advanced features you get access to blocks, paragraphs, lines, words and symbols.
43
+ With advanced features you get access to blocks, paragraphs, lines, words and
44
+ symbols.
36
45
 
37
- There are lot of way to access those levels, the methods are the following (replace level
38
- with one of the accessible features, so `each_level` can be `each_block` or `each_paragraph`
39
- etc.)
46
+ Replace **level** in method names with either `block`, `paragraph`, `line`,
47
+ `word` or `symbol`.
40
48
 
41
- The following kind of accessors need a block to be passed and they pass to the block each
42
- `Element` object. The Element object has various getters to access certain features, I'll
43
- talk about them later.
49
+ The following kind of accessors need a block to be passed and they pass to the
50
+ block each `Element` object. The Element object has various getters to access
51
+ certain features, I'll talk about them later.
44
52
 
45
53
  The methods are:
46
54
 
@@ -48,9 +56,10 @@ The methods are:
48
56
  * `each_level_for`
49
57
  * `each_level_at`
50
58
 
51
- The following accessors instead return an `Array` of `Element`s with cached getters, the getters
52
- are cached beacause the values accessible in the `Element` are linked to the state of the internal
53
- API, and that state changes if you access something else.
59
+ The following accessors instead return an `Array` of `Element`s with cached
60
+ getters, the getters are cached beacause the values accessible in the `Element`
61
+ are linked to the state of the internal API, and that state changes if you
62
+ access something else.
54
63
 
55
64
  The methods are:
56
65
 
@@ -65,16 +74,19 @@ Each `Element` object has the following getters:
65
74
  * `bounding_box`, this will return the box where the element is confined into
66
75
  * `binary_image`, this will return the bichromatic image of the element
67
76
  * `image`, this will return the image of the element
68
- * `baseline`, this will return the line where the text is with a pair of coordinates
77
+ * `baseline`, this will return the line where the text is with a pair of
78
+ coordinates
69
79
  * `orientation`, this will return the orientation of the element
70
80
  * `text`, this will return the text of the element
71
81
  * `confidence`, this will return the confidence of correctness for the element
72
82
 
73
83
  `Block` elements also have `type` accessors that specify the type of the block.
74
84
 
75
- `Word` elements also have `font_attributes`, `from_dictionary?` and `numeric?` getters.
85
+ `Word` elements also have `font_attributes`, `from_dictionary?` and `numeric?`
86
+ getters.
76
87
 
77
- `Symbol` elements also have `superscript?`, `subscript?` and `dropcap?` getters.
88
+ `Symbol` elements also have `superscript?`, `subscript?` and `dropcap?`
89
+ getters.
78
90
 
79
91
  Using the binary
80
92
  ----------------
@@ -97,3 +109,7 @@ ABC
97
109
  > tesseract.rb -c test/first.png
98
110
  86
99
111
  ```
112
+
113
+ License
114
+ -------
115
+ The license is BSD one clause.
@@ -22,7 +22,7 @@ OptionParser.new do |o|
22
22
  end
23
23
 
24
24
  o.on '-p', '--psm MODE', 'page segmentation mode to use' do |value|
25
- options[:psm] = value
25
+ options[:psm] = value.to_i
26
26
  end
27
27
 
28
28
  o.on '-u', '--unlv', 'output in UNLV format' do
@@ -23,11 +23,11 @@
23
23
  #++
24
24
 
25
25
  module Tesseract
26
- def prefix
26
+ def self.prefix
27
27
  ENV['TESSDATA_PREFIX']
28
28
  end
29
29
 
30
- def prefix= (path)
30
+ def self.prefix=(path)
31
31
  ENV['TESSDATA_PREFIX'] = path
32
32
  end
33
33
  end
@@ -155,12 +155,15 @@ class API
155
155
 
156
156
  def get_text
157
157
  pointer = C::BaseAPI.get_utf8_text(to_ffi)
158
- result = pointer.read_string
158
+
159
+ return if pointer.null?
160
+
161
+ result = pointer.read_string
159
162
  result.force_encoding 'UTF-8'
160
163
 
161
164
  result
162
165
  ensure
163
- C.free_array_of_char(pointer)
166
+ C.free_array_of_char(pointer) unless pointer.null?
164
167
  end
165
168
 
166
169
  def get_box (page = 0)
@@ -25,7 +25,7 @@
25
25
  module Tesseract; class API
26
26
 
27
27
  class Image
28
- def self.new (image)
28
+ def self.new (image, x = 0, y = 0)
29
29
  image = if image.is_a?(String) && (File.exists?(File.expand_path(image)) rescue nil)
30
30
  C::Leptonica.pix_read(File.expand_path(image))
31
31
  elsif image.is_a?(String)
@@ -36,11 +36,13 @@ class Image
36
36
  image = image.to_blob
37
37
 
38
38
  C::Leptonica.pix_read_mem(image, image.bytesize)
39
+ else
40
+ image
39
41
  end
40
42
 
41
43
  raise ArgumentError, 'invalid image' if image.nil? || image.null?
42
44
 
43
- super(image)
45
+ super(image, x, y)
44
46
  end
45
47
 
46
48
  attr_accessor :x, :y
@@ -68,10 +70,8 @@ class Image
68
70
  size = FFI::MemoryPointer.new(:size_t)
69
71
 
70
72
  C::Leptonica.pix_write_mem(to_ffi, data, size, C.for_enum(format))
71
-
72
73
  result = data.typecast(:pointer).read_string(size.typecast(:size_t))
73
-
74
- data.typecast(:pointer).free
74
+ C.free(data.typecast(:pointer))
75
75
 
76
76
  result
77
77
  end
@@ -62,7 +62,7 @@ class Iterator
62
62
  def get_image (level = :word, padding = 0)
63
63
  image = C::Iterator.get_image(to_ffi, C.for_enum(level), padding)
64
64
 
65
- Image.new(image.pix, image.x, image.y)
65
+ Image.new(image[:pix], image[:x], image[:y])
66
66
  end
67
67
 
68
68
  def baseline (level = :word)
@@ -75,12 +75,15 @@ class Iterator
75
75
 
76
76
  def get_text (level = :word)
77
77
  pointer = C::Iterator.get_utf8_text(to_ffi, C.for_enum(level))
78
- result = pointer.read_string
78
+
79
+ return if pointer.null?
80
+
81
+ result = pointer.read_string
79
82
  result.force_encoding 'UTF-8'
80
83
 
81
84
  result
82
85
  ensure
83
- C.free_array_of_char(pointer)
86
+ C.free_array_of_char(pointer) unless pointer.null?
84
87
  end
85
88
 
86
89
  def confidence (level = :word)
@@ -35,6 +35,12 @@ module C
35
35
  cpp.include 'tesseract/strngs.h'
36
36
  cpp.libraries 'tesseract'
37
37
 
38
+ cpp.function %{
39
+ void free (void* pointer) {
40
+ free(pointer);
41
+ }
42
+ }
43
+
38
44
  cpp.function %{
39
45
  void free_array_of_char (char* pointer) {
40
46
  delete [] pointer;
@@ -48,7 +48,7 @@ module Leptonica
48
48
  }, blocking: true
49
49
 
50
50
  cpp.function %{
51
- Pix* pix_read_fd (int fd) {
51
+ Pix* pix_read_stream (int fd) {
52
52
  return pixReadStream(fdopen(fd, "rb"), 0);
53
53
  }
54
54
  }, blocking: true
@@ -24,6 +24,6 @@
24
24
 
25
25
  module Tesseract
26
26
  def self.version
27
- '0.1.5'
27
+ '0.1.6'
28
28
  end
29
29
  end
@@ -1,13 +1,14 @@
1
1
  Kernel.load 'lib/tesseract/version.rb'
2
2
 
3
3
  Gem::Specification.new {|s|
4
- s.name = 'tesseract-ocr'
5
- s.version = Tesseract.version
6
- s.author = 'meh.'
7
- s.email = 'meh@paranoici.org'
8
- s.homepage = 'http://github.com/meh/ruby-tesseract-ocr'
9
- s.platform = Gem::Platform::RUBY
10
- s.summary = 'A wrapper library to the tesseract-ocr API.'
4
+ s.name = 'tesseract-ocr'
5
+ s.version = Tesseract.version
6
+ s.author = 'meh.'
7
+ s.email = 'meh@schizofreni.co'
8
+ s.homepage = 'http://github.com/meh/ruby-tesseract-ocr'
9
+ s.platform = Gem::Platform::RUBY
10
+ s.summary = 'A wrapper library to the tesseract-ocr API.'
11
+ s.license = 'BSD'
11
12
 
12
13
  s.files = `git ls-files`.split("\n")
13
14
  s.executables = `git ls-files -- bin/*`.split("\n").map { |f| File.basename(f) }
metadata CHANGED
@@ -1,82 +1,73 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tesseract-ocr
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.5
5
- prerelease:
4
+ version: 0.1.6
6
5
  platform: ruby
7
6
  authors:
8
7
  - meh.
9
8
  autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2012-04-29 00:00:00.000000000 Z
11
+ date: 2014-02-24 00:00:00.000000000 Z
13
12
  dependencies:
14
13
  - !ruby/object:Gem::Dependency
15
14
  name: call-me
16
15
  requirement: !ruby/object:Gem::Requirement
17
- none: false
18
16
  requirements:
19
- - - ! '>='
17
+ - - ">="
20
18
  - !ruby/object:Gem::Version
21
19
  version: '0'
22
20
  type: :runtime
23
21
  prerelease: false
24
22
  version_requirements: !ruby/object:Gem::Requirement
25
- none: false
26
23
  requirements:
27
- - - ! '>='
24
+ - - ">="
28
25
  - !ruby/object:Gem::Version
29
26
  version: '0'
30
27
  - !ruby/object:Gem::Dependency
31
28
  name: iso-639
32
29
  requirement: !ruby/object:Gem::Requirement
33
- none: false
34
30
  requirements:
35
- - - ! '>='
31
+ - - ">="
36
32
  - !ruby/object:Gem::Version
37
33
  version: '0'
38
34
  type: :runtime
39
35
  prerelease: false
40
36
  version_requirements: !ruby/object:Gem::Requirement
41
- none: false
42
37
  requirements:
43
- - - ! '>='
38
+ - - ">="
44
39
  - !ruby/object:Gem::Version
45
40
  version: '0'
46
41
  - !ruby/object:Gem::Dependency
47
42
  name: ffi-extra
48
43
  requirement: !ruby/object:Gem::Requirement
49
- none: false
50
44
  requirements:
51
- - - ! '>='
45
+ - - ">="
52
46
  - !ruby/object:Gem::Version
53
47
  version: '0'
54
48
  type: :runtime
55
49
  prerelease: false
56
50
  version_requirements: !ruby/object:Gem::Requirement
57
- none: false
58
51
  requirements:
59
- - - ! '>='
52
+ - - ">="
60
53
  - !ruby/object:Gem::Version
61
54
  version: '0'
62
55
  - !ruby/object:Gem::Dependency
63
56
  name: ffi-inline
64
57
  requirement: !ruby/object:Gem::Requirement
65
- none: false
66
58
  requirements:
67
- - - ! '>='
59
+ - - ">="
68
60
  - !ruby/object:Gem::Version
69
61
  version: '0'
70
62
  type: :runtime
71
63
  prerelease: false
72
64
  version_requirements: !ruby/object:Gem::Requirement
73
- none: false
74
65
  requirements:
75
- - - ! '>='
66
+ - - ">="
76
67
  - !ruby/object:Gem::Version
77
68
  version: '0'
78
69
  description:
79
- email: meh@paranoici.org
70
+ email: meh@schizofreni.co
80
71
  executables:
81
72
  - tesseract-train.rb
82
73
  - tesseract.rb
@@ -221,28 +212,28 @@ files:
221
212
  - test/test-european.jpg
222
213
  - test/test.png
223
214
  homepage: http://github.com/meh/ruby-tesseract-ocr
224
- licenses: []
215
+ licenses:
216
+ - BSD
217
+ metadata: {}
225
218
  post_install_message:
226
219
  rdoc_options: []
227
220
  require_paths:
228
221
  - lib
229
222
  required_ruby_version: !ruby/object:Gem::Requirement
230
- none: false
231
223
  requirements:
232
- - - ! '>='
224
+ - - ">="
233
225
  - !ruby/object:Gem::Version
234
226
  version: '0'
235
227
  required_rubygems_version: !ruby/object:Gem::Requirement
236
- none: false
237
228
  requirements:
238
- - - ! '>='
229
+ - - ">="
239
230
  - !ruby/object:Gem::Version
240
231
  version: '0'
241
232
  requirements: []
242
233
  rubyforge_project:
243
- rubygems_version: 1.8.23
234
+ rubygems_version: 2.2.0
244
235
  signing_key:
245
- specification_version: 3
236
+ specification_version: 4
246
237
  summary: A wrapper library to the tesseract-ocr API.
247
238
  test_files:
248
239
  - test/first.png
@@ -252,3 +243,4 @@ test_files:
252
243
  - test/tesseract_spec.rb
253
244
  - test/test-european.jpg
254
245
  - test/test.png
246
+ has_rdoc: