hocr_turtletext 0.1.1

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: da1d48bb9c9bbf8e723a10d039603670bba89ba5
4
+ data.tar.gz: 00f89918601c84176ccac11b41ae38a5f829f183
5
+ SHA512:
6
+ metadata.gz: f6c50bfb48f0483673b13a963535e0a783554a4c5cdbd55aca911f47cd084c512242e9c3868f46a1948d52128575d5efbfec8f1ce60f78fa2ad81f282abe61e2
7
+ data.tar.gz: 27727e5d7a6f640b0152dcc0695228ff21bd9cafd7d5f3fcde798e5c86a7f5d89e1c5ba8b455cbca737291c9cce80128e062903f764c54cb17e16ccc96baa84e
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /_yardoc/
4
+ /coverage/
5
+ /doc/
6
+ /pkg/
7
+ /spec/reports/
8
+ /tmp/
9
+
10
+ # rspec failure tracking
11
+ .rspec_status
12
+
13
+ /.idea
14
+ *.iml
data/.rspec ADDED
@@ -0,0 +1,3 @@
1
+ --format documentation
2
+ --color
3
+ --require spec_helper
@@ -0,0 +1,7 @@
1
+ ---
2
+ sudo: false
3
+ language: ruby
4
+ cache: bundler
5
+ rvm:
6
+ - 2.4.4
7
+ before_install: gem install bundler -v 1.16.3
data/Gemfile ADDED
@@ -0,0 +1,6 @@
1
+ source "https://rubygems.org"
2
+
3
+ git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }
4
+
5
+ # Specify your gem's dependencies in hocr_turtletext.gemspec
6
+ gemspec
@@ -0,0 +1,39 @@
1
+ PATH
2
+ remote: .
3
+ specs:
4
+ hocr_turtletext (0.1.0)
5
+ nokogiri (~> 1.10, >= 1.10.7)
6
+
7
+ GEM
8
+ remote: https://rubygems.org/
9
+ specs:
10
+ diff-lcs (1.3)
11
+ mini_portile2 (2.4.0)
12
+ nokogiri (1.10.7)
13
+ mini_portile2 (~> 2.4.0)
14
+ rake (10.5.0)
15
+ rspec (3.7.0)
16
+ rspec-core (~> 3.7.0)
17
+ rspec-expectations (~> 3.7.0)
18
+ rspec-mocks (~> 3.7.0)
19
+ rspec-core (3.7.1)
20
+ rspec-support (~> 3.7.0)
21
+ rspec-expectations (3.7.0)
22
+ diff-lcs (>= 1.2.0, < 2.0)
23
+ rspec-support (~> 3.7.0)
24
+ rspec-mocks (3.7.0)
25
+ diff-lcs (>= 1.2.0, < 2.0)
26
+ rspec-support (~> 3.7.0)
27
+ rspec-support (3.7.1)
28
+
29
+ PLATFORMS
30
+ ruby
31
+
32
+ DEPENDENCIES
33
+ bundler (~> 1.16)
34
+ hocr_turtletext!
35
+ rake (~> 10.0)
36
+ rspec (~> 3.0)
37
+
38
+ BUNDLED WITH
39
+ 1.16.3
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2020 Sue Zheng Hao
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
@@ -0,0 +1,168 @@
1
+ # HocrTurtletext
2
+
3
+ Heavily inspired by [PDF::Reader::Turtletext](https://github.com/tardate/pdf-reader-turtletext), HocrTurtletext provides convenient methods to extract content from a hOCR file. hOCR output is commonly produced by OCR software such as tesseract-ocr.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'hocr_turtletext'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install hocr_turtletext
20
+
21
+ ## Usage
22
+
23
+ ### Instantiate HocrTurtletext
24
+
25
+ Typical usage:
26
+ ```ruby
27
+ hocr_path = '/tmp/page1.hocr'
28
+ options = { :y_precision => 7 }
29
+ reader = HocrTurtletext::Reader.new(hocr_path, options)
30
+ ```
31
+
32
+ Options:
33
+ `x_whitespace_threshold`: Words with a x distance of less than this threshold will be concatenated with a space. Try increasing this value if words/letters that are supposed to belong together are separated.
34
+ `y_precision`: Different rows of text with y positions that are less than y_precision of difference will be put together into one row. Try increasing this value if words that are supposed to be on the same row are detected as separate rows.
35
+
36
+ ### Extract text within a region described in relation to other text
37
+
38
+ This method works nearly identically to its counterpart from PDF::Reader::Turtletext.
39
+ The main difference is that we are not dealing with multiple pages in our hOCR input, so
40
+ there is no need to support page selection.
41
+
42
+ Given that we know the text we want to find is relatively positioned (for example)
43
+ below a certain bit of text, to the left of another, and above some other text, use
44
+ the `bounding_box` method to describe the region and extract the matching text.
45
+ ```
46
+ textangle = reader.bounding_box do
47
+ below /electricity/i
48
+ above 10
49
+ right_of 240.0
50
+ left_of "Total ($)"
51
+ end
52
+ textangle.text
53
+ => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
54
+ ```
55
+
56
+ The range of methods that can be used within the `bounding_box` block are all optional, and include:
57
+ - `inclusive` - whether region selection should be inclusive or exclusive of the specified positions
58
+ (default is false).
59
+ - `below` - a string, regex or number that describes the upper limit of the text box
60
+ (default is top border of the page)`.
61
+ - `above` - a string, regex or number that describes the lower limit of the text box
62
+ (default is bottom border of the page).
63
+ - `left_of` - a string, regex or number that describes the right limit of the text box
64
+ (default is right border of the page).
65
+ - `right_of` - a string, regex or number that describes the left limit of the text box
66
+ (default is left border of the page).
67
+
68
+ Note that `left_of` and `right_of` constraints do *not* need to be within the vertical
69
+ range of the box being described.
70
+ For example, you could use an element in the page header to describe the `left_of` limit
71
+ for a table at the bottom of the page, if it has the correct alignment needed to describe your text region.
72
+
73
+ Similarly, `above` and `below` constraints do *not* need to be within the horizontal
74
+ range of the box being described.
75
+
76
+ ### Using a block parameter with the `bounding_box` method
77
+
78
+ An explicit block parameter may be used with the `bounding_box` method:
79
+ ```
80
+ textangle = reader.bounding_box do |r|
81
+ r.below /electricity/i
82
+ r.left_of "Total ($)"
83
+ end
84
+ textangle.text
85
+ => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
86
+ ```
87
+
88
+ ### How to describe an inclusive `bounding_box` region
89
+
90
+ By default, the `bounding_box` method makes exclusive selection (i.e. not including the
91
+ region limits).
92
+
93
+ To specify an inclusive region, use the `inclusive!` command:
94
+ ```ruby
95
+ textangle = reader.bounding_box do
96
+ inclusive!
97
+ below /electricity/i
98
+ left_of "Total ($)"
99
+ end
100
+ ```
101
+ Alternatively, set `inclusive` to true:
102
+ ```ruby
103
+ textangle = reader.bounding_box do
104
+ inclusive true
105
+ below /electricity/i
106
+ left_of "Total ($)"
107
+ end
108
+ ```
109
+ Or with a block parameter, you may also assign `inclusive` to true:
110
+ ```ruby
111
+ textangle = reader.bounding_box do |r|
112
+ r.inclusive = true
113
+ r.below /electricity/i
114
+ r.left_of "Total ($)"
115
+ end
116
+ ```
117
+ ### Extract text for a region with known positional co-ordinates
118
+
119
+ If you know (or can calculate) the x,y positions of the required text region, you can extract the region's text using the `text_in_region` method.
120
+ ```
121
+ text = reader.text_in_region(
122
+ 10, # minimum x (left-most)
123
+ 900, # maximum x (right-most)
124
+ 200, # minimum y (top-most)
125
+ 400, # maximum y (bottom-most)
126
+ false # inclusive of x/y position if true (default false)
127
+ )
128
+ => [['string','string'],['string']] # array of rows, each row is an array of text elements in the row
129
+ ```
130
+ Note that the x,y origin is at the **top-left**.
131
+ This differs from how it works in PDF::Reader::Turtletext, where the origin
132
+ was bottom-left of the page.
133
+
134
+ ### How to find the x,y co-ordinate of a specific text element
135
+
136
+ If you are doing low-level text extraction with `text_in_region` for example,
137
+ it is usually necessary to locate specific text to provide a positional reference.
138
+
139
+ Use the `text_position` method to locate text by exact or partial match.
140
+ It returns a Hash of x/y co-ordinates that is the bottom-left corner of the text.
141
+ ```
142
+ text_by_exact_match = reader.text_position("Transaction Table")
143
+ => { :x => 10.0, :y => 600.0 }
144
+ text_by_regex_match = reader.text_position(/transaction summary/i)
145
+ => { :x => 10.0, :y => 300.0 }
146
+ ```
147
+ Note: in the case of multiple matches, only the first match is returned.
148
+
149
+ ## Development
150
+
151
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
152
+
153
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
154
+
155
+ ## Contributing
156
+
157
+ - Check issue tracker if someone is working on what you plan to work on
158
+ - Fork project
159
+ - Create new branch
160
+ - Make changes in new branch
161
+ - Submit pull request
162
+
163
+ ## License
164
+
165
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
166
+
167
+ ## Special Thanks
168
+ - Paul Gallagher, creator of the [PDF::Reader::Turtletext](https://github.com/tardate/pdf-reader-turtletext) gem, from which large sections of this gem was copied/modified from.
@@ -0,0 +1,6 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ task :default => :spec
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "hocr_turtletext"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start(__FILE__)
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+ set -vx
5
+
6
+ bundle install
7
+
8
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,43 @@
1
+
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'hocr_turtletext/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = 'hocr_turtletext'
8
+ spec.version = HocrTurtletext::VERSION
9
+ spec.authors = ['Sue Zheng Hao']
10
+
11
+ spec.summary = 'Reads structured text from hOCR input.'
12
+ spec.description = <<-DESC
13
+ Parses hOCR input and provides methods to access text in a structured manner. Typical use
14
+ cases include parsing formatted text from a hOCR file produced by running a document
15
+ through OCR.
16
+ DESC
17
+ spec.homepage = 'https://github.com/emmeryn/hocr-turtletext'
18
+ spec.license = 'MIT'
19
+
20
+ # Prevent pushing this gem to RubyGems.org. To allow pushes either set the 'allowed_push_host'
21
+ # to allow pushing to a single host or delete this section to allow pushing to any host.
22
+ # if spec.respond_to?(:metadata)
23
+ # spec.metadata['allowed_push_host'] = "TODO: Set to 'http://mygemserver.com'"
24
+ # else
25
+ # raise 'RubyGems 2.0 or newer is required to protect against ' \
26
+ # 'public gem pushes.'
27
+ # end
28
+
29
+ # Specify which files should be added to the gem when it is released.
30
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
31
+ spec.files = Dir.chdir(File.expand_path('..', __FILE__)) do
32
+ `git ls-files -z`.split("\x0").reject { |f| f.match(%r{^(test|spec|features)/}) }
33
+ end
34
+ spec.bindir = 'exe'
35
+ spec.executables = spec.files.grep(%r{^exe/}) { |f| File.basename(f) }
36
+ spec.require_paths = ['lib']
37
+
38
+ spec.add_development_dependency 'bundler', '~> 1.16'
39
+ spec.add_development_dependency 'rake', '~> 10.0'
40
+ spec.add_development_dependency 'rspec', '~> 3.0'
41
+
42
+ spec.add_runtime_dependency 'nokogiri', '~> 1.10', '>= 1.10.7'
43
+ end
@@ -0,0 +1,3 @@
1
+ require 'nokogiri'
2
+ require 'hocr_turtletext/version'
3
+ require 'hocr_turtletext/reader'
@@ -0,0 +1,155 @@
1
+
2
+ # pdf-reader-turtletext methods such as text_in_region, text_position and
3
+ # fuzzed_y method modified from the original at https://github.com/tardate/pdf-reader-turtletext
4
+ class HocrTurtletext::Reader
5
+
6
+ def initialize(hocr_path, options = {})
7
+ @hocr_path = hocr_path
8
+ @options = options
9
+ end
10
+
11
+ def content
12
+ hocr_content = File.read(@hocr_path)
13
+ lines = precise_content(hocr_content)
14
+ pos_hash = to_pos_hash(lines)
15
+ fuzzed_y(pos_hash)
16
+ end
17
+
18
+ def text_in_region(xmin,xmax,ymin,ymax,inclusive=false)
19
+ return [] unless xmin && xmax && ymin && ymax
20
+ text_map = content
21
+ box = []
22
+
23
+ text_map.each do |y,text_row|
24
+ if inclusive ? (y >= ymin && y <= ymax) : (y > ymin && y < ymax)
25
+ row = []
26
+ text_row.each do |x,element|
27
+ if inclusive ? (x >= xmin && x <= xmax) : (x > xmin && x < xmax)
28
+ row << element
29
+ end
30
+ end
31
+ box << row unless row.empty?
32
+ end
33
+ end
34
+ box
35
+ end
36
+
37
+ def text_position(text)
38
+ item = if text.class <= Regexp
39
+ content.map do |k,v|
40
+ if x = v.reduce(nil){|memo,vv| memo = (vv[1] =~ text) ? vv[0] : memo }
41
+ [k,x]
42
+ end
43
+ end
44
+ else
45
+ content.map {|k,v| if x = v.rassoc(text) ; [k,x] ; end }
46
+ end
47
+ item = item.compact.flatten
48
+ unless item.empty?
49
+ { :x => item[1], :y => item[0] }
50
+ end
51
+ end
52
+
53
+ def bounding_box(&block)
54
+ HocrTurtletext::Textangle.new(self,&block)
55
+ end
56
+
57
+ private
58
+
59
+ def x_whitespace_threshold
60
+ @options[:x_whitespace_threshold] ||= 30
61
+ end
62
+
63
+ def y_precision
64
+ @options[:y_precision] ||= 3
65
+ end
66
+
67
+ def fuzzed_y(input)
68
+ output = []
69
+ input.keys.sort.each do |precise_y|
70
+ matching_y = output.map(&:first).select { |new_y| (new_y - precise_y).abs < y_precision }.first || precise_y
71
+ y_index = output.index{ |y| y.first == matching_y }
72
+ new_row_content = input[precise_y].to_a
73
+ if y_index
74
+ row_content = output[y_index].last
75
+ row_content += new_row_content
76
+ output[y_index] = [matching_y,row_content.sort{ |a,b| a.first <=> b.first }]
77
+ else
78
+ output << [matching_y,new_row_content.sort{ |a,b| a.first <=> b.first }]
79
+ end
80
+ end
81
+ output
82
+ end
83
+
84
+ def precise_content(hocr_content)
85
+ html = Nokogiri::HTML(hocr_content)
86
+ lines = []
87
+ html.css('span.ocr_line').map do |line|
88
+ chunks = chunks_from_processed_ocr_line(line)
89
+ lines.concat(chunks)
90
+ end
91
+ lines
92
+ end
93
+
94
+ def chunks_from_processed_ocr_line(ocr_line)
95
+ pos_info_line = add_positional_info_to_line(ocr_line)
96
+ sorted_pos_info_line = sort_words_in_line(pos_info_line)
97
+ concat_words_in_line(sorted_pos_info_line)
98
+ end
99
+
100
+ def add_positional_info_to_line(ocr_line)
101
+ ocr_line.css('span.ocrx_word, span.ocr_word').map do |word|
102
+ word_attributes = word.attributes['title'].value.to_s
103
+ .delete(';').split(' ')
104
+ info(word, word_attributes)
105
+ end
106
+ end
107
+
108
+ def sort_words_in_line(pos_info_line)
109
+ # sort word by x value, concat if x2.x_start - x1.x_end < some_x_threshold
110
+ pos_info_line.sort_by { |word| word[:x_start] }
111
+ pos_info_line.slice_when do |x, y|
112
+ y[:x_start] - x[:x_end] > x_whitespace_threshold
113
+ end.to_a
114
+ end
115
+
116
+ def concat_words_in_line(sorted_pos_info_line)
117
+ chunks = []
118
+ # merge all words in each chunk
119
+ sorted_pos_info_line.each do |chunk|
120
+ sentence = nil
121
+ chunk.each do |word|
122
+ if sentence.nil?
123
+ sentence = word
124
+ else
125
+ sentence[:word] = "#{sentence[:word]} #{word[:word]}"
126
+ sentence[:x_end] = word[:x_end]
127
+ end
128
+ end
129
+ chunks.push sentence
130
+ end
131
+ chunks
132
+ end
133
+
134
+ def to_pos_hash(lines)
135
+ lines.sort_by { |line| line[:y_start] }
136
+
137
+ pos_hash = {}
138
+ lines.each do |run|
139
+ pos_hash[run[:y_start]] ||= {}
140
+ pos_hash[run[:y_start]][run[:x_start]] ||= ''
141
+ pos_hash[run[:y_start]][run[:x_start]] << run[:word]
142
+ end
143
+ pos_hash
144
+ end
145
+
146
+ def info(word, data)
147
+ {
148
+ word: word.text,
149
+ x_start: data[1].to_i,
150
+ y_start: data[2].to_i,
151
+ x_end: data[3].to_i,
152
+ y_end: data[4].to_i
153
+ }
154
+ end
155
+ end
@@ -0,0 +1,117 @@
1
+ # A DSL syntax for text extraction.
2
+ # Modified from the original at https://github.com/tardate/pdf-reader-turtletext
3
+ class HocrTurtletext::Textangle
4
+
5
+ attr_reader :reader
6
+
7
+ # +hocr_turtletext_reader+ is a HocrTurtletext::Reader
8
+ def initialize(hocr_turtletext_reader,&block)
9
+ @reader = hocr_turtletext_reader
10
+ @inclusive = false
11
+ if block_given?
12
+ if block.arity == 1
13
+ yield self
14
+ else
15
+ instance_eval &block
16
+ end
17
+ end
18
+ end
19
+
20
+ attr_writer :inclusive
21
+
22
+ def inclusive(*args)
23
+ if value = args.first
24
+ @inclusive = value
25
+ end
26
+ @inclusive
27
+ end
28
+
29
+ # Command: sets +inclusive true
30
+ def inclusive!
31
+ @inclusive = true
32
+ end
33
+
34
+ # Command: sets +inclusive false
35
+ def exclusive!
36
+ @inclusive = false
37
+ end
38
+
39
+ attr_writer :above
40
+ def above(*args)
41
+ if value = args.first
42
+ @above = value
43
+ end
44
+ @above
45
+ end
46
+
47
+ attr_writer :below
48
+ def below(*args)
49
+ if value = args.first
50
+ @below = value
51
+ end
52
+ @below
53
+ end
54
+
55
+ attr_writer :left_of
56
+ def left_of(*args)
57
+ if value = args.first
58
+ @left_of = value
59
+ end
60
+ @left_of
61
+ end
62
+
63
+ attr_writer :right_of
64
+ def right_of(*args)
65
+ if value = args.first
66
+ @right_of = value
67
+ end
68
+ @right_of
69
+ end
70
+
71
+ # Returns the text array found within the defined region.
72
+ # Each line of text is an array of the separate text elements found on that line.
73
+ # [["first line first text", "first line last text"],["second line text"]]
74
+ def text
75
+ return unless reader
76
+
77
+ xmin = if right_of
78
+ if [Integer,Float].include?(right_of.class)
79
+ right_of
80
+ elsif xy = reader.text_position(right_of)
81
+ xy[:x]
82
+ end
83
+ else
84
+ 0
85
+ end
86
+ xmax = if left_of
87
+ if [Integer,Float].include?(left_of.class)
88
+ left_of
89
+ elsif xy = reader.text_position(left_of)
90
+ xy[:x]
91
+ end
92
+ else
93
+ 99999 # TODO: figure out the actual limit?
94
+ end
95
+
96
+ ymax = if above
97
+ if [Integer,Float].include?(above.class)
98
+ above
99
+ elsif xy = reader.text_position(above)
100
+ xy[:y]
101
+ end
102
+ else
103
+ 99999
104
+ end
105
+ ymin = if below
106
+ if [Integer,Float].include?(below.class)
107
+ below
108
+ elsif xy = reader.text_position(below)
109
+ xy[:y]
110
+ end
111
+ else
112
+ 0 # TODO: figure out the actual limit?
113
+ end
114
+
115
+ reader.text_in_region(xmin,xmax,ymin,ymax,inclusive)
116
+ end
117
+ end
@@ -0,0 +1,3 @@
1
+ module HocrTurtletext
2
+ VERSION = '0.1.1'.freeze
3
+ end
metadata ADDED
@@ -0,0 +1,123 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: hocr_turtletext
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.1
5
+ platform: ruby
6
+ authors:
7
+ - Sue Zheng Hao
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2020-01-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: bundler
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '1.16'
20
+ type: :development
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '1.16'
27
+ - !ruby/object:Gem::Dependency
28
+ name: rake
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '10.0'
34
+ type: :development
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '10.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: rspec
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - "~>"
46
+ - !ruby/object:Gem::Version
47
+ version: '3.0'
48
+ type: :development
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - "~>"
53
+ - !ruby/object:Gem::Version
54
+ version: '3.0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: nokogiri
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: '1.10'
62
+ - - ">="
63
+ - !ruby/object:Gem::Version
64
+ version: 1.10.7
65
+ type: :runtime
66
+ prerelease: false
67
+ version_requirements: !ruby/object:Gem::Requirement
68
+ requirements:
69
+ - - "~>"
70
+ - !ruby/object:Gem::Version
71
+ version: '1.10'
72
+ - - ">="
73
+ - !ruby/object:Gem::Version
74
+ version: 1.10.7
75
+ description: |2
76
+ Parses hOCR input and provides methods to access text in a structured manner. Typical use
77
+ cases include parsing formatted text from a hOCR file produced by running a document
78
+ through OCR.
79
+ email:
80
+ executables: []
81
+ extensions: []
82
+ extra_rdoc_files: []
83
+ files:
84
+ - ".gitignore"
85
+ - ".rspec"
86
+ - ".travis.yml"
87
+ - Gemfile
88
+ - Gemfile.lock
89
+ - LICENSE.txt
90
+ - README.md
91
+ - Rakefile
92
+ - bin/console
93
+ - bin/setup
94
+ - hocr_turtletext.gemspec
95
+ - lib/hocr_turtletext.rb
96
+ - lib/hocr_turtletext/reader.rb
97
+ - lib/hocr_turtletext/textangle.rb
98
+ - lib/hocr_turtletext/version.rb
99
+ homepage: https://github.com/emmeryn/hocr-turtletext
100
+ licenses:
101
+ - MIT
102
+ metadata: {}
103
+ post_install_message:
104
+ rdoc_options: []
105
+ require_paths:
106
+ - lib
107
+ required_ruby_version: !ruby/object:Gem::Requirement
108
+ requirements:
109
+ - - ">="
110
+ - !ruby/object:Gem::Version
111
+ version: '0'
112
+ required_rubygems_version: !ruby/object:Gem::Requirement
113
+ requirements:
114
+ - - ">="
115
+ - !ruby/object:Gem::Version
116
+ version: '0'
117
+ requirements: []
118
+ rubyforge_project:
119
+ rubygems_version: 2.6.14.1
120
+ signing_key:
121
+ specification_version: 4
122
+ summary: Reads structured text from hOCR input.
123
+ test_files: []