henkei 1.14.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 8f90fc4730d0e9fe321311d38bad4b9ab34ccc5b
4
+ data.tar.gz: 2df0e595e7fdfa1b7e15c986755321dfdcaeafc8
5
+ SHA512:
6
+ metadata.gz: 97bedb4df8bc4665a756b4a5f92f15a74ef11ac415483a5eb0cb6d4d9b368f92edf9d51b9460bfafa3ae25dd68c909b5e12e27f8cf2237298dece177bc793430
7
+ data.tar.gz: 031e5c953328a5dff2578b13720a49cda89ebc7fc2b26f71619ea8144cbd56032a390c2bdead4230623eaf4b009d0cc4af88d5d6ad648152bc55dbaa8a8d4341
@@ -0,0 +1,18 @@
1
+ *.gem
2
+ *.rbc
3
+ .bundle
4
+ .config
5
+ .gs
6
+ .yardoc
7
+ Gemfile.lock
8
+ InstalledFiles
9
+ _yardoc
10
+ coverage
11
+ doc/
12
+ lib/bundler/man
13
+ pkg
14
+ rdoc
15
+ spec/reports
16
+ test/tmp
17
+ test/version_tmp
18
+ tmp
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --color
2
+ --format documentation
@@ -0,0 +1,9 @@
1
+ language: ruby
2
+ rvm:
3
+ - 1.9.3
4
+ - 2.0.0
5
+ - 2.1.0
6
+ - 2.2.5
7
+
8
+ before_install:
9
+ - gem update bundler
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in henkei.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,23 @@
1
+ Copyright (c) 2017 Andrew Bromwich
2
+ Copyright (c) 2012 Erol Fornoles
3
+
4
+ MIT License
5
+
6
+ Permission is hereby granted, free of charge, to any person obtaining
7
+ a copy of this software and associated documentation files (the
8
+ "Software"), to deal in the Software without restriction, including
9
+ without limitation the rights to use, copy, modify, merge, publish,
10
+ distribute, sublicense, and/or sell copies of the Software, and to
11
+ permit persons to whom the Software is furnished to do so, subject to
12
+ the following conditions:
13
+
14
+ The above copyright notice and this permission notice shall be
15
+ included in all copies or substantial portions of the Software.
16
+
17
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
18
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
19
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
20
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
21
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
22
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
23
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,11 @@
1
+ Henkei
2
+ Copyright 2017 Andrew Bromwich, released under the MIT license
3
+
4
+ Yomu
5
+ Copyright 2011 Erol Fornoles, released under the MIT license
6
+
7
+ Apache Tika
8
+ Copyright 2011 The Apache Software Foundation
9
+
10
+ This product includes software developed at
11
+ The Apache Software Foundation (http://www.apache.org/).
@@ -0,0 +1,109 @@
1
+ [![Travis Build Status](http://img.shields.io/travis/Erol/yomu.svg?style=flat)](https://travis-ci.org/Erol/yomu)
2
+ [![Code Climate Score](http://img.shields.io/codeclimate/github/Erol/yomu.svg?style=flat)](https://codeclimate.com/github/Erol/yomu)
3
+ [![Gem Version](http://img.shields.io/gem/v/yomu.svg?style=flat)](#)
4
+
5
+ # Henkei 変形
6
+
7
+ [Henkei](http://github.com/abrom/henkei) is a library for extracting text and metadata from files and documents using the [Apache Tika](http://tika.apache.org/) content analysis toolkit.
8
+
9
+ Here are some of the formats supported:
10
+
11
+ - Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx,
12
+ .ppt, .pptx)
13
+ - OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
14
+ - Apple iWorks Formats
15
+ - Rich Text Format (.rtf)
16
+ - Portable Document Format (.pdf)
17
+
18
+ For the complete list of supported formats, please visit the Apache Tika
19
+ [Supported Document Formats](http://tika.apache.org/0.9/formats.html) page.
20
+
21
+ ## Usage
22
+
23
+ Text, metadata and MIME type information can be extracted by calling `Henkei.read` directly:
24
+
25
+ ```ruby
26
+ require 'henkei'
27
+
28
+ data = File.read 'sample.pages'
29
+ text = Henkei.read :text, data
30
+ metadata = Henkei.read :metadata, data
31
+ mimetype = Henkei.read :mimetype, data
32
+ ```
33
+
34
+ ### Reading text from a given filename
35
+
36
+ Create a new instance of Henkei and pass a filename.
37
+
38
+ ```ruby
39
+ henkei = Henkei.new 'sample.pages'
40
+ text = henkei.text
41
+ ```
42
+
43
+ ### Reading text from a given URL
44
+
45
+ This is useful for reading remote files, like documents hosted on Amazon S3.
46
+
47
+ ```ruby
48
+ henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
49
+ text = henkei.text
50
+ ```
51
+
52
+ ### Reading text from a stream
53
+
54
+ Henkei can also read from a stream or any object that responds to `read`, including file uploads from Ruby on Rails or Sinatra.
55
+
56
+ ```ruby
57
+ post '/:name/:filename' do
58
+ henkei = Henkei.new params[:data][:tempfile]
59
+ henkei.text
60
+ end
61
+ ```
62
+
63
+ ### Reading metadata
64
+
65
+ Metadata is returned as a hash.
66
+
67
+ ```ruby
68
+ henkei = Henkei.new 'sample.pages'
69
+ henkei.metadata['Content-Type'] #=> "application/vnd.apple.pages"
70
+ ```
71
+
72
+ ### Reading MIME types
73
+
74
+ MIME type is returned as a MIME::Type object.
75
+
76
+ ```ruby
77
+ henkei = Henkei.new 'sample.docx'
78
+ henkei.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
79
+ henkei.mimetype.extensions #=> ['docx']
80
+ ```
81
+
82
+ ## Installation and Dependencies
83
+
84
+ ### Java Runtime
85
+
86
+ Henkei packages the Apache Tika application jar and requires a working JRE for it to work.
87
+
88
+ ### Gem
89
+
90
+ Add this line to your application's Gemfile:
91
+
92
+ gem 'henkei'
93
+
94
+ And then execute:
95
+
96
+ $ bundle
97
+
98
+ Or install it yourself as:
99
+
100
+ $ gem install henkei
101
+
102
+ ## Contributing
103
+
104
+ 1. Fork it
105
+ 2. Create your feature branch ( `git checkout -b my-new-feature` )
106
+ 3. Create tests and make them pass ( `rake test` )
107
+ 4. Commit your changes ( `git commit -am 'Added some feature'` )
108
+ 5. Push to the branch ( `git push origin my-new-feature` )
109
+ 6. Create a new Pull Request
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env rake
2
+
3
+ require 'bundler/gem_tasks'
4
+ require 'rspec/core/rake_task'
5
+
6
+ RSpec::Core::RakeTask.new 'spec'
7
+
8
+ task :default => :spec
@@ -0,0 +1,27 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'henkei/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = 'henkei'
8
+ spec.version = Henkei::VERSION
9
+ spec.authors = ['Erol Fornoles', 'Andrew Bromwich']
10
+ spec.email = ['erol.fornoles@gmail.com', 'a.bromwich@gmail.com']
11
+ spec.description = %q{Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)}
12
+ spec.summary = %q{Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)}
13
+ spec.homepage = 'http://github.com/abrom/henkei'
14
+ spec.license = 'MIT'
15
+
16
+ spec.files = `git ls-files`.split($/)
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ['lib']
20
+
21
+ spec.add_runtime_dependency 'mime-types', '~> 1.23'
22
+ spec.add_runtime_dependency 'json', '~> 1.8'
23
+
24
+ spec.add_development_dependency 'bundler', '~> 1.3'
25
+ spec.add_development_dependency 'rake'
26
+ spec.add_development_dependency 'rspec', '~> 2.14'
27
+ end
Binary file
@@ -0,0 +1,268 @@
1
+ require 'henkei/version'
2
+
3
+ require 'net/http'
4
+ require 'mime/types'
5
+ require 'json'
6
+
7
+ require 'socket'
8
+ require 'stringio'
9
+
10
+ class Henkei
11
+ GEMPATH = File.dirname(File.dirname(__FILE__))
12
+ JARPATH = File.join(Henkei::GEMPATH, 'jar', 'tika-app-1.14.jar')
13
+ DEFAULT_SERVER_PORT = 9293 # an arbitrary, but perfectly cromulent, port
14
+
15
+ @@server_port = nil
16
+ @@server_pid = nil
17
+
18
+ # Read text or metadata from a data buffer.
19
+ #
20
+ # data = File.read 'sample.pages'
21
+ # text = Henkei.read :text, data
22
+ # metadata = Henkei.read :metadata, data
23
+
24
+ def self.read(type, data)
25
+ result = @@server_pid ? self._server_read(type, data) : self._client_read(type, data)
26
+
27
+ case type
28
+ when :text
29
+ result
30
+ when :html
31
+ result
32
+ when :metadata
33
+ JSON.parse(result)
34
+ when :mimetype
35
+ MIME::Types[JSON.parse(result)['Content-Type']].first
36
+ end
37
+ end
38
+
39
+ def self._client_read(type, data)
40
+ switch = case type
41
+ when :text
42
+ '-t'
43
+ when :html
44
+ '-h'
45
+ when :metadata
46
+ '-m -j'
47
+ when :mimetype
48
+ '-m -j'
49
+ end
50
+
51
+ IO.popen "#{java} -Djava.awt.headless=true -jar #{Henkei::JARPATH} #{switch}", 'r+' do |io|
52
+ io.write data
53
+ io.close_write
54
+ io.read
55
+ end
56
+ end
57
+
58
+
59
+ def self._server_read(_, data)
60
+ s = TCPSocket.new('localhost', @@server_port)
61
+ file = StringIO.new(data, 'r')
62
+
63
+ while 1
64
+ chunk = file.read(65536)
65
+ break unless chunk
66
+ s.write(chunk)
67
+ end
68
+
69
+ # tell Tika that we're done sending data
70
+ s.shutdown(Socket::SHUT_WR)
71
+
72
+ resp = ''
73
+ while 1
74
+ chunk = s.recv(65536)
75
+ break if chunk.empty? || !chunk
76
+ resp << chunk
77
+ end
78
+ resp
79
+ end
80
+
81
+ # Create a new instance of Henkei with a given document.
82
+ #
83
+ # Using a file path:
84
+ #
85
+ # Henkei.new 'sample.pages'
86
+ #
87
+ # Using a URL:
88
+ #
89
+ # Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
90
+ #
91
+ # From a stream or an object which responds to +read+
92
+ #
93
+ # Henkei.new File.open('sample.pages')
94
+
95
+ def initialize(input)
96
+ if input.is_a? String
97
+ if File.exists? input
98
+ @path = input
99
+ elsif input =~ URI::regexp
100
+ @uri = URI.parse input
101
+ else
102
+ raise Errno::ENOENT.new "missing file or invalid URI - #{input}"
103
+ end
104
+ elsif input.respond_to? :read
105
+ @stream = input
106
+ else
107
+ raise TypeError.new "can't read from #{input.class.name}"
108
+ end
109
+ end
110
+
111
+ # Returns the text content of the Henkei document.
112
+ #
113
+ # henkei = Henkei.new 'sample.pages'
114
+ # henkei.text
115
+
116
+ def text
117
+ return @text if defined? @text
118
+
119
+ @text = Henkei.read :text, data
120
+ end
121
+
122
+ # Returns the text content of the Henkei document in HTML.
123
+ #
124
+ # henkei = Henkei.new 'sample.pages'
125
+ # henkei.html
126
+
127
+ def html
128
+ return @html if defined? @html
129
+
130
+ @html = Henkei.read :html, data
131
+ end
132
+
133
+ # Returns the metadata hash of the Henkei document.
134
+ #
135
+ # henkei = Henkei.new 'sample.pages'
136
+ # henkei.metadata['Content-Type']
137
+
138
+ def metadata
139
+ return @metadata if defined? @metadata
140
+
141
+ @metadata = Henkei.read :metadata, data
142
+ end
143
+
144
+ # Returns the mimetype object of the Henkei document.
145
+ #
146
+ # henkei = Henkei.new 'sample.docx'
147
+ # henkei.mimetype.content_type #=> 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
148
+ # henkei.mimetype.extensions #=> ['docx']
149
+
150
+ def mimetype
151
+ return @mimetype if defined? @mimetype
152
+
153
+ type = metadata["Content-Type"].is_a?(Array) ? metadata["Content-Type"].first : metadata["Content-Type"]
154
+
155
+ @mimetype = MIME::Types[type].first
156
+ end
157
+
158
+ # Returns +true+ if the Henkei document was specified using a file path.
159
+ #
160
+ # henkei = Henkei.new 'sample.pages'
161
+ # henkei.path? #=> true
162
+
163
+
164
+ def creation_date
165
+ return @creation_date if defined? @creation_date
166
+
167
+ if metadata['Creation-Date']
168
+ @creation_date = Time.parse(metadata['Creation-Date'])
169
+ else
170
+ nil
171
+ end
172
+ end
173
+
174
+ def path?
175
+ defined? @path
176
+ end
177
+
178
+ # Returns +true+ if the Henkei document was specified using a URI.
179
+ #
180
+ # henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
181
+ # henkei.uri? #=> true
182
+
183
+ def uri?
184
+ defined? @uri
185
+ end
186
+
187
+ # Returns +true+ if the Henkei document was specified from a stream or an object which responds to +read+.
188
+ #
189
+ # file = File.open('sample.pages')
190
+ # henkei = Henkei.new file
191
+ # henkei.stream? #=> true
192
+
193
+ def stream?
194
+ defined? @stream
195
+ end
196
+
197
+ # Returns the raw/unparsed content of the Henkei document.
198
+ #
199
+ # henkei = Henkei.new 'sample.pages'
200
+ # henkei.data
201
+
202
+ def data
203
+ return @data if defined? @data
204
+
205
+ if path?
206
+ @data = File.read @path
207
+ elsif uri?
208
+ @data = Net::HTTP.get @uri
209
+ elsif stream?
210
+ @data = @stream.read
211
+ end
212
+
213
+ @data
214
+ end
215
+
216
+ # Returns pid of Tika server, started as a new spawned process.
217
+ #
218
+ # type :html, :text or :metadata
219
+ # custom_port e.g. 9293
220
+ #
221
+ # Henkei.server(:text, 9294)
222
+ #
223
+ def self.server(type, custom_port=nil)
224
+ switch = case type
225
+ when :text
226
+ '-t'
227
+ when :html
228
+ '-h'
229
+ when :metadata
230
+ '-m -j'
231
+ when :mimetype
232
+ '-m -j'
233
+ end
234
+
235
+ @@server_port = custom_port || DEFAULT_SERVER_PORT
236
+
237
+ @@server_pid = Process.spawn("#{java} -Djava.awt.headless=true -jar #{Henkei::JARPATH} --server --port #{@@server_port} #{switch}")
238
+ sleep(2) # Give the server 2 seconds to spin up.
239
+ @@server_pid
240
+ end
241
+
242
+ # Kills server started by Henkei.server
243
+ #
244
+ # Always run this when you're done, or else Tika might run until you kill it manually
245
+ # You might try putting your extraction in a begin..rescue...ensure...end block and
246
+ # putting this method in the ensure block.
247
+ #
248
+ # Henkei.server(:text)
249
+ # reports = ["report1.docx", "report2.doc", "report3.pdf"]
250
+ # begin
251
+ # my_texts = reports.map{|report_path| Henkei.new(report_path).text }
252
+ # rescue
253
+ # ensure
254
+ # Henkei.kill_server!
255
+ # end
256
+ def self.kill_server!
257
+ if @@server_pid
258
+ Process.kill('INT', @@server_pid)
259
+ @@server_pid = nil
260
+ @@server_port = nil
261
+ end
262
+ end
263
+
264
+ def self.java
265
+ ENV['JAVA_HOME'] ? ENV['JAVA_HOME'] + '/bin/java' : 'java'
266
+ end
267
+ private_class_method :java
268
+ end
@@ -0,0 +1,3 @@
1
+ class Henkei
2
+ VERSION = '1.14.1'
3
+ end
@@ -0,0 +1,6 @@
1
+ RSpec.configure do |config|
2
+ config.treat_symbols_as_metadata_keys_with_true_values = true
3
+ config.run_all_when_everything_filtered = true
4
+ config.filter_run :focus
5
+ config.order = 'random'
6
+ end
@@ -0,0 +1,182 @@
1
+ require 'helper.rb'
2
+ require 'henkei'
3
+
4
+ describe Henkei do
5
+ let(:data) { File.read 'spec/samples/sample.docx' }
6
+
7
+ before do
8
+ ENV['JAVA_HOME'] = nil
9
+ end
10
+
11
+ describe '.read' do
12
+ it 'reads text' do
13
+ text = Henkei.read :text, data
14
+
15
+ expect( text ).to include 'The quick brown fox jumped over the lazy cat.'
16
+ end
17
+
18
+ it 'reads metadata' do
19
+ metadata = Henkei.read :metadata, data
20
+
21
+ expect( metadata['Content-Type'] ).to eql 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
22
+ end
23
+
24
+ it 'reads metadata values with colons as strings' do
25
+ data = File.read 'spec/samples/sample-metadata-values-with-colons.doc'
26
+ metadata = Henkei.read :metadata, data
27
+
28
+ expect( metadata['dc:title'] ).to eql 'problem: test'
29
+ end
30
+
31
+ it 'reads mimetype' do
32
+ mimetype = Henkei.read :mimetype, data
33
+
34
+ expect( mimetype.content_type ).to eql 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
35
+ expect( mimetype.extensions ).to include 'docx'
36
+ end
37
+ end
38
+
39
+ describe '.new' do
40
+ it 'requires parameters' do
41
+ expect { Henkei.new }.to raise_error ArgumentError
42
+ end
43
+
44
+ it 'accepts a root path' do
45
+ henkei = Henkei.new 'spec/samples/sample.pages'
46
+
47
+ expect( henkei ).to be_path
48
+ expect( henkei ).not_to be_uri
49
+ expect( henkei ).not_to be_stream
50
+ end
51
+
52
+ it 'accepts a relative path' do
53
+ henkei = Henkei.new 'spec/samples/sample.pages'
54
+
55
+ expect( henkei ).to be_path
56
+ expect( henkei ).not_to be_uri
57
+ expect( henkei ).not_to be_stream
58
+ end
59
+
60
+ it 'accepts a path with spaces' do
61
+ henkei = Henkei.new 'spec/samples/sample filename with spaces.pages'
62
+
63
+ expect( henkei ).to be_path
64
+ expect( henkei ).not_to be_uri
65
+ expect( henkei ).not_to be_stream
66
+ end
67
+
68
+ it 'accepts a URI' do
69
+ henkei = Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
70
+
71
+ expect( henkei ).to be_uri
72
+ expect( henkei ).not_to be_path
73
+ expect( henkei ).not_to be_stream
74
+ end
75
+
76
+ it 'accepts a stream or object that can be read' do
77
+ File.open 'spec/samples/sample.pages', 'r' do |file|
78
+ henkei = Henkei.new file
79
+
80
+ expect( henkei ).to be_stream
81
+ expect( henkei ).not_to be_path
82
+ expect( henkei ).not_to be_uri
83
+ end
84
+ end
85
+
86
+ it 'refuses a path to a missing file' do
87
+ expect { Henkei.new 'test/sample/missing.pages'}.to raise_error Errno::ENOENT
88
+ end
89
+
90
+ it 'refuses other objects' do
91
+ [nil, 1, 1.1].each do |object|
92
+ expect { Henkei.new object }.to raise_error TypeError
93
+ end
94
+ end
95
+ end
96
+
97
+
98
+ describe '.creation_date' do
99
+ let(:henkei) { Henkei.new 'spec/samples/sample.pages' }
100
+ it 'should retur Time' do
101
+ expect( henkei.creation_date ).to be_a Time
102
+ end
103
+ end
104
+
105
+ describe '.java' do
106
+ specify 'with no specified JAVA_HOME' do
107
+ expect( Henkei.send(:java) ).to eql 'java'
108
+ end
109
+
110
+ specify 'with a specified JAVA_HOME' do
111
+ ENV['JAVA_HOME'] = '/path/to/java/home'
112
+
113
+ expect( Henkei.send(:java) ).to eql '/path/to/java/home/bin/java'
114
+ end
115
+ end
116
+
117
+ context 'initialized with a given path' do
118
+ let(:henkei) { Henkei.new 'spec/samples/sample.pages' }
119
+
120
+ specify '#text reads text' do
121
+ expect( henkei.text).to include 'The quick brown fox jumped over the lazy cat.'
122
+ end
123
+
124
+ specify '#metadata reads metadata' do
125
+ expect( henkei.metadata['Content-Type'] ).to eql ["application/vnd.apple.pages", "application/vnd.apple.pages"]
126
+ end
127
+ end
128
+
129
+ context 'initialized with a given URI' do
130
+ let(:henkei) { Henkei.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx' }
131
+
132
+ specify '#text reads text' do
133
+ expect( henkei.text ).to include 'Lorem ipsum dolor sit amet, consectetuer adipiscing elit.'
134
+ end
135
+
136
+ specify '#metadata reads metadata' do
137
+ expect( henkei.metadata['Content-Type'] ).to eql 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
138
+ end
139
+ end
140
+
141
+ context 'initialized with a given stream' do
142
+ let(:henkei) { Henkei.new File.open('spec/samples/sample.pages', 'rb') }
143
+
144
+ specify '#text reads text' do
145
+ expect( henkei.text ).to include 'The quick brown fox jumped over the lazy cat.'
146
+ end
147
+
148
+ specify '#metadata reads metadata' do
149
+ expect( henkei.metadata['Content-Type'] ).to eql ["application/vnd.apple.pages", "application/vnd.apple.pages"]
150
+ end
151
+ end
152
+
153
+ context 'working as server mode' do
154
+ specify '#starts and kills server' do
155
+ begin
156
+ Henkei.server(:text)
157
+ expect(Henkei.class_variable_get(:@@server_pid)).not_to be_nil
158
+ expect(Henkei.class_variable_get(:@@server_port)).not_to be_nil
159
+
160
+ s = TCPSocket.new('localhost', Henkei.class_variable_get(:@@server_port))
161
+ expect(s).to be_a TCPSocket
162
+ s.close
163
+ ensure
164
+ port = Henkei.class_variable_get(:@@server_port)
165
+ Henkei.kill_server!
166
+ sleep 2
167
+ expect { TCPSocket.new('localhost', port) }.to raise_error Errno::ECONNREFUSED
168
+ end
169
+ end
170
+
171
+ specify '#runs samples through server mode' do
172
+ begin
173
+ Henkei.server(:text)
174
+ expect(Henkei.new('spec/samples/sample.pages').text).to include 'The quick brown fox jumped over the lazy cat.'
175
+ expect(Henkei.new('spec/samples/sample filename with spaces.pages').text).to include 'The quick brown fox jumped over the lazy cat.'
176
+ expect(Henkei.new('spec/samples/sample.docx').text).to include 'The quick brown fox jumped over the lazy cat.'
177
+ ensure
178
+ Henkei.kill_server!
179
+ end
180
+ end
181
+ end
182
+ end
Binary file
Binary file
metadata ADDED
@@ -0,0 +1,142 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: henkei
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.14.1
5
+ platform: ruby
6
+ authors:
7
+ - Erol Fornoles
8
+ - Andrew Bromwich
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2017-02-18 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: mime-types
16
+ requirement: !ruby/object:Gem::Requirement
17
+ requirements:
18
+ - - "~>"
19
+ - !ruby/object:Gem::Version
20
+ version: '1.23'
21
+ type: :runtime
22
+ prerelease: false
23
+ version_requirements: !ruby/object:Gem::Requirement
24
+ requirements:
25
+ - - "~>"
26
+ - !ruby/object:Gem::Version
27
+ version: '1.23'
28
+ - !ruby/object:Gem::Dependency
29
+ name: json
30
+ requirement: !ruby/object:Gem::Requirement
31
+ requirements:
32
+ - - "~>"
33
+ - !ruby/object:Gem::Version
34
+ version: '1.8'
35
+ type: :runtime
36
+ prerelease: false
37
+ version_requirements: !ruby/object:Gem::Requirement
38
+ requirements:
39
+ - - "~>"
40
+ - !ruby/object:Gem::Version
41
+ version: '1.8'
42
+ - !ruby/object:Gem::Dependency
43
+ name: bundler
44
+ requirement: !ruby/object:Gem::Requirement
45
+ requirements:
46
+ - - "~>"
47
+ - !ruby/object:Gem::Version
48
+ version: '1.3'
49
+ type: :development
50
+ prerelease: false
51
+ version_requirements: !ruby/object:Gem::Requirement
52
+ requirements:
53
+ - - "~>"
54
+ - !ruby/object:Gem::Version
55
+ version: '1.3'
56
+ - !ruby/object:Gem::Dependency
57
+ name: rake
58
+ requirement: !ruby/object:Gem::Requirement
59
+ requirements:
60
+ - - ">="
61
+ - !ruby/object:Gem::Version
62
+ version: '0'
63
+ type: :development
64
+ prerelease: false
65
+ version_requirements: !ruby/object:Gem::Requirement
66
+ requirements:
67
+ - - ">="
68
+ - !ruby/object:Gem::Version
69
+ version: '0'
70
+ - !ruby/object:Gem::Dependency
71
+ name: rspec
72
+ requirement: !ruby/object:Gem::Requirement
73
+ requirements:
74
+ - - "~>"
75
+ - !ruby/object:Gem::Version
76
+ version: '2.14'
77
+ type: :development
78
+ prerelease: false
79
+ version_requirements: !ruby/object:Gem::Requirement
80
+ requirements:
81
+ - - "~>"
82
+ - !ruby/object:Gem::Version
83
+ version: '2.14'
84
+ description: Read text and metadata from files and documents (.doc, .docx, .pages,
85
+ .odt, .rtf, .pdf)
86
+ email:
87
+ - erol.fornoles@gmail.com
88
+ - a.bromwich@gmail.com
89
+ executables: []
90
+ extensions: []
91
+ extra_rdoc_files: []
92
+ files:
93
+ - ".gitignore"
94
+ - ".rspec"
95
+ - ".travis.yml"
96
+ - Gemfile
97
+ - LICENSE
98
+ - NOTICE.txt
99
+ - README.md
100
+ - Rakefile
101
+ - henkei.gemspec
102
+ - jar/tika-app-1.14.jar
103
+ - lib/henkei.rb
104
+ - lib/henkei/version.rb
105
+ - spec/helper.rb
106
+ - spec/henkei_spec.rb
107
+ - spec/samples/sample filename with spaces.pages
108
+ - spec/samples/sample-metadata-values-with-colons.doc
109
+ - spec/samples/sample.docx
110
+ - spec/samples/sample.pages
111
+ homepage: http://github.com/abrom/henkei
112
+ licenses:
113
+ - MIT
114
+ metadata: {}
115
+ post_install_message:
116
+ rdoc_options: []
117
+ require_paths:
118
+ - lib
119
+ required_ruby_version: !ruby/object:Gem::Requirement
120
+ requirements:
121
+ - - ">="
122
+ - !ruby/object:Gem::Version
123
+ version: '0'
124
+ required_rubygems_version: !ruby/object:Gem::Requirement
125
+ requirements:
126
+ - - ">="
127
+ - !ruby/object:Gem::Version
128
+ version: '0'
129
+ requirements: []
130
+ rubyforge_project:
131
+ rubygems_version: 2.4.8
132
+ signing_key:
133
+ specification_version: 4
134
+ summary: Read text and metadata from files and documents (.doc, .docx, .pages, .odt,
135
+ .rtf, .pdf)
136
+ test_files:
137
+ - spec/helper.rb
138
+ - spec/henkei_spec.rb
139
+ - spec/samples/sample filename with spaces.pages
140
+ - spec/samples/sample-metadata-values-with-colons.doc
141
+ - spec/samples/sample.docx
142
+ - spec/samples/sample.pages