yomu2 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.gitignore +18 -0
- data/.rspec +2 -0
- data/.travis.yml +5 -0
- data/Gemfile +4 -0
- data/LICENSE +22 -0
- data/NOTICE.txt +8 -0
- data/README.md +111 -0
- data/Rakefile +8 -0
- data/jar/tika-app-1.11.jar +0 -0
- data/lib/yomu.rb +274 -0
- data/lib/yomu/version.rb +3 -0
- data/spec/helper.rb +6 -0
- data/spec/samples/sample filename with spaces.pages +0 -0
- data/spec/samples/sample-metadata-values-with-colons.doc +0 -0
- data/spec/samples/sample.docx +0 -0
- data/spec/samples/sample.pages +0 -0
- data/spec/yomu_spec.rb +182 -0
- data/yomu2.gemspec +27 -0
- metadata +142 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 6b01b6c47c318f92b499b25df734fad15f37a79b
|
4
|
+
data.tar.gz: 71430d5a68f0db97d3a80242d9a86f67aa6116da
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 0d29eeb6f7026c32e9bb7bf9c0d6a8f56eeede3af8eb6a4126b61753e343f97d37cd7b7bccd4a46e1ca9fe38c7569f840a9ea6a51c5f34d36641fc3fd232834c
|
7
|
+
data.tar.gz: af9252b76bd976c3e643fda9345f0232eca67a8e7924a6ece11f34fe0f2589236003deadf3f616eaa24d8ee9717e533e2b6b58d6055a5bc198c06589cc1cb729
|
data/.gitignore
ADDED
data/.rspec
ADDED
data/.travis.yml
ADDED
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
Copyright (c) 2012 Erol Fornoles
|
2
|
+
|
3
|
+
MIT License
|
4
|
+
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
6
|
+
a copy of this software and associated documentation files (the
|
7
|
+
"Software"), to deal in the Software without restriction, including
|
8
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
9
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
10
|
+
permit persons to whom the Software is furnished to do so, subject to
|
11
|
+
the following conditions:
|
12
|
+
|
13
|
+
The above copyright notice and this permission notice shall be
|
14
|
+
included in all copies or substantial portions of the Software.
|
15
|
+
|
16
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
17
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
18
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
19
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
20
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
21
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
22
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/NOTICE.txt
ADDED
data/README.md
ADDED
@@ -0,0 +1,111 @@
|
|
1
|
+
[](https://travis-ci.org/AlphaExchange/yomu2)
|
2
|
+
[](https://codeclimate.com/github/AlphaExchange/yomu2)
|
3
|
+
[](#)
|
4
|
+
|
5
|
+
# Yomu2 読む2
|
6
|
+
|
7
|
+
[Yomu2](http://github.com/AlphaExchange/yomu2) is a library for extracting text and metadata from files and documents using the [Apache Tika](http://tika.apache.org/) content analysis toolkit.
|
8
|
+
|
9
|
+
This is a up-to-date and maintained fork of Yomu gem.
|
10
|
+
|
11
|
+
Here are some of the formats supported:
|
12
|
+
|
13
|
+
- Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx,
|
14
|
+
.ppt, .pptx)
|
15
|
+
- OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
|
16
|
+
- Apple iWorks Formats
|
17
|
+
- Rich Text Format (.rtf)
|
18
|
+
- Portable Document Format (.pdf)
|
19
|
+
|
20
|
+
For the complete list of supported formats, please visit the Apache Tika
|
21
|
+
[Supported Document Formats](http://tika.apache.org/0.9/formats.html) page.
|
22
|
+
|
23
|
+
## Usage
|
24
|
+
|
25
|
+
Text, metadata and MIME type information can be extracted by calling `Yomu.read` directly:
|
26
|
+
|
27
|
+
```ruby
|
28
|
+
require 'yomu'
|
29
|
+
|
30
|
+
data = File.read 'sample.pages'
|
31
|
+
text = Yomu.read :text, data
|
32
|
+
metadata = Yomu.read :metadata, data
|
33
|
+
mimetype = Yomu.read :mimetype, data
|
34
|
+
```
|
35
|
+
|
36
|
+
### Reading text from a given filename
|
37
|
+
|
38
|
+
Create a new instance of Yomu and pass a filename.
|
39
|
+
|
40
|
+
```ruby
|
41
|
+
yomu = Yomu.new 'sample.pages'
|
42
|
+
text = yomu.text
|
43
|
+
```
|
44
|
+
|
45
|
+
### Reading text from a given URL
|
46
|
+
|
47
|
+
This is useful for reading remote files, like documents hosted on Amazon S3.
|
48
|
+
|
49
|
+
```ruby
|
50
|
+
yomu = Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
|
51
|
+
text = yomu.text
|
52
|
+
```
|
53
|
+
|
54
|
+
### Reading text from a stream
|
55
|
+
|
56
|
+
Yomu can also read from a stream or any object that responds to `read`, including file uploads from Ruby on Rails or Sinatra.
|
57
|
+
|
58
|
+
```ruby
|
59
|
+
post '/:name/:filename' do
|
60
|
+
yomu = Yomu.new params[:data][:tempfile]
|
61
|
+
yomu.text
|
62
|
+
end
|
63
|
+
```
|
64
|
+
|
65
|
+
### Reading metadata
|
66
|
+
|
67
|
+
Metadata is returned as a hash.
|
68
|
+
|
69
|
+
```ruby
|
70
|
+
yomu = Yomu.new 'sample.pages'
|
71
|
+
yomu.metadata['Content-Type'] #=> "application/vnd.apple.pages"
|
72
|
+
```
|
73
|
+
|
74
|
+
### Reading MIME types
|
75
|
+
|
76
|
+
MIME type is returned as a MIME::Type object.
|
77
|
+
|
78
|
+
```ruby
|
79
|
+
yomu = Yomu.new 'sample.docx'
|
80
|
+
yomu.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
81
|
+
yomu.mimetype.extensions #=> ['docx']
|
82
|
+
```
|
83
|
+
|
84
|
+
## Installation and Dependencies
|
85
|
+
|
86
|
+
### Java Runtime
|
87
|
+
|
88
|
+
Yomu packages the Apache Tika application jar and requires a working JRE for it to work.
|
89
|
+
|
90
|
+
### Gem
|
91
|
+
|
92
|
+
Add this line to your application's Gemfile:
|
93
|
+
|
94
|
+
gem 'yomu'
|
95
|
+
|
96
|
+
And then execute:
|
97
|
+
|
98
|
+
$ bundle
|
99
|
+
|
100
|
+
Or install it yourself as:
|
101
|
+
|
102
|
+
$ gem install yomu
|
103
|
+
|
104
|
+
## Contributing
|
105
|
+
|
106
|
+
1. Fork it
|
107
|
+
2. Create your feature branch ( `git checkout -b my-new-feature` )
|
108
|
+
3. Create tests and make them pass ( `rake` )
|
109
|
+
4. Commit your changes ( `git commit -am 'Added some feature'` )
|
110
|
+
5. Push to the branch ( `git push origin my-new-feature` )
|
111
|
+
6. Create a new Pull Request
|
data/Rakefile
ADDED
Binary file
|
data/lib/yomu.rb
ADDED
@@ -0,0 +1,274 @@
|
|
1
|
+
require 'yomu/version'
|
2
|
+
|
3
|
+
require 'net/http'
|
4
|
+
require 'mime/types'
|
5
|
+
require 'json'
|
6
|
+
|
7
|
+
require 'socket'
|
8
|
+
require 'stringio'
|
9
|
+
|
10
|
+
class Yomu
|
11
|
+
GEMPATH = File.dirname(File.dirname(__FILE__))
|
12
|
+
JARPATH = File.join(Yomu::GEMPATH, 'jar', 'tika-app-1.11.jar')
|
13
|
+
DEFAULT_SERVER_PORT = 9293 # an arbitrary, but perfectly cromulent, port
|
14
|
+
|
15
|
+
@@server_port = nil
|
16
|
+
@@server_pid = nil
|
17
|
+
|
18
|
+
# Read text or metadata from a data buffer.
|
19
|
+
#
|
20
|
+
# data = File.read 'sample.pages'
|
21
|
+
# text = Yomu.read :text, data
|
22
|
+
# metadata = Yomu.read :metadata, data
|
23
|
+
|
24
|
+
def self.read(type, data)
|
25
|
+
result = @@server_port ? self._server_read(type, data) : self._client_read(type, data)
|
26
|
+
|
27
|
+
case type
|
28
|
+
when :text
|
29
|
+
result
|
30
|
+
when :html
|
31
|
+
result
|
32
|
+
when :metadata
|
33
|
+
JSON.parse(result)
|
34
|
+
when :mimetype
|
35
|
+
MIME::Types[JSON.parse(result)['Content-Type']].first
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
def self._client_read(type, data)
|
40
|
+
switch = case type
|
41
|
+
when :text
|
42
|
+
'-t'
|
43
|
+
when :html
|
44
|
+
'-h'
|
45
|
+
when :metadata
|
46
|
+
'-m -j'
|
47
|
+
when :mimetype
|
48
|
+
'-m -j'
|
49
|
+
end
|
50
|
+
|
51
|
+
IO.popen "#{java} -Djava.awt.headless=true -jar #{Yomu::JARPATH} #{switch}", 'r+' do |io|
|
52
|
+
io.write data
|
53
|
+
io.close_write
|
54
|
+
io.read
|
55
|
+
end
|
56
|
+
end
|
57
|
+
|
58
|
+
|
59
|
+
def self._server_read(_, data)
|
60
|
+
s = TCPSocket.new('localhost', @@server_port)
|
61
|
+
file = StringIO.new(data, 'r')
|
62
|
+
|
63
|
+
while 1
|
64
|
+
chunk = file.read(65536)
|
65
|
+
break unless chunk
|
66
|
+
s.write(chunk)
|
67
|
+
end
|
68
|
+
|
69
|
+
# tell Tika that we're done sending data
|
70
|
+
s.shutdown(Socket::SHUT_WR)
|
71
|
+
|
72
|
+
resp = ''
|
73
|
+
while 1
|
74
|
+
chunk = s.recv(65536)
|
75
|
+
break if chunk.empty? || !chunk
|
76
|
+
resp << chunk
|
77
|
+
end
|
78
|
+
resp
|
79
|
+
ensure
|
80
|
+
s.close unless s.nil?
|
81
|
+
end
|
82
|
+
|
83
|
+
# Create a new instance of Yomu with a given document.
|
84
|
+
#
|
85
|
+
# Using a file path:
|
86
|
+
#
|
87
|
+
# Yomu.new 'sample.pages'
|
88
|
+
#
|
89
|
+
# Using a URL:
|
90
|
+
#
|
91
|
+
# Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
|
92
|
+
#
|
93
|
+
# From a stream or an object which responds to +read+
|
94
|
+
#
|
95
|
+
# Yomu.new File.open('sample.pages')
|
96
|
+
|
97
|
+
def initialize(input)
|
98
|
+
if input.is_a? String
|
99
|
+
if File.exists? input
|
100
|
+
@path = input
|
101
|
+
elsif input =~ URI::regexp
|
102
|
+
@uri = URI.parse input
|
103
|
+
else
|
104
|
+
raise Errno::ENOENT.new "missing file or invalid URI - #{input}"
|
105
|
+
end
|
106
|
+
elsif input.respond_to? :read
|
107
|
+
@stream = input
|
108
|
+
else
|
109
|
+
raise TypeError.new "can't read from #{input.class.name}"
|
110
|
+
end
|
111
|
+
end
|
112
|
+
|
113
|
+
# Returns the text content of the Yomu document.
|
114
|
+
#
|
115
|
+
# yomu = Yomu.new 'sample.pages'
|
116
|
+
# yomu.text
|
117
|
+
|
118
|
+
def text
|
119
|
+
return @text if defined? @text
|
120
|
+
|
121
|
+
@text = Yomu.read :text, data
|
122
|
+
end
|
123
|
+
|
124
|
+
# Returns the text content of the Yomu document in HTML.
|
125
|
+
#
|
126
|
+
# yomu = Yomu.new 'sample.pages'
|
127
|
+
# yomu.html
|
128
|
+
|
129
|
+
def html
|
130
|
+
return @html if defined? @html
|
131
|
+
|
132
|
+
@html = Yomu.read :html, data
|
133
|
+
end
|
134
|
+
|
135
|
+
# Returns the metadata hash of the Yomu document.
|
136
|
+
#
|
137
|
+
# yomu = Yomu.new 'sample.pages'
|
138
|
+
# yomu.metadata['Content-Type']
|
139
|
+
|
140
|
+
def metadata
|
141
|
+
return @metadata if defined? @metadata
|
142
|
+
|
143
|
+
@metadata = Yomu.read :metadata, data
|
144
|
+
end
|
145
|
+
|
146
|
+
# Returns the mimetype object of the Yomu document.
|
147
|
+
#
|
148
|
+
# yomu = Yomu.new 'sample.docx'
|
149
|
+
# yomu.mimetype.content_type #=> 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
150
|
+
# yomu.mimetype.extensions #=> ['docx']
|
151
|
+
|
152
|
+
def mimetype
|
153
|
+
return @mimetype if defined? @mimetype
|
154
|
+
|
155
|
+
type = metadata["Content-Type"].is_a?(Array) ? metadata["Content-Type"].first : metadata["Content-Type"]
|
156
|
+
|
157
|
+
@mimetype = MIME::Types[type].first
|
158
|
+
end
|
159
|
+
|
160
|
+
# Returns +true+ if the Yomu document was specified using a file path.
|
161
|
+
#
|
162
|
+
# yomu = Yomu.new 'sample.pages'
|
163
|
+
# yomu.path? #=> true
|
164
|
+
|
165
|
+
|
166
|
+
def creation_date
|
167
|
+
return @creation_date if defined? @creation_date
|
168
|
+
|
169
|
+
if metadata['Creation-Date']
|
170
|
+
@creation_date = Time.parse(metadata['Creation-Date'])
|
171
|
+
else
|
172
|
+
nil
|
173
|
+
end
|
174
|
+
end
|
175
|
+
|
176
|
+
def path?
|
177
|
+
defined? @path
|
178
|
+
end
|
179
|
+
|
180
|
+
# Returns +true+ if the Yomu document was specified using a URI.
|
181
|
+
#
|
182
|
+
# yomu = Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
|
183
|
+
# yomu.uri? #=> true
|
184
|
+
|
185
|
+
def uri?
|
186
|
+
defined? @uri
|
187
|
+
end
|
188
|
+
|
189
|
+
# Returns +true+ if the Yomu document was specified from a stream or an object which responds to +read+.
|
190
|
+
#
|
191
|
+
# file = File.open('sample.pages')
|
192
|
+
# yomu = Yomu.new file
|
193
|
+
# yomu.stream? #=> true
|
194
|
+
|
195
|
+
def stream?
|
196
|
+
defined? @stream
|
197
|
+
end
|
198
|
+
|
199
|
+
# Returns the raw/unparsed content of the Yomu document.
|
200
|
+
#
|
201
|
+
# yomu = Yomu.new 'sample.pages'
|
202
|
+
# yomu.data
|
203
|
+
|
204
|
+
def data
|
205
|
+
return @data if defined? @data
|
206
|
+
|
207
|
+
if path?
|
208
|
+
@data = File.read @path
|
209
|
+
elsif uri?
|
210
|
+
@data = Net::HTTP.get @uri
|
211
|
+
elsif stream?
|
212
|
+
@data = @stream.read
|
213
|
+
end
|
214
|
+
|
215
|
+
@data
|
216
|
+
end
|
217
|
+
|
218
|
+
# Returns pid of Tika server, started as a new spawned process.
|
219
|
+
#
|
220
|
+
# type :html, :text or :metadata
|
221
|
+
# custom_port e.g. 9293
|
222
|
+
#
|
223
|
+
# Yomu.server(:text, 9294)
|
224
|
+
#
|
225
|
+
def self.server(type, custom_port=nil)
|
226
|
+
switch = case type
|
227
|
+
when :text
|
228
|
+
'-t'
|
229
|
+
when :html
|
230
|
+
'-h'
|
231
|
+
when :metadata
|
232
|
+
'-m -j'
|
233
|
+
when :mimetype
|
234
|
+
'-m -j'
|
235
|
+
end
|
236
|
+
|
237
|
+
@@server_port = custom_port || DEFAULT_SERVER_PORT
|
238
|
+
|
239
|
+
begin
|
240
|
+
TCPSocket.new('localhost', @@server_port).close
|
241
|
+
rescue Errno::ECONNREFUSED
|
242
|
+
@@server_pid = Process.spawn("#{java} -Djava.awt.headless=true -jar #{Yomu::JARPATH} --server --port #{@@server_port} #{switch}")
|
243
|
+
sleep(2) # Give the server 2 seconds to spin up.
|
244
|
+
@@server_pid
|
245
|
+
end
|
246
|
+
end
|
247
|
+
|
248
|
+
# Kills server started by Yomu.server
|
249
|
+
#
|
250
|
+
# Always run this when you're done, or else Tika might run until you kill it manually
|
251
|
+
# You might try putting your extraction in a begin..rescue...ensure...end block and
|
252
|
+
# putting this method in the ensure block.
|
253
|
+
#
|
254
|
+
# Yomu.server(:text)
|
255
|
+
# reports = ["report1.docx", "report2.doc", "report3.pdf"]
|
256
|
+
# begin
|
257
|
+
# my_texts = reports.map{|report_path| Yomu.new(report_path).text }
|
258
|
+
# rescue
|
259
|
+
# ensure
|
260
|
+
# Yomu.kill_server!
|
261
|
+
# end
|
262
|
+
def self.kill_server!
|
263
|
+
if @@server_pid
|
264
|
+
Process.kill('INT', @@server_pid)
|
265
|
+
@@server_pid = nil
|
266
|
+
@@server_port = nil
|
267
|
+
end
|
268
|
+
end
|
269
|
+
|
270
|
+
def self.java
|
271
|
+
ENV['JAVA_HOME'] ? ENV['JAVA_HOME'] + '/bin/java' : 'java'
|
272
|
+
end
|
273
|
+
private_class_method :java
|
274
|
+
end
|
data/lib/yomu/version.rb
ADDED
data/spec/helper.rb
ADDED
Binary file
|
Binary file
|
Binary file
|
Binary file
|
data/spec/yomu_spec.rb
ADDED
@@ -0,0 +1,182 @@
|
|
1
|
+
require 'helper.rb'
|
2
|
+
require 'yomu'
|
3
|
+
|
4
|
+
describe Yomu do
|
5
|
+
let(:data) { File.read 'spec/samples/sample.docx' }
|
6
|
+
|
7
|
+
before do
|
8
|
+
ENV['JAVA_HOME'] = nil
|
9
|
+
end
|
10
|
+
|
11
|
+
describe '.read' do
|
12
|
+
it 'reads text' do
|
13
|
+
text = Yomu.read :text, data
|
14
|
+
|
15
|
+
expect( text ).to include 'The quick brown fox jumped over the lazy cat.'
|
16
|
+
end
|
17
|
+
|
18
|
+
it 'reads metadata' do
|
19
|
+
metadata = Yomu.read :metadata, data
|
20
|
+
|
21
|
+
expect( metadata['Content-Type'] ).to eql 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
22
|
+
end
|
23
|
+
|
24
|
+
it 'reads metadata values with colons as strings' do
|
25
|
+
data = File.read 'spec/samples/sample-metadata-values-with-colons.doc'
|
26
|
+
metadata = Yomu.read :metadata, data
|
27
|
+
|
28
|
+
expect( metadata['dc:title'] ).to eql 'problem: test'
|
29
|
+
end
|
30
|
+
|
31
|
+
it 'reads mimetype' do
|
32
|
+
mimetype = Yomu.read :mimetype, data
|
33
|
+
|
34
|
+
expect( mimetype.content_type ).to eql 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
35
|
+
expect( mimetype.extensions ).to include 'docx'
|
36
|
+
end
|
37
|
+
end
|
38
|
+
|
39
|
+
describe '.new' do
|
40
|
+
it 'requires parameters' do
|
41
|
+
expect { Yomu.new }.to raise_error ArgumentError
|
42
|
+
end
|
43
|
+
|
44
|
+
it 'accepts a root path' do
|
45
|
+
yomu = Yomu.new 'spec/samples/sample.pages'
|
46
|
+
|
47
|
+
expect( yomu ).to be_path
|
48
|
+
expect( yomu ).not_to be_uri
|
49
|
+
expect( yomu ).not_to be_stream
|
50
|
+
end
|
51
|
+
|
52
|
+
it 'accepts a relative path' do
|
53
|
+
yomu = Yomu.new 'spec/samples/sample.pages'
|
54
|
+
|
55
|
+
expect( yomu ).to be_path
|
56
|
+
expect( yomu ).not_to be_uri
|
57
|
+
expect( yomu ).not_to be_stream
|
58
|
+
end
|
59
|
+
|
60
|
+
it 'accepts a path with spaces' do
|
61
|
+
yomu = Yomu.new 'spec/samples/sample filename with spaces.pages'
|
62
|
+
|
63
|
+
expect( yomu ).to be_path
|
64
|
+
expect( yomu ).not_to be_uri
|
65
|
+
expect( yomu ).not_to be_stream
|
66
|
+
end
|
67
|
+
|
68
|
+
it 'accepts a URI' do
|
69
|
+
yomu = Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
|
70
|
+
|
71
|
+
expect( yomu ).to be_uri
|
72
|
+
expect( yomu ).not_to be_path
|
73
|
+
expect( yomu ).not_to be_stream
|
74
|
+
end
|
75
|
+
|
76
|
+
it 'accepts a stream or object that can be read' do
|
77
|
+
File.open 'spec/samples/sample.pages', 'r' do |file|
|
78
|
+
yomu = Yomu.new file
|
79
|
+
|
80
|
+
expect( yomu ).to be_stream
|
81
|
+
expect( yomu ).not_to be_path
|
82
|
+
expect( yomu ).not_to be_uri
|
83
|
+
end
|
84
|
+
end
|
85
|
+
|
86
|
+
it 'refuses a path to a missing file' do
|
87
|
+
expect { Yomu.new 'test/sample/missing.pages'}.to raise_error Errno::ENOENT
|
88
|
+
end
|
89
|
+
|
90
|
+
it 'refuses other objects' do
|
91
|
+
[nil, 1, 1.1].each do |object|
|
92
|
+
expect { Yomu.new object }.to raise_error TypeError
|
93
|
+
end
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
|
98
|
+
describe '.creation_date' do
|
99
|
+
let(:yomu) { Yomu.new 'spec/samples/sample.pages' }
|
100
|
+
it 'should retur Time' do
|
101
|
+
expect( yomu.creation_date ).to be_a Time
|
102
|
+
end
|
103
|
+
end
|
104
|
+
|
105
|
+
describe '.java' do
|
106
|
+
specify 'with no specified JAVA_HOME' do
|
107
|
+
expect( Yomu.send(:java) ).to eql 'java'
|
108
|
+
end
|
109
|
+
|
110
|
+
specify 'with a specified JAVA_HOME' do
|
111
|
+
ENV['JAVA_HOME'] = '/path/to/java/home'
|
112
|
+
|
113
|
+
expect( Yomu.send(:java) ).to eql '/path/to/java/home/bin/java'
|
114
|
+
end
|
115
|
+
end
|
116
|
+
|
117
|
+
context 'initialized with a given path' do
|
118
|
+
let(:yomu) { Yomu.new 'spec/samples/sample.pages' }
|
119
|
+
|
120
|
+
specify '#text reads text' do
|
121
|
+
expect( yomu.text).to include 'The quick brown fox jumped over the lazy cat.'
|
122
|
+
end
|
123
|
+
|
124
|
+
specify '#metadata reads metadata' do
|
125
|
+
expect( yomu.metadata['Content-Type'] ).to eql ["application/vnd.apple.pages", "application/vnd.apple.pages"]
|
126
|
+
end
|
127
|
+
end
|
128
|
+
|
129
|
+
context 'initialized with a given URI' do
|
130
|
+
let(:yomu) { Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx' }
|
131
|
+
|
132
|
+
specify '#text reads text' do
|
133
|
+
expect( yomu.text ).to include 'Lorem ipsum dolor sit amet, consectetuer adipiscing elit.'
|
134
|
+
end
|
135
|
+
|
136
|
+
specify '#metadata reads metadata' do
|
137
|
+
expect( yomu.metadata['Content-Type'] ).to eql 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
|
138
|
+
end
|
139
|
+
end
|
140
|
+
|
141
|
+
context 'initialized with a given stream' do
|
142
|
+
let(:yomu) { Yomu.new File.open('spec/samples/sample.pages', 'rb') }
|
143
|
+
|
144
|
+
specify '#text reads text' do
|
145
|
+
expect( yomu.text ).to include 'The quick brown fox jumped over the lazy cat.'
|
146
|
+
end
|
147
|
+
|
148
|
+
specify '#metadata reads metadata' do
|
149
|
+
expect( yomu.metadata['Content-Type'] ).to eql ["application/vnd.apple.pages", "application/vnd.apple.pages"]
|
150
|
+
end
|
151
|
+
end
|
152
|
+
|
153
|
+
context 'working as server mode' do
|
154
|
+
specify '#starts and kills server' do
|
155
|
+
begin
|
156
|
+
Yomu.server(:text)
|
157
|
+
expect(Yomu.class_variable_get(:@@server_pid)).not_to be_nil
|
158
|
+
expect(Yomu.class_variable_get(:@@server_port)).not_to be_nil
|
159
|
+
|
160
|
+
s = TCPSocket.new('localhost', Yomu.class_variable_get(:@@server_port))
|
161
|
+
expect(s).to be_a TCPSocket
|
162
|
+
s.close
|
163
|
+
ensure
|
164
|
+
port = Yomu.class_variable_get(:@@server_port)
|
165
|
+
Yomu.kill_server!
|
166
|
+
sleep 2
|
167
|
+
expect { TCPSocket.new('localhost', port) }.to raise_error Errno::ECONNREFUSED
|
168
|
+
end
|
169
|
+
end
|
170
|
+
|
171
|
+
specify '#runs samples through server mode' do
|
172
|
+
begin
|
173
|
+
Yomu.server(:text)
|
174
|
+
expect(Yomu.new('spec/samples/sample.pages').text).to include 'The quick brown fox jumped over the lazy cat.'
|
175
|
+
expect(Yomu.new('spec/samples/sample filename with spaces.pages').text).to include 'The quick brown fox jumped over the lazy cat.'
|
176
|
+
expect(Yomu.new('spec/samples/sample.docx').text).to include 'The quick brown fox jumped over the lazy cat.'
|
177
|
+
ensure
|
178
|
+
Yomu.kill_server!
|
179
|
+
end
|
180
|
+
end
|
181
|
+
end
|
182
|
+
end
|
data/yomu2.gemspec
ADDED
@@ -0,0 +1,27 @@
|
|
1
|
+
# coding: utf-8
|
2
|
+
lib = File.expand_path('../lib', __FILE__)
|
3
|
+
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
|
4
|
+
require 'yomu/version'
|
5
|
+
|
6
|
+
Gem::Specification.new do |spec|
|
7
|
+
spec.name = 'yomu2'
|
8
|
+
spec.version = Yomu::VERSION
|
9
|
+
spec.authors = ['Erol Fornoles', 'Diego Silva']
|
10
|
+
spec.email = ['erol.fornoles@gmail.com', 'diego@alpha-exchange.com']
|
11
|
+
spec.description = %q{Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)}
|
12
|
+
spec.summary = %q{Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)}
|
13
|
+
spec.homepage = 'http://erol.github.com/yomu'
|
14
|
+
spec.license = 'MIT'
|
15
|
+
|
16
|
+
spec.files = `git ls-files`.split($/)
|
17
|
+
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
18
|
+
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
|
+
spec.require_paths = ['lib']
|
20
|
+
|
21
|
+
spec.add_runtime_dependency 'mime-types', '~> 1.23'
|
22
|
+
spec.add_runtime_dependency 'json', '~> 1.8'
|
23
|
+
|
24
|
+
spec.add_development_dependency 'bundler', '~> 1.3'
|
25
|
+
spec.add_development_dependency 'rake'
|
26
|
+
spec.add_development_dependency 'rspec', '~> 3.4'
|
27
|
+
end
|
metadata
ADDED
@@ -0,0 +1,142 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: yomu2
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 0.3.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Erol Fornoles
|
8
|
+
- Diego Silva
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
date: 2017-02-09 00:00:00.000000000 Z
|
13
|
+
dependencies:
|
14
|
+
- !ruby/object:Gem::Dependency
|
15
|
+
name: mime-types
|
16
|
+
requirement: !ruby/object:Gem::Requirement
|
17
|
+
requirements:
|
18
|
+
- - "~>"
|
19
|
+
- !ruby/object:Gem::Version
|
20
|
+
version: '1.23'
|
21
|
+
type: :runtime
|
22
|
+
prerelease: false
|
23
|
+
version_requirements: !ruby/object:Gem::Requirement
|
24
|
+
requirements:
|
25
|
+
- - "~>"
|
26
|
+
- !ruby/object:Gem::Version
|
27
|
+
version: '1.23'
|
28
|
+
- !ruby/object:Gem::Dependency
|
29
|
+
name: json
|
30
|
+
requirement: !ruby/object:Gem::Requirement
|
31
|
+
requirements:
|
32
|
+
- - "~>"
|
33
|
+
- !ruby/object:Gem::Version
|
34
|
+
version: '1.8'
|
35
|
+
type: :runtime
|
36
|
+
prerelease: false
|
37
|
+
version_requirements: !ruby/object:Gem::Requirement
|
38
|
+
requirements:
|
39
|
+
- - "~>"
|
40
|
+
- !ruby/object:Gem::Version
|
41
|
+
version: '1.8'
|
42
|
+
- !ruby/object:Gem::Dependency
|
43
|
+
name: bundler
|
44
|
+
requirement: !ruby/object:Gem::Requirement
|
45
|
+
requirements:
|
46
|
+
- - "~>"
|
47
|
+
- !ruby/object:Gem::Version
|
48
|
+
version: '1.3'
|
49
|
+
type: :development
|
50
|
+
prerelease: false
|
51
|
+
version_requirements: !ruby/object:Gem::Requirement
|
52
|
+
requirements:
|
53
|
+
- - "~>"
|
54
|
+
- !ruby/object:Gem::Version
|
55
|
+
version: '1.3'
|
56
|
+
- !ruby/object:Gem::Dependency
|
57
|
+
name: rake
|
58
|
+
requirement: !ruby/object:Gem::Requirement
|
59
|
+
requirements:
|
60
|
+
- - ">="
|
61
|
+
- !ruby/object:Gem::Version
|
62
|
+
version: '0'
|
63
|
+
type: :development
|
64
|
+
prerelease: false
|
65
|
+
version_requirements: !ruby/object:Gem::Requirement
|
66
|
+
requirements:
|
67
|
+
- - ">="
|
68
|
+
- !ruby/object:Gem::Version
|
69
|
+
version: '0'
|
70
|
+
- !ruby/object:Gem::Dependency
|
71
|
+
name: rspec
|
72
|
+
requirement: !ruby/object:Gem::Requirement
|
73
|
+
requirements:
|
74
|
+
- - "~>"
|
75
|
+
- !ruby/object:Gem::Version
|
76
|
+
version: '3.4'
|
77
|
+
type: :development
|
78
|
+
prerelease: false
|
79
|
+
version_requirements: !ruby/object:Gem::Requirement
|
80
|
+
requirements:
|
81
|
+
- - "~>"
|
82
|
+
- !ruby/object:Gem::Version
|
83
|
+
version: '3.4'
|
84
|
+
description: Read text and metadata from files and documents (.doc, .docx, .pages,
|
85
|
+
.odt, .rtf, .pdf)
|
86
|
+
email:
|
87
|
+
- erol.fornoles@gmail.com
|
88
|
+
- diego@alpha-exchange.com
|
89
|
+
executables: []
|
90
|
+
extensions: []
|
91
|
+
extra_rdoc_files: []
|
92
|
+
files:
|
93
|
+
- ".gitignore"
|
94
|
+
- ".rspec"
|
95
|
+
- ".travis.yml"
|
96
|
+
- Gemfile
|
97
|
+
- LICENSE
|
98
|
+
- NOTICE.txt
|
99
|
+
- README.md
|
100
|
+
- Rakefile
|
101
|
+
- jar/tika-app-1.11.jar
|
102
|
+
- lib/yomu.rb
|
103
|
+
- lib/yomu/version.rb
|
104
|
+
- spec/helper.rb
|
105
|
+
- spec/samples/sample filename with spaces.pages
|
106
|
+
- spec/samples/sample-metadata-values-with-colons.doc
|
107
|
+
- spec/samples/sample.docx
|
108
|
+
- spec/samples/sample.pages
|
109
|
+
- spec/yomu_spec.rb
|
110
|
+
- yomu2.gemspec
|
111
|
+
homepage: http://erol.github.com/yomu
|
112
|
+
licenses:
|
113
|
+
- MIT
|
114
|
+
metadata: {}
|
115
|
+
post_install_message:
|
116
|
+
rdoc_options: []
|
117
|
+
require_paths:
|
118
|
+
- lib
|
119
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
120
|
+
requirements:
|
121
|
+
- - ">="
|
122
|
+
- !ruby/object:Gem::Version
|
123
|
+
version: '0'
|
124
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
125
|
+
requirements:
|
126
|
+
- - ">="
|
127
|
+
- !ruby/object:Gem::Version
|
128
|
+
version: '0'
|
129
|
+
requirements: []
|
130
|
+
rubyforge_project:
|
131
|
+
rubygems_version: 2.5.1
|
132
|
+
signing_key:
|
133
|
+
specification_version: 4
|
134
|
+
summary: Read text and metadata from files and documents (.doc, .docx, .pages, .odt,
|
135
|
+
.rtf, .pdf)
|
136
|
+
test_files:
|
137
|
+
- spec/helper.rb
|
138
|
+
- spec/samples/sample filename with spaces.pages
|
139
|
+
- spec/samples/sample-metadata-values-with-colons.doc
|
140
|
+
- spec/samples/sample.docx
|
141
|
+
- spec/samples/sample.pages
|
142
|
+
- spec/yomu_spec.rb
|