opener-tokenizer 1.1.2 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: f9af1f4a79564c201746a78973f5fc3cdfe272f1
4
- data.tar.gz: 41e00e0819e7a16aa5d9e509fb4f28751811b292
3
+ metadata.gz: dab5882d292da38d032ad2de7dfb8c289372428b
4
+ data.tar.gz: 19ee04d381d370e64d606dc41e3472296980df67
5
5
  SHA512:
6
- metadata.gz: 563674a5a7f855eff87bd6659a5688f35502c0790fbde1797dc1b27f8f45f03f4c42ca8da2dfbe501da6c5a8db7523bc4cd308adf765c0e4a754e3874f1563c8
7
- data.tar.gz: e716377afcbc496894a89b95aeec064c89402c60839fb1213dfce2990a2baaab083e0f46950a1d8791c54a1028ac4c7faefd00dd49ee36b463af90204b55602d
6
+ metadata.gz: dcdb9e6f44524b5a1a23aa54dea0ed68e4b048b5b0b929c0964aeb576d2eecdc2877a0d054cd73e8afd67ad936016a5e044ae9b95a507219a9f930a7bf4775ec
7
+ data.tar.gz: e8660e048b3b58049bed277b207a69a5c763eac89049c251b750e0d4e111bb003030987540758686f175f4cfd2be76959a3534262faf57b5475f7982286a966c
data/LICENSE.txt ADDED
@@ -0,0 +1,13 @@
1
+ Copyright 2014 OpeNER Project Consortium
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
data/README.md CHANGED
@@ -1,11 +1,15 @@
1
1
  Introduction
2
2
  ------------
3
3
 
4
- The tokenizer tokenizes a text into sentences and words.
4
+ The tokenizer tokenizes a text into sentences and words.
5
5
 
6
6
  ### Confused by some terminology?
7
7
 
8
- This software is part of a larger collection of natural language processing tools known as "the OpeNER project". You can find more information about the project at [the OpeNER portal](http://opener-project.github.io). There you can also find references to terms like KAF (an XML standard to represent linguistic annotations in texts), component, cores, scenario's and pipelines.
8
+ This software is part of a larger collection of natural language processing
9
+ tools known as "the OpeNER project". You can find more information about the
10
+ project at [the OpeNER portal](http://opener-project.github.io). There you can
11
+ also find references to terms like KAF (an XML standard to represent linguistic
12
+ annotations in texts), component, cores, scenario's and pipelines.
9
13
 
10
14
  Quick Use Example
11
15
  -----------------
@@ -20,13 +24,14 @@ output KAF by default.
20
24
 
21
25
  ### Command line interface
22
26
 
23
- You should now be able to call the tokenizer as a regular shell
24
- command: by its name. Once installed the gem normally sits in your path so you can call it directly from anywhere.
27
+ You should now be able to call the tokenizer as a regular shell command: by its
28
+ name. Once installed the gem normally sits in your path so you can call it
29
+ directly from anywhere.
25
30
 
26
31
  Tokenizing some text:
27
32
 
28
33
  echo "This is English text" | tokenizer -l en --no-kaf
29
-
34
+
30
35
  Will result in
31
36
 
32
37
  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
@@ -45,11 +50,13 @@ Will result in
45
50
  </text>
46
51
  </KAF>
47
52
 
48
- The available languages for tokenization are: English (en), German (de), Dutch (nl), French (fr), Spanish (es), Italian (it)
53
+ The available languages for tokenization are: English (en), German (de), Dutch
54
+ (nl), French (fr), Spanish (es), Italian (it)
49
55
 
50
56
  #### KAF input format
51
57
 
52
- The tokenizer is capable of taking KAF as input, and actually does so by default. You can do so like this:
58
+ The tokenizer is capable of taking KAF as input, and actually does so by
59
+ default. You can do so like this:
53
60
 
54
61
  echo "<?xml version='1.0' encoding='UTF-8' standalone='no'?><KAF version='v1.opener' xml:lang='en'><raw>This is what I call, a test!</raw></KAF>" | tokenizer
55
62
 
@@ -72,7 +79,8 @@ Will result in
72
79
  </text>
73
80
  </KAF>
74
81
 
75
- If the argument -k (--kaf) is passed, then the argument -l (--language) is ignored.
82
+ If the argument -k (--kaf) is passed, then the argument -l (--language) is
83
+ ignored.
76
84
 
77
85
  ### Webservices
78
86
 
@@ -80,7 +88,8 @@ You can launch a language identification webservice by executing:
80
88
 
81
89
  tokenizer-server
82
90
 
83
- This will launch a mini webserver with the webservice. It defaults to port 9292, so you can access it at <http://localhost:9292>.
91
+ This will launch a mini webserver with the webservice. It defaults to port 9292,
92
+ so you can access it at <http://localhost:9292>.
84
93
 
85
94
  To launch it on a different port provide the `-p [port-number]` option like this:
86
95
 
@@ -88,19 +97,25 @@ To launch it on a different port provide the `-p [port-number]` option like this
88
97
 
89
98
  It then launches at <http://localhost:1234>
90
99
 
91
- Documentation on the Webservice is provided by surfing to the urls provided above. For more information on how to launch a webservice run the command with the ```-h``` option.
100
+ Documentation on the Webservice is provided by surfing to the urls provided
101
+ above. For more information on how to launch a webservice run the command with
102
+ the `--help` option.
92
103
 
93
104
 
94
105
  ### Daemon
95
106
 
96
- Last but not least the tokenizer comes shipped with a daemon that can read jobs (and write) jobs to and from Amazon SQS queues. For more information type:
107
+ Last but not least the tokenizer comes shipped with a daemon that can read jobs
108
+ (and write) jobs to and from Amazon SQS queues. For more information type:
97
109
 
98
- tokenizer-daemon -h
110
+ tokenizer-daemon --help
99
111
 
100
112
  Description of dependencies
101
113
  ---------------------------
102
114
 
103
- This component runs best if you run it in an environment suited for OpeNER components. You can find an installation guide and helper tools in the [OpeNER installer](https://github.com/opener-project/opener-installer) and [an installation guide on the Opener Website](http://opener-project.github.io/getting-started/how-to/local-installation.html)
115
+ This component runs best if you run it in an environment suited for OpeNER
116
+ components. You can find an installation guide and helper tools in the
117
+ [OpeNER installer](https://github.com/opener-project/opener-installer) and
118
+ [an installation guide on the Opener Website](http://opener-project.github.io/getting-started/how-to/local-installation.html).
104
119
 
105
120
  At least you need the following system setup:
106
121
 
@@ -113,16 +128,20 @@ At least you need the following system setup:
113
128
 
114
129
  * Maven (for building the Gem)
115
130
 
116
-
117
131
  Language Extension
118
132
  ------------------
119
133
 
120
- The tokenizer module is a wrapping around a Perl script, which performs the actual tokenization based on rules (when to break a character sequence). The tokenizer already supports a lot of languages. Have a look to the core script to figure out how to extend to new languages.
134
+ The tokenizer module is a wrapping around a Perl script, which performs the
135
+ actual tokenization based on rules (when to break a character sequence). The
136
+ tokenizer already supports a lot of languages. Have a look to the core script to
137
+ figure out how to extend to new languages.
121
138
 
122
139
  The Core
123
140
  --------
124
141
 
125
- The component is a fat wrapper around the actual language technology core. The core is a rule based tokenizer implemented in Perl. You can find the core technologies in the following repositories:
142
+ The component is a fat wrapper around the actual language technology core. The
143
+ core is a rule based tokenizer implemented in Perl. You can find the core
144
+ technologies in the following repositories:
126
145
 
127
146
  * [tokenizer-base](http://github.com/opener-project/tokenizer-base)
128
147
 
@@ -135,9 +154,8 @@ Where to go from here
135
154
  Report problem/Get help
136
155
  -----------------------
137
156
 
138
- If you encounter problems, please email <support@opener-project.eu> or leave an issue in the
139
- [issue tracker](https://github.com/opener-project/tokenizer/issues).
140
-
157
+ If you encounter problems, please email <support@opener-project.eu> or leave an
158
+ issue in the [issue tracker](https://github.com/opener-project/tokenizer/issues).
141
159
 
142
160
  Contributing
143
161
  ------------
data/bin/tokenizer-daemon CHANGED
@@ -1,9 +1,10 @@
1
1
  #!/usr/bin/env ruby
2
- #
3
- require 'rubygems'
2
+
4
3
  require 'opener/daemons'
5
4
 
6
- exec_path = File.expand_path("../../exec/tokenizer.rb", __FILE__)
7
- Opener::Daemons::Controller.new(:name=>"tokenizer",
8
- :exec_path=>exec_path)
5
+ controller = Opener::Daemons::Controller.new(
6
+ :name => 'opener-tokenizer',
7
+ :exec_path => File.expand_path('../../exec/tokenizer.rb', __FILE__)
8
+ )
9
9
 
10
+ controller.run
data/bin/tokenizer-server CHANGED
@@ -1,8 +1,10 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- require 'puma/cli'
3
+ require 'opener/webservice'
4
4
 
5
- rack_config = File.expand_path('../../config.ru', __FILE__)
5
+ parser = Opener::Webservice::OptionParser.new(
6
+ 'opener-tokenizer',
7
+ File.expand_path('../../config.ru', __FILE__)
8
+ )
6
9
 
7
- cli = Puma::CLI.new([rack_config] + ARGV)
8
- cli.run
10
+ parser.run
data/exec/tokenizer.rb CHANGED
@@ -1,8 +1,9 @@
1
1
  #!/usr/bin/env ruby
2
- #
2
+
3
3
  require 'opener/daemons'
4
- require 'opener/tokenizer'
5
4
 
6
- options = Opener::Daemons::OptParser.parse!(ARGV)
7
- daemon = Opener::Daemons::Daemon.new(Opener::Tokenizer, options)
5
+ require_relative '../lib/opener/tokenizer'
6
+
7
+ daemon = Opener::Daemons::Daemon.new(Opener::Tokenizer)
8
+
8
9
  daemon.start
@@ -2,7 +2,6 @@ require 'opener/tokenizers/base'
2
2
  require 'nokogiri'
3
3
  require 'open3'
4
4
  require 'optparse'
5
- require 'opener/core'
6
5
 
7
6
  require_relative 'tokenizer/version'
8
7
  require_relative 'tokenizer/cli'
@@ -41,8 +40,10 @@ module Opener
41
40
  #
42
41
  # @option options [Array] :args Collection of arbitrary arguments to pass
43
42
  # to the individual tokenizer commands.
43
+ #
44
44
  # @option options [String] :language The language to use for the
45
45
  # tokenization process.
46
+ #
46
47
  # @option options [TrueClass|FalseClass] :kaf When set to `true` the input
47
48
  # is assumed to be KAF.
48
49
  #
@@ -64,19 +65,21 @@ module Opener
64
65
  else
65
66
  language = options[:language]
66
67
  end
67
-
68
+
68
69
  unless valid_language?(language)
69
70
  raise ArgumentError, "The specified language (#{language}) is invalid"
70
71
  end
71
-
72
+
72
73
  kernel = language_constant(language).new(:args => options[:args])
73
-
74
- stdout, stderr, process = Open3.capture3(*kernel.command.split(" "), :stdin_data => input)
74
+
75
+ stdout, stderr, process = Open3.capture3(
76
+ *kernel.command.split(" "),
77
+ :stdin_data => input
78
+ )
79
+
75
80
  raise stderr unless process.success?
81
+
76
82
  return stdout
77
-
78
- rescue Exception => error
79
- return Opener::Core::ErrorLayer.new(input, error.message, self.class).add
80
83
  end
81
84
  end
82
85
 
@@ -1,5 +1,3 @@
1
- require 'sinatra/base'
2
- require 'httpclient'
3
1
  require 'opener/webservice'
4
2
 
5
3
  module Opener
@@ -7,10 +5,11 @@ module Opener
7
5
  ##
8
6
  # Text tokenizer server powered by Sinatra.
9
7
  #
10
- class Server < Webservice
8
+ class Server < Webservice::Server
11
9
  set :views, File.expand_path('../views', __FILE__)
12
- text_processor Tokenizer
13
- accepted_params :input, :kaf, :language
10
+
11
+ self.text_processor = Tokenizer
12
+ self.accepted_params = [:input, :kaf, :language]
14
13
  end # Server
15
14
  end # Tokenizer
16
15
  end # Opener
@@ -1,5 +1,5 @@
1
1
  module Opener
2
2
  class Tokenizer
3
- VERSION = "1.1.2"
3
+ VERSION = '2.0.0'
4
4
  end
5
5
  end
@@ -9,6 +9,8 @@ Gem::Specification.new do |gem|
9
9
  gem.homepage = 'http://opener-project.github.com/'
10
10
  gem.has_rdoc = "yard"
11
11
 
12
+ gem.license = 'Apache 2.0'
13
+
12
14
  gem.required_ruby_version = '>= 1.9.2'
13
15
 
14
16
  gem.files = Dir.glob([
@@ -16,20 +18,17 @@ Gem::Specification.new do |gem|
16
18
  'lib/**/*',
17
19
  'config.ru',
18
20
  '*.gemspec',
19
- 'README.md'
21
+ 'README.md',
22
+ 'LICENSE.txt'
20
23
  ]).select { |file| File.file?(file) }
21
24
 
22
25
  gem.executables = Dir.glob('bin/*').map { |file| File.basename(file) }
23
26
 
24
- gem.add_dependency 'opener-tokenizer-base', '>= 0.3.1'
25
- gem.add_dependency 'opener-webservice'
26
-
27
27
  gem.add_dependency 'nokogiri'
28
- gem.add_dependency 'sinatra', '~>1.4.2'
29
- gem.add_dependency 'httpclient'
30
- gem.add_dependency 'opener-daemons'
31
- gem.add_dependency 'opener-core', '>= 1.0.2'
32
- gem.add_dependency 'puma'
28
+ gem.add_dependency 'opener-tokenizer-base', '~> 1.0'
29
+ gem.add_dependency 'opener-webservice', '~> 2.1'
30
+ gem.add_dependency 'opener-daemons', '~> 2.1'
31
+ gem.add_dependency 'opener-core', '~> 2.0'
33
32
 
34
33
  gem.add_development_dependency 'rspec'
35
34
  gem.add_development_dependency 'cucumber'
metadata CHANGED
@@ -1,185 +1,143 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: opener-tokenizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.2
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - development@olery.com
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-06-19 00:00:00.000000000 Z
11
+ date: 2014-11-24 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: opener-tokenizer-base
15
- version_requirements: !ruby/object:Gem::Requirement
16
- requirements:
17
- - - '>='
18
- - !ruby/object:Gem::Version
19
- version: 0.3.1
20
- requirement: !ruby/object:Gem::Requirement
21
- requirements:
22
- - - '>='
23
- - !ruby/object:Gem::Version
24
- version: 0.3.1
25
- prerelease: false
26
- type: :runtime
27
- - !ruby/object:Gem::Dependency
28
- name: opener-webservice
29
- version_requirements: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - '>='
32
- - !ruby/object:Gem::Version
33
- version: '0'
14
+ name: nokogiri
34
15
  requirement: !ruby/object:Gem::Requirement
35
16
  requirements:
36
- - - '>='
17
+ - - ">="
37
18
  - !ruby/object:Gem::Version
38
19
  version: '0'
39
- prerelease: false
40
20
  type: :runtime
41
- - !ruby/object:Gem::Dependency
42
- name: nokogiri
21
+ prerelease: false
43
22
  version_requirements: !ruby/object:Gem::Requirement
44
23
  requirements:
45
- - - '>='
24
+ - - ">="
46
25
  - !ruby/object:Gem::Version
47
26
  version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: opener-tokenizer-base
48
29
  requirement: !ruby/object:Gem::Requirement
49
30
  requirements:
50
- - - '>='
31
+ - - "~>"
51
32
  - !ruby/object:Gem::Version
52
- version: '0'
53
- prerelease: false
33
+ version: '1.0'
54
34
  type: :runtime
55
- - !ruby/object:Gem::Dependency
56
- name: sinatra
35
+ prerelease: false
57
36
  version_requirements: !ruby/object:Gem::Requirement
58
37
  requirements:
59
- - - ~>
38
+ - - "~>"
60
39
  - !ruby/object:Gem::Version
61
- version: 1.4.2
40
+ version: '1.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: opener-webservice
62
43
  requirement: !ruby/object:Gem::Requirement
63
44
  requirements:
64
- - - ~>
45
+ - - "~>"
65
46
  - !ruby/object:Gem::Version
66
- version: 1.4.2
67
- prerelease: false
47
+ version: '2.1'
68
48
  type: :runtime
69
- - !ruby/object:Gem::Dependency
70
- name: httpclient
49
+ prerelease: false
71
50
  version_requirements: !ruby/object:Gem::Requirement
72
51
  requirements:
73
- - - '>='
52
+ - - "~>"
74
53
  - !ruby/object:Gem::Version
75
- version: '0'
76
- requirement: !ruby/object:Gem::Requirement
77
- requirements:
78
- - - '>='
79
- - !ruby/object:Gem::Version
80
- version: '0'
81
- prerelease: false
82
- type: :runtime
54
+ version: '2.1'
83
55
  - !ruby/object:Gem::Dependency
84
56
  name: opener-daemons
85
- version_requirements: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - '>='
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
57
  requirement: !ruby/object:Gem::Requirement
91
58
  requirements:
92
- - - '>='
59
+ - - "~>"
93
60
  - !ruby/object:Gem::Version
94
- version: '0'
95
- prerelease: false
61
+ version: '2.1'
96
62
  type: :runtime
97
- - !ruby/object:Gem::Dependency
98
- name: opener-core
63
+ prerelease: false
99
64
  version_requirements: !ruby/object:Gem::Requirement
100
65
  requirements:
101
- - - '>='
66
+ - - "~>"
102
67
  - !ruby/object:Gem::Version
103
- version: 1.0.2
68
+ version: '2.1'
69
+ - !ruby/object:Gem::Dependency
70
+ name: opener-core
104
71
  requirement: !ruby/object:Gem::Requirement
105
72
  requirements:
106
- - - '>='
73
+ - - "~>"
107
74
  - !ruby/object:Gem::Version
108
- version: 1.0.2
109
- prerelease: false
75
+ version: '2.0'
110
76
  type: :runtime
111
- - !ruby/object:Gem::Dependency
112
- name: puma
77
+ prerelease: false
113
78
  version_requirements: !ruby/object:Gem::Requirement
114
79
  requirements:
115
- - - '>='
80
+ - - "~>"
116
81
  - !ruby/object:Gem::Version
117
- version: '0'
82
+ version: '2.0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rspec
118
85
  requirement: !ruby/object:Gem::Requirement
119
86
  requirements:
120
- - - '>='
87
+ - - ">="
121
88
  - !ruby/object:Gem::Version
122
89
  version: '0'
90
+ type: :development
123
91
  prerelease: false
124
- type: :runtime
125
- - !ruby/object:Gem::Dependency
126
- name: rspec
127
92
  version_requirements: !ruby/object:Gem::Requirement
128
93
  requirements:
129
- - - '>='
94
+ - - ">="
130
95
  - !ruby/object:Gem::Version
131
96
  version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: cucumber
132
99
  requirement: !ruby/object:Gem::Requirement
133
100
  requirements:
134
- - - '>='
101
+ - - ">="
135
102
  - !ruby/object:Gem::Version
136
103
  version: '0'
137
- prerelease: false
138
104
  type: :development
139
- - !ruby/object:Gem::Dependency
140
- name: cucumber
105
+ prerelease: false
141
106
  version_requirements: !ruby/object:Gem::Requirement
142
107
  requirements:
143
- - - '>='
108
+ - - ">="
144
109
  - !ruby/object:Gem::Version
145
110
  version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: pry
146
113
  requirement: !ruby/object:Gem::Requirement
147
114
  requirements:
148
- - - '>='
115
+ - - ">="
149
116
  - !ruby/object:Gem::Version
150
117
  version: '0'
151
- prerelease: false
152
118
  type: :development
153
- - !ruby/object:Gem::Dependency
154
- name: pry
119
+ prerelease: false
155
120
  version_requirements: !ruby/object:Gem::Requirement
156
121
  requirements:
157
- - - '>='
122
+ - - ">="
158
123
  - !ruby/object:Gem::Version
159
124
  version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: rake
160
127
  requirement: !ruby/object:Gem::Requirement
161
128
  requirements:
162
- - - '>='
129
+ - - ">="
163
130
  - !ruby/object:Gem::Version
164
131
  version: '0'
165
- prerelease: false
166
132
  type: :development
167
- - !ruby/object:Gem::Dependency
168
- name: rake
133
+ prerelease: false
169
134
  version_requirements: !ruby/object:Gem::Requirement
170
135
  requirements:
171
- - - '>='
172
- - !ruby/object:Gem::Version
173
- version: '0'
174
- requirement: !ruby/object:Gem::Requirement
175
- requirements:
176
- - - '>='
136
+ - - ">="
177
137
  - !ruby/object:Gem::Version
178
138
  version: '0'
179
- prerelease: false
180
- type: :development
181
139
  description: Gem that wraps up the the tokenizer cores
182
- email:
140
+ email:
183
141
  executables:
184
142
  - tokenizer
185
143
  - tokenizer-daemon
@@ -187,6 +145,7 @@ executables:
187
145
  extensions: []
188
146
  extra_rdoc_files: []
189
147
  files:
148
+ - LICENSE.txt
190
149
  - README.md
191
150
  - bin/tokenizer
192
151
  - bin/tokenizer-daemon
@@ -202,26 +161,28 @@ files:
202
161
  - lib/opener/tokenizer/views/result.erb
203
162
  - opener-tokenizer.gemspec
204
163
  homepage: http://opener-project.github.com/
205
- licenses: []
164
+ licenses:
165
+ - Apache 2.0
206
166
  metadata: {}
207
- post_install_message:
167
+ post_install_message:
208
168
  rdoc_options: []
209
169
  require_paths:
210
170
  - lib
211
171
  required_ruby_version: !ruby/object:Gem::Requirement
212
172
  requirements:
213
- - - '>='
173
+ - - ">="
214
174
  - !ruby/object:Gem::Version
215
175
  version: 1.9.2
216
176
  required_rubygems_version: !ruby/object:Gem::Requirement
217
177
  requirements:
218
- - - '>='
178
+ - - ">="
219
179
  - !ruby/object:Gem::Version
220
180
  version: '0'
221
181
  requirements: []
222
- rubyforge_project:
182
+ rubyforge_project:
223
183
  rubygems_version: 2.2.2
224
- signing_key:
184
+ signing_key:
225
185
  specification_version: 4
226
186
  summary: Gem that wraps up the the tokenizer cores
227
187
  test_files: []
188
+ has_rdoc: yard