opener-tokenizer 1.1.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: f9af1f4a79564c201746a78973f5fc3cdfe272f1
4
- data.tar.gz: 41e00e0819e7a16aa5d9e509fb4f28751811b292
3
+ metadata.gz: dab5882d292da38d032ad2de7dfb8c289372428b
4
+ data.tar.gz: 19ee04d381d370e64d606dc41e3472296980df67
5
5
  SHA512:
6
- metadata.gz: 563674a5a7f855eff87bd6659a5688f35502c0790fbde1797dc1b27f8f45f03f4c42ca8da2dfbe501da6c5a8db7523bc4cd308adf765c0e4a754e3874f1563c8
7
- data.tar.gz: e716377afcbc496894a89b95aeec064c89402c60839fb1213dfce2990a2baaab083e0f46950a1d8791c54a1028ac4c7faefd00dd49ee36b463af90204b55602d
6
+ metadata.gz: dcdb9e6f44524b5a1a23aa54dea0ed68e4b048b5b0b929c0964aeb576d2eecdc2877a0d054cd73e8afd67ad936016a5e044ae9b95a507219a9f930a7bf4775ec
7
+ data.tar.gz: e8660e048b3b58049bed277b207a69a5c763eac89049c251b750e0d4e111bb003030987540758686f175f4cfd2be76959a3534262faf57b5475f7982286a966c
data/LICENSE.txt ADDED
@@ -0,0 +1,13 @@
1
+ Copyright 2014 OpeNER Project Consortium
2
+
3
+ Licensed under the Apache License, Version 2.0 (the "License");
4
+ you may not use this file except in compliance with the License.
5
+ You may obtain a copy of the License at
6
+
7
+ http://www.apache.org/licenses/LICENSE-2.0
8
+
9
+ Unless required by applicable law or agreed to in writing, software
10
+ distributed under the License is distributed on an "AS IS" BASIS,
11
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ See the License for the specific language governing permissions and
13
+ limitations under the License.
data/README.md CHANGED
@@ -1,11 +1,15 @@
1
1
  Introduction
2
2
  ------------
3
3
 
4
- The tokenizer tokenizes a text into sentences and words.
4
+ The tokenizer tokenizes a text into sentences and words.
5
5
 
6
6
  ### Confused by some terminology?
7
7
 
8
- This software is part of a larger collection of natural language processing tools known as "the OpeNER project". You can find more information about the project at [the OpeNER portal](http://opener-project.github.io). There you can also find references to terms like KAF (an XML standard to represent linguistic annotations in texts), component, cores, scenario's and pipelines.
8
+ This software is part of a larger collection of natural language processing
9
+ tools known as "the OpeNER project". You can find more information about the
10
+ project at [the OpeNER portal](http://opener-project.github.io). There you can
11
+ also find references to terms like KAF (an XML standard to represent linguistic
12
+ annotations in texts), component, cores, scenario's and pipelines.
9
13
 
10
14
  Quick Use Example
11
15
  -----------------
@@ -20,13 +24,14 @@ output KAF by default.
20
24
 
21
25
  ### Command line interface
22
26
 
23
- You should now be able to call the tokenizer as a regular shell
24
- command: by its name. Once installed the gem normally sits in your path so you can call it directly from anywhere.
27
+ You should now be able to call the tokenizer as a regular shell command: by its
28
+ name. Once installed the gem normally sits in your path so you can call it
29
+ directly from anywhere.
25
30
 
26
31
  Tokenizing some text:
27
32
 
28
33
  echo "This is English text" | tokenizer -l en --no-kaf
29
-
34
+
30
35
  Will result in
31
36
 
32
37
  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
@@ -45,11 +50,13 @@ Will result in
45
50
  </text>
46
51
  </KAF>
47
52
 
48
- The available languages for tokenization are: English (en), German (de), Dutch (nl), French (fr), Spanish (es), Italian (it)
53
+ The available languages for tokenization are: English (en), German (de), Dutch
54
+ (nl), French (fr), Spanish (es), Italian (it)
49
55
 
50
56
  #### KAF input format
51
57
 
52
- The tokenizer is capable of taking KAF as input, and actually does so by default. You can do so like this:
58
+ The tokenizer is capable of taking KAF as input, and actually does so by
59
+ default. You can do so like this:
53
60
 
54
61
  echo "<?xml version='1.0' encoding='UTF-8' standalone='no'?><KAF version='v1.opener' xml:lang='en'><raw>This is what I call, a test!</raw></KAF>" | tokenizer
55
62
 
@@ -72,7 +79,8 @@ Will result in
72
79
  </text>
73
80
  </KAF>
74
81
 
75
- If the argument -k (--kaf) is passed, then the argument -l (--language) is ignored.
82
+ If the argument -k (--kaf) is passed, then the argument -l (--language) is
83
+ ignored.
76
84
 
77
85
  ### Webservices
78
86
 
@@ -80,7 +88,8 @@ You can launch a language identification webservice by executing:
80
88
 
81
89
  tokenizer-server
82
90
 
83
- This will launch a mini webserver with the webservice. It defaults to port 9292, so you can access it at <http://localhost:9292>.
91
+ This will launch a mini webserver with the webservice. It defaults to port 9292,
92
+ so you can access it at <http://localhost:9292>.
84
93
 
85
94
  To launch it on a different port provide the `-p [port-number]` option like this:
86
95
 
@@ -88,19 +97,25 @@ To launch it on a different port provide the `-p [port-number]` option like this
88
97
 
89
98
  It then launches at <http://localhost:1234>
90
99
 
91
- Documentation on the Webservice is provided by surfing to the urls provided above. For more information on how to launch a webservice run the command with the ```-h``` option.
100
+ Documentation on the Webservice is provided by surfing to the urls provided
101
+ above. For more information on how to launch a webservice run the command with
102
+ the `--help` option.
92
103
 
93
104
 
94
105
  ### Daemon
95
106
 
96
- Last but not least the tokenizer comes shipped with a daemon that can read jobs (and write) jobs to and from Amazon SQS queues. For more information type:
107
+ Last but not least the tokenizer comes shipped with a daemon that can read jobs
108
+ (and write) jobs to and from Amazon SQS queues. For more information type:
97
109
 
98
- tokenizer-daemon -h
110
+ tokenizer-daemon --help
99
111
 
100
112
  Description of dependencies
101
113
  ---------------------------
102
114
 
103
- This component runs best if you run it in an environment suited for OpeNER components. You can find an installation guide and helper tools in the [OpeNER installer](https://github.com/opener-project/opener-installer) and [an installation guide on the Opener Website](http://opener-project.github.io/getting-started/how-to/local-installation.html)
115
+ This component runs best if you run it in an environment suited for OpeNER
116
+ components. You can find an installation guide and helper tools in the
117
+ [OpeNER installer](https://github.com/opener-project/opener-installer) and
118
+ [an installation guide on the Opener Website](http://opener-project.github.io/getting-started/how-to/local-installation.html).
104
119
 
105
120
  At least you need the following system setup:
106
121
 
@@ -113,16 +128,20 @@ At least you need the following system setup:
113
128
 
114
129
  * Maven (for building the Gem)
115
130
 
116
-
117
131
  Language Extension
118
132
  ------------------
119
133
 
120
- The tokenizer module is a wrapping around a Perl script, which performs the actual tokenization based on rules (when to break a character sequence). The tokenizer already supports a lot of languages. Have a look to the core script to figure out how to extend to new languages.
134
+ The tokenizer module is a wrapping around a Perl script, which performs the
135
+ actual tokenization based on rules (when to break a character sequence). The
136
+ tokenizer already supports a lot of languages. Have a look to the core script to
137
+ figure out how to extend to new languages.
121
138
 
122
139
  The Core
123
140
  --------
124
141
 
125
- The component is a fat wrapper around the actual language technology core. The core is a rule based tokenizer implemented in Perl. You can find the core technologies in the following repositories:
142
+ The component is a fat wrapper around the actual language technology core. The
143
+ core is a rule based tokenizer implemented in Perl. You can find the core
144
+ technologies in the following repositories:
126
145
 
127
146
  * [tokenizer-base](http://github.com/opener-project/tokenizer-base)
128
147
 
@@ -135,9 +154,8 @@ Where to go from here
135
154
  Report problem/Get help
136
155
  -----------------------
137
156
 
138
- If you encounter problems, please email <support@opener-project.eu> or leave an issue in the
139
- [issue tracker](https://github.com/opener-project/tokenizer/issues).
140
-
157
+ If you encounter problems, please email <support@opener-project.eu> or leave an
158
+ issue in the [issue tracker](https://github.com/opener-project/tokenizer/issues).
141
159
 
142
160
  Contributing
143
161
  ------------
data/bin/tokenizer-daemon CHANGED
@@ -1,9 +1,10 @@
1
1
  #!/usr/bin/env ruby
2
- #
3
- require 'rubygems'
2
+
4
3
  require 'opener/daemons'
5
4
 
6
- exec_path = File.expand_path("../../exec/tokenizer.rb", __FILE__)
7
- Opener::Daemons::Controller.new(:name=>"tokenizer",
8
- :exec_path=>exec_path)
5
+ controller = Opener::Daemons::Controller.new(
6
+ :name => 'opener-tokenizer',
7
+ :exec_path => File.expand_path('../../exec/tokenizer.rb', __FILE__)
8
+ )
9
9
 
10
+ controller.run
data/bin/tokenizer-server CHANGED
@@ -1,8 +1,10 @@
1
1
  #!/usr/bin/env ruby
2
2
 
3
- require 'puma/cli'
3
+ require 'opener/webservice'
4
4
 
5
- rack_config = File.expand_path('../../config.ru', __FILE__)
5
+ parser = Opener::Webservice::OptionParser.new(
6
+ 'opener-tokenizer',
7
+ File.expand_path('../../config.ru', __FILE__)
8
+ )
6
9
 
7
- cli = Puma::CLI.new([rack_config] + ARGV)
8
- cli.run
10
+ parser.run
data/exec/tokenizer.rb CHANGED
@@ -1,8 +1,9 @@
1
1
  #!/usr/bin/env ruby
2
- #
2
+
3
3
  require 'opener/daemons'
4
- require 'opener/tokenizer'
5
4
 
6
- options = Opener::Daemons::OptParser.parse!(ARGV)
7
- daemon = Opener::Daemons::Daemon.new(Opener::Tokenizer, options)
5
+ require_relative '../lib/opener/tokenizer'
6
+
7
+ daemon = Opener::Daemons::Daemon.new(Opener::Tokenizer)
8
+
8
9
  daemon.start
@@ -2,7 +2,6 @@ require 'opener/tokenizers/base'
2
2
  require 'nokogiri'
3
3
  require 'open3'
4
4
  require 'optparse'
5
- require 'opener/core'
6
5
 
7
6
  require_relative 'tokenizer/version'
8
7
  require_relative 'tokenizer/cli'
@@ -41,8 +40,10 @@ module Opener
41
40
  #
42
41
  # @option options [Array] :args Collection of arbitrary arguments to pass
43
42
  # to the individual tokenizer commands.
43
+ #
44
44
  # @option options [String] :language The language to use for the
45
45
  # tokenization process.
46
+ #
46
47
  # @option options [TrueClass|FalseClass] :kaf When set to `true` the input
47
48
  # is assumed to be KAF.
48
49
  #
@@ -64,19 +65,21 @@ module Opener
64
65
  else
65
66
  language = options[:language]
66
67
  end
67
-
68
+
68
69
  unless valid_language?(language)
69
70
  raise ArgumentError, "The specified language (#{language}) is invalid"
70
71
  end
71
-
72
+
72
73
  kernel = language_constant(language).new(:args => options[:args])
73
-
74
- stdout, stderr, process = Open3.capture3(*kernel.command.split(" "), :stdin_data => input)
74
+
75
+ stdout, stderr, process = Open3.capture3(
76
+ *kernel.command.split(" "),
77
+ :stdin_data => input
78
+ )
79
+
75
80
  raise stderr unless process.success?
81
+
76
82
  return stdout
77
-
78
- rescue Exception => error
79
- return Opener::Core::ErrorLayer.new(input, error.message, self.class).add
80
83
  end
81
84
  end
82
85
 
@@ -1,5 +1,3 @@
1
- require 'sinatra/base'
2
- require 'httpclient'
3
1
  require 'opener/webservice'
4
2
 
5
3
  module Opener
@@ -7,10 +5,11 @@ module Opener
7
5
  ##
8
6
  # Text tokenizer server powered by Sinatra.
9
7
  #
10
- class Server < Webservice
8
+ class Server < Webservice::Server
11
9
  set :views, File.expand_path('../views', __FILE__)
12
- text_processor Tokenizer
13
- accepted_params :input, :kaf, :language
10
+
11
+ self.text_processor = Tokenizer
12
+ self.accepted_params = [:input, :kaf, :language]
14
13
  end # Server
15
14
  end # Tokenizer
16
15
  end # Opener
@@ -1,5 +1,5 @@
1
1
  module Opener
2
2
  class Tokenizer
3
- VERSION = "1.1.2"
3
+ VERSION = '2.0.0'
4
4
  end
5
5
  end
@@ -9,6 +9,8 @@ Gem::Specification.new do |gem|
9
9
  gem.homepage = 'http://opener-project.github.com/'
10
10
  gem.has_rdoc = "yard"
11
11
 
12
+ gem.license = 'Apache 2.0'
13
+
12
14
  gem.required_ruby_version = '>= 1.9.2'
13
15
 
14
16
  gem.files = Dir.glob([
@@ -16,20 +18,17 @@ Gem::Specification.new do |gem|
16
18
  'lib/**/*',
17
19
  'config.ru',
18
20
  '*.gemspec',
19
- 'README.md'
21
+ 'README.md',
22
+ 'LICENSE.txt'
20
23
  ]).select { |file| File.file?(file) }
21
24
 
22
25
  gem.executables = Dir.glob('bin/*').map { |file| File.basename(file) }
23
26
 
24
- gem.add_dependency 'opener-tokenizer-base', '>= 0.3.1'
25
- gem.add_dependency 'opener-webservice'
26
-
27
27
  gem.add_dependency 'nokogiri'
28
- gem.add_dependency 'sinatra', '~>1.4.2'
29
- gem.add_dependency 'httpclient'
30
- gem.add_dependency 'opener-daemons'
31
- gem.add_dependency 'opener-core', '>= 1.0.2'
32
- gem.add_dependency 'puma'
28
+ gem.add_dependency 'opener-tokenizer-base', '~> 1.0'
29
+ gem.add_dependency 'opener-webservice', '~> 2.1'
30
+ gem.add_dependency 'opener-daemons', '~> 2.1'
31
+ gem.add_dependency 'opener-core', '~> 2.0'
33
32
 
34
33
  gem.add_development_dependency 'rspec'
35
34
  gem.add_development_dependency 'cucumber'
metadata CHANGED
@@ -1,185 +1,143 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: opener-tokenizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.2
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - development@olery.com
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2014-06-19 00:00:00.000000000 Z
11
+ date: 2014-11-24 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: opener-tokenizer-base
15
- version_requirements: !ruby/object:Gem::Requirement
16
- requirements:
17
- - - '>='
18
- - !ruby/object:Gem::Version
19
- version: 0.3.1
20
- requirement: !ruby/object:Gem::Requirement
21
- requirements:
22
- - - '>='
23
- - !ruby/object:Gem::Version
24
- version: 0.3.1
25
- prerelease: false
26
- type: :runtime
27
- - !ruby/object:Gem::Dependency
28
- name: opener-webservice
29
- version_requirements: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - '>='
32
- - !ruby/object:Gem::Version
33
- version: '0'
14
+ name: nokogiri
34
15
  requirement: !ruby/object:Gem::Requirement
35
16
  requirements:
36
- - - '>='
17
+ - - ">="
37
18
  - !ruby/object:Gem::Version
38
19
  version: '0'
39
- prerelease: false
40
20
  type: :runtime
41
- - !ruby/object:Gem::Dependency
42
- name: nokogiri
21
+ prerelease: false
43
22
  version_requirements: !ruby/object:Gem::Requirement
44
23
  requirements:
45
- - - '>='
24
+ - - ">="
46
25
  - !ruby/object:Gem::Version
47
26
  version: '0'
27
+ - !ruby/object:Gem::Dependency
28
+ name: opener-tokenizer-base
48
29
  requirement: !ruby/object:Gem::Requirement
49
30
  requirements:
50
- - - '>='
31
+ - - "~>"
51
32
  - !ruby/object:Gem::Version
52
- version: '0'
53
- prerelease: false
33
+ version: '1.0'
54
34
  type: :runtime
55
- - !ruby/object:Gem::Dependency
56
- name: sinatra
35
+ prerelease: false
57
36
  version_requirements: !ruby/object:Gem::Requirement
58
37
  requirements:
59
- - - ~>
38
+ - - "~>"
60
39
  - !ruby/object:Gem::Version
61
- version: 1.4.2
40
+ version: '1.0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: opener-webservice
62
43
  requirement: !ruby/object:Gem::Requirement
63
44
  requirements:
64
- - - ~>
45
+ - - "~>"
65
46
  - !ruby/object:Gem::Version
66
- version: 1.4.2
67
- prerelease: false
47
+ version: '2.1'
68
48
  type: :runtime
69
- - !ruby/object:Gem::Dependency
70
- name: httpclient
49
+ prerelease: false
71
50
  version_requirements: !ruby/object:Gem::Requirement
72
51
  requirements:
73
- - - '>='
52
+ - - "~>"
74
53
  - !ruby/object:Gem::Version
75
- version: '0'
76
- requirement: !ruby/object:Gem::Requirement
77
- requirements:
78
- - - '>='
79
- - !ruby/object:Gem::Version
80
- version: '0'
81
- prerelease: false
82
- type: :runtime
54
+ version: '2.1'
83
55
  - !ruby/object:Gem::Dependency
84
56
  name: opener-daemons
85
- version_requirements: !ruby/object:Gem::Requirement
86
- requirements:
87
- - - '>='
88
- - !ruby/object:Gem::Version
89
- version: '0'
90
57
  requirement: !ruby/object:Gem::Requirement
91
58
  requirements:
92
- - - '>='
59
+ - - "~>"
93
60
  - !ruby/object:Gem::Version
94
- version: '0'
95
- prerelease: false
61
+ version: '2.1'
96
62
  type: :runtime
97
- - !ruby/object:Gem::Dependency
98
- name: opener-core
63
+ prerelease: false
99
64
  version_requirements: !ruby/object:Gem::Requirement
100
65
  requirements:
101
- - - '>='
66
+ - - "~>"
102
67
  - !ruby/object:Gem::Version
103
- version: 1.0.2
68
+ version: '2.1'
69
+ - !ruby/object:Gem::Dependency
70
+ name: opener-core
104
71
  requirement: !ruby/object:Gem::Requirement
105
72
  requirements:
106
- - - '>='
73
+ - - "~>"
107
74
  - !ruby/object:Gem::Version
108
- version: 1.0.2
109
- prerelease: false
75
+ version: '2.0'
110
76
  type: :runtime
111
- - !ruby/object:Gem::Dependency
112
- name: puma
77
+ prerelease: false
113
78
  version_requirements: !ruby/object:Gem::Requirement
114
79
  requirements:
115
- - - '>='
80
+ - - "~>"
116
81
  - !ruby/object:Gem::Version
117
- version: '0'
82
+ version: '2.0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: rspec
118
85
  requirement: !ruby/object:Gem::Requirement
119
86
  requirements:
120
- - - '>='
87
+ - - ">="
121
88
  - !ruby/object:Gem::Version
122
89
  version: '0'
90
+ type: :development
123
91
  prerelease: false
124
- type: :runtime
125
- - !ruby/object:Gem::Dependency
126
- name: rspec
127
92
  version_requirements: !ruby/object:Gem::Requirement
128
93
  requirements:
129
- - - '>='
94
+ - - ">="
130
95
  - !ruby/object:Gem::Version
131
96
  version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: cucumber
132
99
  requirement: !ruby/object:Gem::Requirement
133
100
  requirements:
134
- - - '>='
101
+ - - ">="
135
102
  - !ruby/object:Gem::Version
136
103
  version: '0'
137
- prerelease: false
138
104
  type: :development
139
- - !ruby/object:Gem::Dependency
140
- name: cucumber
105
+ prerelease: false
141
106
  version_requirements: !ruby/object:Gem::Requirement
142
107
  requirements:
143
- - - '>='
108
+ - - ">="
144
109
  - !ruby/object:Gem::Version
145
110
  version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: pry
146
113
  requirement: !ruby/object:Gem::Requirement
147
114
  requirements:
148
- - - '>='
115
+ - - ">="
149
116
  - !ruby/object:Gem::Version
150
117
  version: '0'
151
- prerelease: false
152
118
  type: :development
153
- - !ruby/object:Gem::Dependency
154
- name: pry
119
+ prerelease: false
155
120
  version_requirements: !ruby/object:Gem::Requirement
156
121
  requirements:
157
- - - '>='
122
+ - - ">="
158
123
  - !ruby/object:Gem::Version
159
124
  version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: rake
160
127
  requirement: !ruby/object:Gem::Requirement
161
128
  requirements:
162
- - - '>='
129
+ - - ">="
163
130
  - !ruby/object:Gem::Version
164
131
  version: '0'
165
- prerelease: false
166
132
  type: :development
167
- - !ruby/object:Gem::Dependency
168
- name: rake
133
+ prerelease: false
169
134
  version_requirements: !ruby/object:Gem::Requirement
170
135
  requirements:
171
- - - '>='
172
- - !ruby/object:Gem::Version
173
- version: '0'
174
- requirement: !ruby/object:Gem::Requirement
175
- requirements:
176
- - - '>='
136
+ - - ">="
177
137
  - !ruby/object:Gem::Version
178
138
  version: '0'
179
- prerelease: false
180
- type: :development
181
139
  description: Gem that wraps up the the tokenizer cores
182
- email:
140
+ email:
183
141
  executables:
184
142
  - tokenizer
185
143
  - tokenizer-daemon
@@ -187,6 +145,7 @@ executables:
187
145
  extensions: []
188
146
  extra_rdoc_files: []
189
147
  files:
148
+ - LICENSE.txt
190
149
  - README.md
191
150
  - bin/tokenizer
192
151
  - bin/tokenizer-daemon
@@ -202,26 +161,28 @@ files:
202
161
  - lib/opener/tokenizer/views/result.erb
203
162
  - opener-tokenizer.gemspec
204
163
  homepage: http://opener-project.github.com/
205
- licenses: []
164
+ licenses:
165
+ - Apache 2.0
206
166
  metadata: {}
207
- post_install_message:
167
+ post_install_message:
208
168
  rdoc_options: []
209
169
  require_paths:
210
170
  - lib
211
171
  required_ruby_version: !ruby/object:Gem::Requirement
212
172
  requirements:
213
- - - '>='
173
+ - - ">="
214
174
  - !ruby/object:Gem::Version
215
175
  version: 1.9.2
216
176
  required_rubygems_version: !ruby/object:Gem::Requirement
217
177
  requirements:
218
- - - '>='
178
+ - - ">="
219
179
  - !ruby/object:Gem::Version
220
180
  version: '0'
221
181
  requirements: []
222
- rubyforge_project:
182
+ rubyforge_project:
223
183
  rubygems_version: 2.2.2
224
- signing_key:
184
+ signing_key:
225
185
  specification_version: 4
226
186
  summary: Gem that wraps up the the tokenizer cores
227
187
  test_files: []
188
+ has_rdoc: yard