opener-tokenizer 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: fcf7f5ea8023ba15ef71ca6fb82839f7404cf1bf
4
+ data.tar.gz: 021f76b37d483bea54bc3f41ccb1c85a35432003
5
+ SHA512:
6
+ metadata.gz: d359e3fdfc7f5792958df137725fcdb02192f52e8a321654628bb5ed4a954b09ef461f5f6a69ed2215e813cad663b945c61cf8a262a80775d882dba98cc185a3
7
+ data.tar.gz: 77d3ab3fd86a8dd12d8fe4210f36d40e1fb1185cfaab0e5fe40eba34b44b84e62ffe80e042e14b6c90aa67f2b58feaf4fcffd6e8aac85339d6dae9ba7564d22a
data/README.md ADDED
@@ -0,0 +1,165 @@
1
+ Introduction
2
+ ------------
3
+
4
+ The tokenizer tokenizes a text into sentences and words.
5
+
6
+ ### Confused by some terminology?
7
+
8
+ This software is part of a larger collection of natural language processing
9
+ tools known as "the OpeNER project". You can find more information about the
10
+ project at (the OpeNER portal)[http://opener-project.github.io]. There you can
11
+ also find references to terms like KAF (an XML standard to represent linguistic
12
+ annotations in texts), component, cores, scenario's and pipelines.
13
+
14
+ Quick Use Example
15
+ -----------------
16
+
17
+ Installing the tokenizer can be done by executing:
18
+
19
+ gem install tokenizer
20
+
21
+ Please bare in mind that all components in OpeNER take KAF as an input and
22
+ output KAF by default.
23
+
24
+
25
+ ### Command line interface
26
+
27
+ You should now be able to call the tokenizer as a regular shell
28
+ command: by its name. Once installed the gem normalyl sits in your path so you can call it directly from anywhere.
29
+
30
+ Tokenizing some text:
31
+
32
+ echo "This is English text" | tokenizer -l en --no-kaf
33
+
34
+ Will result in
35
+
36
+ <?xml version="1.0" encoding="UTF-8" standalone="no"?>
37
+ <KAF version="v1.opener" xml:lang="en">
38
+ <kafHeader>
39
+ <linguisticProcessors layer="text">
40
+ <lp name="opener-sentence-splitter-en" timestamp="2013-05-31T11:39:31Z" version="0.0.1"/>
41
+ <lp name="opener-tokenizer-en" timestamp="2013-05-31T11:39:32Z" version="1.0.1"/>
42
+ </linguisticProcessors>
43
+ </kafHeader>
44
+ <text>
45
+ <wf length="4" offset="0" para="1" sent="1" wid="w1">This</wf>
46
+ <wf length="2" offset="5" para="1" sent="1" wid="w2">is</wf>
47
+ <wf length="7" offset="8" para="1" sent="1" wid="w3">English</wf>
48
+ <wf length="4" offset="16" para="1" sent="1" wid="w4">text</wf>
49
+ </text>
50
+ </KAF>
51
+
52
+ The available languages for tokenization are: English (en), German (de), Dutch (nl), French (fr), Spanish (es), Italian (it)
53
+
54
+ #### KAF input format
55
+
56
+ The tokenizer is capable of taking KAF as input, and actually does so by
57
+ default. You can do so like this:
58
+
59
+ echo "<?xml version='1.0' encoding='UTF-8' standalone='no'?><KAF version='v1.opener' xml:lang='en'><raw>This is what I call, a test!</raw></KAF>" | tokenizer
60
+
61
+ Will result in
62
+
63
+ <?xml version="1.0" encoding="UTF-8" standalone="no"?>
64
+ <KAF version="v1.opener" xml:lang="en">
65
+ <kafHeader>
66
+ <linguisticProcessors layer="text">
67
+ <lp name="opener-sentence-splitter-en" timestamp="2013-05-31T11:39:31Z" version="0.0.1"/>
68
+ <lp name="opener-tokenizer-en" timestamp="2013-05-31T11:39:32Z" version="1.0.1"/>
69
+ </linguisticProcessors>
70
+ </kafHeader>
71
+ <text>
72
+ <wf length="4" offset="0" para="1" sent="1" wid="w1">this</wf>
73
+ <wf length="2" offset="5" para="1" sent="1" wid="w2">is</wf>
74
+ <wf length="2" offset="8" para="1" sent="1" wid="w3">an</wf>
75
+ <wf length="7" offset="11" para="1" sent="1" wid="w4">english</wf>
76
+ <wf length="4" offset="19" para="1" sent="1" wid="w5">text</wf>
77
+ </text>
78
+ </KAF>
79
+
80
+ If the argument -k (--kaf) is passed, then the argument -l (--language) is ignored.
81
+
82
+ ### Webservices
83
+
84
+ You can launch a language identification webservice by executing:
85
+
86
+ tokenizer-server
87
+
88
+ This will launch a mini webserver with the webservice. It defaults to port 9292,
89
+ so you can access it at <http://localhost:9292>.
90
+
91
+ To launch it on a different port provide the `-p [port-number]` option like
92
+ this:
93
+
94
+ tokenizer-server -p 1234
95
+
96
+ It then launches at <http://localhost:1234>
97
+
98
+ Documentation on the Webservice is provided by surfing to the urls provided
99
+ above. For more information on how to launch a webservice run the command with
100
+ the ```-h``` option.
101
+
102
+
103
+ ### Daemon
104
+
105
+ Last but not least the tokenizer comes shipped with a daemon that
106
+ can read jobs (and write) jobs to and from Amazon SQS queues. For more
107
+ information type:
108
+
109
+ tokenizer-daemon -h
110
+
111
+ Description of dependencies
112
+ ---------------------------
113
+
114
+ This component runs best if you run it in an environment suited for OpeNER
115
+ components. You can find an installation guide and helper tools in the (OpeNER
116
+ installer)[https://github.com/opener-project/opener-installer] and (an
117
+ installation guide on the Opener
118
+ Website)[http://opener-project.github.io/getting-started/how-to/local-installation.html]
119
+
120
+ At least you need the following system setup:
121
+
122
+ ### Depenencies for normal use:
123
+
124
+ * Perl 5
125
+ * MRI 1.9.3
126
+
127
+ ### Dependencies if you want to modify the component:
128
+
129
+ * Maven (for building the Gem)
130
+
131
+
132
+ Language Extension
133
+ ------------------
134
+
135
+ TODO
136
+
137
+ The Core
138
+ --------
139
+
140
+ The component is a fat wrapper around the actual language technology core. You
141
+ can find the core technolies in the following repositories:
142
+
143
+ * (tokenizer-base)[http://github.com/opener-project/tokenizer-base]
144
+
145
+ Where to go from here
146
+ ---------------------
147
+
148
+ * Check (the project websitere)[http://opener-project.github.io]
149
+ * (Checkout the webservice)[http://opener.olery.com/tokenizer]
150
+
151
+ Report problem/Get help
152
+ -----------------------
153
+
154
+ If you encounter problems, please email support@opener-project.eu or leave an
155
+ issue in the (issue tracker)[https://github.com/opener-project/tokenizer/issues].
156
+
157
+
158
+ Contributing
159
+ ------------
160
+
161
+ 1. Fork it ( http://github.com/opener-project/tokenizer/fork )
162
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
163
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
164
+ 4. Push to the branch (`git push origin my-new-feature`)
165
+ 5. Create new Pull Request
data/bin/tokenizer ADDED
@@ -0,0 +1,7 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require_relative '../lib/opener/tokenizer'
4
+
5
+ cli = Opener::Tokenizer::CLI.new(:args => ARGV)
6
+
7
+ cli.run(STDIN.tty? ? nil : STDIN.read)
@@ -0,0 +1,9 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ require 'rubygems'
4
+ require 'opener/daemons'
5
+
6
+ exec_path = File.expand_path("../../exec/tokenizer.rb", __FILE__)
7
+ Opener::Daemons::Controller.new(:name=>"tokenizer",
8
+ :exec_path=>exec_path)
9
+
@@ -0,0 +1,10 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require 'rack'
4
+
5
+ # Without calling `Rack::Server#options` manually the CLI arguments will never
6
+ # be passed, thus the application can't be specified as a constructor argument.
7
+ server = Rack::Server.new
8
+ server.options[:config] = File.expand_path('../../config.ru', __FILE__)
9
+
10
+ server.start
data/config.ru ADDED
@@ -0,0 +1,4 @@
1
+ require File.expand_path('../lib/opener/tokenizer', __FILE__)
2
+ require File.expand_path('../lib/opener/tokenizer/server', __FILE__)
3
+
4
+ run Opener::Tokenizer::Server
data/exec/tokenizer.rb ADDED
@@ -0,0 +1,8 @@
1
+ #!/usr/bin/env ruby
2
+ #
3
+ require 'opener/daemons'
4
+ require 'opener/tokenizer'
5
+
6
+ options = Opener::Daemons::OptParser.parse!(ARGV)
7
+ daemon = Opener::Daemons::Daemon.new(Opener::Tokenizer, options)
8
+ daemon.start
@@ -0,0 +1,117 @@
1
+ module Opener
2
+ class Tokenizer
3
+ ##
4
+ # CLI wrapper around {Opener::Tokenizer} using OptionParser.
5
+ #
6
+ # @!attribute [r] options
7
+ # @return [Hash]
8
+ # @!attribute [r] option_parser
9
+ # @return [OptionParser]
10
+ #
11
+ class CLI
12
+ attr_reader :options, :option_parser
13
+
14
+ ##
15
+ # @param [Hash] options
16
+ #
17
+ def initialize(options = {})
18
+ @options = DEFAULT_OPTIONS.merge(options)
19
+
20
+ @option_parser = OptionParser.new do |opts|
21
+ opts.program_name = 'tokenizer'
22
+ opts.summary_indent = ' '
23
+
24
+ opts.on('-h', '--help', 'Shows this help message') do
25
+ show_help
26
+ end
27
+
28
+ opts.on('-v', '--version', 'Shows the current version') do
29
+ show_version
30
+ end
31
+
32
+ opts.on(
33
+ '-l',
34
+ '--language [VALUE]',
35
+ 'Uses this specific language'
36
+ ) do |value|
37
+ @options[:language] = value
38
+ @options[:kaf] = false
39
+ end
40
+
41
+ opts.on('-k', '--kaf', 'Treats the input as a KAF document') do
42
+ @options[:kaf] = true
43
+ end
44
+
45
+ opts.on('-p', '--plain', 'Treats the input as plain text') do
46
+ @options[:kaf] = false
47
+ end
48
+
49
+ opts.separator <<-EOF
50
+
51
+ Examples:
52
+
53
+ cat example.txt | #{opts.program_name} -l en # Manually specify the language
54
+ cat example.kaf | #{opts.program_name} # Uses the xml:lang attribute
55
+
56
+ Languages:
57
+
58
+ * Dutch (nl)
59
+ * English (en)
60
+ * French (fr)
61
+ * German (de)
62
+ * Italian (it)
63
+ * Spanish (es)
64
+
65
+ KAF Input:
66
+
67
+ If you give a KAF file as an input (-k or --kaf) the language is taken from
68
+ the xml:lang attribute inside the file. Else it expects that you give the
69
+ language as an argument (-l or --language)
70
+
71
+ Sample KAF syntax:
72
+
73
+ <?xml version="1.0" encoding="UTF-8" standalone="no"?>
74
+ <KAF version="v1.opener" xml:lang="en">
75
+ <raw>This is some text.</raw>
76
+ </KAF>
77
+ EOF
78
+ end
79
+ end
80
+
81
+ ##
82
+ # @param [String] input
83
+ #
84
+ def run(input)
85
+ option_parser.parse!(options[:args])
86
+
87
+ tokenizer = Tokenizer.new(options)
88
+
89
+ stdout, stderr, process = tokenizer.run(input)
90
+
91
+ if process.success?
92
+ puts stdout
93
+
94
+ STDERR.puts(stderr) unless stderr.empty?
95
+ else
96
+ abort stderr
97
+ end
98
+ end
99
+
100
+ private
101
+
102
+ ##
103
+ # Shows the help message and exits the program.
104
+ #
105
+ def show_help
106
+ abort option_parser.to_s
107
+ end
108
+
109
+ ##
110
+ # Shows the version and exits the program.
111
+ #
112
+ def show_version
113
+ abort "#{option_parser.program_name} v#{VERSION} on #{RUBY_DESCRIPTION}"
114
+ end
115
+ end # CLI
116
+ end # Tokenizer
117
+ end # Opener
@@ -0,0 +1,283 @@
1
+ input[type="text"], textarea
2
+ {
3
+ width: 500px;
4
+ }
5
+
6
+ body {
7
+ font-family: Helvetica, arial, sans-serif;
8
+ font-size: 14px;
9
+ line-height: 1.6;
10
+ padding-top: 10px;
11
+ padding-bottom: 10px;
12
+ background-color: white;
13
+ padding: 30px; }
14
+
15
+ body > *:first-child {
16
+ margin-top: 0 !important; }
17
+ body > *:last-child {
18
+ margin-bottom: 0 !important; }
19
+
20
+ a {
21
+ color: #4183C4; }
22
+ a.absent {
23
+ color: #cc0000; }
24
+ a.anchor {
25
+ display: block;
26
+ padding-left: 30px;
27
+ margin-left: -30px;
28
+ cursor: pointer;
29
+ position: absolute;
30
+ top: 0;
31
+ left: 0;
32
+ bottom: 0; }
33
+
34
+ h1, h2, h3, h4, h5, h6 {
35
+ margin: 20px 0 10px;
36
+ padding: 0;
37
+ font-weight: bold;
38
+ -webkit-font-smoothing: antialiased;
39
+ cursor: text;
40
+ position: relative; }
41
+
42
+ h1:hover a.anchor, h2:hover a.anchor, h3:hover a.anchor, h4:hover a.anchor, h5:hover a.anchor, h6:hover a.anchor {
43
+ background: url("../../images/modules/styleguide/para.png") no-repeat 10px center;
44
+ text-decoration: none; }
45
+
46
+ h1 tt, h1 code {
47
+ font-size: inherit; }
48
+
49
+ h2 tt, h2 code {
50
+ font-size: inherit; }
51
+
52
+ h3 tt, h3 code {
53
+ font-size: inherit; }
54
+
55
+ h4 tt, h4 code {
56
+ font-size: inherit; }
57
+
58
+ h5 tt, h5 code {
59
+ font-size: inherit; }
60
+
61
+ h6 tt, h6 code {
62
+ font-size: inherit; }
63
+
64
+ h1 {
65
+ font-size: 28px;
66
+ color: black; }
67
+
68
+ h2 {
69
+ font-size: 24px;
70
+ border-bottom: 1px solid #cccccc;
71
+ color: black; }
72
+
73
+ h3 {
74
+ font-size: 18px; }
75
+
76
+ h4 {
77
+ font-size: 16px; }
78
+
79
+ h5 {
80
+ font-size: 14px; }
81
+
82
+ h6 {
83
+ color: #777777;
84
+ font-size: 14px; }
85
+
86
+ p, blockquote, ul, ol, dl, li, table, pre {
87
+ margin: 15px 0; }
88
+
89
+ hr {
90
+ background: transparent url("../../images/modules/pulls/dirty-shade.png") repeat-x 0 0;
91
+ border: 0 none;
92
+ color: #cccccc;
93
+ height: 4px;
94
+ padding: 0; }
95
+
96
+ body > h2:first-child {
97
+ margin-top: 0;
98
+ padding-top: 0; }
99
+ body > h1:first-child {
100
+ margin-top: 0;
101
+ padding-top: 0; }
102
+ body > h1:first-child + h2 {
103
+ margin-top: 0;
104
+ padding-top: 0; }
105
+ body > h3:first-child, body > h4:first-child, body > h5:first-child, body > h6:first-child {
106
+ margin-top: 0;
107
+ padding-top: 0; }
108
+
109
+ a:first-child h1, a:first-child h2, a:first-child h3, a:first-child h4, a:first-child h5, a:first-child h6 {
110
+ margin-top: 0;
111
+ padding-top: 0; }
112
+
113
+ h1 p, h2 p, h3 p, h4 p, h5 p, h6 p {
114
+ margin-top: 0; }
115
+
116
+ li p.first {
117
+ display: inline-block; }
118
+
119
+ ul, ol {
120
+ padding-left: 30px; }
121
+
122
+ ul :first-child, ol :first-child {
123
+ margin-top: 0; }
124
+
125
+ ul :last-child, ol :last-child {
126
+ margin-bottom: 0; }
127
+
128
+ dl {
129
+ padding: 0; }
130
+ dl dt {
131
+ font-size: 14px;
132
+ font-weight: bold;
133
+ font-style: italic;
134
+ padding: 0;
135
+ margin: 15px 0 5px; }
136
+ dl dt:first-child {
137
+ padding: 0; }
138
+ dl dt > :first-child {
139
+ margin-top: 0; }
140
+ dl dt > :last-child {
141
+ margin-bottom: 0; }
142
+ dl dd {
143
+ margin: 0 0 15px;
144
+ padding: 0 15px; }
145
+ dl dd > :first-child {
146
+ margin-top: 0; }
147
+ dl dd > :last-child {
148
+ margin-bottom: 0; }
149
+
150
+ blockquote {
151
+ border-left: 4px solid #dddddd;
152
+ padding: 0 15px;
153
+ color: #777777; }
154
+ blockquote > :first-child {
155
+ margin-top: 0; }
156
+ blockquote > :last-child {
157
+ margin-bottom: 0; }
158
+
159
+ table {
160
+ padding: 0; }
161
+ table tr {
162
+ border-top: 1px solid #cccccc;
163
+ background-color: white;
164
+ margin: 0;
165
+ padding: 0; }
166
+ table tr:nth-child(2n) {
167
+ background-color: #f8f8f8; }
168
+ table tr th {
169
+ font-weight: bold;
170
+ border: 1px solid #cccccc;
171
+ text-align: left;
172
+ margin: 0;
173
+ padding: 6px 13px; }
174
+ table tr td {
175
+ border: 1px solid #cccccc;
176
+ text-align: left;
177
+ margin: 0;
178
+ padding: 6px 13px; }
179
+ table tr th :first-child, table tr td :first-child {
180
+ margin-top: 0; }
181
+ table tr th :last-child, table tr td :last-child {
182
+ margin-bottom: 0; }
183
+
184
+ img {
185
+ max-width: 100%; }
186
+
187
+ span.frame {
188
+ display: block;
189
+ overflow: hidden; }
190
+ span.frame > span {
191
+ border: 1px solid #dddddd;
192
+ display: block;
193
+ float: left;
194
+ overflow: hidden;
195
+ margin: 13px 0 0;
196
+ padding: 7px;
197
+ width: auto; }
198
+ span.frame span img {
199
+ display: block;
200
+ float: left; }
201
+ span.frame span span {
202
+ clear: both;
203
+ color: #333333;
204
+ display: block;
205
+ padding: 5px 0 0; }
206
+ span.align-center {
207
+ display: block;
208
+ overflow: hidden;
209
+ clear: both; }
210
+ span.align-center > span {
211
+ display: block;
212
+ overflow: hidden;
213
+ margin: 13px auto 0;
214
+ text-align: center; }
215
+ span.align-center span img {
216
+ margin: 0 auto;
217
+ text-align: center; }
218
+ span.align-right {
219
+ display: block;
220
+ overflow: hidden;
221
+ clear: both; }
222
+ span.align-right > span {
223
+ display: block;
224
+ overflow: hidden;
225
+ margin: 13px 0 0;
226
+ text-align: right; }
227
+ span.align-right span img {
228
+ margin: 0;
229
+ text-align: right; }
230
+ span.float-left {
231
+ display: block;
232
+ margin-right: 13px;
233
+ overflow: hidden;
234
+ float: left; }
235
+ span.float-left span {
236
+ margin: 13px 0 0; }
237
+ span.float-right {
238
+ display: block;
239
+ margin-left: 13px;
240
+ overflow: hidden;
241
+ float: right; }
242
+ span.float-right > span {
243
+ display: block;
244
+ overflow: hidden;
245
+ margin: 13px auto 0;
246
+ text-align: right; }
247
+
248
+ code, tt {
249
+ margin: 0 2px;
250
+ padding: 0 5px;
251
+ white-space: nowrap;
252
+ border: 1px solid #eaeaea;
253
+ background-color: #f8f8f8;
254
+ border-radius: 3px; }
255
+
256
+ pre code {
257
+ margin: 0;
258
+ padding: 0;
259
+ white-space: pre;
260
+ border: none;
261
+ background: transparent; }
262
+
263
+ .highlight pre {
264
+ background-color: #f8f8f8;
265
+ border: 1px solid #cccccc;
266
+ font-size: 13px;
267
+ line-height: 19px;
268
+ overflow: auto;
269
+ padding: 6px 10px;
270
+ border-radius: 3px; }
271
+
272
+ pre {
273
+ background-color: #f8f8f8;
274
+ border: 1px solid #cccccc;
275
+ font-size: 13px;
276
+ line-height: 19px;
277
+ overflow: auto;
278
+ padding: 6px 10px;
279
+ border-radius: 3px; }
280
+ pre code, pre tt {
281
+ background-color: transparent;
282
+ border: none; }
283
+
@@ -0,0 +1,16 @@
1
+ require 'sinatra/base'
2
+ require 'httpclient'
3
+ require 'opener/webservice'
4
+
5
+ module Opener
6
+ class Tokenizer
7
+ ##
8
+ # Text tokenizer server powered by Sinatra.
9
+ #
10
+ class Server < Webservice
11
+ set :views, File.expand_path('../views', __FILE__)
12
+ text_processor Tokenizer
13
+ accepted_params :input, :kaf, :language
14
+ end # Server
15
+ end # Tokenizer
16
+ end # Opener
@@ -0,0 +1,5 @@
1
+ module Opener
2
+ class Tokenizer
3
+ VERSION = "1.0.0"
4
+ end
5
+ end
@@ -0,0 +1,162 @@
1
+ <!DOCTYPE html>
2
+ <html>
3
+ <head>
4
+ <link type="text/css" rel="stylesheet" charset="UTF-8" href="markdown.css"/>
5
+ <title>Tokenizer Webservice</title>
6
+ </head>
7
+ <body>
8
+ <h1>Tokenizer Web Service</h1>
9
+
10
+ <h2>Example Usage</h2>
11
+
12
+ <p>
13
+ <pre>tokenizer-server start</pre>
14
+ <pre>curl -d "input=this is an english text&amp;language=en" http://localhost:9393 -XPOST</pre>
15
+
16
+ outputs:
17
+
18
+ <pre>&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;no&quot;?&gt;
19
+ &lt;KAF version=&quot;v1.opener&quot; xml:lang=&quot;en&quot;&gt;
20
+ &lt;kafHeader&gt;
21
+ &lt;linguisticProcessors layer=&quot;text&quot;&gt;
22
+ &lt;lp name=&quot;opener-sentence-splitter-en&quot; timestamp=&quot;2013-06-11T13:29:21Z&quot; version=&quot;0.0.1&quot;/&gt;
23
+ &lt;lp name=&quot;opener-tokenizer-en&quot; timestamp=&quot;2013-06-11T13:29:22Z&quot; version=&quot;1.0.1&quot;/&gt;
24
+ &lt;/linguisticProcessors&gt;
25
+ &lt;/kafHeader&gt;
26
+ &lt;text&gt;
27
+ &lt;wf length=&quot;4&quot; offset=&quot;0&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w1&quot;&gt;this&lt;/wf&gt;
28
+ &lt;wf length=&quot;2&quot; offset=&quot;5&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w2&quot;&gt;is&lt;/wf&gt;
29
+ &lt;wf length=&quot;2&quot; offset=&quot;8&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w3&quot;&gt;an&lt;/wf&gt;
30
+ &lt;wf length=&quot;7&quot; offset=&quot;11&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w4&quot;&gt;english&lt;/wf&gt;
31
+ &lt;wf length=&quot;4&quot; offset=&quot;19&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w5&quot;&gt;text&lt;/wf&gt;
32
+ &lt;/text&gt;
33
+ &lt;/KAF&gt;</pre>
34
+
35
+
36
+
37
+ <pre>curl -d &#39;text=&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;yes&quot;?&gt;&lt;KAF xml:lang=&quot;en&quot;&gt;&lt;raw&gt;this is an english text&lt;/raw&gt;&lt;/KAF&gt;&amp;kaf=true&#39; http://localhost:9292 -XPOST</pre>
38
+
39
+ outputs:
40
+
41
+ <pre>&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;no&quot;?&gt;
42
+ &lt;KAF version=&quot;v1.opener&quot; xml:lang=&quot;en&quot;&gt;
43
+ &lt;kafHeader&gt;
44
+ &lt;linguisticProcessors layer=&quot;text&quot;&gt;
45
+ &lt;lp name=&quot;opener-sentence-splitter-en&quot; timestamp=&quot;2013-06-11T13:26:15Z&quot; version=&quot;0.0.1&quot;/&gt;
46
+ &lt;lp name=&quot;opener-tokenizer-en&quot; timestamp=&quot;2013-06-11T13:26:16Z&quot; version=&quot;1.0.1&quot;/&gt;
47
+ &lt;/linguisticProcessors&gt;
48
+ &lt;/kafHeader&gt;
49
+ &lt;text&gt;
50
+ &lt;wf length=&quot;4&quot; offset=&quot;0&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w1&quot;&gt;this&lt;/wf&gt;
51
+ &lt;wf length=&quot;2&quot; offset=&quot;5&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w2&quot;&gt;is&lt;/wf&gt;
52
+ &lt;wf length=&quot;2&quot; offset=&quot;8&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w3&quot;&gt;an&lt;/wf&gt;
53
+ &lt;wf length=&quot;7&quot; offset=&quot;11&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w4&quot;&gt;english&lt;/wf&gt;
54
+ &lt;wf length=&quot;4&quot; offset=&quot;19&quot; para=&quot;1&quot; sent=&quot;1&quot; wid=&quot;w5&quot;&gt;text&lt;/wf&gt;
55
+ &lt;/text&gt;
56
+ &lt;/KAF&gt;</pre>
57
+ </p>
58
+
59
+ <h2>Try the webservice</h2>
60
+
61
+ <p>* required</p>
62
+ <p>** When entering a value no response will be displayed in the browser.</p>
63
+
64
+ <form action="<%=url("/")%>" method="POST">
65
+ <div>
66
+ <label for="input"/>Type your text here*</label>
67
+ <br/>
68
+
69
+ <textarea name="input" id="input" rows="10" cols="50"/></textarea>
70
+ </div>
71
+
72
+ <% 10.times do |t| %>
73
+ <div>
74
+ <label for="callbacks">Callback URL <%=t+1%>(**)</label>
75
+ <br />
76
+
77
+ <input id="callbacks" type="text" name="callbacks[]" />
78
+ </div>
79
+ <% end %>
80
+
81
+
82
+ <div>
83
+ <label for="error_callback">Error Callback</label>
84
+ <br />
85
+
86
+ <input id="error_callback" type="text" name="error_callback" />
87
+ </div>
88
+ <div>
89
+ <label for="kaf">
90
+ <input type="checkbox" name="kaf" value="false" id="kaf"/>
91
+ The input is in raw text (as opposed to kaf) format.
92
+ </label>
93
+
94
+ <br/>
95
+
96
+ <label for="language">
97
+ Choose the language of the text from the list.
98
+ <select name="language" id="language">
99
+ <option value="en">English</option>
100
+ <option value="de">German</option>
101
+ <option value="nl">Dutch</option>
102
+ <option value="fr">French</option>
103
+ <option value="es">Spanish</option>
104
+ <option value="it">Italian</option>
105
+ </select>
106
+ </label>
107
+ </div>
108
+
109
+ <input type="submit" value="Submit" />
110
+ </form>
111
+
112
+ <h2>Actions</h2>
113
+
114
+ <p>
115
+ <dl>
116
+ <dt>POST /</dt>
117
+ <dd>Tokenize the input text. See arguments listing for more options.</dd>
118
+ <dt>GET /</dt>
119
+ <dd>Show this page</dd>
120
+ </dl>
121
+ </p>
122
+
123
+ <h2>Arguments</h2>
124
+
125
+ <p> The webservice takes the following arguments: </p>
126
+ <p>* required</p>
127
+
128
+ <dl>
129
+ <dt>text*</dt>
130
+ <dd>The input text</dd>
131
+ <dt>kaf [true | false]</dt>
132
+ <dd>The input is in KAF format.</dd>
133
+ <dt>language [English | German | Dutch | French | Spanish | Italian]</dt>
134
+ <dd>The language of the provided text</dt>
135
+ <dt>callbacks</dt>
136
+ <dd>
137
+ You can provide a list of callback urls. If you provide callback urls
138
+ the tokenizer will run as a background job and a callback
139
+ with the results will be performed (POST) to the first url in the callback
140
+ list. The other urls in callback list will be provided in the "callbacks"
141
+ argument.<br/><br/>
142
+ Using callback you can chain together several OpeNER webservices in
143
+ one call. The first, will call the second, which will call the third, etc.
144
+ See for more information the <a href="http://opener-project.github.io">
145
+ webservice documentation online</a>.
146
+ </dd>
147
+ <dt>error_callback</dt>
148
+ <dd>URL to notify if errors occur in the background process. The error
149
+ callback will do a POST with the error message in the 'error' field.</dd>
150
+ </dt>
151
+
152
+
153
+
154
+ </dl>
155
+
156
+
157
+ <p>
158
+
159
+ </p>
160
+
161
+ </body>
162
+ </html>
@@ -0,0 +1,15 @@
1
+ <!DOCTYPE html>
2
+ <html>
3
+ <head>
4
+ <link type="text/css" rel="stylesheet" charset="UTF-8" href="markdown.css"/>
5
+ <title>Language Detector Webservice</title>
6
+ </head>
7
+ <body>
8
+ <h1>Output URL</h1>
9
+ <p>
10
+ When ready, you can view the result
11
+ <a href=<%= output_url %>>here</a>
12
+ </p>
13
+
14
+ </body>
15
+ </html>
@@ -0,0 +1,109 @@
1
+ require 'opener/tokenizers/base'
2
+ require 'nokogiri'
3
+ require 'open3'
4
+ require 'optparse'
5
+
6
+ require_relative 'tokenizer/version'
7
+ require_relative 'tokenizer/cli'
8
+
9
+ module Opener
10
+ ##
11
+ # Primary tokenizer class that delegates the work to the various language
12
+ # specific tokenizers.
13
+ #
14
+ # @!attribute [r] options
15
+ # @return [Hash]
16
+ #
17
+ class Tokenizer
18
+ attr_reader :options
19
+
20
+ ##
21
+ # The default language to use when no custom one is specified.
22
+ #
23
+ # @return [String]
24
+ #
25
+ DEFAULT_LANGUAGE = 'en'.freeze
26
+
27
+ ##
28
+ # Hash containing the default options to use.
29
+ #
30
+ # @return [Hash]
31
+ #
32
+ DEFAULT_OPTIONS = {
33
+ :args => [],
34
+ :kaf => true,
35
+ :language => DEFAULT_LANGUAGE
36
+ }.freeze
37
+
38
+ ##
39
+ # @param [Hash] options
40
+ #
41
+ # @option options [Array] :args Collection of arbitrary arguments to pass
42
+ # to the individual tokenizer commands.
43
+ # @option options [String] :language The language to use for the
44
+ # tokenization process.
45
+ # @option options [TrueClass|FalseClass] :kaf When set to `true` the input
46
+ # is assumed to be KAF.
47
+ #
48
+ def initialize(options = {})
49
+ @options = DEFAULT_OPTIONS.merge(options)
50
+ end
51
+
52
+ ##
53
+ # Processes the input and returns an array containing the output of STDOUT,
54
+ # STDERR and an object containing process information.
55
+ #
56
+ # @param [String] input
57
+ # @return [Array]
58
+ #
59
+ def run(input)
60
+
61
+ if options[:kaf]
62
+ language, input = kaf_elements(input)
63
+ else
64
+ language = options[:language]
65
+ end
66
+
67
+ unless valid_language?(language)
68
+ raise ArgumentError, "The specified language (#{language}) is invalid"
69
+ end
70
+
71
+ kernel = language_constant(language).new(:args => options[:args])
72
+
73
+ return Open3.capture3(*kernel.command.split(" "), :stdin_data => input)
74
+ end
75
+
76
+ alias tokenize run
77
+
78
+ private
79
+
80
+ ##
81
+ # Returns an Array containing the language an input from a KAF document.
82
+ #
83
+ # @param [String] input The KAF document.
84
+ # @return [Array]
85
+ #
86
+ def kaf_elements(input)
87
+ document = Nokogiri::XML(input)
88
+ language = document.at('KAF').attr('xml:lang')
89
+ text = document.at('raw').text
90
+
91
+ return language, text
92
+ end
93
+
94
+ ##
95
+ # @param [String] language
96
+ # @return [Class]
97
+ #
98
+ def language_constant(language)
99
+ Opener::Tokenizers.const_get(language.upcase)
100
+ end
101
+
102
+ ##
103
+ # @return [TrueClass|FalseClass]
104
+ #
105
+ def valid_language?(language)
106
+ return Opener::Tokenizers.const_defined?(language.upcase)
107
+ end
108
+ end # Tokenizer
109
+ end # Opener
@@ -0,0 +1,36 @@
1
+ require File.expand_path('../lib/opener/tokenizer/version', __FILE__)
2
+
3
+ Gem::Specification.new do |gem|
4
+ gem.name = 'opener-tokenizer'
5
+ gem.version = Opener::Tokenizer::VERSION
6
+ gem.authors = ['development@olery.com']
7
+ gem.summary = 'Gem that wraps up the the tokenizer cores'
8
+ gem.description = gem.summary
9
+ gem.homepage = 'http://opener-project.github.com/'
10
+ gem.has_rdoc = "yard"
11
+
12
+ gem.required_ruby_version = '>= 1.9.2'
13
+
14
+ gem.files = Dir.glob([
15
+ 'exec/**/*',
16
+ 'lib/**/*',
17
+ 'config.ru',
18
+ '*.gemspec',
19
+ 'README.md'
20
+ ]).select { |file| File.file?(file) }
21
+
22
+ gem.executables = Dir.glob('bin/*').map { |file| File.basename(file) }
23
+
24
+ gem.add_dependency 'opener-tokenizer-base', '>= 0.3.1'
25
+ gem.add_dependency 'opener-webservice'
26
+
27
+ gem.add_dependency 'nokogiri'
28
+ gem.add_dependency 'sinatra', '~>1.4.2'
29
+ gem.add_dependency 'httpclient'
30
+ gem.add_dependency 'opener-daemons'
31
+
32
+ gem.add_development_dependency 'rspec'
33
+ gem.add_development_dependency 'cucumber'
34
+ gem.add_development_dependency 'pry'
35
+ gem.add_development_dependency 'rake'
36
+ end
metadata ADDED
@@ -0,0 +1,200 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: opener-tokenizer
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.0.0
5
+ platform: ruby
6
+ authors:
7
+ - development@olery.com
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2014-05-20 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: opener-tokenizer-base
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: 0.3.1
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: 0.3.1
27
+ - !ruby/object:Gem::Dependency
28
+ name: opener-webservice
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ">="
32
+ - !ruby/object:Gem::Version
33
+ version: '0'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - ">="
39
+ - !ruby/object:Gem::Version
40
+ version: '0'
41
+ - !ruby/object:Gem::Dependency
42
+ name: nokogiri
43
+ requirement: !ruby/object:Gem::Requirement
44
+ requirements:
45
+ - - ">="
46
+ - !ruby/object:Gem::Version
47
+ version: '0'
48
+ type: :runtime
49
+ prerelease: false
50
+ version_requirements: !ruby/object:Gem::Requirement
51
+ requirements:
52
+ - - ">="
53
+ - !ruby/object:Gem::Version
54
+ version: '0'
55
+ - !ruby/object:Gem::Dependency
56
+ name: sinatra
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: 1.4.2
62
+ type: :runtime
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: 1.4.2
69
+ - !ruby/object:Gem::Dependency
70
+ name: httpclient
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - ">="
74
+ - !ruby/object:Gem::Version
75
+ version: '0'
76
+ type: :runtime
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">="
81
+ - !ruby/object:Gem::Version
82
+ version: '0'
83
+ - !ruby/object:Gem::Dependency
84
+ name: opener-daemons
85
+ requirement: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: '0'
90
+ type: :runtime
91
+ prerelease: false
92
+ version_requirements: !ruby/object:Gem::Requirement
93
+ requirements:
94
+ - - ">="
95
+ - !ruby/object:Gem::Version
96
+ version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: rspec
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - ">="
102
+ - !ruby/object:Gem::Version
103
+ version: '0'
104
+ type: :development
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - ">="
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ - !ruby/object:Gem::Dependency
112
+ name: cucumber
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - ">="
116
+ - !ruby/object:Gem::Version
117
+ version: '0'
118
+ type: :development
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - ">="
123
+ - !ruby/object:Gem::Version
124
+ version: '0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: pry
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ - !ruby/object:Gem::Dependency
140
+ name: rake
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - ">="
144
+ - !ruby/object:Gem::Version
145
+ version: '0'
146
+ type: :development
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - ">="
151
+ - !ruby/object:Gem::Version
152
+ version: '0'
153
+ description: Gem that wraps up the the tokenizer cores
154
+ email:
155
+ executables:
156
+ - tokenizer
157
+ - tokenizer-daemon
158
+ - tokenizer-server
159
+ extensions: []
160
+ extra_rdoc_files: []
161
+ files:
162
+ - README.md
163
+ - bin/tokenizer
164
+ - bin/tokenizer-daemon
165
+ - bin/tokenizer-server
166
+ - config.ru
167
+ - exec/tokenizer.rb
168
+ - lib/opener/tokenizer.rb
169
+ - lib/opener/tokenizer/cli.rb
170
+ - lib/opener/tokenizer/public/markdown.css
171
+ - lib/opener/tokenizer/server.rb
172
+ - lib/opener/tokenizer/version.rb
173
+ - lib/opener/tokenizer/views/index.erb
174
+ - lib/opener/tokenizer/views/result.erb
175
+ - opener-tokenizer.gemspec
176
+ homepage: http://opener-project.github.com/
177
+ licenses: []
178
+ metadata: {}
179
+ post_install_message:
180
+ rdoc_options: []
181
+ require_paths:
182
+ - lib
183
+ required_ruby_version: !ruby/object:Gem::Requirement
184
+ requirements:
185
+ - - ">="
186
+ - !ruby/object:Gem::Version
187
+ version: 1.9.2
188
+ required_rubygems_version: !ruby/object:Gem::Requirement
189
+ requirements:
190
+ - - ">="
191
+ - !ruby/object:Gem::Version
192
+ version: '0'
193
+ requirements: []
194
+ rubyforge_project:
195
+ rubygems_version: 2.2.2
196
+ signing_key:
197
+ specification_version: 4
198
+ summary: Gem that wraps up the the tokenizer cores
199
+ test_files: []
200
+ has_rdoc: yard