opener-language-identifier 3.0.3 → 3.0.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +46 -41
- data/core/target/LanguageDetection-0.0.1.jar +0 -0
- data/lib/opener/language_identifier.rb +13 -7
- data/lib/opener/language_identifier/error_layer.rb +91 -0
- data/lib/opener/language_identifier/version.rb +1 -1
- data/opener-language-identifier.gemspec +3 -3
- metadata +22 -7
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ce2775b5964868c6ad0e00519dde29c5dc1654a4
|
4
|
+
data.tar.gz: be35d8f78c6a32b39a77e41b9ba673c70e37c728
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 527280005269de7dadc0e7a4c8169c8f9da4f922e596e04e2765f8e69b0d5873c96e1f2c5f4d902ff7ca5fd4fa85fa9758c693f6fa1eb0cc5a905c5b41365337
|
7
|
+
data.tar.gz: d4cf50e2110aa86c9c908068e9fc339e4124523edeef301fcd49b42c0776cc41d3e5100681b2f12f6e480ebef4a2abb743036ac430f7abe86882a27fe38f1586
|
data/README.md
CHANGED
@@ -2,41 +2,48 @@
|
|
2
2
|
|
3
3
|
# Language Identifier
|
4
4
|
|
5
|
-
The language identifier takes raw text and tries to figure out what language it
|
5
|
+
The language identifier takes raw text and tries to figure out what language it
|
6
|
+
was written in. The output can either be a plain-text i18n language code or a
|
7
|
+
basic KAF document containing the language and raw input text.
|
6
8
|
|
7
|
-
The output of the language identifier can then be used to drive further text
|
9
|
+
The output of the language identifier can then be used to drive further text
|
10
|
+
analysis of for example sentiments and or entities.
|
8
11
|
|
9
|
-
|
12
|
+
## Confused by some terminology?
|
10
13
|
|
11
|
-
This software is part of a larger collection of natural language processing
|
14
|
+
This software is part of a larger collection of natural language processing
|
15
|
+
tools known as "the OpeNER project". You can find more information about the
|
16
|
+
project at [the OpeNER portal](http://opener-project.github.io). There you can
|
17
|
+
also find references to terms like KAF (an XML standard to represent linguistic
|
18
|
+
annotations in texts), component, cores, scenario's and pipelines.
|
12
19
|
|
13
|
-
Quick Use Example
|
14
|
-
-----------------
|
20
|
+
## Quick Use Example
|
15
21
|
|
16
22
|
Install the Gem:
|
17
23
|
|
18
24
|
gem install opener-language-identifier
|
19
25
|
|
20
|
-
Make sure you run
|
26
|
+
Make sure you run `jruby` since the language-identifier uses Java.
|
21
27
|
|
22
28
|
### Command line interface
|
23
29
|
|
24
|
-
You should now be able to call the language indentifier as a regular shell
|
30
|
+
You should now be able to call the language indentifier as a regular shell
|
31
|
+
command: by its name. Once installed the gem normally sits in your path so you
|
32
|
+
can call it directly from anywhere.
|
25
33
|
|
26
|
-
This aplication reads a text from standard input in order to identify the
|
34
|
+
This aplication reads a text from standard input in order to identify the
|
35
|
+
language.
|
27
36
|
|
28
37
|
echo "This is an English text." | language-identifier
|
29
38
|
|
30
39
|
This will output:
|
31
40
|
|
32
|
-
|
33
|
-
|
34
|
-
<
|
35
|
-
|
36
|
-
</KAF>
|
37
|
-
~~~~
|
41
|
+
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
|
42
|
+
<KAF xml:lang="en" version="2.1">
|
43
|
+
<raw>This is an English text.</raw>
|
44
|
+
</KAF>
|
38
45
|
|
39
|
-
If you just want the language code returned add the
|
46
|
+
If you just want the language code returned add the `--no-kaf` option like this
|
40
47
|
|
41
48
|
echo "This is an English text." | language-identifier --no-kaf
|
42
49
|
|
@@ -50,7 +57,8 @@ You can launch a language identification webservice by executing:
|
|
50
57
|
|
51
58
|
$ language-identifier-server
|
52
59
|
|
53
|
-
This will launch a mini webserver with the webservice. It defaults to port
|
60
|
+
This will launch a mini webserver with the webservice. It defaults to port
|
61
|
+
9292, so you can access it at <http://localhost:9292/>.
|
54
62
|
|
55
63
|
To launch it on a different port provide the `-p [port-number]` option like
|
56
64
|
this:
|
@@ -61,61 +69,58 @@ It then launches at <http://localhost:1234/>
|
|
61
69
|
|
62
70
|
Documentation on the Webservice is provided by surfing to the urls provided
|
63
71
|
above. For more information on how to launch a webservice run the command with
|
64
|
-
the
|
72
|
+
the `-h` option.
|
65
73
|
|
66
74
|
### Daemon
|
67
75
|
|
68
|
-
Last but not least the language identifier comes shipped with a daemon that
|
69
|
-
|
70
|
-
|
76
|
+
Last but not least the language identifier comes shipped with a daemon that can
|
77
|
+
read jobs (and write) jobs to and from Amazon SQS queues. For more information
|
78
|
+
type:
|
71
79
|
|
72
80
|
$ language-identifier-daemon -h
|
73
81
|
|
74
|
-
Description of dependencies
|
75
|
-
---------------------------
|
82
|
+
## Description of dependencies
|
76
83
|
|
77
84
|
This component runs best if you run it in an environment suited for OpeNER
|
78
|
-
components. You can find an installation guide and helper tools in the
|
85
|
+
components. You can find an installation guide and helper tools in the
|
86
|
+
[OpeNER installer](https://github.com/opener-project/opener-installer) and
|
87
|
+
[an installation guide on the OpenerWebsite](http://opener-project.github.io/getting-started/how-to/local-installation.html).
|
79
88
|
|
80
89
|
At least you need the following system setup:
|
81
90
|
|
82
91
|
### Dependencies for normal use:
|
83
92
|
|
84
|
-
*
|
85
|
-
*
|
86
|
-
* Java 1.7 or newer (There are problems with encoding in older versions).
|
93
|
+
* JRuby 1.7 or newer
|
94
|
+
* Java 1.7 or newer (there are problems with encodings in older versions).
|
87
95
|
|
88
96
|
### Dependencies if you want to modify the component:
|
89
97
|
|
90
98
|
* Maven (for building the Gem)
|
91
99
|
|
92
|
-
Language Extension
|
93
|
-
------------------
|
100
|
+
## Language Extension
|
94
101
|
|
95
|
-
The internal library that actually performs the language identification already
|
96
|
-
For more information about how to extends it for
|
102
|
+
The internal library that actually performs the language identification already
|
103
|
+
supports a lot of languages. For more information about how to extends it for
|
104
|
+
more languages or functionalities, please, visit the website of the tool at
|
105
|
+
<https://code.google.com/p/language-detection/>.
|
106
|
+
|
107
|
+
## The Core
|
97
108
|
|
98
|
-
The Core
|
99
|
-
--------
|
100
|
-
|
101
109
|
The component is a fat wrapper around the actual language technology core.
|
102
110
|
Written in Java. Checkout the core/src directory of the package to get to the
|
103
111
|
actual working component.
|
104
112
|
|
105
|
-
Where to go from here
|
106
|
-
---------------------
|
113
|
+
## Where to go from here
|
107
114
|
|
108
115
|
* [Check the project website](http://opener-project.github.io)
|
109
116
|
* [Checkout the webservice](http://opener.olery.com/language-identifier)
|
110
117
|
|
111
|
-
Report problem/Get help
|
112
|
-
-----------------------
|
118
|
+
## Report problem/Get help
|
113
119
|
|
114
120
|
If you encounter problems, please email support@opener-project.eu or leave an
|
115
|
-
issue in the [issue tracker](https://github.com/opener-project/language-identifier/issues).
|
121
|
+
issue in the [issue tracker](https://github.com/opener-project/language-identifier/issues).
|
116
122
|
|
117
|
-
Contributing
|
118
|
-
------------
|
123
|
+
## Contributing
|
119
124
|
|
120
125
|
1. Fork it <http://github.com/opener-project/language-identifier/fork>
|
121
126
|
2. Create your feature branch (`git checkout -b my-new-feature`)
|
Binary file
|
@@ -9,6 +9,7 @@ import 'org.vicomtech.opennlp.LanguageDetection.CybozuDetector'
|
|
9
9
|
require_relative 'language_identifier/version'
|
10
10
|
require_relative 'language_identifier/kaf_builder'
|
11
11
|
require_relative 'language_identifier/cli'
|
12
|
+
require_relative 'language_identifier/error_layer'
|
12
13
|
require_relative 'language_identifier/detector.rb'
|
13
14
|
|
14
15
|
module Opener
|
@@ -57,14 +58,19 @@ module Opener
|
|
57
58
|
# @return [Array]
|
58
59
|
#
|
59
60
|
def run(input)
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
61
|
+
begin
|
62
|
+
if options[:probs]
|
63
|
+
output = @detector.probabilities(input)
|
64
|
+
else
|
65
|
+
output = @detector.detect(input)
|
66
|
+
output = build_kaf(input, output) if @options[:kaf]
|
67
|
+
end
|
66
68
|
|
67
|
-
|
69
|
+
return output
|
70
|
+
|
71
|
+
rescue Exception => error
|
72
|
+
return ErrorLayer.new(input, error.message, self.class).add
|
73
|
+
end
|
68
74
|
end
|
69
75
|
|
70
76
|
alias identify run
|
@@ -0,0 +1,91 @@
|
|
1
|
+
require 'nokogiri'
|
2
|
+
|
3
|
+
module Opener
|
4
|
+
class LanguageIdentifier
|
5
|
+
##
|
6
|
+
# Add Error Layer to KAF file instead of throwing an error.
|
7
|
+
#
|
8
|
+
class ErrorLayer
|
9
|
+
attr_accessor :input, :document, :error, :klass
|
10
|
+
|
11
|
+
def initialize(input, error, klass)
|
12
|
+
@input = input.to_s
|
13
|
+
# Make sure there is always a document, even if it is empty.
|
14
|
+
@document = Nokogiri::XML(input) rescue Nokogiri::XML(nil)
|
15
|
+
@error = error
|
16
|
+
@klass = klass
|
17
|
+
end
|
18
|
+
|
19
|
+
def add
|
20
|
+
if is_xml?
|
21
|
+
unless has_errors_layer?
|
22
|
+
add_errors_layer
|
23
|
+
end
|
24
|
+
else
|
25
|
+
add_root
|
26
|
+
add_text
|
27
|
+
add_errors_layer
|
28
|
+
end
|
29
|
+
add_error
|
30
|
+
|
31
|
+
xml = !!document.encoding ? document.to_xml : document.to_xml(:encoding => "UTF-8")
|
32
|
+
|
33
|
+
return xml
|
34
|
+
end
|
35
|
+
|
36
|
+
##
|
37
|
+
# Check if the document is a valid XML file.
|
38
|
+
#
|
39
|
+
def is_xml?
|
40
|
+
!!document.root
|
41
|
+
end
|
42
|
+
|
43
|
+
##
|
44
|
+
# Add root element to the XML file.
|
45
|
+
#
|
46
|
+
def add_root
|
47
|
+
root = Nokogiri::XML::Node.new "KAF", document
|
48
|
+
document.add_child(root)
|
49
|
+
end
|
50
|
+
|
51
|
+
##
|
52
|
+
# Check if the document already has an errors layer.
|
53
|
+
#
|
54
|
+
def has_errors_layer?
|
55
|
+
!!document.at('errors')
|
56
|
+
end
|
57
|
+
|
58
|
+
##
|
59
|
+
# Add errors element to the XML file.
|
60
|
+
#
|
61
|
+
def add_errors_layer
|
62
|
+
node = Nokogiri::XML::Node.new "errors", document
|
63
|
+
document.root.add_child(node)
|
64
|
+
end
|
65
|
+
|
66
|
+
##
|
67
|
+
# Add the text file incase it is not a valid XML document. More
|
68
|
+
# info for debugging.
|
69
|
+
#
|
70
|
+
def add_text
|
71
|
+
node = Nokogiri::XML::Node.new "raw", document
|
72
|
+
node.inner_html = input
|
73
|
+
document.root.add_child(node)
|
74
|
+
|
75
|
+
end
|
76
|
+
|
77
|
+
##
|
78
|
+
# Add the actual error to the errors layer.
|
79
|
+
#
|
80
|
+
def add_error
|
81
|
+
node = document.at('errors')
|
82
|
+
error_node = Nokogiri::XML::Node.new "error", node
|
83
|
+
error_node['class'] = klass.to_s
|
84
|
+
error_node['version'] = klass::VERSION
|
85
|
+
error_node.inner_html = error
|
86
|
+
node.add_child(error_node)
|
87
|
+
end
|
88
|
+
|
89
|
+
end # ErrorLayer
|
90
|
+
end # LanguageIdentifier
|
91
|
+
end # Opener
|
@@ -12,7 +12,6 @@ Gem::Specification.new do |gem|
|
|
12
12
|
|
13
13
|
gem.files = Dir.glob([
|
14
14
|
'core/target/LanguageDetection-*.jar',
|
15
|
-
'core/target/classes/**/*.*',
|
16
15
|
'core/target/classes/**/*',
|
17
16
|
'exec/**/*',
|
18
17
|
'lib/**/*',
|
@@ -28,11 +27,12 @@ Gem::Specification.new do |gem|
|
|
28
27
|
gem.add_dependency 'sinatra', '~>1.4.2'
|
29
28
|
gem.add_dependency 'httpclient'
|
30
29
|
gem.add_dependency 'uuidtools'
|
31
|
-
gem.add_dependency 'opener-build-tools'
|
32
30
|
gem.add_dependency 'opener-webservice'
|
33
31
|
gem.add_dependency 'opener-daemons'
|
32
|
+
gem.add_dependency 'nokogiri'
|
34
33
|
|
35
|
-
gem.add_development_dependency 'rspec'
|
34
|
+
gem.add_development_dependency 'rspec', '~> 3.0'
|
36
35
|
gem.add_development_dependency 'cucumber'
|
37
36
|
gem.add_development_dependency 'rake'
|
37
|
+
gem.add_development_dependency 'cliver'
|
38
38
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: opener-language-identifier
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 3.0.
|
4
|
+
version: 3.0.4
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- development@olery.com
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2014-
|
11
|
+
date: 2014-06-12 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: builder
|
@@ -81,7 +81,7 @@ dependencies:
|
|
81
81
|
prerelease: false
|
82
82
|
type: :runtime
|
83
83
|
- !ruby/object:Gem::Dependency
|
84
|
-
name: opener-
|
84
|
+
name: opener-webservice
|
85
85
|
version_requirements: !ruby/object:Gem::Requirement
|
86
86
|
requirements:
|
87
87
|
- - '>='
|
@@ -95,7 +95,7 @@ dependencies:
|
|
95
95
|
prerelease: false
|
96
96
|
type: :runtime
|
97
97
|
- !ruby/object:Gem::Dependency
|
98
|
-
name: opener-
|
98
|
+
name: opener-daemons
|
99
99
|
version_requirements: !ruby/object:Gem::Requirement
|
100
100
|
requirements:
|
101
101
|
- - '>='
|
@@ -109,7 +109,7 @@ dependencies:
|
|
109
109
|
prerelease: false
|
110
110
|
type: :runtime
|
111
111
|
- !ruby/object:Gem::Dependency
|
112
|
-
name:
|
112
|
+
name: nokogiri
|
113
113
|
version_requirements: !ruby/object:Gem::Requirement
|
114
114
|
requirements:
|
115
115
|
- - '>='
|
@@ -124,6 +124,20 @@ dependencies:
|
|
124
124
|
type: :runtime
|
125
125
|
- !ruby/object:Gem::Dependency
|
126
126
|
name: rspec
|
127
|
+
version_requirements: !ruby/object:Gem::Requirement
|
128
|
+
requirements:
|
129
|
+
- - ~>
|
130
|
+
- !ruby/object:Gem::Version
|
131
|
+
version: '3.0'
|
132
|
+
requirement: !ruby/object:Gem::Requirement
|
133
|
+
requirements:
|
134
|
+
- - ~>
|
135
|
+
- !ruby/object:Gem::Version
|
136
|
+
version: '3.0'
|
137
|
+
prerelease: false
|
138
|
+
type: :development
|
139
|
+
- !ruby/object:Gem::Dependency
|
140
|
+
name: cucumber
|
127
141
|
version_requirements: !ruby/object:Gem::Requirement
|
128
142
|
requirements:
|
129
143
|
- - '>='
|
@@ -137,7 +151,7 @@ dependencies:
|
|
137
151
|
prerelease: false
|
138
152
|
type: :development
|
139
153
|
- !ruby/object:Gem::Dependency
|
140
|
-
name:
|
154
|
+
name: rake
|
141
155
|
version_requirements: !ruby/object:Gem::Requirement
|
142
156
|
requirements:
|
143
157
|
- - '>='
|
@@ -151,7 +165,7 @@ dependencies:
|
|
151
165
|
prerelease: false
|
152
166
|
type: :development
|
153
167
|
- !ruby/object:Gem::Dependency
|
154
|
-
name:
|
168
|
+
name: cliver
|
155
169
|
version_requirements: !ruby/object:Gem::Requirement
|
156
170
|
requirements:
|
157
171
|
- - '>='
|
@@ -253,6 +267,7 @@ files:
|
|
253
267
|
- lib/opener/language_identifier.rb
|
254
268
|
- lib/opener/language_identifier/cli.rb
|
255
269
|
- lib/opener/language_identifier/detector.rb
|
270
|
+
- lib/opener/language_identifier/error_layer.rb
|
256
271
|
- lib/opener/language_identifier/kaf_builder.rb
|
257
272
|
- lib/opener/language_identifier/public/markdown.css
|
258
273
|
- lib/opener/language_identifier/server.rb
|