lingo 1.8.4.2 → 1.8.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/ChangeLog +413 -325
- data/README +380 -131
- data/Rakefile +19 -21
- data/de/lingo-abk.txt +15 -17
- data/de/lingo-dic.txt +20210 -20659
- data/de/lingo-mul.txt +5 -13
- data/de/lingo-syn.txt +5 -8
- data/de/test_dic.txt +2 -0
- data/de/test_gen.txt +8 -0
- data/de/{test_mul2.txt → test_mu2.txt} +0 -0
- data/de/{test_singleword.txt → test_sgw.txt} +0 -0
- data/de/user-dic.txt +5 -7
- data/de.lang +64 -49
- data/en/lingo-dic.txt +6398 -6404
- data/en/lingo-irr.txt +2 -3
- data/en/lingo-mul.txt +6 -7
- data/en/lingo-wdn.txt +881 -1762
- data/en/user-dic.txt +2 -5
- data/en.lang +39 -39
- data/lib/lingo/app.rb +10 -6
- data/lib/lingo/attendee/abbreviator.rb +1 -0
- data/lib/lingo/attendee/decomposer.rb +2 -1
- data/lib/lingo/attendee/multi_worder.rb +5 -6
- data/lib/lingo/attendee/stemmer.rb +1 -1
- data/lib/lingo/attendee/synonymer.rb +4 -2
- data/lib/lingo/attendee/text_reader.rb +77 -57
- data/lib/lingo/attendee/text_writer.rb +1 -1
- data/lib/lingo/attendee/tokenizer.rb +101 -50
- data/lib/lingo/attendee/variator.rb +2 -1
- data/lib/lingo/attendee/vector_filter.rb +28 -6
- data/lib/lingo/attendee/word_searcher.rb +2 -1
- data/lib/lingo/attendee.rb +8 -4
- data/lib/lingo/call.rb +7 -3
- data/lib/lingo/cli.rb +8 -16
- data/lib/lingo/config.rb +11 -6
- data/lib/lingo/ctl.rb +54 -3
- data/lib/lingo/database/crypter.rb +8 -14
- data/lib/lingo/database/hash_store.rb +1 -1
- data/lib/lingo/database/{show_progress.rb → progress.rb} +7 -8
- data/lib/lingo/database/source/key_value.rb +6 -5
- data/lib/lingo/database/source/multi_key.rb +5 -2
- data/lib/lingo/database/source/multi_value.rb +6 -4
- data/lib/lingo/database/source/single_word.rb +2 -3
- data/lib/lingo/database/source/word_class.rb +24 -5
- data/lib/lingo/database/source.rb +5 -3
- data/lib/lingo/database.rb +102 -41
- data/lib/lingo/error.rb +24 -2
- data/lib/lingo/language/dictionary.rb +26 -54
- data/lib/lingo/language/grammar.rb +19 -23
- data/lib/lingo/language/lexical.rb +5 -1
- data/lib/lingo/language/lexical_hash.rb +7 -12
- data/lib/lingo/language/token.rb +10 -1
- data/lib/lingo/language/word.rb +35 -23
- data/lib/lingo/language/word_form.rb +5 -4
- data/lib/lingo/{show_progress.rb → progress.rb} +43 -30
- data/lib/lingo/srv/lingosrv.cfg +1 -1
- data/lib/lingo/srv/public/.gitkeep +0 -0
- data/lib/lingo/srv.rb +11 -6
- data/lib/lingo/version.rb +2 -2
- data/lib/lingo/web/lingoweb.cfg +1 -1
- data/lib/lingo/web/views/index.erb +4 -4
- data/lib/lingo/web.rb +4 -6
- data/lib/lingo.rb +4 -12
- data/lingo.cfg +1 -1
- data/lir.cfg +1 -1
- data/ru/lingo-dic.txt +33473 -2113
- data/ru/lingo-mul.txt +8430 -1913
- data/ru/lingo-syn.txt +1634 -0
- data/ru/user-dic.txt +6 -0
- data/ru.lang +49 -47
- data/spec/spec_helper.rb +4 -0
- data/test/attendee/ts_decomposer.rb +2 -2
- data/test/attendee/ts_synonymer.rb +3 -3
- data/test/attendee/ts_tokenizer.rb +215 -2
- data/test/attendee/ts_variator.rb +2 -2
- data/test/attendee/ts_word_searcher.rb +10 -6
- data/test/ref/artikel.seq +2 -2
- data/test/ref/artikel.vec +5 -5
- data/test/ref/artikel.ven +11 -11
- data/test/ref/artikel.ver +11 -11
- data/test/ref/lir.seq +13 -13
- data/test/ref/lir.vec +31 -31
- data/test/test_helper.rb +19 -5
- data/test/ts_database.rb +206 -77
- data/test/ts_language.rb +86 -26
- metadata +93 -49
- data/.rspec +0 -1
- data/de/test_syn2.txt +0 -1
data/README
CHANGED
@@ -1,6 +1,5 @@
|
|
1
1
|
= Lingo - A full-featured automatic indexing system
|
2
2
|
|
3
|
-
<b></b>
|
4
3
|
* {Version}[rdoc-label:label-VERSION]
|
5
4
|
* {Description}[rdoc-label:label-DESCRIPTION]
|
6
5
|
* {Introduction}[rdoc-label:label-Introduction]
|
@@ -9,6 +8,10 @@
|
|
9
8
|
* {Markup}[rdoc-label:label-Markup]
|
10
9
|
* {Inline annotation}[rdoc-label:label-Inline+annotation]
|
11
10
|
* {Plugins}[rdoc-label:label-Plugins]
|
11
|
+
* {Server}[rdoc-label:label-Server]
|
12
|
+
* {JSON endpoint}[rdoc-label:label-JSON+endpoint]
|
13
|
+
* {Raw endpoint}[rdoc-label:label-Raw+endpoint]
|
14
|
+
* {Deployment}[rdoc-label:label-Deployment]
|
12
15
|
* {Example}[rdoc-label:label-EXAMPLE]
|
13
16
|
* {Installation and Usage}[rdoc-label:label-INSTALLATION+AND+USAGE]
|
14
17
|
* {Dictionary and configuration file lookup}[rdoc-label:label-Dictionary+and+configuration+file+lookup]
|
@@ -17,15 +20,22 @@
|
|
17
20
|
* {Configuration}[rdoc-label:label-Configuration]
|
18
21
|
* {Language definition}[rdoc-label:label-Language+definition]
|
19
22
|
* {Dictionaries}[rdoc-label:label-Dictionaries]
|
23
|
+
* {Encoding word classes and gender information}[rdoc-label:label-Encoding+word+classes+and+gender+information]
|
24
|
+
* {Lexicalizing multiword expressions}[rdoc-label:label-Lexicalizing+multiword+expressions]
|
25
|
+
* {Lexicalizing compounds}[rdoc-label:label-Lexicalizing+compounds]
|
20
26
|
* {Issues and Contributions}[rdoc-label:label-ISSUES+AND+CONTRIBUTIONS]
|
21
27
|
* {Links}[rdoc-label:label-LINKS]
|
22
28
|
* {Literature}[rdoc-label:label-LITERATURE]
|
29
|
+
* {Background and Theory}[rdoc-label:label-Background+and+Theory]
|
30
|
+
* {Research publications}[rdoc-label:label-Research+publications]
|
23
31
|
* {Credits}[rdoc-label:label-CREDITS]
|
32
|
+
* {Authors}[rdoc-label:label-Authors]
|
33
|
+
* {Contributors}[rdoc-label:label-Contributors]
|
24
34
|
* {License and Copyright}[rdoc-label:label-LICENSE+AND+COPYRIGHT]
|
25
35
|
|
26
36
|
== VERSION
|
27
37
|
|
28
|
-
This documentation refers to Lingo version 1.8.
|
38
|
+
This documentation refers to Lingo version 1.8.5
|
29
39
|
|
30
40
|
|
31
41
|
== DESCRIPTION
|
@@ -36,57 +46,54 @@ functions of Lingo are:
|
|
36
46
|
* identification of (i.e. reduction to) basic word form by means of dictionaries
|
37
47
|
and suffix lists
|
38
48
|
* algorithmic decomposition
|
39
|
-
* dictionary-based
|
49
|
+
* dictionary-based synonymization and identification of phrases
|
40
50
|
* generic identification of phrases/word sequences based on patterns of word
|
41
51
|
classes
|
42
52
|
|
43
53
|
=== Introduction
|
44
54
|
|
45
|
-
|
46
|
-
|
47
|
-
enables you to assemble a network of practically unlimited functionality
|
48
|
-
from modules with limited functions. This network is built by configuration
|
49
|
-
files. Here's a minimal example:
|
55
|
+
Lingo allows flexible and extendable linguistic analysis of text files. Here
|
56
|
+
is a minimal configuration example to analyse this README file:
|
50
57
|
|
51
58
|
meeting:
|
52
59
|
attendees:
|
53
60
|
- text_reader: { files: 'README' }
|
54
61
|
- debugger: { eval: 'true', ceval: 'cmd!="EOL"', prompt: '<debug>: ' }
|
55
62
|
|
56
|
-
Lingo is told to invite two attendees
|
57
|
-
|
63
|
+
Lingo is told to invite two attendees and wants them to talk to each other,
|
64
|
+
hence the name Lingo (= the technical language).
|
58
65
|
|
59
|
-
The first attendee is the
|
60
|
-
read files
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
+
The first attendee is the text_reader[rdoc-ref:Lingo::Attendee::TextReader].
|
67
|
+
It can read files and communicates their content to other attendees. For this
|
68
|
+
purpose, the +text_reader+ is given an output channel. Everything that the
|
69
|
+
+text_reader+ has to say is steered through this channel. It will do nothing
|
70
|
+
further until Lingo tells the first attendee to speak. Then the +text_reader+
|
71
|
+
will open the file +README+ (as per the +files+ parameter) and pass the content
|
72
|
+
to the other attendees via its output channel.
|
66
73
|
|
67
|
-
The second attendee
|
68
|
-
than to put everything on the console (standard error
|
69
|
-
|
70
|
-
|
71
|
-
|
74
|
+
The second attendee, debugger[rdoc-refLingo::Attendee::Debugger], does nothing
|
75
|
+
else than to put everything on the console (standard error) that comes into its
|
76
|
+
input channel. If you write the Lingo configuration which is shown above as an
|
77
|
+
example into the file <tt>readme.cfg</tt> and then run <tt>lingo -c readme -l en</tt>,
|
78
|
+
the result will look something like this:
|
72
79
|
|
73
80
|
<debug>: *FILE('README')
|
74
81
|
<debug>: "= Lingo - [...]"
|
75
82
|
...
|
76
|
-
<debug>: "
|
77
|
-
<debug>: "
|
83
|
+
<debug>: "Lingo allows flexible and extendable linguistic analysis [...]"
|
84
|
+
<debug>: "is a minimal configuration example to analyse this README [...]"
|
78
85
|
...
|
79
86
|
<debug>: *EOF('README')
|
80
87
|
|
81
|
-
What we see are lines with an asterisk (
|
82
|
-
Lingo distinguishes between commands and data. The +text_reader+
|
83
|
-
read the content of the file, but also communicated through the
|
84
|
-
a file
|
85
|
-
information for other attendees that will be added later.
|
88
|
+
What we see are lines beginning with an asterisk (<tt>*</tt>) and lines without.
|
89
|
+
That's because Lingo distinguishes between commands and data. The +text_reader+
|
90
|
+
did not only read the content of the file, but also communicated through the
|
91
|
+
commands when a file began and when it ended. This can (and will) be an
|
92
|
+
important piece of information for other attendees that will be added later.
|
86
93
|
|
87
94
|
To try out Lingo's functionality without installing it first, have a look at
|
88
95
|
{Lingo Web}[http://ixtrieve.fh-koeln.de/lingoweb]. There you can enter some
|
89
|
-
text and see the debug output Lingo generated
|
96
|
+
text and see the debug output Lingo generated -- including tokenization, word
|
90
97
|
identification, decomposition, etc.
|
91
98
|
|
92
99
|
=== Attendees
|
@@ -94,35 +101,42 @@ identification, decomposition, etc.
|
|
94
101
|
Available attendees that can be used for solving a specific problem (for more
|
95
102
|
information see each attendee's documentation):
|
96
103
|
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
104
|
+
+text_reader+:: Reads files (or standard input) and puts their content into
|
105
|
+
the channels line by line. (see Lingo::Attendee::TextReader)
|
106
|
+
+tokenizer+:: Dissects lines into defined character strings, i.e. tokens.
|
107
|
+
(see Lingo::Attendee::Tokenizer)
|
108
|
+
+abbreviator+:: Identifies abbreviations and produces the long form if
|
109
|
+
listed in a dictionary. (see Lingo::Attendee::Abbreviator)
|
110
|
+
+word_searcher+:: Identifies tokens and turns them into words for further
|
111
|
+
processing. To this end, it consults the dictionaries.
|
112
|
+
(see Lingo::Attendee::WordSearcher)
|
113
|
+
+stemmer+:: Identifies tokens not identified by the +word_searcher+ by
|
114
|
+
means of stemming. (see Lingo::Attendee::Stemmer)
|
115
|
+
+decomposer+:: Tests any tokens not identified by the +word_searcher+ for
|
116
|
+
being compounds. (see Lingo::Attendee::Decomposer)
|
117
|
+
+synonymer+:: Extends words with their synonyms. (see
|
118
|
+
Lingo::Attendee::Synonymer)
|
119
|
+
+noneword_filter+:: Filters out everything and lets through only those tokens
|
120
|
+
that are unknown. (see Lingo::Attendee::NonewordFilter)
|
121
|
+
+vector_filter+:: Filters out everything and lets through only those tokens
|
122
|
+
that are considered useful for indexing. (see
|
123
|
+
Lingo::Attendee::VectorFilter)
|
124
|
+
+object_filter+:: Similar to the +vector_filter+. (see
|
125
|
+
Lingo::Attendee::ObjectFilter)
|
126
|
+
+text_writer+:: Writes anything that it receives into a file (or to
|
127
|
+
standard output). (see Lingo::Attendee::TextWriter)
|
128
|
+
+formatter+:: Similar to the +text_writer+, but allows for custom output
|
129
|
+
formats. (see Lingo::Attendee::Formatter)
|
130
|
+
+debugger+:: Shows everything for debugging. (see
|
131
|
+
Lingo::Attendee::Debugger)
|
132
|
+
+variator+:: Tries to correct spelling errors and the like. (see
|
133
|
+
Lingo::Attendee::Variator)
|
134
|
+
+dehyphenizer+:: Tries to undo hyphenation. (see
|
135
|
+
Lingo::Attendee::Dehyphenizer)
|
136
|
+
+multi_worder+:: Identifies phrases (word sequences) based on a multiword
|
137
|
+
dictionary. (see Lingo::Attendee::MultiWorder)
|
138
|
+
+sequencer+:: Identifies phrases (word sequences) based on patterns of
|
139
|
+
word classes. (see Lingo::Attendee::Sequencer)
|
126
140
|
|
127
141
|
Furthermore, it may be useful to have a look at the configuration files
|
128
142
|
<tt>lingo.cfg</tt> and <tt>en.lang</tt>.
|
@@ -131,33 +145,193 @@ Furthermore, it may be useful to have a look at the configuration files
|
|
131
145
|
|
132
146
|
Lingo is able to read HTML, XML, and PDF in addition to plain text.
|
133
147
|
|
134
|
-
|
148
|
+
_Examples_:
|
149
|
+
|
150
|
+
Read any file, guessing the correct type automatically:
|
151
|
+
|
152
|
+
- text_reader: { files: $(files), filter: true }
|
153
|
+
|
154
|
+
Read HTML files specifically (accordingly for XML):
|
155
|
+
|
156
|
+
- text_reader: { files: $(files), filter: 'html' }
|
157
|
+
|
158
|
+
Read PDF files, either with the pdf-reader[http://rubygems.org/gems/pdf-reader]
|
159
|
+
gem (default):
|
160
|
+
|
161
|
+
- text_reader: { files: $(files), filter: 'pdf' }
|
162
|
+
|
163
|
+
or with the pdftotext[http://en.wikipedia.org/wiki/Pdftotext] command line tool:
|
164
|
+
|
165
|
+
- text_reader: { files: $(files), filter: 'pdftotext' }
|
135
166
|
|
136
167
|
=== Markup
|
137
168
|
|
138
|
-
Lingo is able to parse HTML/XML and
|
169
|
+
Lingo is able to, in a limited form, parse HTML/XML and
|
170
|
+
MediaWiki[http://mediawiki.org/wiki/Help:Formatting] markup.
|
171
|
+
|
172
|
+
_Examples_:
|
173
|
+
|
174
|
+
Identify HTML/XML tags in the input stream:
|
175
|
+
|
176
|
+
- tokenizer: { tags: true }
|
139
177
|
|
140
|
-
|
178
|
+
Identify MediaWiki markup in the input stream:
|
179
|
+
|
180
|
+
- tokenizer: { wiki: true }
|
141
181
|
|
142
182
|
=== Inline annotation
|
143
183
|
|
144
|
-
Lingo is able to annotate input text inline, instead of printing results
|
145
|
-
external files.
|
184
|
+
Lingo is able to annotate input text inline, instead of printing results out
|
185
|
+
of context to external files.
|
186
|
+
|
187
|
+
_Example_:
|
146
188
|
|
147
|
-
|
189
|
+
# keep line endings
|
190
|
+
- text_reader: { files: $(files), chomp: false }
|
191
|
+
# keep whitespace
|
192
|
+
- tokenizer: { space: true }
|
193
|
+
# do processing...
|
194
|
+
- word_searcher: { source: sys-dic, mode: first }
|
195
|
+
# insert formatted results (e.g. "[[Name::lingo|Lingo]] got these [[Noun::word|words]].")
|
196
|
+
- formatter: { ext: out, format: '[[%3$s::%2$s|%1$s]]', map: { e: Name, s: Noun } }
|
148
197
|
|
149
198
|
=== Plugins
|
150
199
|
|
151
200
|
Lingo has a plugin system that allows you to implement additional features
|
152
201
|
(e.g. add new attendees) or modify existing ones. Just create a file named
|
153
202
|
+lingo_plugin.rb+ in your Gem's +lib+ directory or any directory that's in
|
154
|
-
<tt>$LOAD_PATH</tt>. You can also define an environment variable
|
155
|
-
(by default <tt>~/.lingo/plugins</tt>) with additional
|
156
|
-
plugins from (<tt>*.rb</tt>).
|
203
|
+
<tt>$LOAD_PATH</tt>. You can also define an environment variable
|
204
|
+
+LINGO_PLUGIN_PATH+ (by default <tt>~/.lingo/plugins</tt>) with additional
|
205
|
+
directories to load plugins from (<tt>*.rb</tt>).
|
157
206
|
|
158
207
|
A dedicated API to support writing and integrating plugins will be added in
|
159
208
|
the future.
|
160
209
|
|
210
|
+
=== Server
|
211
|
+
|
212
|
+
Lingo comes with a server daemon Lingo::Srv that exposes an HTTP interface to
|
213
|
+
Lingo's functionality. The configuration needs to ensure that input is read
|
214
|
+
from standard input (<tt>files: STDIN</tt> on +text_reader+) and output is
|
215
|
+
written to standard output (<tt>ext: STDOUT</tt> on +text_writer+).
|
216
|
+
|
217
|
+
_Example_: Start Lingo server on port 6789 with language configuration +en+
|
218
|
+
and default configuration file; server options come before <tt>--</tt>, Lingo
|
219
|
+
options come after.
|
220
|
+
|
221
|
+
> lingosrv -p 6789 -- -l en
|
222
|
+
|
223
|
+
You can also pass Lingo options through the +LINGO_SRV_OPTS+ environment
|
224
|
+
variable (e.g., <tt>LINGO_SRV_OPTS='-l en -c /path/to/your/srv.cfg'</tt>).
|
225
|
+
|
226
|
+
==== JSON endpoint
|
227
|
+
|
228
|
+
_Example_: Ask the server about "Lingo server"; returns JSON data (output
|
229
|
+
formatted for clarity).
|
230
|
+
|
231
|
+
> curl 'http://localhost:6789/?q=Lingo+server'
|
232
|
+
{
|
233
|
+
"Lingo server" : [
|
234
|
+
" <Lingo = [(lingo/s), (lingo/e)]>",
|
235
|
+
" <server = [(server/s)]>"
|
236
|
+
]
|
237
|
+
}
|
238
|
+
|
239
|
+
_Example_: Ask the server about "Lingo" and "server"; returns JSON data (output
|
240
|
+
formatted for clarity).
|
241
|
+
|
242
|
+
> curl -g 'http://localhost:6789/?q[]=Lingo&q[]=server'
|
243
|
+
{
|
244
|
+
"[\"Lingo\", \"server\"]" : {
|
245
|
+
"Lingo" : [
|
246
|
+
" <Lingo = [(lingo/s), (lingo/e)]>"
|
247
|
+
],
|
248
|
+
"server" : [
|
249
|
+
" <server = [(server/s)]>"
|
250
|
+
]
|
251
|
+
}
|
252
|
+
}
|
253
|
+
|
254
|
+
==== Raw endpoint
|
255
|
+
|
256
|
+
_Example_: Ask the server about "Lingo server"; returns raw Lingo response.
|
257
|
+
|
258
|
+
> curl --data 'Lingo server' http://localhost:6789/raw
|
259
|
+
<Lingo = [(lingo/s), (lingo/e)]>
|
260
|
+
<server = [(server/s)]>
|
261
|
+
|
262
|
+
_Example_: Ask the server about this file; returns raw Lingo response (output
|
263
|
+
truncated for clarity).
|
264
|
+
|
265
|
+
> curl --data @README -H 'Content-Type: text/plain' http://localhost:6789/raw
|
266
|
+
:=/OTHR:
|
267
|
+
<Lingo = [(lingo/s), (lingo/e)]>
|
268
|
+
<-|?>
|
269
|
+
<A|?>
|
270
|
+
<full-featured|KOM = [(full-featured/k), (full/s+), (full/a+), (full/v+), (featured/a+)]>
|
271
|
+
<automatic = [(automatic/s), (automatic/a)]>
|
272
|
+
<indexing = [(index/v)]>
|
273
|
+
<system = [(system/s)]>
|
274
|
+
[...]
|
275
|
+
|
276
|
+
==== Deployment
|
277
|
+
|
278
|
+
Lingo::Srv can be started directly through the provided command-line executable
|
279
|
+
+lingosrv+ (see above) or through any other Rack[http://rack.github.com/]
|
280
|
+
-compatible deployment option; a +rackup+ file is included (see <tt>lingoctl
|
281
|
+
rackup srv</tt>).
|
282
|
+
|
283
|
+
_Example_: To deploy Lingo::Srv with Passenger[http://phusionpassenger.com/]
|
284
|
+
on Apache, create a symlink in the DocumentRoot pointing to the app's
|
285
|
+
<tt>public/</tt> directory; adjust the paths according to your environment
|
286
|
+
(you can use current_gem[http://blackwinter.github.com/current_gem] to
|
287
|
+
create a stable gem path):
|
288
|
+
|
289
|
+
/var/www
|
290
|
+
|
|
291
|
+
+-- lingo-srv -> /usr/lib/ruby/gems/2.1.0/gems/lingo-x.y.z/lib/lingo/srv/public
|
292
|
+
|
293
|
+
Then put the following snippet in Apache's VirtualHost configuration:
|
294
|
+
|
295
|
+
<VirtualHost *:80>
|
296
|
+
...
|
297
|
+
|
298
|
+
RackBaseURI /lingo-srv
|
299
|
+
<Directory /var/www/lingo-srv>
|
300
|
+
Options -MultiViews
|
301
|
+
SetEnv LINGO_SRV_OPTS "-l en" # <-- Optionally set Lingo options
|
302
|
+
</Directory>
|
303
|
+
</VirtualHost>
|
304
|
+
|
305
|
+
In order to provide your own +rackup+ file and Lingo configuration, create a
|
306
|
+
directory with those files:
|
307
|
+
|
308
|
+
/srv/lingo-srv
|
309
|
+
|
|
310
|
+
+-- config.ru
|
311
|
+
|
|
312
|
+
+-- lingosrv.cfg
|
313
|
+
|
314
|
+
And then point Passenger at it:
|
315
|
+
|
316
|
+
<VirtualHost *:80>
|
317
|
+
...
|
318
|
+
|
319
|
+
RackBaseURI /lingo-srv
|
320
|
+
<Directory /var/www/lingo-srv>
|
321
|
+
Options -MultiViews
|
322
|
+
PassengerAppRoot /srv/lingo-srv # <-- Add this line
|
323
|
+
</Directory>
|
324
|
+
</VirtualHost>
|
325
|
+
|
326
|
+
Restart Apache and test the result (output formatted for clarity):
|
327
|
+
|
328
|
+
> curl http://localhost/lingo-srv/about
|
329
|
+
{
|
330
|
+
"Lingo::Srv" : {
|
331
|
+
"version" : "x.y.z"
|
332
|
+
}
|
333
|
+
}
|
334
|
+
|
161
335
|
|
162
336
|
== EXAMPLE
|
163
337
|
|
@@ -167,20 +341,21 @@ for further discussion.
|
|
167
341
|
|
168
342
|
== INSTALLATION AND USAGE
|
169
343
|
|
170
|
-
Since version 1.8.0, Lingo is available as a
|
171
|
-
|
172
|
-
|
173
|
-
environment
|
174
|
-
|
344
|
+
Since version 1.8.0, Lingo is available as a
|
345
|
+
RubyGem[http://rubygems.org/gems/lingo]. So a simple <tt>gem install lingo</tt>
|
346
|
+
will install Lingo and its dependencies. You might want to run that command
|
347
|
+
with administrator privileges, depending on your environment. Then you can call
|
348
|
+
the +lingo+ executable to process your text files. See <tt>lingo --help</tt>
|
349
|
+
for available options.
|
175
350
|
|
176
|
-
Please note that Lingo requires Ruby version 1.9.
|
177
|
-
(2.
|
178
|
-
version). If you want to use Lingo on Ruby 1.8, please refer to the
|
179
|
-
version
|
351
|
+
Please note that Lingo requires Ruby version 1.9.3 or higher to run
|
352
|
+
(2.1.3[http://ruby-lang.org/en/downloads/] is the currently recommended
|
353
|
+
version). If you want to use Lingo on Ruby 1.8, please refer to the
|
354
|
+
{legacy version}[rdoc-label:label-Legacy+version].
|
180
355
|
|
181
356
|
Since Lingo depends on native extensions, you need to make sure that
|
182
357
|
development files for your Ruby version are installed. On Debian-based
|
183
|
-
Linux platforms they are included in the package <tt>
|
358
|
+
Linux platforms they are included in the package <tt>ruby-dev</tt>;
|
184
359
|
other distributions may have a similarly named package. On Windows those
|
185
360
|
development files are currently not required.
|
186
361
|
|
@@ -194,29 +369,29 @@ version of Lingo (see below).
|
|
194
369
|
=== Dictionary and configuration file lookup
|
195
370
|
|
196
371
|
Lingo will search different locations to find dictionaries and configuration
|
197
|
-
files. By default, these are the current directory, your personal Lingo
|
372
|
+
files. By default, these are the current working directory, your personal Lingo
|
198
373
|
directory (<tt>~/.lingo</tt>) and the installation directory (in that order).
|
199
374
|
You can control this lookup path by either moving files up the chain (using
|
200
375
|
the +lingoctl+ executable) or by setting various environment variables.
|
201
376
|
|
202
377
|
With +lingoctl+ you can copy dictionaries and configuration files from your
|
203
|
-
personal Lingo directory or the installation directory to the current
|
378
|
+
personal Lingo directory or the installation directory to the current working
|
204
379
|
directory so you can modify them and they will take precedence over the
|
205
380
|
original ones. See <tt>lingoctl --help</tt> for usage information.
|
206
381
|
|
207
|
-
In order to change the search path
|
382
|
+
In order to change the search path itself, you can define the
|
208
383
|
+LINGO_PATH+ environment variable as a whole or its individual parts
|
209
384
|
+LINGO_CURR+ (the local Lingo directory), +LINGO_HOME+ (your personal
|
210
385
|
Lingo directory), and +LINGO_BASE+ (the system-wide Lingo directory).
|
211
386
|
|
212
|
-
Inside of any of these directories dictionaries and configuration files are
|
387
|
+
Inside of any of these directories, dictionaries and configuration files are
|
213
388
|
typically organized in the following directory structure:
|
214
389
|
|
215
|
-
<tt>config
|
216
|
-
<tt>dict
|
217
|
-
|
218
|
-
<tt>lang
|
219
|
-
<tt>store
|
390
|
+
<tt>config/</tt>:: Configuration files (<tt>*.cfg</tt>).
|
391
|
+
<tt>dict/</tt>:: Dictionary source files (<tt>*.txt</tt>) in
|
392
|
+
language-specific subdirectories (+de/+, +en/+, ...).
|
393
|
+
<tt>lang/</tt>:: Language definition files (<tt>*.lang</tt>).
|
394
|
+
<tt>store/</tt>:: Compiled dictionaries, generated from source files.
|
220
395
|
|
221
396
|
But for compatibility reasons these naming conventions are not enforced.
|
222
397
|
|
@@ -224,14 +399,14 @@ But for compatibility reasons these naming conventions are not enforced.
|
|
224
399
|
|
225
400
|
As Lingo 1.8 introduced some major disruptions and no longer runs on Ruby 1.8,
|
226
401
|
there is a maintenance branch for Lingo 1.7.x that will remain compatible with
|
227
|
-
both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch
|
402
|
+
both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch may
|
228
403
|
receive occasional bug fixes and minor feature updates. However, the bulk of
|
229
404
|
the development efforts will be directed towards Lingo 1.8+.
|
230
405
|
|
231
|
-
To install the legacy version, download and extract the
|
232
|
-
|
233
|
-
are required. This version of Lingo works
|
234
|
-
and 1.9 (1.9.2 or higher).
|
406
|
+
To install the legacy version, download and extract the
|
407
|
+
{ZIP archive}[http://ixtrieve.fh-koeln.de/buch/lingo-1.7.1.zip].
|
408
|
+
No additional dependencies are required. This version of Lingo works
|
409
|
+
with both Ruby 1.8 (1.8.5 or higher) and 1.9 (1.9.2 or higher).
|
235
410
|
|
236
411
|
The executable is named +lingo.rb+. It's located at the root of the installation
|
237
412
|
directory and may only be run from there. See <tt>ruby lingo.rb -h</tt> for
|
@@ -239,49 +414,116 @@ usage instructions.
|
|
239
414
|
|
240
415
|
Configuration and language definition files are also located at the root of the
|
241
416
|
installation directory (<tt>*.cfg</tt> and <tt>*.lang</tt>, respectively).
|
242
|
-
Dictionary source files are found in language-specific subdirectories (+de
|
243
|
-
+en
|
244
|
-
beneath these subdirectories in a directory named <tt>store
|
417
|
+
Dictionary source files are found in language-specific subdirectories (+de/+,
|
418
|
+
+en/+, ...) and are named <tt>*.txt</tt>. The compiled dictionaries are found
|
419
|
+
beneath these language subdirectories in a directory named <tt>store/</tt>.
|
245
420
|
|
246
421
|
|
247
422
|
== FILE FORMATS
|
248
423
|
|
249
|
-
Lingo uses three different types of files to determine its behaviour
|
250
|
-
|
251
|
-
|
252
|
-
|
253
|
-
|
424
|
+
Lingo uses three different types of files to determine its behaviour:
|
425
|
+
{configuration files}[rdoc-label:label-Configuration] control the details of the
|
426
|
+
indexing process; {language definitions}[rdoc-label:label-Language+definition]
|
427
|
+
specify grammar rules and dictionaries available for indexing;
|
428
|
+
dictionaries[rdoc-label:label-Dictionaries], finally, hold the
|
429
|
+
vocabulary used in indexing the input text and producing the results.
|
254
430
|
|
255
431
|
=== Configuration
|
256
432
|
|
257
|
-
|
433
|
+
Configuration files are defined in the YAML[http://yaml.org/] syntax. They
|
434
|
+
specify the attendees[rdoc-label:label-Attendees] to call in order and the
|
435
|
+
options to provide them with. The first attendee in any indexing process is
|
436
|
+
the text_reader[rdoc-ref:Lingo::Attendee::TextReader], who reads the input
|
437
|
+
text and passes it on to the other attendees. Every attendee transforms or
|
438
|
+
extends the input stream and automatically sends everything down to the next
|
439
|
+
attendee. This process may be customized by explicitly specifying the input
|
440
|
+
and/or output channels of individual attendees with the +in+ and +out+ options.
|
441
|
+
|
442
|
+
_Example_:
|
443
|
+
|
444
|
+
# input is taken from the previous attendee,
|
445
|
+
# output is sent to the named channel "syn"
|
446
|
+
- synonymer: { skip: '?,t', source: sys-syn, out: syn }
|
447
|
+
|
448
|
+
# input is taken from the named channel "syn",
|
449
|
+
# output is sent to the next attendee
|
450
|
+
- vector_filter: { in: syn, lexicals: y, sort: term_abs }
|
451
|
+
|
452
|
+
# input is taken from the previous attendee,
|
453
|
+
# output is sent to the next attendee
|
454
|
+
- text_writer: { ext: syn, sep: "\n" }
|
455
|
+
|
456
|
+
# input is taken from the named channel "syn"
|
457
|
+
# (ignoring the output of the previous attendee),
|
458
|
+
# output is sent to the next attendee
|
459
|
+
- vector_filter: { in: syn, lexicals: m }
|
460
|
+
|
461
|
+
# input is taken from the previous attendee,
|
462
|
+
# output is sent to the next attendee
|
463
|
+
- text_writer: { ext: mul, sep: "\n" }
|
258
464
|
|
259
465
|
=== Language definition
|
260
466
|
|
261
|
-
|
467
|
+
Language definitions, like {configuration files}[rdoc-label:label-Configuration],
|
468
|
+
are defined in the YAML[http://yaml.org/] syntax. They specify the
|
469
|
+
dictionaries[rdoc-label:label-Dictionaries] to be used as well as the grammar
|
470
|
+
rules according to which the input shall be processed. These settings do not
|
471
|
+
necessarily have to coincide with an existing language, they are
|
472
|
+
application-specific.
|
262
473
|
|
263
474
|
=== Dictionaries
|
264
475
|
|
476
|
+
Dictionaries come in different varieties and encode the knowledge about the
|
477
|
+
vocabulary used for indexing and analysis.
|
478
|
+
|
479
|
+
Supported dictionary formats:
|
480
|
+
|
481
|
+
+SingleWord+:: One word (projection) per line. E.g. <tt>open source</tt>. (see
|
482
|
+
Lingo::Database::Source::SingleWord)
|
483
|
+
+MultiValue+:: Multiple words per line (separated with a unique symbol), all of
|
484
|
+
which are interpreted as belonging to a single equivalence class.
|
485
|
+
E.g. <tt>fax;telefax;facsimile</tt>. (see
|
486
|
+
Lingo::Database::Source::MultiValue)
|
487
|
+
+MultiKey+:: Similar to +MultiValue+, except that the first word will be
|
488
|
+
treated as the preferred term (descriptor). E.g.
|
489
|
+
<tt>fax;telefax;facsimile</tt>. (see
|
490
|
+
Lingo::Database::Source::MultiKey)
|
491
|
+
+KeyValue+:: One word and its associated projection per line, separated with
|
492
|
+
a unique symbol. E.g. <tt>abfrage*query</tt>. (see
|
493
|
+
Lingo::Database::Source::KeyValue)
|
494
|
+
+WordClass+:: Similar to +KeyValue+, except that the projection may consist of
|
495
|
+
multiple lexicalizations, each with its own word class and
|
496
|
+
(optional) gender information. E.g. <tt>abort,abort #s|v</tt>,
|
497
|
+
which is equivalent to <tt>abort,abort #s abort #v</tt>. (see
|
498
|
+
Lingo::Database::Source::WordClass)
|
499
|
+
|
500
|
+
==== Encoding word classes and gender information
|
501
|
+
|
502
|
+
TODO...
|
503
|
+
|
504
|
+
==== Lexicalizing multiword expressions
|
505
|
+
|
506
|
+
TODO...
|
507
|
+
|
508
|
+
==== Lexicalizing compounds
|
509
|
+
|
265
510
|
TODO...
|
266
511
|
|
267
512
|
|
268
513
|
== ISSUES AND CONTRIBUTIONS
|
269
514
|
|
270
|
-
If you find bugs or want to suggest new features, please
|
271
|
-
|
272
|
-
GitHub[http://github.com/lex-lingo/lingo/issues]. Include your Ruby
|
515
|
+
If you find bugs or want to suggest new features, please report them
|
516
|
+
on GitHub[http://github.com/lex-lingo/lingo/issues]. Include your Ruby
|
273
517
|
version (<tt>ruby --version</tt>) and the version of Lingo you are using
|
274
|
-
(
|
275
|
-
that flag).
|
518
|
+
(<tt>lingo --version</tt>).
|
276
519
|
|
277
|
-
If you want to contribute to Lingo, please fork the project
|
278
|
-
GitHub[http://github.com/lex-lingo/lingo] and submit a
|
279
|
-
{pull request}[http://github.com/lex-lingo/lingo/pulls]
|
280
|
-
|
281
|
-
and send your formatted patch to the {developer list}[mailto:lingo-core@rubyforge.org].
|
520
|
+
If you want to contribute to Lingo, please fork the project
|
521
|
+
on GitHub[http://github.com/lex-lingo/lingo] and submit a
|
522
|
+
{pull request}[http://github.com/lex-lingo/lingo/pulls]
|
523
|
+
(bonus points for topic branches).
|
282
524
|
|
283
525
|
To make sure that Lingo's tests pass, install hen[http://blackwinter.github.com/hen]
|
284
|
-
(typically <tt>gem install hen</tt>) and all development dependencies (either
|
526
|
+
(typically <tt>gem install hen</tt>) and all development dependencies (either with
|
285
527
|
<tt>gem install --development lingo</tt> or manually; see <tt>rake gem:dependencies</tt>).
|
286
528
|
Then run <tt>rake test</tt> for the basic tests or <tt>rake test:all</tt> for
|
287
529
|
the full test suite.
|
@@ -289,28 +531,35 @@ the full test suite.
|
|
289
531
|
|
290
532
|
== LINKS
|
291
533
|
|
292
|
-
|
293
|
-
|
294
|
-
|
295
|
-
|
296
|
-
|
297
|
-
|
298
|
-
|
299
|
-
Mailing list:: http://rubyforge.org/mailman/listinfo/lingo-users
|
300
|
-
Bug tracker:: http://github.com/lex-lingo/lingo/issues
|
534
|
+
Website:: http://lex-lingo.de
|
535
|
+
Demo:: http://ixtrieve.fh-koeln.de/lingoweb
|
536
|
+
Documentation:: https://lex-lingo.github.com/lingo
|
537
|
+
Source code:: https://github.com/lex-lingo/lingo
|
538
|
+
RubyGem:: https://rubygems.org/gems/lingo
|
539
|
+
Bug tracker:: https://github.com/lex-lingo/lingo/issues
|
540
|
+
Travis CI:: https://travis-ci.org/lex-lingo/lingo
|
301
541
|
|
302
542
|
|
303
543
|
== LITERATURE
|
304
544
|
|
305
|
-
|
306
|
-
|
307
|
-
* Gödert, W
|
545
|
+
=== Background and Theory
|
546
|
+
|
547
|
+
* Gödert, W.; Lepsky, K.; Nagelschmidt, M.: <em>{Informationserschließung und Automatisches Indexieren: ein Lehr- und Arbeitsbuch}[http://dx.doi.org/10.1007/978-3-642-23513-9]</em>. (German) Berlin etc.: Springer, 2012.
|
548
|
+
* Lepsky, K.; Vorhauer, J.: <em>{Lingo: ein open source System für die automatische Indexierung deutschsprachiger Dokumente}[http://dx.doi.org/10.1515/ABITECH.2006.26.1.18]</em>. (German) In: ABI Technik 26 (1), 2006. pp 18-29.
|
308
549
|
* Nohr, H.: <em>{Grundlagen der automatischen Indexierung: ein Lehrbuch}[http://logos-verlag.de/cgi-bin/buch/isbn/0121]</em>. (German) Berlin: Logos, 2005.
|
309
|
-
* Hausser, R.: <em>{Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache}[http://
|
310
|
-
* Allen, J.: <em>{Natural language understanding}[http://
|
311
|
-
* Grishman, R.: <em>{Computational linguistics: an introduction}[http://
|
312
|
-
* Salton, G
|
313
|
-
* Porter, M.: <em>{An algorithm for suffix stripping}[http://tartarus.org/~martin/PorterStemmer/]</em>. (English) In: Program 14, 1980.
|
550
|
+
* Hausser, R.: <em>{Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache}[http://zbmath.org/?q=an:0956.68141]</em>. (German) Berlin etc.: Springer, 2000.
|
551
|
+
* Allen, J.: <em>{Natural language understanding}[http://zbmath.org/?q=an:0851.68106]</em>. (English) Redwood City, CA: Benjamin/Cummings, 1995.
|
552
|
+
* Grishman, R.: <em>{Computational linguistics: an introduction}[http://cambridge.org/9780521310383]</em>. (English) Cambridge: Cambridge Univ. Press, 1986.
|
553
|
+
* Salton, G.; McGill, M.: <em>{Introduction to modern information retrieval}[http://zbmath.org/?q=an:0523.68084]</em>. (English) New York etc.: McGraw-Hill, 1983.
|
554
|
+
* Porter, M.: <em>{An algorithm for suffix stripping}[http://tartarus.org/~martin/PorterStemmer/]</em>. (English) In: Program 14 (3), 1980. pp 130-137.
|
555
|
+
|
556
|
+
=== Research publications
|
557
|
+
|
558
|
+
* Bredack, J.; Lepsky, K.: <em>{Automatische Extraktion von Fachterminologie aus Volltexten}[http://dx.doi.org/10.1515/abitech-2014-0002]</em>. (German) In: ABI Technik 34 (1), 2014. pp 2-12.
|
559
|
+
* Bredack, J.: <em>{Terminologieextraktion von Mehrwortgruppen in kunsthistorischen Fachtexten}[http://ixtrieve.fh-koeln.de/lehre/bredack-2013.pdf]</em>. (German) Köln: Fachhochschule Köln, 2013.
|
560
|
+
* Maylein, L.; Langenstein, A.: <em>{Neues vom Relevanz-Ranking im HEIDI-Katalog der Universitätsbibliothek Heidelberg}[http://b-i-t-online.de/heft/2013-03-fachbeitrag-maylein.pdf]</em>. (German) In: b.i.t.online 16 (3), 2013. pp 190-200.
|
561
|
+
* Gödert, W.: <em>{Detecting multiword phrases in mathematical text corpora}[http://arxiv.org/abs/1210.0852]</em>. (English) arXiv:1210.0852 [cs.CL], 2012.
|
562
|
+
* Schiffer, R.: <em>{Automatisches Indexieren technischer Kongressschriften}[http://ixtrieve.fh-koeln.de/lehre/schiffer-2007.pdf]</em>. (German) Köln: Fachhochschule Köln, 2007.
|
314
563
|
|
315
564
|
|
316
565
|
== CREDITS
|
@@ -333,7 +582,7 @@ Lingo is based on a collective development by Klaus Lepsky and John Vorhauer.
|
|
333
582
|
== LICENSE AND COPYRIGHT
|
334
583
|
|
335
584
|
Copyright (C) 2005-2007 John Vorhauer
|
336
|
-
Copyright (C) 2007-
|
585
|
+
Copyright (C) 2007-2014 John Vorhauer, Jens Wille
|
337
586
|
|
338
587
|
Lingo is free software: you can redistribute it and/or modify it under the
|
339
588
|
terms of the GNU Affero General Public License as published by the Free
|