lingo 1.8.4.2 → 1.8.5
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/ChangeLog +413 -325
- data/README +380 -131
- data/Rakefile +19 -21
- data/de/lingo-abk.txt +15 -17
- data/de/lingo-dic.txt +20210 -20659
- data/de/lingo-mul.txt +5 -13
- data/de/lingo-syn.txt +5 -8
- data/de/test_dic.txt +2 -0
- data/de/test_gen.txt +8 -0
- data/de/{test_mul2.txt → test_mu2.txt} +0 -0
- data/de/{test_singleword.txt → test_sgw.txt} +0 -0
- data/de/user-dic.txt +5 -7
- data/de.lang +64 -49
- data/en/lingo-dic.txt +6398 -6404
- data/en/lingo-irr.txt +2 -3
- data/en/lingo-mul.txt +6 -7
- data/en/lingo-wdn.txt +881 -1762
- data/en/user-dic.txt +2 -5
- data/en.lang +39 -39
- data/lib/lingo/app.rb +10 -6
- data/lib/lingo/attendee/abbreviator.rb +1 -0
- data/lib/lingo/attendee/decomposer.rb +2 -1
- data/lib/lingo/attendee/multi_worder.rb +5 -6
- data/lib/lingo/attendee/stemmer.rb +1 -1
- data/lib/lingo/attendee/synonymer.rb +4 -2
- data/lib/lingo/attendee/text_reader.rb +77 -57
- data/lib/lingo/attendee/text_writer.rb +1 -1
- data/lib/lingo/attendee/tokenizer.rb +101 -50
- data/lib/lingo/attendee/variator.rb +2 -1
- data/lib/lingo/attendee/vector_filter.rb +28 -6
- data/lib/lingo/attendee/word_searcher.rb +2 -1
- data/lib/lingo/attendee.rb +8 -4
- data/lib/lingo/call.rb +7 -3
- data/lib/lingo/cli.rb +8 -16
- data/lib/lingo/config.rb +11 -6
- data/lib/lingo/ctl.rb +54 -3
- data/lib/lingo/database/crypter.rb +8 -14
- data/lib/lingo/database/hash_store.rb +1 -1
- data/lib/lingo/database/{show_progress.rb → progress.rb} +7 -8
- data/lib/lingo/database/source/key_value.rb +6 -5
- data/lib/lingo/database/source/multi_key.rb +5 -2
- data/lib/lingo/database/source/multi_value.rb +6 -4
- data/lib/lingo/database/source/single_word.rb +2 -3
- data/lib/lingo/database/source/word_class.rb +24 -5
- data/lib/lingo/database/source.rb +5 -3
- data/lib/lingo/database.rb +102 -41
- data/lib/lingo/error.rb +24 -2
- data/lib/lingo/language/dictionary.rb +26 -54
- data/lib/lingo/language/grammar.rb +19 -23
- data/lib/lingo/language/lexical.rb +5 -1
- data/lib/lingo/language/lexical_hash.rb +7 -12
- data/lib/lingo/language/token.rb +10 -1
- data/lib/lingo/language/word.rb +35 -23
- data/lib/lingo/language/word_form.rb +5 -4
- data/lib/lingo/{show_progress.rb → progress.rb} +43 -30
- data/lib/lingo/srv/lingosrv.cfg +1 -1
- data/lib/lingo/srv/public/.gitkeep +0 -0
- data/lib/lingo/srv.rb +11 -6
- data/lib/lingo/version.rb +2 -2
- data/lib/lingo/web/lingoweb.cfg +1 -1
- data/lib/lingo/web/views/index.erb +4 -4
- data/lib/lingo/web.rb +4 -6
- data/lib/lingo.rb +4 -12
- data/lingo.cfg +1 -1
- data/lir.cfg +1 -1
- data/ru/lingo-dic.txt +33473 -2113
- data/ru/lingo-mul.txt +8430 -1913
- data/ru/lingo-syn.txt +1634 -0
- data/ru/user-dic.txt +6 -0
- data/ru.lang +49 -47
- data/spec/spec_helper.rb +4 -0
- data/test/attendee/ts_decomposer.rb +2 -2
- data/test/attendee/ts_synonymer.rb +3 -3
- data/test/attendee/ts_tokenizer.rb +215 -2
- data/test/attendee/ts_variator.rb +2 -2
- data/test/attendee/ts_word_searcher.rb +10 -6
- data/test/ref/artikel.seq +2 -2
- data/test/ref/artikel.vec +5 -5
- data/test/ref/artikel.ven +11 -11
- data/test/ref/artikel.ver +11 -11
- data/test/ref/lir.seq +13 -13
- data/test/ref/lir.vec +31 -31
- data/test/test_helper.rb +19 -5
- data/test/ts_database.rb +206 -77
- data/test/ts_language.rb +86 -26
- metadata +93 -49
- data/.rspec +0 -1
- data/de/test_syn2.txt +0 -1
data/README
CHANGED
@@ -1,6 +1,5 @@
|
|
1
1
|
= Lingo - A full-featured automatic indexing system
|
2
2
|
|
3
|
-
<b></b>
|
4
3
|
* {Version}[rdoc-label:label-VERSION]
|
5
4
|
* {Description}[rdoc-label:label-DESCRIPTION]
|
6
5
|
* {Introduction}[rdoc-label:label-Introduction]
|
@@ -9,6 +8,10 @@
|
|
9
8
|
* {Markup}[rdoc-label:label-Markup]
|
10
9
|
* {Inline annotation}[rdoc-label:label-Inline+annotation]
|
11
10
|
* {Plugins}[rdoc-label:label-Plugins]
|
11
|
+
* {Server}[rdoc-label:label-Server]
|
12
|
+
* {JSON endpoint}[rdoc-label:label-JSON+endpoint]
|
13
|
+
* {Raw endpoint}[rdoc-label:label-Raw+endpoint]
|
14
|
+
* {Deployment}[rdoc-label:label-Deployment]
|
12
15
|
* {Example}[rdoc-label:label-EXAMPLE]
|
13
16
|
* {Installation and Usage}[rdoc-label:label-INSTALLATION+AND+USAGE]
|
14
17
|
* {Dictionary and configuration file lookup}[rdoc-label:label-Dictionary+and+configuration+file+lookup]
|
@@ -17,15 +20,22 @@
|
|
17
20
|
* {Configuration}[rdoc-label:label-Configuration]
|
18
21
|
* {Language definition}[rdoc-label:label-Language+definition]
|
19
22
|
* {Dictionaries}[rdoc-label:label-Dictionaries]
|
23
|
+
* {Encoding word classes and gender information}[rdoc-label:label-Encoding+word+classes+and+gender+information]
|
24
|
+
* {Lexicalizing multiword expressions}[rdoc-label:label-Lexicalizing+multiword+expressions]
|
25
|
+
* {Lexicalizing compounds}[rdoc-label:label-Lexicalizing+compounds]
|
20
26
|
* {Issues and Contributions}[rdoc-label:label-ISSUES+AND+CONTRIBUTIONS]
|
21
27
|
* {Links}[rdoc-label:label-LINKS]
|
22
28
|
* {Literature}[rdoc-label:label-LITERATURE]
|
29
|
+
* {Background and Theory}[rdoc-label:label-Background+and+Theory]
|
30
|
+
* {Research publications}[rdoc-label:label-Research+publications]
|
23
31
|
* {Credits}[rdoc-label:label-CREDITS]
|
32
|
+
* {Authors}[rdoc-label:label-Authors]
|
33
|
+
* {Contributors}[rdoc-label:label-Contributors]
|
24
34
|
* {License and Copyright}[rdoc-label:label-LICENSE+AND+COPYRIGHT]
|
25
35
|
|
26
36
|
== VERSION
|
27
37
|
|
28
|
-
This documentation refers to Lingo version 1.8.
|
38
|
+
This documentation refers to Lingo version 1.8.5
|
29
39
|
|
30
40
|
|
31
41
|
== DESCRIPTION
|
@@ -36,57 +46,54 @@ functions of Lingo are:
|
|
36
46
|
* identification of (i.e. reduction to) basic word form by means of dictionaries
|
37
47
|
and suffix lists
|
38
48
|
* algorithmic decomposition
|
39
|
-
* dictionary-based
|
49
|
+
* dictionary-based synonymization and identification of phrases
|
40
50
|
* generic identification of phrases/word sequences based on patterns of word
|
41
51
|
classes
|
42
52
|
|
43
53
|
=== Introduction
|
44
54
|
|
45
|
-
|
46
|
-
|
47
|
-
enables you to assemble a network of practically unlimited functionality
|
48
|
-
from modules with limited functions. This network is built by configuration
|
49
|
-
files. Here's a minimal example:
|
55
|
+
Lingo allows flexible and extendable linguistic analysis of text files. Here
|
56
|
+
is a minimal configuration example to analyse this README file:
|
50
57
|
|
51
58
|
meeting:
|
52
59
|
attendees:
|
53
60
|
- text_reader: { files: 'README' }
|
54
61
|
- debugger: { eval: 'true', ceval: 'cmd!="EOL"', prompt: '<debug>: ' }
|
55
62
|
|
56
|
-
Lingo is told to invite two attendees
|
57
|
-
|
63
|
+
Lingo is told to invite two attendees and wants them to talk to each other,
|
64
|
+
hence the name Lingo (= the technical language).
|
58
65
|
|
59
|
-
The first attendee is the
|
60
|
-
read files
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
+
The first attendee is the text_reader[rdoc-ref:Lingo::Attendee::TextReader].
|
67
|
+
It can read files and communicates their content to other attendees. For this
|
68
|
+
purpose, the +text_reader+ is given an output channel. Everything that the
|
69
|
+
+text_reader+ has to say is steered through this channel. It will do nothing
|
70
|
+
further until Lingo tells the first attendee to speak. Then the +text_reader+
|
71
|
+
will open the file +README+ (as per the +files+ parameter) and pass the content
|
72
|
+
to the other attendees via its output channel.
|
66
73
|
|
67
|
-
The second attendee
|
68
|
-
than to put everything on the console (standard error
|
69
|
-
|
70
|
-
|
71
|
-
|
74
|
+
The second attendee, debugger[rdoc-refLingo::Attendee::Debugger], does nothing
|
75
|
+
else than to put everything on the console (standard error) that comes into its
|
76
|
+
input channel. If you write the Lingo configuration which is shown above as an
|
77
|
+
example into the file <tt>readme.cfg</tt> and then run <tt>lingo -c readme -l en</tt>,
|
78
|
+
the result will look something like this:
|
72
79
|
|
73
80
|
<debug>: *FILE('README')
|
74
81
|
<debug>: "= Lingo - [...]"
|
75
82
|
...
|
76
|
-
<debug>: "
|
77
|
-
<debug>: "
|
83
|
+
<debug>: "Lingo allows flexible and extendable linguistic analysis [...]"
|
84
|
+
<debug>: "is a minimal configuration example to analyse this README [...]"
|
78
85
|
...
|
79
86
|
<debug>: *EOF('README')
|
80
87
|
|
81
|
-
What we see are lines with an asterisk (
|
82
|
-
Lingo distinguishes between commands and data. The +text_reader+
|
83
|
-
read the content of the file, but also communicated through the
|
84
|
-
a file
|
85
|
-
information for other attendees that will be added later.
|
88
|
+
What we see are lines beginning with an asterisk (<tt>*</tt>) and lines without.
|
89
|
+
That's because Lingo distinguishes between commands and data. The +text_reader+
|
90
|
+
did not only read the content of the file, but also communicated through the
|
91
|
+
commands when a file began and when it ended. This can (and will) be an
|
92
|
+
important piece of information for other attendees that will be added later.
|
86
93
|
|
87
94
|
To try out Lingo's functionality without installing it first, have a look at
|
88
95
|
{Lingo Web}[http://ixtrieve.fh-koeln.de/lingoweb]. There you can enter some
|
89
|
-
text and see the debug output Lingo generated
|
96
|
+
text and see the debug output Lingo generated -- including tokenization, word
|
90
97
|
identification, decomposition, etc.
|
91
98
|
|
92
99
|
=== Attendees
|
@@ -94,35 +101,42 @@ identification, decomposition, etc.
|
|
94
101
|
Available attendees that can be used for solving a specific problem (for more
|
95
102
|
information see each attendee's documentation):
|
96
103
|
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
104
|
+
+text_reader+:: Reads files (or standard input) and puts their content into
|
105
|
+
the channels line by line. (see Lingo::Attendee::TextReader)
|
106
|
+
+tokenizer+:: Dissects lines into defined character strings, i.e. tokens.
|
107
|
+
(see Lingo::Attendee::Tokenizer)
|
108
|
+
+abbreviator+:: Identifies abbreviations and produces the long form if
|
109
|
+
listed in a dictionary. (see Lingo::Attendee::Abbreviator)
|
110
|
+
+word_searcher+:: Identifies tokens and turns them into words for further
|
111
|
+
processing. To this end, it consults the dictionaries.
|
112
|
+
(see Lingo::Attendee::WordSearcher)
|
113
|
+
+stemmer+:: Identifies tokens not identified by the +word_searcher+ by
|
114
|
+
means of stemming. (see Lingo::Attendee::Stemmer)
|
115
|
+
+decomposer+:: Tests any tokens not identified by the +word_searcher+ for
|
116
|
+
being compounds. (see Lingo::Attendee::Decomposer)
|
117
|
+
+synonymer+:: Extends words with their synonyms. (see
|
118
|
+
Lingo::Attendee::Synonymer)
|
119
|
+
+noneword_filter+:: Filters out everything and lets through only those tokens
|
120
|
+
that are unknown. (see Lingo::Attendee::NonewordFilter)
|
121
|
+
+vector_filter+:: Filters out everything and lets through only those tokens
|
122
|
+
that are considered useful for indexing. (see
|
123
|
+
Lingo::Attendee::VectorFilter)
|
124
|
+
+object_filter+:: Similar to the +vector_filter+. (see
|
125
|
+
Lingo::Attendee::ObjectFilter)
|
126
|
+
+text_writer+:: Writes anything that it receives into a file (or to
|
127
|
+
standard output). (see Lingo::Attendee::TextWriter)
|
128
|
+
+formatter+:: Similar to the +text_writer+, but allows for custom output
|
129
|
+
formats. (see Lingo::Attendee::Formatter)
|
130
|
+
+debugger+:: Shows everything for debugging. (see
|
131
|
+
Lingo::Attendee::Debugger)
|
132
|
+
+variator+:: Tries to correct spelling errors and the like. (see
|
133
|
+
Lingo::Attendee::Variator)
|
134
|
+
+dehyphenizer+:: Tries to undo hyphenation. (see
|
135
|
+
Lingo::Attendee::Dehyphenizer)
|
136
|
+
+multi_worder+:: Identifies phrases (word sequences) based on a multiword
|
137
|
+
dictionary. (see Lingo::Attendee::MultiWorder)
|
138
|
+
+sequencer+:: Identifies phrases (word sequences) based on patterns of
|
139
|
+
word classes. (see Lingo::Attendee::Sequencer)
|
126
140
|
|
127
141
|
Furthermore, it may be useful to have a look at the configuration files
|
128
142
|
<tt>lingo.cfg</tt> and <tt>en.lang</tt>.
|
@@ -131,33 +145,193 @@ Furthermore, it may be useful to have a look at the configuration files
|
|
131
145
|
|
132
146
|
Lingo is able to read HTML, XML, and PDF in addition to plain text.
|
133
147
|
|
134
|
-
|
148
|
+
_Examples_:
|
149
|
+
|
150
|
+
Read any file, guessing the correct type automatically:
|
151
|
+
|
152
|
+
- text_reader: { files: $(files), filter: true }
|
153
|
+
|
154
|
+
Read HTML files specifically (accordingly for XML):
|
155
|
+
|
156
|
+
- text_reader: { files: $(files), filter: 'html' }
|
157
|
+
|
158
|
+
Read PDF files, either with the pdf-reader[http://rubygems.org/gems/pdf-reader]
|
159
|
+
gem (default):
|
160
|
+
|
161
|
+
- text_reader: { files: $(files), filter: 'pdf' }
|
162
|
+
|
163
|
+
or with the pdftotext[http://en.wikipedia.org/wiki/Pdftotext] command line tool:
|
164
|
+
|
165
|
+
- text_reader: { files: $(files), filter: 'pdftotext' }
|
135
166
|
|
136
167
|
=== Markup
|
137
168
|
|
138
|
-
Lingo is able to parse HTML/XML and
|
169
|
+
Lingo is able to, in a limited form, parse HTML/XML and
|
170
|
+
MediaWiki[http://mediawiki.org/wiki/Help:Formatting] markup.
|
171
|
+
|
172
|
+
_Examples_:
|
173
|
+
|
174
|
+
Identify HTML/XML tags in the input stream:
|
175
|
+
|
176
|
+
- tokenizer: { tags: true }
|
139
177
|
|
140
|
-
|
178
|
+
Identify MediaWiki markup in the input stream:
|
179
|
+
|
180
|
+
- tokenizer: { wiki: true }
|
141
181
|
|
142
182
|
=== Inline annotation
|
143
183
|
|
144
|
-
Lingo is able to annotate input text inline, instead of printing results
|
145
|
-
external files.
|
184
|
+
Lingo is able to annotate input text inline, instead of printing results out
|
185
|
+
of context to external files.
|
186
|
+
|
187
|
+
_Example_:
|
146
188
|
|
147
|
-
|
189
|
+
# keep line endings
|
190
|
+
- text_reader: { files: $(files), chomp: false }
|
191
|
+
# keep whitespace
|
192
|
+
- tokenizer: { space: true }
|
193
|
+
# do processing...
|
194
|
+
- word_searcher: { source: sys-dic, mode: first }
|
195
|
+
# insert formatted results (e.g. "[[Name::lingo|Lingo]] got these [[Noun::word|words]].")
|
196
|
+
- formatter: { ext: out, format: '[[%3$s::%2$s|%1$s]]', map: { e: Name, s: Noun } }
|
148
197
|
|
149
198
|
=== Plugins
|
150
199
|
|
151
200
|
Lingo has a plugin system that allows you to implement additional features
|
152
201
|
(e.g. add new attendees) or modify existing ones. Just create a file named
|
153
202
|
+lingo_plugin.rb+ in your Gem's +lib+ directory or any directory that's in
|
154
|
-
<tt>$LOAD_PATH</tt>. You can also define an environment variable
|
155
|
-
(by default <tt>~/.lingo/plugins</tt>) with additional
|
156
|
-
plugins from (<tt>*.rb</tt>).
|
203
|
+
<tt>$LOAD_PATH</tt>. You can also define an environment variable
|
204
|
+
+LINGO_PLUGIN_PATH+ (by default <tt>~/.lingo/plugins</tt>) with additional
|
205
|
+
directories to load plugins from (<tt>*.rb</tt>).
|
157
206
|
|
158
207
|
A dedicated API to support writing and integrating plugins will be added in
|
159
208
|
the future.
|
160
209
|
|
210
|
+
=== Server
|
211
|
+
|
212
|
+
Lingo comes with a server daemon Lingo::Srv that exposes an HTTP interface to
|
213
|
+
Lingo's functionality. The configuration needs to ensure that input is read
|
214
|
+
from standard input (<tt>files: STDIN</tt> on +text_reader+) and output is
|
215
|
+
written to standard output (<tt>ext: STDOUT</tt> on +text_writer+).
|
216
|
+
|
217
|
+
_Example_: Start Lingo server on port 6789 with language configuration +en+
|
218
|
+
and default configuration file; server options come before <tt>--</tt>, Lingo
|
219
|
+
options come after.
|
220
|
+
|
221
|
+
> lingosrv -p 6789 -- -l en
|
222
|
+
|
223
|
+
You can also pass Lingo options through the +LINGO_SRV_OPTS+ environment
|
224
|
+
variable (e.g., <tt>LINGO_SRV_OPTS='-l en -c /path/to/your/srv.cfg'</tt>).
|
225
|
+
|
226
|
+
==== JSON endpoint
|
227
|
+
|
228
|
+
_Example_: Ask the server about "Lingo server"; returns JSON data (output
|
229
|
+
formatted for clarity).
|
230
|
+
|
231
|
+
> curl 'http://localhost:6789/?q=Lingo+server'
|
232
|
+
{
|
233
|
+
"Lingo server" : [
|
234
|
+
" <Lingo = [(lingo/s), (lingo/e)]>",
|
235
|
+
" <server = [(server/s)]>"
|
236
|
+
]
|
237
|
+
}
|
238
|
+
|
239
|
+
_Example_: Ask the server about "Lingo" and "server"; returns JSON data (output
|
240
|
+
formatted for clarity).
|
241
|
+
|
242
|
+
> curl -g 'http://localhost:6789/?q[]=Lingo&q[]=server'
|
243
|
+
{
|
244
|
+
"[\"Lingo\", \"server\"]" : {
|
245
|
+
"Lingo" : [
|
246
|
+
" <Lingo = [(lingo/s), (lingo/e)]>"
|
247
|
+
],
|
248
|
+
"server" : [
|
249
|
+
" <server = [(server/s)]>"
|
250
|
+
]
|
251
|
+
}
|
252
|
+
}
|
253
|
+
|
254
|
+
==== Raw endpoint
|
255
|
+
|
256
|
+
_Example_: Ask the server about "Lingo server"; returns raw Lingo response.
|
257
|
+
|
258
|
+
> curl --data 'Lingo server' http://localhost:6789/raw
|
259
|
+
<Lingo = [(lingo/s), (lingo/e)]>
|
260
|
+
<server = [(server/s)]>
|
261
|
+
|
262
|
+
_Example_: Ask the server about this file; returns raw Lingo response (output
|
263
|
+
truncated for clarity).
|
264
|
+
|
265
|
+
> curl --data @README -H 'Content-Type: text/plain' http://localhost:6789/raw
|
266
|
+
:=/OTHR:
|
267
|
+
<Lingo = [(lingo/s), (lingo/e)]>
|
268
|
+
<-|?>
|
269
|
+
<A|?>
|
270
|
+
<full-featured|KOM = [(full-featured/k), (full/s+), (full/a+), (full/v+), (featured/a+)]>
|
271
|
+
<automatic = [(automatic/s), (automatic/a)]>
|
272
|
+
<indexing = [(index/v)]>
|
273
|
+
<system = [(system/s)]>
|
274
|
+
[...]
|
275
|
+
|
276
|
+
==== Deployment
|
277
|
+
|
278
|
+
Lingo::Srv can be started directly through the provided command-line executable
|
279
|
+
+lingosrv+ (see above) or through any other Rack[http://rack.github.com/]
|
280
|
+
-compatible deployment option; a +rackup+ file is included (see <tt>lingoctl
|
281
|
+
rackup srv</tt>).
|
282
|
+
|
283
|
+
_Example_: To deploy Lingo::Srv with Passenger[http://phusionpassenger.com/]
|
284
|
+
on Apache, create a symlink in the DocumentRoot pointing to the app's
|
285
|
+
<tt>public/</tt> directory; adjust the paths according to your environment
|
286
|
+
(you can use current_gem[http://blackwinter.github.com/current_gem] to
|
287
|
+
create a stable gem path):
|
288
|
+
|
289
|
+
/var/www
|
290
|
+
|
|
291
|
+
+-- lingo-srv -> /usr/lib/ruby/gems/2.1.0/gems/lingo-x.y.z/lib/lingo/srv/public
|
292
|
+
|
293
|
+
Then put the following snippet in Apache's VirtualHost configuration:
|
294
|
+
|
295
|
+
<VirtualHost *:80>
|
296
|
+
...
|
297
|
+
|
298
|
+
RackBaseURI /lingo-srv
|
299
|
+
<Directory /var/www/lingo-srv>
|
300
|
+
Options -MultiViews
|
301
|
+
SetEnv LINGO_SRV_OPTS "-l en" # <-- Optionally set Lingo options
|
302
|
+
</Directory>
|
303
|
+
</VirtualHost>
|
304
|
+
|
305
|
+
In order to provide your own +rackup+ file and Lingo configuration, create a
|
306
|
+
directory with those files:
|
307
|
+
|
308
|
+
/srv/lingo-srv
|
309
|
+
|
|
310
|
+
+-- config.ru
|
311
|
+
|
|
312
|
+
+-- lingosrv.cfg
|
313
|
+
|
314
|
+
And then point Passenger at it:
|
315
|
+
|
316
|
+
<VirtualHost *:80>
|
317
|
+
...
|
318
|
+
|
319
|
+
RackBaseURI /lingo-srv
|
320
|
+
<Directory /var/www/lingo-srv>
|
321
|
+
Options -MultiViews
|
322
|
+
PassengerAppRoot /srv/lingo-srv # <-- Add this line
|
323
|
+
</Directory>
|
324
|
+
</VirtualHost>
|
325
|
+
|
326
|
+
Restart Apache and test the result (output formatted for clarity):
|
327
|
+
|
328
|
+
> curl http://localhost/lingo-srv/about
|
329
|
+
{
|
330
|
+
"Lingo::Srv" : {
|
331
|
+
"version" : "x.y.z"
|
332
|
+
}
|
333
|
+
}
|
334
|
+
|
161
335
|
|
162
336
|
== EXAMPLE
|
163
337
|
|
@@ -167,20 +341,21 @@ for further discussion.
|
|
167
341
|
|
168
342
|
== INSTALLATION AND USAGE
|
169
343
|
|
170
|
-
Since version 1.8.0, Lingo is available as a
|
171
|
-
|
172
|
-
|
173
|
-
environment
|
174
|
-
|
344
|
+
Since version 1.8.0, Lingo is available as a
|
345
|
+
RubyGem[http://rubygems.org/gems/lingo]. So a simple <tt>gem install lingo</tt>
|
346
|
+
will install Lingo and its dependencies. You might want to run that command
|
347
|
+
with administrator privileges, depending on your environment. Then you can call
|
348
|
+
the +lingo+ executable to process your text files. See <tt>lingo --help</tt>
|
349
|
+
for available options.
|
175
350
|
|
176
|
-
Please note that Lingo requires Ruby version 1.9.
|
177
|
-
(2.
|
178
|
-
version). If you want to use Lingo on Ruby 1.8, please refer to the
|
179
|
-
version
|
351
|
+
Please note that Lingo requires Ruby version 1.9.3 or higher to run
|
352
|
+
(2.1.3[http://ruby-lang.org/en/downloads/] is the currently recommended
|
353
|
+
version). If you want to use Lingo on Ruby 1.8, please refer to the
|
354
|
+
{legacy version}[rdoc-label:label-Legacy+version].
|
180
355
|
|
181
356
|
Since Lingo depends on native extensions, you need to make sure that
|
182
357
|
development files for your Ruby version are installed. On Debian-based
|
183
|
-
Linux platforms they are included in the package <tt>
|
358
|
+
Linux platforms they are included in the package <tt>ruby-dev</tt>;
|
184
359
|
other distributions may have a similarly named package. On Windows those
|
185
360
|
development files are currently not required.
|
186
361
|
|
@@ -194,29 +369,29 @@ version of Lingo (see below).
|
|
194
369
|
=== Dictionary and configuration file lookup
|
195
370
|
|
196
371
|
Lingo will search different locations to find dictionaries and configuration
|
197
|
-
files. By default, these are the current directory, your personal Lingo
|
372
|
+
files. By default, these are the current working directory, your personal Lingo
|
198
373
|
directory (<tt>~/.lingo</tt>) and the installation directory (in that order).
|
199
374
|
You can control this lookup path by either moving files up the chain (using
|
200
375
|
the +lingoctl+ executable) or by setting various environment variables.
|
201
376
|
|
202
377
|
With +lingoctl+ you can copy dictionaries and configuration files from your
|
203
|
-
personal Lingo directory or the installation directory to the current
|
378
|
+
personal Lingo directory or the installation directory to the current working
|
204
379
|
directory so you can modify them and they will take precedence over the
|
205
380
|
original ones. See <tt>lingoctl --help</tt> for usage information.
|
206
381
|
|
207
|
-
In order to change the search path
|
382
|
+
In order to change the search path itself, you can define the
|
208
383
|
+LINGO_PATH+ environment variable as a whole or its individual parts
|
209
384
|
+LINGO_CURR+ (the local Lingo directory), +LINGO_HOME+ (your personal
|
210
385
|
Lingo directory), and +LINGO_BASE+ (the system-wide Lingo directory).
|
211
386
|
|
212
|
-
Inside of any of these directories dictionaries and configuration files are
|
387
|
+
Inside of any of these directories, dictionaries and configuration files are
|
213
388
|
typically organized in the following directory structure:
|
214
389
|
|
215
|
-
<tt>config
|
216
|
-
<tt>dict
|
217
|
-
|
218
|
-
<tt>lang
|
219
|
-
<tt>store
|
390
|
+
<tt>config/</tt>:: Configuration files (<tt>*.cfg</tt>).
|
391
|
+
<tt>dict/</tt>:: Dictionary source files (<tt>*.txt</tt>) in
|
392
|
+
language-specific subdirectories (+de/+, +en/+, ...).
|
393
|
+
<tt>lang/</tt>:: Language definition files (<tt>*.lang</tt>).
|
394
|
+
<tt>store/</tt>:: Compiled dictionaries, generated from source files.
|
220
395
|
|
221
396
|
But for compatibility reasons these naming conventions are not enforced.
|
222
397
|
|
@@ -224,14 +399,14 @@ But for compatibility reasons these naming conventions are not enforced.
|
|
224
399
|
|
225
400
|
As Lingo 1.8 introduced some major disruptions and no longer runs on Ruby 1.8,
|
226
401
|
there is a maintenance branch for Lingo 1.7.x that will remain compatible with
|
227
|
-
both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch
|
402
|
+
both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch may
|
228
403
|
receive occasional bug fixes and minor feature updates. However, the bulk of
|
229
404
|
the development efforts will be directed towards Lingo 1.8+.
|
230
405
|
|
231
|
-
To install the legacy version, download and extract the
|
232
|
-
|
233
|
-
are required. This version of Lingo works
|
234
|
-
and 1.9 (1.9.2 or higher).
|
406
|
+
To install the legacy version, download and extract the
|
407
|
+
{ZIP archive}[http://ixtrieve.fh-koeln.de/buch/lingo-1.7.1.zip].
|
408
|
+
No additional dependencies are required. This version of Lingo works
|
409
|
+
with both Ruby 1.8 (1.8.5 or higher) and 1.9 (1.9.2 or higher).
|
235
410
|
|
236
411
|
The executable is named +lingo.rb+. It's located at the root of the installation
|
237
412
|
directory and may only be run from there. See <tt>ruby lingo.rb -h</tt> for
|
@@ -239,49 +414,116 @@ usage instructions.
|
|
239
414
|
|
240
415
|
Configuration and language definition files are also located at the root of the
|
241
416
|
installation directory (<tt>*.cfg</tt> and <tt>*.lang</tt>, respectively).
|
242
|
-
Dictionary source files are found in language-specific subdirectories (+de
|
243
|
-
+en
|
244
|
-
beneath these subdirectories in a directory named <tt>store
|
417
|
+
Dictionary source files are found in language-specific subdirectories (+de/+,
|
418
|
+
+en/+, ...) and are named <tt>*.txt</tt>. The compiled dictionaries are found
|
419
|
+
beneath these language subdirectories in a directory named <tt>store/</tt>.
|
245
420
|
|
246
421
|
|
247
422
|
== FILE FORMATS
|
248
423
|
|
249
|
-
Lingo uses three different types of files to determine its behaviour
|
250
|
-
|
251
|
-
|
252
|
-
|
253
|
-
|
424
|
+
Lingo uses three different types of files to determine its behaviour:
|
425
|
+
{configuration files}[rdoc-label:label-Configuration] control the details of the
|
426
|
+
indexing process; {language definitions}[rdoc-label:label-Language+definition]
|
427
|
+
specify grammar rules and dictionaries available for indexing;
|
428
|
+
dictionaries[rdoc-label:label-Dictionaries], finally, hold the
|
429
|
+
vocabulary used in indexing the input text and producing the results.
|
254
430
|
|
255
431
|
=== Configuration
|
256
432
|
|
257
|
-
|
433
|
+
Configuration files are defined in the YAML[http://yaml.org/] syntax. They
|
434
|
+
specify the attendees[rdoc-label:label-Attendees] to call in order and the
|
435
|
+
options to provide them with. The first attendee in any indexing process is
|
436
|
+
the text_reader[rdoc-ref:Lingo::Attendee::TextReader], who reads the input
|
437
|
+
text and passes it on to the other attendees. Every attendee transforms or
|
438
|
+
extends the input stream and automatically sends everything down to the next
|
439
|
+
attendee. This process may be customized by explicitly specifying the input
|
440
|
+
and/or output channels of individual attendees with the +in+ and +out+ options.
|
441
|
+
|
442
|
+
_Example_:
|
443
|
+
|
444
|
+
# input is taken from the previous attendee,
|
445
|
+
# output is sent to the named channel "syn"
|
446
|
+
- synonymer: { skip: '?,t', source: sys-syn, out: syn }
|
447
|
+
|
448
|
+
# input is taken from the named channel "syn",
|
449
|
+
# output is sent to the next attendee
|
450
|
+
- vector_filter: { in: syn, lexicals: y, sort: term_abs }
|
451
|
+
|
452
|
+
# input is taken from the previous attendee,
|
453
|
+
# output is sent to the next attendee
|
454
|
+
- text_writer: { ext: syn, sep: "\n" }
|
455
|
+
|
456
|
+
# input is taken from the named channel "syn"
|
457
|
+
# (ignoring the output of the previous attendee),
|
458
|
+
# output is sent to the next attendee
|
459
|
+
- vector_filter: { in: syn, lexicals: m }
|
460
|
+
|
461
|
+
# input is taken from the previous attendee,
|
462
|
+
# output is sent to the next attendee
|
463
|
+
- text_writer: { ext: mul, sep: "\n" }
|
258
464
|
|
259
465
|
=== Language definition
|
260
466
|
|
261
|
-
|
467
|
+
Language definitions, like {configuration files}[rdoc-label:label-Configuration],
|
468
|
+
are defined in the YAML[http://yaml.org/] syntax. They specify the
|
469
|
+
dictionaries[rdoc-label:label-Dictionaries] to be used as well as the grammar
|
470
|
+
rules according to which the input shall be processed. These settings do not
|
471
|
+
necessarily have to coincide with an existing language, they are
|
472
|
+
application-specific.
|
262
473
|
|
263
474
|
=== Dictionaries
|
264
475
|
|
476
|
+
Dictionaries come in different varieties and encode the knowledge about the
|
477
|
+
vocabulary used for indexing and analysis.
|
478
|
+
|
479
|
+
Supported dictionary formats:
|
480
|
+
|
481
|
+
+SingleWord+:: One word (projection) per line. E.g. <tt>open source</tt>. (see
|
482
|
+
Lingo::Database::Source::SingleWord)
|
483
|
+
+MultiValue+:: Multiple words per line (separated with a unique symbol), all of
|
484
|
+
which are interpreted as belonging to a single equivalence class.
|
485
|
+
E.g. <tt>fax;telefax;facsimile</tt>. (see
|
486
|
+
Lingo::Database::Source::MultiValue)
|
487
|
+
+MultiKey+:: Similar to +MultiValue+, except that the first word will be
|
488
|
+
treated as the preferred term (descriptor). E.g.
|
489
|
+
<tt>fax;telefax;facsimile</tt>. (see
|
490
|
+
Lingo::Database::Source::MultiKey)
|
491
|
+
+KeyValue+:: One word and its associated projection per line, separated with
|
492
|
+
a unique symbol. E.g. <tt>abfrage*query</tt>. (see
|
493
|
+
Lingo::Database::Source::KeyValue)
|
494
|
+
+WordClass+:: Similar to +KeyValue+, except that the projection may consist of
|
495
|
+
multiple lexicalizations, each with its own word class and
|
496
|
+
(optional) gender information. E.g. <tt>abort,abort #s|v</tt>,
|
497
|
+
which is equivalent to <tt>abort,abort #s abort #v</tt>. (see
|
498
|
+
Lingo::Database::Source::WordClass)
|
499
|
+
|
500
|
+
==== Encoding word classes and gender information
|
501
|
+
|
502
|
+
TODO...
|
503
|
+
|
504
|
+
==== Lexicalizing multiword expressions
|
505
|
+
|
506
|
+
TODO...
|
507
|
+
|
508
|
+
==== Lexicalizing compounds
|
509
|
+
|
265
510
|
TODO...
|
266
511
|
|
267
512
|
|
268
513
|
== ISSUES AND CONTRIBUTIONS
|
269
514
|
|
270
|
-
If you find bugs or want to suggest new features, please
|
271
|
-
|
272
|
-
GitHub[http://github.com/lex-lingo/lingo/issues]. Include your Ruby
|
515
|
+
If you find bugs or want to suggest new features, please report them
|
516
|
+
on GitHub[http://github.com/lex-lingo/lingo/issues]. Include your Ruby
|
273
517
|
version (<tt>ruby --version</tt>) and the version of Lingo you are using
|
274
|
-
(
|
275
|
-
that flag).
|
518
|
+
(<tt>lingo --version</tt>).
|
276
519
|
|
277
|
-
If you want to contribute to Lingo, please fork the project
|
278
|
-
GitHub[http://github.com/lex-lingo/lingo] and submit a
|
279
|
-
{pull request}[http://github.com/lex-lingo/lingo/pulls]
|
280
|
-
|
281
|
-
and send your formatted patch to the {developer list}[mailto:lingo-core@rubyforge.org].
|
520
|
+
If you want to contribute to Lingo, please fork the project
|
521
|
+
on GitHub[http://github.com/lex-lingo/lingo] and submit a
|
522
|
+
{pull request}[http://github.com/lex-lingo/lingo/pulls]
|
523
|
+
(bonus points for topic branches).
|
282
524
|
|
283
525
|
To make sure that Lingo's tests pass, install hen[http://blackwinter.github.com/hen]
|
284
|
-
(typically <tt>gem install hen</tt>) and all development dependencies (either
|
526
|
+
(typically <tt>gem install hen</tt>) and all development dependencies (either with
|
285
527
|
<tt>gem install --development lingo</tt> or manually; see <tt>rake gem:dependencies</tt>).
|
286
528
|
Then run <tt>rake test</tt> for the basic tests or <tt>rake test:all</tt> for
|
287
529
|
the full test suite.
|
@@ -289,28 +531,35 @@ the full test suite.
|
|
289
531
|
|
290
532
|
== LINKS
|
291
533
|
|
292
|
-
|
293
|
-
|
294
|
-
|
295
|
-
|
296
|
-
|
297
|
-
|
298
|
-
|
299
|
-
Mailing list:: http://rubyforge.org/mailman/listinfo/lingo-users
|
300
|
-
Bug tracker:: http://github.com/lex-lingo/lingo/issues
|
534
|
+
Website:: http://lex-lingo.de
|
535
|
+
Demo:: http://ixtrieve.fh-koeln.de/lingoweb
|
536
|
+
Documentation:: https://lex-lingo.github.com/lingo
|
537
|
+
Source code:: https://github.com/lex-lingo/lingo
|
538
|
+
RubyGem:: https://rubygems.org/gems/lingo
|
539
|
+
Bug tracker:: https://github.com/lex-lingo/lingo/issues
|
540
|
+
Travis CI:: https://travis-ci.org/lex-lingo/lingo
|
301
541
|
|
302
542
|
|
303
543
|
== LITERATURE
|
304
544
|
|
305
|
-
|
306
|
-
|
307
|
-
* Gödert, W
|
545
|
+
=== Background and Theory
|
546
|
+
|
547
|
+
* Gödert, W.; Lepsky, K.; Nagelschmidt, M.: <em>{Informationserschließung und Automatisches Indexieren: ein Lehr- und Arbeitsbuch}[http://dx.doi.org/10.1007/978-3-642-23513-9]</em>. (German) Berlin etc.: Springer, 2012.
|
548
|
+
* Lepsky, K.; Vorhauer, J.: <em>{Lingo: ein open source System für die automatische Indexierung deutschsprachiger Dokumente}[http://dx.doi.org/10.1515/ABITECH.2006.26.1.18]</em>. (German) In: ABI Technik 26 (1), 2006. pp 18-29.
|
308
549
|
* Nohr, H.: <em>{Grundlagen der automatischen Indexierung: ein Lehrbuch}[http://logos-verlag.de/cgi-bin/buch/isbn/0121]</em>. (German) Berlin: Logos, 2005.
|
309
|
-
* Hausser, R.: <em>{Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache}[http://
|
310
|
-
* Allen, J.: <em>{Natural language understanding}[http://
|
311
|
-
* Grishman, R.: <em>{Computational linguistics: an introduction}[http://
|
312
|
-
* Salton, G
|
313
|
-
* Porter, M.: <em>{An algorithm for suffix stripping}[http://tartarus.org/~martin/PorterStemmer/]</em>. (English) In: Program 14, 1980.
|
550
|
+
* Hausser, R.: <em>{Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache}[http://zbmath.org/?q=an:0956.68141]</em>. (German) Berlin etc.: Springer, 2000.
|
551
|
+
* Allen, J.: <em>{Natural language understanding}[http://zbmath.org/?q=an:0851.68106]</em>. (English) Redwood City, CA: Benjamin/Cummings, 1995.
|
552
|
+
* Grishman, R.: <em>{Computational linguistics: an introduction}[http://cambridge.org/9780521310383]</em>. (English) Cambridge: Cambridge Univ. Press, 1986.
|
553
|
+
* Salton, G.; McGill, M.: <em>{Introduction to modern information retrieval}[http://zbmath.org/?q=an:0523.68084]</em>. (English) New York etc.: McGraw-Hill, 1983.
|
554
|
+
* Porter, M.: <em>{An algorithm for suffix stripping}[http://tartarus.org/~martin/PorterStemmer/]</em>. (English) In: Program 14 (3), 1980. pp 130-137.
|
555
|
+
|
556
|
+
=== Research publications
|
557
|
+
|
558
|
+
* Bredack, J.; Lepsky, K.: <em>{Automatische Extraktion von Fachterminologie aus Volltexten}[http://dx.doi.org/10.1515/abitech-2014-0002]</em>. (German) In: ABI Technik 34 (1), 2014. pp 2-12.
|
559
|
+
* Bredack, J.: <em>{Terminologieextraktion von Mehrwortgruppen in kunsthistorischen Fachtexten}[http://ixtrieve.fh-koeln.de/lehre/bredack-2013.pdf]</em>. (German) Köln: Fachhochschule Köln, 2013.
|
560
|
+
* Maylein, L.; Langenstein, A.: <em>{Neues vom Relevanz-Ranking im HEIDI-Katalog der Universitätsbibliothek Heidelberg}[http://b-i-t-online.de/heft/2013-03-fachbeitrag-maylein.pdf]</em>. (German) In: b.i.t.online 16 (3), 2013. pp 190-200.
|
561
|
+
* Gödert, W.: <em>{Detecting multiword phrases in mathematical text corpora}[http://arxiv.org/abs/1210.0852]</em>. (English) arXiv:1210.0852 [cs.CL], 2012.
|
562
|
+
* Schiffer, R.: <em>{Automatisches Indexieren technischer Kongressschriften}[http://ixtrieve.fh-koeln.de/lehre/schiffer-2007.pdf]</em>. (German) Köln: Fachhochschule Köln, 2007.
|
314
563
|
|
315
564
|
|
316
565
|
== CREDITS
|
@@ -333,7 +582,7 @@ Lingo is based on a collective development by Klaus Lepsky and John Vorhauer.
|
|
333
582
|
== LICENSE AND COPYRIGHT
|
334
583
|
|
335
584
|
Copyright (C) 2005-2007 John Vorhauer
|
336
|
-
Copyright (C) 2007-
|
585
|
+
Copyright (C) 2007-2014 John Vorhauer, Jens Wille
|
337
586
|
|
338
587
|
Lingo is free software: you can redistribute it and/or modify it under the
|
339
588
|
terms of the GNU Affero General Public License as published by the Free
|