lingo 1.8.4.2 → 1.8.5

Sign up to get free protection for your applications and to get access to all the features.
Files changed (89) hide show
  1. checksums.yaml +4 -4
  2. data/ChangeLog +413 -325
  3. data/README +380 -131
  4. data/Rakefile +19 -21
  5. data/de/lingo-abk.txt +15 -17
  6. data/de/lingo-dic.txt +20210 -20659
  7. data/de/lingo-mul.txt +5 -13
  8. data/de/lingo-syn.txt +5 -8
  9. data/de/test_dic.txt +2 -0
  10. data/de/test_gen.txt +8 -0
  11. data/de/{test_mul2.txt → test_mu2.txt} +0 -0
  12. data/de/{test_singleword.txt → test_sgw.txt} +0 -0
  13. data/de/user-dic.txt +5 -7
  14. data/de.lang +64 -49
  15. data/en/lingo-dic.txt +6398 -6404
  16. data/en/lingo-irr.txt +2 -3
  17. data/en/lingo-mul.txt +6 -7
  18. data/en/lingo-wdn.txt +881 -1762
  19. data/en/user-dic.txt +2 -5
  20. data/en.lang +39 -39
  21. data/lib/lingo/app.rb +10 -6
  22. data/lib/lingo/attendee/abbreviator.rb +1 -0
  23. data/lib/lingo/attendee/decomposer.rb +2 -1
  24. data/lib/lingo/attendee/multi_worder.rb +5 -6
  25. data/lib/lingo/attendee/stemmer.rb +1 -1
  26. data/lib/lingo/attendee/synonymer.rb +4 -2
  27. data/lib/lingo/attendee/text_reader.rb +77 -57
  28. data/lib/lingo/attendee/text_writer.rb +1 -1
  29. data/lib/lingo/attendee/tokenizer.rb +101 -50
  30. data/lib/lingo/attendee/variator.rb +2 -1
  31. data/lib/lingo/attendee/vector_filter.rb +28 -6
  32. data/lib/lingo/attendee/word_searcher.rb +2 -1
  33. data/lib/lingo/attendee.rb +8 -4
  34. data/lib/lingo/call.rb +7 -3
  35. data/lib/lingo/cli.rb +8 -16
  36. data/lib/lingo/config.rb +11 -6
  37. data/lib/lingo/ctl.rb +54 -3
  38. data/lib/lingo/database/crypter.rb +8 -14
  39. data/lib/lingo/database/hash_store.rb +1 -1
  40. data/lib/lingo/database/{show_progress.rb → progress.rb} +7 -8
  41. data/lib/lingo/database/source/key_value.rb +6 -5
  42. data/lib/lingo/database/source/multi_key.rb +5 -2
  43. data/lib/lingo/database/source/multi_value.rb +6 -4
  44. data/lib/lingo/database/source/single_word.rb +2 -3
  45. data/lib/lingo/database/source/word_class.rb +24 -5
  46. data/lib/lingo/database/source.rb +5 -3
  47. data/lib/lingo/database.rb +102 -41
  48. data/lib/lingo/error.rb +24 -2
  49. data/lib/lingo/language/dictionary.rb +26 -54
  50. data/lib/lingo/language/grammar.rb +19 -23
  51. data/lib/lingo/language/lexical.rb +5 -1
  52. data/lib/lingo/language/lexical_hash.rb +7 -12
  53. data/lib/lingo/language/token.rb +10 -1
  54. data/lib/lingo/language/word.rb +35 -23
  55. data/lib/lingo/language/word_form.rb +5 -4
  56. data/lib/lingo/{show_progress.rb → progress.rb} +43 -30
  57. data/lib/lingo/srv/lingosrv.cfg +1 -1
  58. data/lib/lingo/srv/public/.gitkeep +0 -0
  59. data/lib/lingo/srv.rb +11 -6
  60. data/lib/lingo/version.rb +2 -2
  61. data/lib/lingo/web/lingoweb.cfg +1 -1
  62. data/lib/lingo/web/views/index.erb +4 -4
  63. data/lib/lingo/web.rb +4 -6
  64. data/lib/lingo.rb +4 -12
  65. data/lingo.cfg +1 -1
  66. data/lir.cfg +1 -1
  67. data/ru/lingo-dic.txt +33473 -2113
  68. data/ru/lingo-mul.txt +8430 -1913
  69. data/ru/lingo-syn.txt +1634 -0
  70. data/ru/user-dic.txt +6 -0
  71. data/ru.lang +49 -47
  72. data/spec/spec_helper.rb +4 -0
  73. data/test/attendee/ts_decomposer.rb +2 -2
  74. data/test/attendee/ts_synonymer.rb +3 -3
  75. data/test/attendee/ts_tokenizer.rb +215 -2
  76. data/test/attendee/ts_variator.rb +2 -2
  77. data/test/attendee/ts_word_searcher.rb +10 -6
  78. data/test/ref/artikel.seq +2 -2
  79. data/test/ref/artikel.vec +5 -5
  80. data/test/ref/artikel.ven +11 -11
  81. data/test/ref/artikel.ver +11 -11
  82. data/test/ref/lir.seq +13 -13
  83. data/test/ref/lir.vec +31 -31
  84. data/test/test_helper.rb +19 -5
  85. data/test/ts_database.rb +206 -77
  86. data/test/ts_language.rb +86 -26
  87. metadata +93 -49
  88. data/.rspec +0 -1
  89. data/de/test_syn2.txt +0 -1
data/README CHANGED
@@ -1,6 +1,5 @@
1
1
  = Lingo - A full-featured automatic indexing system
2
2
 
3
- <b></b>
4
3
  * {Version}[rdoc-label:label-VERSION]
5
4
  * {Description}[rdoc-label:label-DESCRIPTION]
6
5
  * {Introduction}[rdoc-label:label-Introduction]
@@ -9,6 +8,10 @@
9
8
  * {Markup}[rdoc-label:label-Markup]
10
9
  * {Inline annotation}[rdoc-label:label-Inline+annotation]
11
10
  * {Plugins}[rdoc-label:label-Plugins]
11
+ * {Server}[rdoc-label:label-Server]
12
+ * {JSON endpoint}[rdoc-label:label-JSON+endpoint]
13
+ * {Raw endpoint}[rdoc-label:label-Raw+endpoint]
14
+ * {Deployment}[rdoc-label:label-Deployment]
12
15
  * {Example}[rdoc-label:label-EXAMPLE]
13
16
  * {Installation and Usage}[rdoc-label:label-INSTALLATION+AND+USAGE]
14
17
  * {Dictionary and configuration file lookup}[rdoc-label:label-Dictionary+and+configuration+file+lookup]
@@ -17,15 +20,22 @@
17
20
  * {Configuration}[rdoc-label:label-Configuration]
18
21
  * {Language definition}[rdoc-label:label-Language+definition]
19
22
  * {Dictionaries}[rdoc-label:label-Dictionaries]
23
+ * {Encoding word classes and gender information}[rdoc-label:label-Encoding+word+classes+and+gender+information]
24
+ * {Lexicalizing multiword expressions}[rdoc-label:label-Lexicalizing+multiword+expressions]
25
+ * {Lexicalizing compounds}[rdoc-label:label-Lexicalizing+compounds]
20
26
  * {Issues and Contributions}[rdoc-label:label-ISSUES+AND+CONTRIBUTIONS]
21
27
  * {Links}[rdoc-label:label-LINKS]
22
28
  * {Literature}[rdoc-label:label-LITERATURE]
29
+ * {Background and Theory}[rdoc-label:label-Background+and+Theory]
30
+ * {Research publications}[rdoc-label:label-Research+publications]
23
31
  * {Credits}[rdoc-label:label-CREDITS]
32
+ * {Authors}[rdoc-label:label-Authors]
33
+ * {Contributors}[rdoc-label:label-Contributors]
24
34
  * {License and Copyright}[rdoc-label:label-LICENSE+AND+COPYRIGHT]
25
35
 
26
36
  == VERSION
27
37
 
28
- This documentation refers to Lingo version 1.8.4
38
+ This documentation refers to Lingo version 1.8.5
29
39
 
30
40
 
31
41
  == DESCRIPTION
@@ -36,57 +46,54 @@ functions of Lingo are:
36
46
  * identification of (i.e. reduction to) basic word form by means of dictionaries
37
47
  and suffix lists
38
48
  * algorithmic decomposition
39
- * dictionary-based synonymisation and identification of phrases
49
+ * dictionary-based synonymization and identification of phrases
40
50
  * generic identification of phrases/word sequences based on patterns of word
41
51
  classes
42
52
 
43
53
  === Introduction
44
54
 
45
- If you want to perform linguistic analysis on some text, Lingo is there to
46
- support your endeavour with all its flexibility and extendability. Lingo
47
- enables you to assemble a network of practically unlimited functionality
48
- from modules with limited functions. This network is built by configuration
49
- files. Here's a minimal example:
55
+ Lingo allows flexible and extendable linguistic analysis of text files. Here
56
+ is a minimal configuration example to analyse this README file:
50
57
 
51
58
  meeting:
52
59
  attendees:
53
60
  - text_reader: { files: 'README' }
54
61
  - debugger: { eval: 'true', ceval: 'cmd!="EOL"', prompt: '<debug>: ' }
55
62
 
56
- Lingo is told to invite two attendees. And Lingo wants them to talk to each
57
- other, hence the name Lingo (= the technical language).
63
+ Lingo is told to invite two attendees and wants them to talk to each other,
64
+ hence the name Lingo (= the technical language).
58
65
 
59
- The first attendee is the +text_reader+ (Lingo::Attendee::TextReader). It can
60
- read files (as well as standard input) and communicate its content to other
61
- attendees. For this purpose the +text_reader+ is given an output channel.
62
- Everything that the +text_reader+ has to say is steered through this channel.
63
- It will do nothing further until Lingo will tell the first attendee to speak.
64
- Then the +text_reader+ will open the file +README+ (<tt>files</tt> parameter)
65
- and babble the content to the world via its output channel.
66
+ The first attendee is the text_reader[rdoc-ref:Lingo::Attendee::TextReader].
67
+ It can read files and communicates their content to other attendees. For this
68
+ purpose, the +text_reader+ is given an output channel. Everything that the
69
+ +text_reader+ has to say is steered through this channel. It will do nothing
70
+ further until Lingo tells the first attendee to speak. Then the +text_reader+
71
+ will open the file +README+ (as per the +files+ parameter) and pass the content
72
+ to the other attendees via its output channel.
66
73
 
67
- The second attendee +debugger+ (Lingo::Attendee::Debugger) does nothing else
68
- than to put everything on the console (standard error, actually) that comes
69
- into its input channel. If you write the Lingo configuration which is shown
70
- above as an example into the file <tt>readme.cfg</tt> and then run <tt>lingo
71
- -c readme -l en</tt>, the result will look something like this:
74
+ The second attendee, debugger[rdoc-refLingo::Attendee::Debugger], does nothing
75
+ else than to put everything on the console (standard error) that comes into its
76
+ input channel. If you write the Lingo configuration which is shown above as an
77
+ example into the file <tt>readme.cfg</tt> and then run <tt>lingo -c readme -l en</tt>,
78
+ the result will look something like this:
72
79
 
73
80
  <debug>: *FILE('README')
74
81
  <debug>: "= Lingo - [...]"
75
82
  ...
76
- <debug>: "If you want to perform linguistic analysis on some text, [...]"
77
- <debug>: "support your endeavour with all its flexibility and [...]"
83
+ <debug>: "Lingo allows flexible and extendable linguistic analysis [...]"
84
+ <debug>: "is a minimal configuration example to analyse this README [...]"
78
85
  ...
79
86
  <debug>: *EOF('README')
80
87
 
81
- What we see are lines with an asterisk (*) and lines without. That's because
82
- Lingo distinguishes between commands and data. The +text_reader+ did not only
83
- read the content of the file, but also communicated through the commands when
84
- a file begins and when it ends. This can (and will) be an important piece of
85
- information for other attendees that will be added later.
88
+ What we see are lines beginning with an asterisk (<tt>*</tt>) and lines without.
89
+ That's because Lingo distinguishes between commands and data. The +text_reader+
90
+ did not only read the content of the file, but also communicated through the
91
+ commands when a file began and when it ended. This can (and will) be an
92
+ important piece of information for other attendees that will be added later.
86
93
 
87
94
  To try out Lingo's functionality without installing it first, have a look at
88
95
  {Lingo Web}[http://ixtrieve.fh-koeln.de/lingoweb]. There you can enter some
89
- text and see the debug output Lingo generated, including tokenization, word
96
+ text and see the debug output Lingo generated -- including tokenization, word
90
97
  identification, decomposition, etc.
91
98
 
92
99
  === Attendees
@@ -94,35 +101,42 @@ identification, decomposition, etc.
94
101
  Available attendees that can be used for solving a specific problem (for more
95
102
  information see each attendee's documentation):
96
103
 
97
- <tt>text_reader</tt>:: Reads files and puts their content into the channels line by
98
- line. (Lingo::Attendee::TextReader)
99
- <tt>tokenizer</tt>:: Dissects lines into defined character strings, i.e. tokens.
100
- (Lingo::Attendee::Tokenizer)
101
- <tt>abbreviator</tt>:: Identifies abbreviations and produces the long form if listed
102
- in a dictionary. (Lingo::Attendee::Abbreviator)
103
- <tt>word_searcher</tt>:: Identifies tokens and turns them into words for further
104
- processing. To do this right it looks into the dictionary.
105
- (Lingo::Attendee::WordSearcher)
106
- <tt>decomposer</tt>:: Tests any character strings not identified by the +word_searcher+
107
- for being compounds. (Lingo::Attendee::Decomposer)
108
- <tt>synonymer</tt>:: Extends words with synonyms. (Lingo::Attendee::Synonymer)
109
- <tt>noneword_filter</tt>:: Filters out everything and lets through only those tokens that
110
- are unknown. (Lingo::Attendee::NonewordFilter)
111
- <tt>vector_filter</tt>:: Filters out everything and lets through only those tokens that are
112
- considered useful for indexing. (Lingo::Attendee::VectorFilter)
113
- <tt>object_filter</tt>:: Similar to the +vector_filter+. (Lingo::Attendee::ObjectFilter)
114
- <tt>text_writer</tt>:: Writes anything that it receives into a file (or to standard
115
- output). (Lingo::Attendee::TextWriter)
116
- <tt>formatter</tt>:: Similar to the +text_writer+, but allows for custom output formats.
117
- (Lingo::Attendee::Formatter)
118
- <tt>debugger</tt>:: Shows everything for debugging. (Lingo::Attendee::Debugger)
119
- <tt>variator</tt>:: Tries to correct spelling errors and the like.
120
- (Lingo::Attendee::Variator)
121
- <tt>dehyphenizer</tt>:: Tries to undo hyphenation. (Lingo::Attendee::Dehyphenizer)
122
- <tt>multi_worder</tt>:: Identifies phrases (word sequences) based on a multiword
123
- dictionary. (Lingo::Attendee::MultiWorder)
124
- <tt>sequencer</tt>:: Identifies phrases (word sequences) based on patterns of word
125
- classes. (Lingo::Attendee::Sequencer)
104
+ +text_reader+:: Reads files (or standard input) and puts their content into
105
+ the channels line by line. (see Lingo::Attendee::TextReader)
106
+ +tokenizer+:: Dissects lines into defined character strings, i.e. tokens.
107
+ (see Lingo::Attendee::Tokenizer)
108
+ +abbreviator+:: Identifies abbreviations and produces the long form if
109
+ listed in a dictionary. (see Lingo::Attendee::Abbreviator)
110
+ +word_searcher+:: Identifies tokens and turns them into words for further
111
+ processing. To this end, it consults the dictionaries.
112
+ (see Lingo::Attendee::WordSearcher)
113
+ +stemmer+:: Identifies tokens not identified by the +word_searcher+ by
114
+ means of stemming. (see Lingo::Attendee::Stemmer)
115
+ +decomposer+:: Tests any tokens not identified by the +word_searcher+ for
116
+ being compounds. (see Lingo::Attendee::Decomposer)
117
+ +synonymer+:: Extends words with their synonyms. (see
118
+ Lingo::Attendee::Synonymer)
119
+ +noneword_filter+:: Filters out everything and lets through only those tokens
120
+ that are unknown. (see Lingo::Attendee::NonewordFilter)
121
+ +vector_filter+:: Filters out everything and lets through only those tokens
122
+ that are considered useful for indexing. (see
123
+ Lingo::Attendee::VectorFilter)
124
+ +object_filter+:: Similar to the +vector_filter+. (see
125
+ Lingo::Attendee::ObjectFilter)
126
+ +text_writer+:: Writes anything that it receives into a file (or to
127
+ standard output). (see Lingo::Attendee::TextWriter)
128
+ +formatter+:: Similar to the +text_writer+, but allows for custom output
129
+ formats. (see Lingo::Attendee::Formatter)
130
+ +debugger+:: Shows everything for debugging. (see
131
+ Lingo::Attendee::Debugger)
132
+ +variator+:: Tries to correct spelling errors and the like. (see
133
+ Lingo::Attendee::Variator)
134
+ +dehyphenizer+:: Tries to undo hyphenation. (see
135
+ Lingo::Attendee::Dehyphenizer)
136
+ +multi_worder+:: Identifies phrases (word sequences) based on a multiword
137
+ dictionary. (see Lingo::Attendee::MultiWorder)
138
+ +sequencer+:: Identifies phrases (word sequences) based on patterns of
139
+ word classes. (see Lingo::Attendee::Sequencer)
126
140
 
127
141
  Furthermore, it may be useful to have a look at the configuration files
128
142
  <tt>lingo.cfg</tt> and <tt>en.lang</tt>.
@@ -131,33 +145,193 @@ Furthermore, it may be useful to have a look at the configuration files
131
145
 
132
146
  Lingo is able to read HTML, XML, and PDF in addition to plain text.
133
147
 
134
- TODO: Examples.
148
+ _Examples_:
149
+
150
+ Read any file, guessing the correct type automatically:
151
+
152
+ - text_reader: { files: $(files), filter: true }
153
+
154
+ Read HTML files specifically (accordingly for XML):
155
+
156
+ - text_reader: { files: $(files), filter: 'html' }
157
+
158
+ Read PDF files, either with the pdf-reader[http://rubygems.org/gems/pdf-reader]
159
+ gem (default):
160
+
161
+ - text_reader: { files: $(files), filter: 'pdf' }
162
+
163
+ or with the pdftotext[http://en.wikipedia.org/wiki/Pdftotext] command line tool:
164
+
165
+ - text_reader: { files: $(files), filter: 'pdftotext' }
135
166
 
136
167
  === Markup
137
168
 
138
- Lingo is able to parse HTML/XML and MediaWiki markup.
169
+ Lingo is able to, in a limited form, parse HTML/XML and
170
+ MediaWiki[http://mediawiki.org/wiki/Help:Formatting] markup.
171
+
172
+ _Examples_:
173
+
174
+ Identify HTML/XML tags in the input stream:
175
+
176
+ - tokenizer: { tags: true }
139
177
 
140
- TODO: Examples.
178
+ Identify MediaWiki markup in the input stream:
179
+
180
+ - tokenizer: { wiki: true }
141
181
 
142
182
  === Inline annotation
143
183
 
144
- Lingo is able to annotate input text inline, instead of printing results to
145
- external files.
184
+ Lingo is able to annotate input text inline, instead of printing results out
185
+ of context to external files.
186
+
187
+ _Example_:
146
188
 
147
- TODO: Examples.
189
+ # keep line endings
190
+ - text_reader: { files: $(files), chomp: false }
191
+ # keep whitespace
192
+ - tokenizer: { space: true }
193
+ # do processing...
194
+ - word_searcher: { source: sys-dic, mode: first }
195
+ # insert formatted results (e.g. "[[Name::lingo|Lingo]] got these [[Noun::word|words]].")
196
+ - formatter: { ext: out, format: '[[%3$s::%2$s|%1$s]]', map: { e: Name, s: Noun } }
148
197
 
149
198
  === Plugins
150
199
 
151
200
  Lingo has a plugin system that allows you to implement additional features
152
201
  (e.g. add new attendees) or modify existing ones. Just create a file named
153
202
  +lingo_plugin.rb+ in your Gem's +lib+ directory or any directory that's in
154
- <tt>$LOAD_PATH</tt>. You can also define an environment variable +LINGO_PLUGIN_PATH+
155
- (by default <tt>~/.lingo/plugins</tt>) with additional directories to load
156
- plugins from (<tt>*.rb</tt>).
203
+ <tt>$LOAD_PATH</tt>. You can also define an environment variable
204
+ +LINGO_PLUGIN_PATH+ (by default <tt>~/.lingo/plugins</tt>) with additional
205
+ directories to load plugins from (<tt>*.rb</tt>).
157
206
 
158
207
  A dedicated API to support writing and integrating plugins will be added in
159
208
  the future.
160
209
 
210
+ === Server
211
+
212
+ Lingo comes with a server daemon Lingo::Srv that exposes an HTTP interface to
213
+ Lingo's functionality. The configuration needs to ensure that input is read
214
+ from standard input (<tt>files: STDIN</tt> on +text_reader+) and output is
215
+ written to standard output (<tt>ext: STDOUT</tt> on +text_writer+).
216
+
217
+ _Example_: Start Lingo server on port 6789 with language configuration +en+
218
+ and default configuration file; server options come before <tt>--</tt>, Lingo
219
+ options come after.
220
+
221
+ > lingosrv -p 6789 -- -l en
222
+
223
+ You can also pass Lingo options through the +LINGO_SRV_OPTS+ environment
224
+ variable (e.g., <tt>LINGO_SRV_OPTS='-l en -c /path/to/your/srv.cfg'</tt>).
225
+
226
+ ==== JSON endpoint
227
+
228
+ _Example_: Ask the server about "Lingo server"; returns JSON data (output
229
+ formatted for clarity).
230
+
231
+ > curl 'http://localhost:6789/?q=Lingo+server'
232
+ {
233
+ "Lingo server" : [
234
+ " <Lingo = [(lingo/s), (lingo/e)]>",
235
+ " <server = [(server/s)]>"
236
+ ]
237
+ }
238
+
239
+ _Example_: Ask the server about "Lingo" and "server"; returns JSON data (output
240
+ formatted for clarity).
241
+
242
+ > curl -g 'http://localhost:6789/?q[]=Lingo&q[]=server'
243
+ {
244
+ "[\"Lingo\", \"server\"]" : {
245
+ "Lingo" : [
246
+ " <Lingo = [(lingo/s), (lingo/e)]>"
247
+ ],
248
+ "server" : [
249
+ " <server = [(server/s)]>"
250
+ ]
251
+ }
252
+ }
253
+
254
+ ==== Raw endpoint
255
+
256
+ _Example_: Ask the server about "Lingo server"; returns raw Lingo response.
257
+
258
+ > curl --data 'Lingo server' http://localhost:6789/raw
259
+ <Lingo = [(lingo/s), (lingo/e)]>
260
+ <server = [(server/s)]>
261
+
262
+ _Example_: Ask the server about this file; returns raw Lingo response (output
263
+ truncated for clarity).
264
+
265
+ > curl --data @README -H 'Content-Type: text/plain' http://localhost:6789/raw
266
+ :=/OTHR:
267
+ <Lingo = [(lingo/s), (lingo/e)]>
268
+ <-|?>
269
+ <A|?>
270
+ <full-featured|KOM = [(full-featured/k), (full/s+), (full/a+), (full/v+), (featured/a+)]>
271
+ <automatic = [(automatic/s), (automatic/a)]>
272
+ <indexing = [(index/v)]>
273
+ <system = [(system/s)]>
274
+ [...]
275
+
276
+ ==== Deployment
277
+
278
+ Lingo::Srv can be started directly through the provided command-line executable
279
+ +lingosrv+ (see above) or through any other Rack[http://rack.github.com/]
280
+ -compatible deployment option; a +rackup+ file is included (see <tt>lingoctl
281
+ rackup srv</tt>).
282
+
283
+ _Example_: To deploy Lingo::Srv with Passenger[http://phusionpassenger.com/]
284
+ on Apache, create a symlink in the DocumentRoot pointing to the app's
285
+ <tt>public/</tt> directory; adjust the paths according to your environment
286
+ (you can use current_gem[http://blackwinter.github.com/current_gem] to
287
+ create a stable gem path):
288
+
289
+ /var/www
290
+ |
291
+ +-- lingo-srv -> /usr/lib/ruby/gems/2.1.0/gems/lingo-x.y.z/lib/lingo/srv/public
292
+
293
+ Then put the following snippet in Apache's VirtualHost configuration:
294
+
295
+ <VirtualHost *:80>
296
+ ...
297
+
298
+ RackBaseURI /lingo-srv
299
+ <Directory /var/www/lingo-srv>
300
+ Options -MultiViews
301
+ SetEnv LINGO_SRV_OPTS "-l en" # <-- Optionally set Lingo options
302
+ </Directory>
303
+ </VirtualHost>
304
+
305
+ In order to provide your own +rackup+ file and Lingo configuration, create a
306
+ directory with those files:
307
+
308
+ /srv/lingo-srv
309
+ |
310
+ +-- config.ru
311
+ |
312
+ +-- lingosrv.cfg
313
+
314
+ And then point Passenger at it:
315
+
316
+ <VirtualHost *:80>
317
+ ...
318
+
319
+ RackBaseURI /lingo-srv
320
+ <Directory /var/www/lingo-srv>
321
+ Options -MultiViews
322
+ PassengerAppRoot /srv/lingo-srv # <-- Add this line
323
+ </Directory>
324
+ </VirtualHost>
325
+
326
+ Restart Apache and test the result (output formatted for clarity):
327
+
328
+ > curl http://localhost/lingo-srv/about
329
+ {
330
+ "Lingo::Srv" : {
331
+ "version" : "x.y.z"
332
+ }
333
+ }
334
+
161
335
 
162
336
  == EXAMPLE
163
337
 
@@ -167,20 +341,21 @@ for further discussion.
167
341
 
168
342
  == INSTALLATION AND USAGE
169
343
 
170
- Since version 1.8.0, Lingo is available as a RubyGem. So a simple <tt>gem
171
- install lingo</tt> will install Lingo and its dependencies (you might want
172
- to run that command with administrator privileges, depending on your
173
- environment). Then you can call the +lingo+ executable to process your
174
- data. See <tt>lingo --help</tt> for starters.
344
+ Since version 1.8.0, Lingo is available as a
345
+ RubyGem[http://rubygems.org/gems/lingo]. So a simple <tt>gem install lingo</tt>
346
+ will install Lingo and its dependencies. You might want to run that command
347
+ with administrator privileges, depending on your environment. Then you can call
348
+ the +lingo+ executable to process your text files. See <tt>lingo --help</tt>
349
+ for available options.
175
350
 
176
- Please note that Lingo requires Ruby version 1.9.2 or higher to run
177
- (2.0.0[http://ruby-lang.org/en/downloads/] is the currently recommended
178
- version). If you want to use Lingo on Ruby 1.8, please refer to the legacy
179
- version (see below).
351
+ Please note that Lingo requires Ruby version 1.9.3 or higher to run
352
+ (2.1.3[http://ruby-lang.org/en/downloads/] is the currently recommended
353
+ version). If you want to use Lingo on Ruby 1.8, please refer to the
354
+ {legacy version}[rdoc-label:label-Legacy+version].
180
355
 
181
356
  Since Lingo depends on native extensions, you need to make sure that
182
357
  development files for your Ruby version are installed. On Debian-based
183
- Linux platforms they are included in the package <tt>ruby1.9.1-dev</tt>;
358
+ Linux platforms they are included in the package <tt>ruby-dev</tt>;
184
359
  other distributions may have a similarly named package. On Windows those
185
360
  development files are currently not required.
186
361
 
@@ -194,29 +369,29 @@ version of Lingo (see below).
194
369
  === Dictionary and configuration file lookup
195
370
 
196
371
  Lingo will search different locations to find dictionaries and configuration
197
- files. By default, these are the current directory, your personal Lingo
372
+ files. By default, these are the current working directory, your personal Lingo
198
373
  directory (<tt>~/.lingo</tt>) and the installation directory (in that order).
199
374
  You can control this lookup path by either moving files up the chain (using
200
375
  the +lingoctl+ executable) or by setting various environment variables.
201
376
 
202
377
  With +lingoctl+ you can copy dictionaries and configuration files from your
203
- personal Lingo directory or the installation directory to the current
378
+ personal Lingo directory or the installation directory to the current working
204
379
  directory so you can modify them and they will take precedence over the
205
380
  original ones. See <tt>lingoctl --help</tt> for usage information.
206
381
 
207
- In order to change the search path in itself, you can define the
382
+ In order to change the search path itself, you can define the
208
383
  +LINGO_PATH+ environment variable as a whole or its individual parts
209
384
  +LINGO_CURR+ (the local Lingo directory), +LINGO_HOME+ (your personal
210
385
  Lingo directory), and +LINGO_BASE+ (the system-wide Lingo directory).
211
386
 
212
- Inside of any of these directories dictionaries and configuration files are
387
+ Inside of any of these directories, dictionaries and configuration files are
213
388
  typically organized in the following directory structure:
214
389
 
215
- <tt>config</tt>:: Configuration files (<tt>*.cfg</tt>).
216
- <tt>dict</tt>:: Dictionary source files (<tt>*.txt</tt>); in
217
- language-specific subdirectories (+de+, +en+, ...).
218
- <tt>lang</tt>:: Language definition files (<tt>*.lang</tt>).
219
- <tt>store</tt>:: Compiled dictionaries, generated from source files.
390
+ <tt>config/</tt>:: Configuration files (<tt>*.cfg</tt>).
391
+ <tt>dict/</tt>:: Dictionary source files (<tt>*.txt</tt>) in
392
+ language-specific subdirectories (+de/+, +en/+, ...).
393
+ <tt>lang/</tt>:: Language definition files (<tt>*.lang</tt>).
394
+ <tt>store/</tt>:: Compiled dictionaries, generated from source files.
220
395
 
221
396
  But for compatibility reasons these naming conventions are not enforced.
222
397
 
@@ -224,14 +399,14 @@ But for compatibility reasons these naming conventions are not enforced.
224
399
 
225
400
  As Lingo 1.8 introduced some major disruptions and no longer runs on Ruby 1.8,
226
401
  there is a maintenance branch for Lingo 1.7.x that will remain compatible with
227
- both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch will
402
+ both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch may
228
403
  receive occasional bug fixes and minor feature updates. However, the bulk of
229
404
  the development efforts will be directed towards Lingo 1.8+.
230
405
 
231
- To install the legacy version, download and extract the ZIP archive from
232
- RubyForge[http://rubyforge.org/frs/?group_id=5663]. No additional dependencies
233
- are required. This version of Lingo works with both Ruby 1.8 (1.8.5 or higher)
234
- and 1.9 (1.9.2 or higher).
406
+ To install the legacy version, download and extract the
407
+ {ZIP archive}[http://ixtrieve.fh-koeln.de/buch/lingo-1.7.1.zip].
408
+ No additional dependencies are required. This version of Lingo works
409
+ with both Ruby 1.8 (1.8.5 or higher) and 1.9 (1.9.2 or higher).
235
410
 
236
411
  The executable is named +lingo.rb+. It's located at the root of the installation
237
412
  directory and may only be run from there. See <tt>ruby lingo.rb -h</tt> for
@@ -239,49 +414,116 @@ usage instructions.
239
414
 
240
415
  Configuration and language definition files are also located at the root of the
241
416
  installation directory (<tt>*.cfg</tt> and <tt>*.lang</tt>, respectively).
242
- Dictionary source files are found in language-specific subdirectories (+de+,
243
- +en+, ...) and are named <tt>*.txt</tt>. The compiled dictionaries are found
244
- beneath these subdirectories in a directory named <tt>store</tt>.
417
+ Dictionary source files are found in language-specific subdirectories (+de/+,
418
+ +en/+, ...) and are named <tt>*.txt</tt>. The compiled dictionaries are found
419
+ beneath these language subdirectories in a directory named <tt>store/</tt>.
245
420
 
246
421
 
247
422
  == FILE FORMATS
248
423
 
249
- Lingo uses three different types of files to determine its behaviour.
250
- Configuration files control the details of the indexing process. Language
251
- definitions specify grammar rules and dictionaries available for indexing.
252
- Dictionaries, finally, hold the vocabulary used in indexing the input text
253
- and producing the results.
424
+ Lingo uses three different types of files to determine its behaviour:
425
+ {configuration files}[rdoc-label:label-Configuration] control the details of the
426
+ indexing process; {language definitions}[rdoc-label:label-Language+definition]
427
+ specify grammar rules and dictionaries available for indexing;
428
+ dictionaries[rdoc-label:label-Dictionaries], finally, hold the
429
+ vocabulary used in indexing the input text and producing the results.
254
430
 
255
431
  === Configuration
256
432
 
257
- TODO...
433
+ Configuration files are defined in the YAML[http://yaml.org/] syntax. They
434
+ specify the attendees[rdoc-label:label-Attendees] to call in order and the
435
+ options to provide them with. The first attendee in any indexing process is
436
+ the text_reader[rdoc-ref:Lingo::Attendee::TextReader], who reads the input
437
+ text and passes it on to the other attendees. Every attendee transforms or
438
+ extends the input stream and automatically sends everything down to the next
439
+ attendee. This process may be customized by explicitly specifying the input
440
+ and/or output channels of individual attendees with the +in+ and +out+ options.
441
+
442
+ _Example_:
443
+
444
+ # input is taken from the previous attendee,
445
+ # output is sent to the named channel "syn"
446
+ - synonymer: { skip: '?,t', source: sys-syn, out: syn }
447
+ 
448
+ # input is taken from the named channel "syn",
449
+ # output is sent to the next attendee
450
+ - vector_filter: { in: syn, lexicals: y, sort: term_abs }
451
+ 
452
+ # input is taken from the previous attendee,
453
+ # output is sent to the next attendee
454
+ - text_writer: { ext: syn, sep: "\n" }
455
+ 
456
+ # input is taken from the named channel "syn"
457
+ # (ignoring the output of the previous attendee),
458
+ # output is sent to the next attendee
459
+ - vector_filter: { in: syn, lexicals: m }
460
+ 
461
+ # input is taken from the previous attendee,
462
+ # output is sent to the next attendee
463
+ - text_writer: { ext: mul, sep: "\n" }
258
464
 
259
465
  === Language definition
260
466
 
261
- TODO...
467
+ Language definitions, like {configuration files}[rdoc-label:label-Configuration],
468
+ are defined in the YAML[http://yaml.org/] syntax. They specify the
469
+ dictionaries[rdoc-label:label-Dictionaries] to be used as well as the grammar
470
+ rules according to which the input shall be processed. These settings do not
471
+ necessarily have to coincide with an existing language, they are
472
+ application-specific.
262
473
 
263
474
  === Dictionaries
264
475
 
476
+ Dictionaries come in different varieties and encode the knowledge about the
477
+ vocabulary used for indexing and analysis.
478
+
479
+ Supported dictionary formats:
480
+
481
+ +SingleWord+:: One word (projection) per line. E.g. <tt>open source</tt>. (see
482
+ Lingo::Database::Source::SingleWord)
483
+ +MultiValue+:: Multiple words per line (separated with a unique symbol), all of
484
+ which are interpreted as belonging to a single equivalence class.
485
+ E.g. <tt>fax;telefax;facsimile</tt>. (see
486
+ Lingo::Database::Source::MultiValue)
487
+ +MultiKey+:: Similar to +MultiValue+, except that the first word will be
488
+ treated as the preferred term (descriptor). E.g.
489
+ <tt>fax;telefax;facsimile</tt>. (see
490
+ Lingo::Database::Source::MultiKey)
491
+ +KeyValue+:: One word and its associated projection per line, separated with
492
+ a unique symbol. E.g. <tt>abfrage*query</tt>. (see
493
+ Lingo::Database::Source::KeyValue)
494
+ +WordClass+:: Similar to +KeyValue+, except that the projection may consist of
495
+ multiple lexicalizations, each with its own word class and
496
+ (optional) gender information. E.g. <tt>abort,abort #s|v</tt>,
497
+ which is equivalent to <tt>abort,abort #s abort #v</tt>. (see
498
+ Lingo::Database::Source::WordClass)
499
+
500
+ ==== Encoding word classes and gender information
501
+
502
+ TODO...
503
+
504
+ ==== Lexicalizing multiword expressions
505
+
506
+ TODO...
507
+
508
+ ==== Lexicalizing compounds
509
+
265
510
  TODO...
266
511
 
267
512
 
268
513
  == ISSUES AND CONTRIBUTIONS
269
514
 
270
- If you find bugs or want to suggest new features, please write to the
271
- {mailing list}[mailto:lingo-users@rubyforge.org] or report them on
272
- GitHub[http://github.com/lex-lingo/lingo/issues]. Include your Ruby
515
+ If you find bugs or want to suggest new features, please report them
516
+ on GitHub[http://github.com/lex-lingo/lingo/issues]. Include your Ruby
273
517
  version (<tt>ruby --version</tt>) and the version of Lingo you are using
274
- (typically <tt>lingo --version</tt>, provided it's new enough to support
275
- that flag).
518
+ (<tt>lingo --version</tt>).
276
519
 
277
- If you want to contribute to Lingo, please fork the project on
278
- GitHub[http://github.com/lex-lingo/lingo] and submit a
279
- {pull request}[http://github.com/lex-lingo/lingo/pulls] (bonus points for topic
280
- branches) or clone the repository[http://github.com/lex-lingo/lingo] locally
281
- and send your formatted patch to the {developer list}[mailto:lingo-core@rubyforge.org].
520
+ If you want to contribute to Lingo, please fork the project
521
+ on GitHub[http://github.com/lex-lingo/lingo] and submit a
522
+ {pull request}[http://github.com/lex-lingo/lingo/pulls]
523
+ (bonus points for topic branches).
282
524
 
283
525
  To make sure that Lingo's tests pass, install hen[http://blackwinter.github.com/hen]
284
- (typically <tt>gem install hen</tt>) and all development dependencies (either
526
+ (typically <tt>gem install hen</tt>) and all development dependencies (either with
285
527
  <tt>gem install --development lingo</tt> or manually; see <tt>rake gem:dependencies</tt>).
286
528
  Then run <tt>rake test</tt> for the basic tests or <tt>rake test:all</tt> for
287
529
  the full test suite.
@@ -289,28 +531,35 @@ the full test suite.
289
531
 
290
532
  == LINKS
291
533
 
292
- <b></b>
293
- Website:: http://lex-lingo.de
294
- Demo:: http://ixtrieve.fh-koeln.de/lingoweb
295
- Documentation:: http://lex-lingo.github.com/lingo
296
- Source code:: http://github.com/lex-lingo/lingo
297
- RubyGem:: http://rubygems.org/gems/lingo
298
- RubyForge project:: http://rubyforge.org/projects/lingo
299
- Mailing list:: http://rubyforge.org/mailman/listinfo/lingo-users
300
- Bug tracker:: http://github.com/lex-lingo/lingo/issues
534
+ Website:: http://lex-lingo.de
535
+ Demo:: http://ixtrieve.fh-koeln.de/lingoweb
536
+ Documentation:: https://lex-lingo.github.com/lingo
537
+ Source code:: https://github.com/lex-lingo/lingo
538
+ RubyGem:: https://rubygems.org/gems/lingo
539
+ Bug tracker:: https://github.com/lex-lingo/lingo/issues
540
+ Travis CI:: https://travis-ci.org/lex-lingo/lingo
301
541
 
302
542
 
303
543
  == LITERATURE
304
544
 
305
- <b></b>
306
- * Lepsky, K., Vorhauer, J.: <em>{Lingo: ein open source System für die automatische Indexierung deutschsprachiger Dokumente}[http://dx.doi.org/10.1515/ABITECH.2006.26.1.18]</em>. (German) In: ABI Technik 26, 2006. p. 18-29.
307
- * Gödert, W., Lepsky, K., Nagelschmidt, M.: <em>{Informationserschließung und Automatisches Indexieren: ein Lehr- und Arbeitsbuch}[http://dx.doi.org/10.1007/978-3-642-23513-9]</em>. (German) Berlin etc.: Springer, 2012.
545
+ === Background and Theory
546
+
547
+ * Gödert, W.; Lepsky, K.; Nagelschmidt, M.: <em>{Informationserschließung und Automatisches Indexieren: ein Lehr- und Arbeitsbuch}[http://dx.doi.org/10.1007/978-3-642-23513-9]</em>. (German) Berlin etc.: Springer, 2012.
548
+ * Lepsky, K.; Vorhauer, J.: <em>{Lingo: ein open source System für die automatische Indexierung deutschsprachiger Dokumente}[http://dx.doi.org/10.1515/ABITECH.2006.26.1.18]</em>. (German) In: ABI Technik 26 (1), 2006. pp 18-29.
308
549
  * Nohr, H.: <em>{Grundlagen der automatischen Indexierung: ein Lehrbuch}[http://logos-verlag.de/cgi-bin/buch/isbn/0121]</em>. (German) Berlin: Logos, 2005.
309
- * Hausser, R.: <em>{Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache}[http://zentralblatt-math.org/zbmath/search/?an=0956.68141]</em>. (German) Berlin etc.: Springer, 2000.
310
- * Allen, J.: <em>{Natural language understanding}[http://zentralblatt-math.org/zbmath/search/?an=0851.68106]</em>. (English) Redwood City, CA: Benjamin/Cummings, 1995.
311
- * Grishman, R.: <em>{Computational linguistics: an introduction}[http://dx.doi.org/10.2277/0521310385]</em>. (English) Cambridge: Cambridge Univ. Press, 1986.
312
- * Salton, G., McGill, M.: <em>{Introduction to modern information retrieval}[http://zentralblatt-math.org/zbmath/search/?an=0523.68084]</em>. (English) New York etc.: McGraw-Hill, 1983.
313
- * Porter, M.: <em>{An algorithm for suffix stripping}[http://tartarus.org/~martin/PorterStemmer/]</em>. (English) In: Program 14, 1980. p. 130-137.
550
+ * Hausser, R.: <em>{Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache}[http://zbmath.org/?q=an:0956.68141]</em>. (German) Berlin etc.: Springer, 2000.
551
+ * Allen, J.: <em>{Natural language understanding}[http://zbmath.org/?q=an:0851.68106]</em>. (English) Redwood City, CA: Benjamin/Cummings, 1995.
552
+ * Grishman, R.: <em>{Computational linguistics: an introduction}[http://cambridge.org/9780521310383]</em>. (English) Cambridge: Cambridge Univ. Press, 1986.
553
+ * Salton, G.; McGill, M.: <em>{Introduction to modern information retrieval}[http://zbmath.org/?q=an:0523.68084]</em>. (English) New York etc.: McGraw-Hill, 1983.
554
+ * Porter, M.: <em>{An algorithm for suffix stripping}[http://tartarus.org/~martin/PorterStemmer/]</em>. (English) In: Program 14 (3), 1980. pp 130-137.
555
+
556
+ === Research publications
557
+
558
+ * Bredack, J.; Lepsky, K.: <em>{Automatische Extraktion von Fachterminologie aus Volltexten}[http://dx.doi.org/10.1515/abitech-2014-0002]</em>. (German) In: ABI Technik 34 (1), 2014. pp 2-12.
559
+ * Bredack, J.: <em>{Terminologieextraktion von Mehrwortgruppen in kunsthistorischen Fachtexten}[http://ixtrieve.fh-koeln.de/lehre/bredack-2013.pdf]</em>. (German) Köln: Fachhochschule Köln, 2013.
560
+ * Maylein, L.; Langenstein, A.: <em>{Neues vom Relevanz-Ranking im HEIDI-Katalog der Universitätsbibliothek Heidelberg}[http://b-i-t-online.de/heft/2013-03-fachbeitrag-maylein.pdf]</em>. (German) In: b.i.t.online 16 (3), 2013. pp 190-200.
561
+ * Gödert, W.: <em>{Detecting multiword phrases in mathematical text corpora}[http://arxiv.org/abs/1210.0852]</em>. (English) arXiv:1210.0852 [cs.CL], 2012.
562
+ * Schiffer, R.: <em>{Automatisches Indexieren technischer Kongressschriften}[http://ixtrieve.fh-koeln.de/lehre/schiffer-2007.pdf]</em>. (German) Köln: Fachhochschule Köln, 2007.
314
563
 
315
564
 
316
565
  == CREDITS
@@ -333,7 +582,7 @@ Lingo is based on a collective development by Klaus Lepsky and John Vorhauer.
333
582
  == LICENSE AND COPYRIGHT
334
583
 
335
584
  Copyright (C) 2005-2007 John Vorhauer
336
- Copyright (C) 2007-2013 John Vorhauer, Jens Wille
585
+ Copyright (C) 2007-2014 John Vorhauer, Jens Wille
337
586
 
338
587
  Lingo is free software: you can redistribute it and/or modify it under the
339
588
  terms of the GNU Affero General Public License as published by the Free