lingo 1.8.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (108) hide show
  1. data/.rspec +1 -0
  2. data/COPYING +663 -0
  3. data/ChangeLog +754 -0
  4. data/README +322 -0
  5. data/Rakefile +100 -0
  6. data/TODO +28 -0
  7. data/bin/lingo +5 -0
  8. data/bin/lingoctl +6 -0
  9. data/de.lang +121 -0
  10. data/de/lingo-abk.txt +74 -0
  11. data/de/lingo-dic.txt +56822 -0
  12. data/de/lingo-mul.txt +3209 -0
  13. data/de/lingo-syn.txt +14841 -0
  14. data/de/test_dic.txt +24 -0
  15. data/de/test_mul.txt +17 -0
  16. data/de/test_mul2.txt +2 -0
  17. data/de/test_singleword.txt +2 -0
  18. data/de/test_syn.txt +4 -0
  19. data/de/test_syn2.txt +1 -0
  20. data/de/user-dic.txt +10 -0
  21. data/en.lang +113 -0
  22. data/en/lingo-dic.txt +55434 -0
  23. data/en/lingo-mul.txt +456 -0
  24. data/en/user-dic.txt +5 -0
  25. data/info/Objekte.png +0 -0
  26. data/info/Typen.png +0 -0
  27. data/info/database.png +0 -0
  28. data/info/db_small.png +0 -0
  29. data/info/download.png +0 -0
  30. data/info/gpl-hdr.txt +27 -0
  31. data/info/kerze.png +0 -0
  32. data/info/language.png +0 -0
  33. data/info/lingo.png +0 -0
  34. data/info/logo.png +0 -0
  35. data/info/meeting.png +0 -0
  36. data/info/types.png +0 -0
  37. data/lib/lingo.rb +321 -0
  38. data/lib/lingo/attendee/abbreviator.rb +119 -0
  39. data/lib/lingo/attendee/debugger.rb +111 -0
  40. data/lib/lingo/attendee/decomposer.rb +101 -0
  41. data/lib/lingo/attendee/dehyphenizer.rb +167 -0
  42. data/lib/lingo/attendee/multiworder.rb +301 -0
  43. data/lib/lingo/attendee/noneword_filter.rb +103 -0
  44. data/lib/lingo/attendee/objectfilter.rb +86 -0
  45. data/lib/lingo/attendee/sequencer.rb +190 -0
  46. data/lib/lingo/attendee/synonymer.rb +105 -0
  47. data/lib/lingo/attendee/textreader.rb +237 -0
  48. data/lib/lingo/attendee/textwriter.rb +196 -0
  49. data/lib/lingo/attendee/tokenizer.rb +218 -0
  50. data/lib/lingo/attendee/variator.rb +185 -0
  51. data/lib/lingo/attendee/vector_filter.rb +158 -0
  52. data/lib/lingo/attendee/wordsearcher.rb +96 -0
  53. data/lib/lingo/attendees.rb +289 -0
  54. data/lib/lingo/cli.rb +62 -0
  55. data/lib/lingo/config.rb +104 -0
  56. data/lib/lingo/const.rb +131 -0
  57. data/lib/lingo/ctl.rb +173 -0
  58. data/lib/lingo/database.rb +587 -0
  59. data/lib/lingo/language.rb +530 -0
  60. data/lib/lingo/modules.rb +98 -0
  61. data/lib/lingo/types.rb +285 -0
  62. data/lib/lingo/utilities.rb +40 -0
  63. data/lib/lingo/version.rb +27 -0
  64. data/lingo-all.cfg +85 -0
  65. data/lingo-call.cfg +15 -0
  66. data/lingo.cfg +78 -0
  67. data/lingo.rb +3 -0
  68. data/lir.cfg +72 -0
  69. data/porter/stem.cfg +311 -0
  70. data/porter/stem.rb +150 -0
  71. data/spec/spec_helper.rb +0 -0
  72. data/test.cfg +79 -0
  73. data/test/attendee/ts_abbreviator.rb +35 -0
  74. data/test/attendee/ts_decomposer.rb +31 -0
  75. data/test/attendee/ts_multiworder.rb +390 -0
  76. data/test/attendee/ts_noneword_filter.rb +19 -0
  77. data/test/attendee/ts_objectfilter.rb +19 -0
  78. data/test/attendee/ts_sequencer.rb +43 -0
  79. data/test/attendee/ts_synonymer.rb +33 -0
  80. data/test/attendee/ts_textreader.rb +58 -0
  81. data/test/attendee/ts_textwriter.rb +98 -0
  82. data/test/attendee/ts_tokenizer.rb +32 -0
  83. data/test/attendee/ts_variator.rb +24 -0
  84. data/test/attendee/ts_vector_filter.rb +62 -0
  85. data/test/attendee/ts_wordsearcher.rb +119 -0
  86. data/test/lir.csv +3 -0
  87. data/test/lir.txt +12 -0
  88. data/test/lir2.txt +12 -0
  89. data/test/mul.txt +1 -0
  90. data/test/ref/artikel.mul +1 -0
  91. data/test/ref/artikel.non +159 -0
  92. data/test/ref/artikel.seq +270 -0
  93. data/test/ref/artikel.syn +16 -0
  94. data/test/ref/artikel.vec +928 -0
  95. data/test/ref/artikel.ven +928 -0
  96. data/test/ref/artikel.ver +928 -0
  97. data/test/ref/lir.csv +328 -0
  98. data/test/ref/lir.mul +1 -0
  99. data/test/ref/lir.non +274 -0
  100. data/test/ref/lir.seq +249 -0
  101. data/test/ref/lir.syn +94 -0
  102. data/test/test_helper.rb +113 -0
  103. data/test/ts_database.rb +269 -0
  104. data/test/ts_language.rb +396 -0
  105. data/txt/artikel-en.txt +157 -0
  106. data/txt/artikel.txt +170 -0
  107. data/txt/lir.txt +1317 -0
  108. metadata +211 -0
data/README ADDED
@@ -0,0 +1,322 @@
1
+ = Lingo - A full-featured automatic indexing system
2
+
3
+ <b></b>
4
+ * {Version}[rdoc-label:label-VERSION]
5
+ * {Description}[rdoc-label:label-DESCRIPTION]
6
+ * {Introduction}[rdoc-label:label-Introduction]
7
+ * {Attendees}[rdoc-label:label-Attendees]
8
+ * {Filters}[rdoc-label:label-Filters]
9
+ * {Markup}[rdoc-label:label-Markup]
10
+ * {Inline annotation}[rdoc-label:label-Inline+annotation]
11
+ * {Plugins}[rdoc-label:label-Plugins]
12
+ * {Example}[rdoc-label:label-EXAMPLE]
13
+ * {Installation and Usage}[rdoc-label:label-INSTALLATION+AND+USAGE]
14
+ * {Dictionary and configuration file lookup}[rdoc-label:label-Dictionary+and+configuration+file+lookup]
15
+ * {Legacy version}[rdoc-label:label-Legacy+version]
16
+ * {File formats}[rdoc-label:label-FILE+FORMATS]
17
+ * {Configuration}[rdoc-label:label-Configuration]
18
+ * {Language definition}[rdoc-label:label-Language+definition]
19
+ * {Dictionaries}[rdoc-label:label-Dictionaries]
20
+ * {Issues and Contributions}[rdoc-label:label-ISSUES+AND+CONTRIBUTIONS]
21
+ * {Links}[rdoc-label:label-LINKS]
22
+ * {Credits}[rdoc-label:label-CREDITS]
23
+ * {License and Copyright}[rdoc-label:label-LICENSE+AND+COPYRIGHT]
24
+
25
+ == VERSION
26
+
27
+ This documentation refers to Lingo version 1.8.0
28
+
29
+
30
+ == DESCRIPTION
31
+
32
+ Lingo is an open source indexing system for research and teachings. The main
33
+ functions of Lingo are:
34
+
35
+ * identification of (i.e. reduction to) basic word form by means of dictionaries
36
+ and suffix lists
37
+ * algorithmic decomposition
38
+ * dictionary-based synonymisation and identification of phrases
39
+ * generic identification of phrases/word sequences based on patterns of word
40
+ classes
41
+
42
+ === Introduction
43
+
44
+ If you want to perform linguistic analysis on some text, Lingo is there to
45
+ support your endeavour with all its flexibility and extendability. Lingo
46
+ enables you to assemble a network of practically unlimited functionality
47
+ from modules with limited functions. This network is built by configuration
48
+ files. Here's a minimal example:
49
+
50
+ meeting:
51
+ attendees:
52
+ - textreader: { files: 'README' }
53
+ - debugger: { eval: 'true', ceval: 'cmd!="EOL"', prompt: '<debug>: ' }
54
+
55
+ Lingo is told to invite two attendees. And Lingo wants them to talk to each
56
+ other, hence the name Lingo (= the technical language).
57
+
58
+ The first attendee is the +textreader+ (Lingo::Attendee::Textreader). It can
59
+ read files (as well as standard input) and communicate its content to other
60
+ attendees. For this purpose the +textreader+ is given an output channel.
61
+ Everything that the +textreader+ has to say is steered through this channel.
62
+ It will do nothing further until Lingo will tell the first attendee to speak.
63
+ Then the +textreader+ will open the file +README+ (<tt>files</tt> parameter)
64
+ and babble the content to the world via its output channel.
65
+
66
+ The second attendee +debugger+ (Lingo::Attendee::Debugger) does nothing else
67
+ than to put everything on the console (standard error, actually) that comes
68
+ into its input channel. If you write the Lingo configuration which is shown
69
+ above as an example into the file readme.cfg and then run <tt>lingo -c readme
70
+ -l en</tt>, the result will look something like this:
71
+
72
+ <debug>: *FILE('README')
73
+ <debug>: "= Lingo - [...]"
74
+ ...
75
+ <debug>: "If you want to perform linguistic analysis on some text, [...]"
76
+ <debug>: "support your endeavour with all its flexibility and [...]"
77
+ ...
78
+ <debug>: *EOF('README')
79
+
80
+ What we see are lines with an asterisk (*) and lines without. That's because
81
+ Lingo distinguishes between commands and data. The +textreader+ did not only
82
+ read the content of the file, but also communicated through the commands when
83
+ a file begins and when it ends. This can (and will) be an important piece of
84
+ information for other attendees that will be added later.
85
+
86
+ To try out Lingo's functionality without installing it first, have a look at
87
+ {Lingo Web}[http://linux2.fbi.fh-koeln.de/lingoweb]. There you can enter some
88
+ text and see the debug output Lingo generated, including tokenization, word
89
+ identification, decomposition, etc.
90
+
91
+ === Attendees
92
+
93
+ Available attendees that can be used for solving a specific problem (for more
94
+ information see each attendee's documentation):
95
+
96
+ <tt>textreader</tt>:: Reads files and puts their content into the channels line by
97
+ line. (Lingo::Attendee::Textreader)
98
+ <tt>tokenizer</tt>:: Dissects lines into defined character strings, i.e. tokens.
99
+ (Lingo::Attendee::Tokenizer)
100
+ <tt>abbreviator</tt>:: Identifies abbreviations and produces the long form if listed
101
+ in a dictionary. (Lingo::Attendee::Abbreviator)
102
+ <tt>wordsearcher</tt>:: Identifies tokens and turns them into words for further
103
+ processing. To do this right it looks into the dictionary.
104
+ (Lingo::Attendee::Wordsearcher)
105
+ <tt>decomposer</tt>:: Tests any character strings not identified by the +wordsearcher+
106
+ for being compounds. (Lingo::Attendee::Decomposer)
107
+ <tt>synonymer</tt>:: Extends words with synonyms. (Lingo::Attendee::Synonymer)
108
+ <tt>noneword_filter</tt>:: Filters out everything and lets through only those tokens that
109
+ are unknown. (Lingo::Attendee::Noneword_filter)
110
+ <tt>vector_filter</tt>:: Filters out everything and lets through only those tokens that are
111
+ considered useful for indexing. (Lingo::Attendee::Vector_filter)
112
+ <tt>objectfilter</tt>:: Similar to the +vector_filter+. (Lingo::Attendee::Objectfilter)
113
+ <tt>textwriter</tt>:: Writes anything that it receives into a file (or to standard
114
+ output). (Lingo::Attendee::Textwriter)
115
+ <tt>formatter</tt>:: Similar to the +textwriter+, but allows for custom output formats.
116
+ (Lingo::Attendee::Formatter)
117
+ <tt>debugger</tt>:: Shows everything for debugging. (Lingo::Attendee::Debugger)
118
+ <tt>variator</tt>:: Tries to correct spelling errors and the like.
119
+ (Lingo::Attendee::Variator)
120
+ <tt>dehyphenizer</tt>:: Tries to undo hyphenation. (Lingo::Attendee::Dehyphenizer)
121
+ <tt>multiworder</tt>:: Identifies phrases (word sequences) based on a multiword
122
+ dictionary. (Lingo::Attendee::Multiworder)
123
+ <tt>sequencer</tt>:: Identifies phrases (word sequences) based on patterns of word
124
+ classes. (Lingo::Attendee::Sequencer)
125
+
126
+ Furthermore, it may be useful to have a look at the configuration files
127
+ <tt>lingo.cfg</tt> and <tt>en.lang</tt>.
128
+
129
+ === Filters
130
+
131
+ Lingo is able to read HTML, XML, and PDF.
132
+
133
+ TODO: Examples.
134
+
135
+ === Markup
136
+
137
+ Lingo is able to parse HTML/XML and MediaWiki markup.
138
+
139
+ TODO: Examples.
140
+
141
+ === Inline annotation
142
+
143
+ Lingo is able to annotate input text inline, instead of printing results to
144
+ external files.
145
+
146
+ TODO: Examples.
147
+
148
+ === Plugins
149
+
150
+ Lingo has a plugin system that allows you to implement additional features
151
+ (e.g. add new attendees) or modify existing ones. Just create a file named
152
+ +lingo_plugin.rb+ in your Gem's +lib+ directory or any directory that's in
153
+ <tt>$LOAD_PATH</tt>. You can also define an environment variable +LINGO_PLUGIN_PATH+
154
+ with additional directories to load plugins from (<tt>*.rb</tt>).
155
+
156
+ A dedicated API to support writing and integrating plugins will be added in
157
+ the future.
158
+
159
+
160
+ == EXAMPLE
161
+
162
+ TODO: Full-fledged example to show off Lingo's features and provide a basis
163
+ for further discussion.
164
+
165
+
166
+ == INSTALLATION AND USAGE
167
+
168
+ Since version 1.8.0, Lingo is available as a RubyGem. So a simple <tt>gem
169
+ install lingo</tt> will install Lingo and its dependencies (you might want
170
+ to run that command with administrator privileges, depending on your
171
+ environment). Then you can call the +lingo+ executable to process your
172
+ data. See <tt>lingo --help</tt> for starters.
173
+
174
+ Please note that Lingo requires Ruby version 1.9 to run
175
+ (1.9.3[http://ruby-lang.org/en/downloads/] is the currently recommended
176
+ version). If you want to use Lingo on Ruby 1.8, please refer to the legacy
177
+ version (see below).
178
+
179
+ Prior to version 1.8.0, Lingo expected to be run from its installation
180
+ directory. This is no longer necessary. But if you prefer that use case,
181
+ you can either download and extract an
182
+ {archive file}[http://github.com/lex-lingo/lingo/tags] or unpack the
183
+ Gem archive (<tt>gem unpack lingo</tt>); or you can install the legacy
184
+ version of Lingo (see below).
185
+
186
+ === Dictionary and configuration file lookup
187
+
188
+ Lingo will search different locations to find dictionaries and configuration
189
+ files. By default, these are the current directory, your personal Lingo
190
+ directory (<tt>~/.lingo</tt>) and the installation directory (in that order).
191
+ You can control this lookup path by either moving files up the chain (using
192
+ the +lingoctl+ executable) or by setting various environment variables.
193
+
194
+ With +lingoctl+ you can copy dictionaries and configuration files from your
195
+ personal Lingo directory or the installation directory to the current
196
+ directory so you can modify them and they will take precedence over the
197
+ original ones. See <tt>lingoctl --help</tt> for usage information.
198
+
199
+ In order to change the search path in itself, you can define the
200
+ +LINGO_PATH+ environment variable as a whole or its individual parts
201
+ +LINGO_CURR+ (the local Lingo directory), +LINGO_HOME+ (your personal
202
+ Lingo directory), and +LINGO_BASE+ (the system-wide Lingo directory).
203
+
204
+ Inside of any of these directories dictionaries and configuration files are
205
+ typically organized in the following directory structure:
206
+
207
+ <tt>config</tt>:: Configuration files (<tt>*.cfg</tt>).
208
+ <tt>dict</tt>:: Dictionary source files (<tt>*.txt</tt>); in
209
+ language-specific subdirectories (+de+, +en+, ...).
210
+ <tt>lang</tt>:: Language definition files (<tt>*.lang</tt>).
211
+ <tt>store</tt>:: Compiled dictionaries, generated from source files.
212
+
213
+ But for compatibility reasons these naming conventions are not enforced.
214
+
215
+ === Legacy version
216
+
217
+ As Lingo 1.8 introduced some major disruptions and no longer runs on Ruby 1.8,
218
+ there is a maintenance branch for Lingo 1.7.x that will remain compatible with
219
+ both Ruby 1.8 and the previous line of Lingo prior to 1.8. This branch will
220
+ receive occasional bug fixes and minor feature updates. However, the bulk of
221
+ the development efforts will be directed towards Lingo 1.8+.
222
+
223
+ To install the legacy version, download and extract the ZIP archive from
224
+ RubyForge[http://rubyforge.org/frs/?group_id=5663]. No additional dependencies
225
+ are required. This version of Lingo works with both Ruby 1.8 (1.8.5 or greater)
226
+ and 1.9.
227
+
228
+ The executable is named +lingo.rb+. It's located at the root of the installation
229
+ directory and may only be run from there. See <tt>ruby lingo.rb -h</tt> for
230
+ usage instructions.
231
+
232
+ Configuration and language definition files are also located at the root of the
233
+ installation directory (<tt>*.cfg</tt> and <tt>*.lang</tt>, respectively).
234
+ Dictionary source files are found in language-specific subdirectories (+de+,
235
+ +en+, ...) and are named <tt>*.txt</tt>. The compiled dictionaries are found
236
+ beneath these subdirectories in a directory named <tt>store</tt>.
237
+
238
+
239
+ == FILE FORMATS
240
+
241
+ Lingo uses three different types of files to determine its behaviour.
242
+ Configuration files control the details of the indexing process. Language
243
+ definitions specify grammar rules and dictionaries available for indexing.
244
+ Dictionaries, finally, hold the vocabulary used in indexing the input text
245
+ and producing the results.
246
+
247
+ === Configuration
248
+
249
+ TODO...
250
+
251
+ === Language definition
252
+
253
+ TODO...
254
+
255
+ === Dictionaries
256
+
257
+ TODO...
258
+
259
+
260
+ == ISSUES AND CONTRIBUTIONS
261
+
262
+ If you find bugs or want to suggest new features, please write to the
263
+ {mailing list}[mailto:lingo-users@rubyforge.org] or report them on
264
+ GitHub[http://github.com/lex-lingo/lingo/issues]. Include your Ruby
265
+ version (<tt>ruby --version</tt>) and the version of Lingo you are using
266
+ (typically <tt>lingo --version</tt>, provided it's new enough to support
267
+ that flag).
268
+
269
+ If you want to contribute to Lingo, please fork the project on
270
+ GitHub[http://github.com/lex-lingo/lingo] and submit a
271
+ {pull request}[http://github.com/lex-lingo/lingo/pulls] (bonus points for
272
+ topic branches) or clone the repository[http://github.com/lex-lingo/lingo]
273
+ locally and send your formatted patch to the
274
+ {developer list}[mailto:lingo-core@rubyforge.org].
275
+
276
+
277
+ == LINKS
278
+
279
+ <b></b>
280
+ Website:: http://lex-lingo.de
281
+ Demo:: http://linux2.fbi.fh-koeln.de/lingoweb
282
+ Documentation:: http://lex-lingo.github.com/lingo
283
+ Source code:: http://github.com/lex-lingo/lingo
284
+ RubyGem:: http://rubygems.org/gems/lingo
285
+ RubyForge project:: http://rubyforge.org/projects/lingo
286
+ Mailing list:: http://rubyforge.org/mailman/listinfo/lingo-users
287
+ Bug tracker:: http://github.com/lex-lingo/lingo/issues
288
+
289
+
290
+ == CREDITS
291
+
292
+ Lingo is based on a collective development by Klaus Lepsky and John Vorhauer.
293
+
294
+ === Authors
295
+
296
+ * John Vorhauer <mailto:lingo@vorhauer.de>
297
+ * Jens Wille <mailto:jens.wille@uni-koeln.de>
298
+
299
+ === Contributors
300
+
301
+ * Klaus Lepsky <mailto:klaus@lepsky.de>
302
+ * Jan-Helge Jacobs <mailto:plancton@web.de>
303
+ * Thomas Müller <mailto:thomas.mueller@fh-koeln.de>
304
+
305
+
306
+ == LICENSE AND COPYRIGHT
307
+
308
+ Copyright (C) 2005-2007 John Vorhauer
309
+ Copyright (C) 2007-2012 John Vorhauer, Jens Wille
310
+
311
+ Lingo is free software: you can redistribute it and/or modify it under the
312
+ terms of the GNU Affero General Public License as published by the Free
313
+ Software Foundation, either version 3 of the License, or (at your option)
314
+ any later version.
315
+
316
+ Lingo is distributed in the hope that it will be useful, but WITHOUT ANY
317
+ WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
318
+ FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more
319
+ details.
320
+
321
+ You should have received a copy of the GNU Affero General Public License along
322
+ with Lingo. If not, see <http://www.gnu.org/licenses/>.
data/Rakefile ADDED
@@ -0,0 +1,100 @@
1
+ # encoding: utf-8
2
+
3
+ __DIR__ = File.expand_path('..', __FILE__)
4
+
5
+ require 'rake/clean'
6
+ require File.join(__DIR__, %w[lib lingo version])
7
+
8
+ PACKAGE_NAME = 'lingo'
9
+ PACKAGE_PATH = File.join(__DIR__, 'pkg', "#{PACKAGE_NAME}-#{Lingo::VERSION}")
10
+
11
+ if RUBY_PLATFORM =~ /msdos|mswin|djgpp|mingw|windows/i
12
+ ZIP_COMMANDS = ['zip', '7z a'] # for hen's gem task
13
+ end
14
+
15
+ task default: :spec
16
+ task package: [:checkdoc, 'test:all', :clean]
17
+
18
+ begin
19
+ require 'hen'
20
+
21
+ Hen.lay! {{
22
+ gem: {
23
+ name: PACKAGE_NAME,
24
+ version: Lingo::VERSION,
25
+ summary: 'The full-featured automatic indexing system',
26
+ authors: ['John Vorhauer', 'Jens Wille'],
27
+ email: ['lingo@vorhauer.de', 'jens.wille@uni-koeln.de'],
28
+ homepage: 'http://lex-lingo.de',
29
+ extra_files: FileList[
30
+ 'lingo.rb', 'lingo{,-all,-call}.cfg', 'lingo.opt', 'doc/**/*',
31
+ '{de,en}.lang', '{de,en}/{lingo-*,user-dic}.txt', 'txt/artikel{,-en}.txt',
32
+ 'info/gpl-hdr.txt', 'info/*.png', 'lir.cfg', 'txt/lir.txt', 'porter/*',
33
+ 'test.cfg', '{de,en}/test_*.txt'
34
+ ].to_a,
35
+ required_ruby_version: '>= 1.9',
36
+ dependencies: [['ruby-nuggets', '>= 0.8.2'], 'unicode'],
37
+ development_dependencies: [['diff-lcs', '>= 1.1.3'], 'open4']
38
+ }
39
+ }}
40
+ rescue LoadError => err
41
+ warn "Please install the `hen' gem first. (#{err})"
42
+ end
43
+
44
+ CLEAN.include(
45
+ 'txt/*.{log,mul,non,seq,syn,ve?,csv}',
46
+ 'test/{test.*,text.non}',
47
+ 'store/*/*.rev'
48
+ )
49
+
50
+ CLOBBER.include(
51
+ 'store', 'doc' ,'pkg/*', PACKAGE_PATH + '.*'
52
+ )
53
+
54
+ task :checkdoc do
55
+ docfile = File.join(__DIR__, 'doc', 'index.html')
56
+ abort "Please run `rake doc' first." unless File.exists?(docfile)
57
+ end
58
+
59
+ desc 'Run ALL tests'
60
+ task 'test:all' => [:test, 'test:txt', 'test:lir']
61
+
62
+ Rake::TestTask.new(:test) do |t|
63
+ t.test_files = FileList.new('test/ts_*.rb', 'test/attendee/ts_*.rb')
64
+ end
65
+
66
+ desc 'Test against reference file (TXT)'
67
+ task 'test:txt' do
68
+ test_ref('artikel', 'test')
69
+ end
70
+
71
+ desc 'Test against reference file (LIR)'
72
+ task 'test:lir' do
73
+ test_ref('lir')
74
+ end
75
+
76
+ desc 'Run all tests on packaged distribution'
77
+ task 'test:remote' => [:package] do
78
+ chdir(PACKAGE_PATH) { system('rake test:all') } || abort
79
+ end
80
+
81
+ def test_ref(name, cfg = name)
82
+ require 'nuggets/util/ruby'
83
+
84
+ require 'diff/lcs'
85
+ require 'diff/lcs/ldiff'
86
+
87
+ cmd = %W[lingo.rb -c #{cfg} txt/#{name}.txt]
88
+ continue, msg = 0, ["Command failed: #{cmd.join(' ')}"]
89
+
90
+ Process.ruby(*cmd) { |_, _, o, e|
91
+ IO.interact({}, { o => msg, e => msg })
92
+ }.success? or abort msg.join("\n\n")
93
+
94
+ Dir["test/ref/#{name}.*"].each { |ref|
95
+ puts "#{'#' * 60} #{org = ref.sub(/test\/ref/, 'txt')}"
96
+ continue += Diff::LCS::Ldiff.run(ARGV.clear << '-a' << org << ref)
97
+ }
98
+
99
+ exit continue + 1 unless continue.zero?
100
+ end
data/TODO ADDED
@@ -0,0 +1,28 @@
1
+ = ToDo list for Lingo
2
+
3
+ (most important first)
4
+
5
+ * Update and translate old documentation.
6
+ * Allow for handling of documents in various encodings, not just the one the
7
+ dictionaries are encoded in.
8
+ * Provide automatic encoding detection.
9
+ * Provide automatic language detection (as fine-grained as possible).
10
+ * Make lingo run faster!? (benchmark - profile - optimize)
11
+ * Replace SDBM by DBM (more platform-independent, no 1k limit on record size);
12
+ maybe QDBM/Tokyo Cabinet or even CDB for faster access.
13
+ * In-memory (volatile) vs. on-disk (persistent) dictionaries. It should be
14
+ possible to simply use the Lingo API without caring about dictionary storage.
15
+ * That being said, provide an easy-to-use Lingo API -- just 'require "lingo"'
16
+ and go for it!
17
+ * In addition to that, provide sensible string extensions: String#tokenize,
18
+ String#lemmatize, ...
19
+ * Provide a DSL for configuration -- in addition to, or instead of, the current
20
+ YAML format.
21
+ * Make sure the Crypter is sufficiently secure.
22
+ * Use RSpec for testing.
23
+ * Make Lingo capable to use multiple cores or even machines to boost performance
24
+ by connecting Attendees through sockets and use separate processes
25
+
26
+ NOTE: New code *should* meet the guidelines outlined in the
27
+ RubyStyleGuide[https://github.com/bbatsov/ruby-style-guide],
28
+ existing code will be adjusted along the way.