mittens 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (100) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +8 -0
  3. data/LICENSE.txt +1 -1
  4. data/README.md +4 -4
  5. data/lib/mittens/version.rb +1 -1
  6. data/mittens.gemspec +1 -1
  7. data/vendor/snowball/.github/workflows/ci.yml +216 -0
  8. data/vendor/snowball/CONTRIBUTING.rst +111 -62
  9. data/vendor/snowball/GNUmakefile +194 -136
  10. data/vendor/snowball/NEWS +798 -3
  11. data/vendor/snowball/README.rst +50 -1
  12. data/vendor/snowball/ada/src/stemmer.adb +25 -13
  13. data/vendor/snowball/ada/src/stemmer.ads +9 -9
  14. data/vendor/snowball/ada/stemmer_config.gpr +7 -7
  15. data/vendor/snowball/algorithms/basque.sbl +4 -19
  16. data/vendor/snowball/algorithms/catalan.sbl +2 -9
  17. data/vendor/snowball/algorithms/danish.sbl +1 -1
  18. data/vendor/snowball/algorithms/dutch.sbl +284 -122
  19. data/vendor/snowball/algorithms/dutch_porter.sbl +178 -0
  20. data/vendor/snowball/algorithms/english.sbl +52 -37
  21. data/vendor/snowball/algorithms/esperanto.sbl +157 -0
  22. data/vendor/snowball/algorithms/estonian.sbl +269 -0
  23. data/vendor/snowball/algorithms/finnish.sbl +2 -3
  24. data/vendor/snowball/algorithms/french.sbl +42 -16
  25. data/vendor/snowball/algorithms/german.sbl +35 -14
  26. data/vendor/snowball/algorithms/greek.sbl +76 -76
  27. data/vendor/snowball/algorithms/hungarian.sbl +8 -6
  28. data/vendor/snowball/algorithms/indonesian.sbl +14 -8
  29. data/vendor/snowball/algorithms/italian.sbl +11 -21
  30. data/vendor/snowball/algorithms/lithuanian.sbl +36 -37
  31. data/vendor/snowball/algorithms/lovins.sbl +0 -1
  32. data/vendor/snowball/algorithms/nepali.sbl +138 -37
  33. data/vendor/snowball/algorithms/norwegian.sbl +19 -5
  34. data/vendor/snowball/algorithms/porter.sbl +2 -2
  35. data/vendor/snowball/algorithms/portuguese.sbl +9 -13
  36. data/vendor/snowball/algorithms/romanian.sbl +17 -4
  37. data/vendor/snowball/algorithms/serbian.sbl +467 -468
  38. data/vendor/snowball/algorithms/spanish.sbl +5 -7
  39. data/vendor/snowball/algorithms/swedish.sbl +60 -6
  40. data/vendor/snowball/algorithms/tamil.sbl +207 -176
  41. data/vendor/snowball/algorithms/turkish.sbl +461 -445
  42. data/vendor/snowball/algorithms/yiddish.sbl +36 -38
  43. data/vendor/snowball/compiler/analyser.c +445 -192
  44. data/vendor/snowball/compiler/driver.c +109 -101
  45. data/vendor/snowball/compiler/generator.c +853 -464
  46. data/vendor/snowball/compiler/generator_ada.c +404 -366
  47. data/vendor/snowball/compiler/generator_csharp.c +297 -260
  48. data/vendor/snowball/compiler/generator_go.c +323 -254
  49. data/vendor/snowball/compiler/generator_java.c +326 -252
  50. data/vendor/snowball/compiler/generator_js.c +362 -252
  51. data/vendor/snowball/compiler/generator_pascal.c +349 -197
  52. data/vendor/snowball/compiler/generator_python.c +257 -240
  53. data/vendor/snowball/compiler/generator_rust.c +423 -251
  54. data/vendor/snowball/compiler/header.h +117 -71
  55. data/vendor/snowball/compiler/space.c +137 -68
  56. data/vendor/snowball/compiler/syswords.h +2 -2
  57. data/vendor/snowball/compiler/tokeniser.c +125 -107
  58. data/vendor/snowball/csharp/Snowball/Among.cs +14 -14
  59. data/vendor/snowball/csharp/Snowball/AssemblyInfo.cs +7 -7
  60. data/vendor/snowball/csharp/Snowball/Stemmer.cs +57 -37
  61. data/vendor/snowball/csharp/Stemwords/App.config +2 -2
  62. data/vendor/snowball/csharp/Stemwords/Program.cs +16 -12
  63. data/vendor/snowball/doc/libstemmer_c_README +7 -4
  64. data/vendor/snowball/doc/libstemmer_csharp_README +4 -1
  65. data/vendor/snowball/doc/libstemmer_java_README +12 -1
  66. data/vendor/snowball/doc/libstemmer_js_README +6 -4
  67. data/vendor/snowball/doc/libstemmer_python_README +9 -4
  68. data/vendor/snowball/examples/stemwords.c +12 -12
  69. data/vendor/snowball/go/env.go +107 -31
  70. data/vendor/snowball/go/util.go +0 -4
  71. data/vendor/snowball/include/libstemmer.h +4 -0
  72. data/vendor/snowball/java/org/tartarus/snowball/Among.java +32 -15
  73. data/vendor/snowball/java/org/tartarus/snowball/SnowballProgram.java +347 -261
  74. data/vendor/snowball/java/org/tartarus/snowball/SnowballStemmer.java +3 -0
  75. data/vendor/snowball/java/org/tartarus/snowball/TestApp.java +52 -37
  76. data/vendor/snowball/javascript/base-stemmer.js +186 -2
  77. data/vendor/snowball/javascript/stemwords.js +3 -6
  78. data/vendor/snowball/libstemmer/libstemmer_c.in +1 -1
  79. data/vendor/snowball/libstemmer/mkalgorithms.pl +6 -6
  80. data/vendor/snowball/libstemmer/mkmodules.pl +2 -2
  81. data/vendor/snowball/libstemmer/modules.txt +13 -10
  82. data/vendor/snowball/libstemmer/test.c +1 -1
  83. data/vendor/snowball/pascal/SnowballProgram.pas +84 -2
  84. data/vendor/snowball/pascal/generate.pl +13 -13
  85. data/vendor/snowball/python/create_init.py +4 -1
  86. data/vendor/snowball/python/setup.cfg +0 -3
  87. data/vendor/snowball/python/setup.py +8 -3
  88. data/vendor/snowball/python/snowballstemmer/basestemmer.py +20 -54
  89. data/vendor/snowball/python/stemwords.py +8 -12
  90. data/vendor/snowball/runtime/api.c +10 -5
  91. data/vendor/snowball/runtime/header.h +10 -9
  92. data/vendor/snowball/runtime/utilities.c +9 -9
  93. data/vendor/snowball/rust/build.rs +1 -1
  94. data/vendor/snowball/rust/src/snowball/snowball_env.rs +83 -5
  95. data/vendor/snowball/tests/stemtest.c +7 -4
  96. metadata +8 -12
  97. data/vendor/snowball/.travis.yml +0 -112
  98. data/vendor/snowball/algorithms/german2.sbl +0 -145
  99. data/vendor/snowball/algorithms/kraaij_pohlmann.sbl +0 -240
  100. data/vendor/snowball/compiler/syswords2.h +0 -13
data/vendor/snowball/NEWS CHANGED
@@ -1,3 +1,779 @@
1
+ Snowball 3.0.1 (2025-05-09)
2
+ ===========================
3
+
4
+ Python
5
+ ------
6
+
7
+ * The __init__.py in 3.0.0 was incorrectly generated due to a missing
8
+ build dependency and the list of algorithms was empty. First reported by
9
+ laymonage. Thanks to Dmitry Shachnev, Henry Schreiner and Adam Turner for
10
+ diagnosing and fixing. (#229, #230, #231)
11
+
12
+ * Add trove classifiers for Armenian and Yiddish which have now been registered
13
+ with PyPI. Thanks to Henry Schreiner and Dmitry Shachnev. (#228)
14
+
15
+ * Update documented details of Python 2 support in old versions.
16
+
17
+ Snowball 3.0.0 (2025-05-08)
18
+ ===========================
19
+
20
+ Ada
21
+ ---
22
+
23
+ * Bug fixes:
24
+
25
+ + Fix invalid Ada code generated for Snowball `loop` (it was partly Pascal!)
26
+ None of the stemmers shipped in previous releases triggered this bug, but
27
+ the Turkish stemmer now does.
28
+
29
+ + The Ada runtime was not tracking the current length of the string
30
+ but instead used the current limit value or some other substitute, which
31
+ manifested as various incorrect behaviours for code inside of `setlimit`.
32
+
33
+ + `size` was incorrectly returning the difference between the limit and the
34
+ backwards limit.
35
+
36
+ + `lenof` or `sizeof` on a string variable generated Ada code that didn't
37
+ even compile.
38
+
39
+ + Fix incorrect preconditions on some methods in the runtime.
40
+
41
+ + Fix bug in runtime code used by `attach`, `insert`, `<-` and string
42
+ variable assignment when a (sub)string was replaced with a larger string.
43
+ This bug was triggered by code in the Kraaij-Pohlmann Dutch stemmer
44
+ implementation (which was previously not enabled by default but is now the
45
+ standard Dutch stemmer).
46
+
47
+ + Fix invalid code generated for `insert`, `<-` and string variable
48
+ assignment. This bug was triggered by code in the Kraaij-Pohlmann
49
+ Dutch stemmer implementation (which was previously not enabled by default
50
+ but is now the standard Dutch stemmer).
51
+
52
+ + Generate valid code for programs which don't use `among`. This didn't
53
+ affect code generation for any algorithms we currently ship.
54
+
55
+ + If the end of a routine was unreachable code the Snowball compiler
56
+ would think the start of the next routine was also unreachable and would
57
+ not generate it. This didn't affect code generation for any algorithms we
58
+ currently ship.
59
+
60
+ * Code quality:
61
+
62
+ + Only declare variables A and C when each is needed.
63
+
64
+ + Fix indentation of generated declarations.
65
+
66
+ + Drop extra blank line before `Result := True`.
67
+
68
+ C/C++
69
+ -----
70
+
71
+ * Bug fixes:
72
+
73
+ + Fix potential NULL dereference in runtime code if we failed to allocate
74
+ memory for the p or S member for a Snowball program which uses one or more
75
+ string variables. Problem was introduced in Snowball 2.0.0. Fixes #206,
76
+ reported by Maxim Korotkov.
77
+
78
+ + Fix invalid C code generated when a failure is handled in a context with
79
+ the opposite direction to where it happened, for example:
80
+
81
+ externals (stem)
82
+ define stem as ( try backwards 'x' )
83
+
84
+ This was fixed by changing the C generator to work like all the other
85
+ generators and pre-generate the code to handle failure.
86
+
87
+ + Eliminate assumptions that NULL has all-zero bit pattern. We don't know
88
+ of any current platforms where this assumption fails, but the C standard
89
+ doesn't require an all-zero bit pattern for NULL. Fixes #207.
90
+
91
+ * Optimisations:
92
+
93
+ + Store index delta for among substring_i field. This makes trying
94
+ substrings after a failed match slightly faster because we can just add
95
+ the offset to the pointer we already have to the current element.
96
+
97
+ * Code quality:
98
+
99
+ + Improve formatting of generated code.
100
+
101
+ C#
102
+ --
103
+
104
+ * Bug fixes:
105
+
106
+ + Add missing runtime support for testing for a string var at the current
107
+ position when working forwards. This situation isn't exercised by any of
108
+ the stemming algorithms we currently ship.
109
+
110
+ + Adjust generated code to work around a code flow analysis bug in the `mcs`
111
+ C# compiler.
112
+
113
+ * Code quality:
114
+
115
+ + Prune unused `using System.Text;`.
116
+
117
+ + Generate C# with UTF-8 source encoding. This makes the generated code
118
+ easier to follow, which helps during development. It's also a bit smaller.
119
+ For now codepoints U+0590 and above are still emitted as escape sequences
120
+ to avoid confusing source code rendering when LTR scripts are involved.
121
+
122
+ Go
123
+ --
124
+
125
+ * Optimisations:
126
+
127
+ + Drop some unneeded Go code generated for string `$`. None of the shipped
128
+ stemmers use string `$`, though the Schinke Latin stemmer algorithm on the
129
+ website does.
130
+
131
+ * Code quality:
132
+
133
+ + Dispatch among result with `switch` instead of an `if` ... `else if` chain
134
+ (which looks like we did because the Go generator evolved from the Python
135
+ generator and Python didn't used to have a switch-like construct. This
136
+ doesn't make a measurable speed difference so it seems the Go compiler is
137
+ optimising both to equivalent code, but using a switch here seems clearer,
138
+ a better match for the intent, and is a bit simpler to generate.
139
+
140
+ + Generate Go with UTF-8 source encoding. This makes the generated code
141
+ easier to follow, which helps during development. It's also a bit smaller.
142
+ For now codepoints U+0590 and above are still emitted as escape sequences
143
+ to avoid confusing source code rendering when LTR scripts are involved.
144
+
145
+ Java
146
+ ----
147
+
148
+ * The Java code generated by Snowball requires now requires Java >= 7. Java 7
149
+ was released in 2011, and Java 6's EOL was 2013 so we don't expect this
150
+ to be a problematic requirement. See #195.
151
+
152
+ * Optimisations:
153
+
154
+ + We now store the current string in a `char[]` rather than using a
155
+ `StringBuilder` to reduce overheads. The `getCurrent()` method continues
156
+ to return a Java `String`, but the `char[]` can be accessed using the new
157
+ `getCurrentBuffer()` and `getCurrentBufferLength()` methods. Patch from
158
+ Robert Muir (#195).
159
+
160
+ + Use a more efficient mechanism for calling `among` functions. Patch from
161
+ Robert Muir (#195).
162
+
163
+ * Code quality:
164
+
165
+ + Consistently put `[]` right after element type for array types, which seems
166
+ the most used style.
167
+
168
+ + Fix javac warnings in SnowballProgram.java.
169
+
170
+ + Improve formatting of generated code.
171
+
172
+ Javascript
173
+ ----------
174
+
175
+ * Bug fixes:
176
+
177
+ + Use base class specified by `-p` in string `$` rather than hard-coding
178
+ `BaseStemmer` (which is the default if you don't specify `-p`). None of
179
+ the shipped stemmers use string `$`, though the Schinke Latin stemmer
180
+ algorithm on the website does.
181
+
182
+ * Code quality:
183
+
184
+ + Modernise the generated code a bit. Loosely based on changes proposed in
185
+ #123 by Emily Marigold Klassen.
186
+
187
+ * Other changes:
188
+
189
+ + The Javascript runner is now specified by make variable `JSRUN` instead
190
+ of `NODE` (since node is just one JS implementation). The default value
191
+ is now `node` instead of `nodejs` (older Debian and Ubuntu packages used
192
+ `/usr/bin/nodejs` because `/usr/bin/node` was already in use by a
193
+ completely different package, but that has since changed).
194
+
195
+ Pascal
196
+ ------
197
+
198
+ * Bug fixes:
199
+
200
+ + Add missing semicolons to code generated in some cases for a function which
201
+ always succeeds or always fails. The new dutch.sbl was triggering this
202
+ bug.
203
+
204
+ + If the end of a routine was unreachable code the Snowball compiler
205
+ would think the start of the next routine was also unreachable and would
206
+ not generate it. This didn't affect code generation for any algorithms we
207
+ currently ship.
208
+
209
+ * Code quality:
210
+
211
+ + Eliminate commented out code generated for string `$`. None of the shipped
212
+ stemmers use string `$`, though the Schinke Latin stemmer algorithm on the
213
+ website does.
214
+
215
+ * Other changes:
216
+
217
+ + Enable warnings, etc from fpc.
218
+
219
+ + Select GNU-style diagnostic format.
220
+
221
+ Python
222
+ ------
223
+
224
+ * Optimisations:
225
+
226
+ + Use Python set for grouping checks. This speeds up running the Python
227
+ testsuite by about 4%.
228
+
229
+ + Routines used in `among` are now referenced by name directly in the
230
+ generated code, rather than using a string containing the name. This
231
+ avoids a `getattr()` call each time an among wants to call a routine. This
232
+ doesn't seem to make a measurable speed difference, but it's cleaner and
233
+ avoids problems with name mangling. Suggested by David Corbett in #217.
234
+
235
+ + Simplify code generated for `loop`. If the iteration count is constant and
236
+ at most 4 then iterate over a tuple which microbenchmarking shows is
237
+ faster. The only current uses of loop in the shipped stemmers are `loop 2`
238
+ so benefit from this. Otherwise we now use `range(AE)` instead of
239
+ `range (AE, 0, -1)` (the actual value of the loop variable is never
240
+ used so only the number of iterations matter).
241
+
242
+ * Bug fixes:
243
+
244
+ + Correctly handle stemmer names with an underscore.
245
+
246
+ * Code quality:
247
+
248
+ + Generate Python with UTF-8 source encoding. This makes the generated code
249
+ easier to follow, which helps during development. It's also a bit smaller.
250
+ For now codepoints U+0590 and above are still emitted as escape sequences
251
+ to avoid confusing source code rendering when LTR scripts are involved.
252
+
253
+ * Other changes:
254
+
255
+ + Set python_requires to indicate to install tools that the generated code
256
+ won't work with Python 3.0.x, 3.1.x and 3.2.x (due to use of `u"foo"`
257
+ string literals). Closes #192 and #191, opened by Andreas Maier.
258
+
259
+ + Add classifiers to indicate support for Python 3.3 and for 3.8 to 3.13.
260
+ Fixes #158, reported by Dmitry Shachnev.
261
+
262
+ + Stop marking the wheel as universal, which had started to give a warning
263
+ message. Patch from Dmitry Shachnev (#210).
264
+
265
+ + Stop calling `setup.py` directly which is deprecated and now produces a
266
+ warning - use the `build` module instead. Patch from Dmitry Shachnev
267
+ (#210).
268
+
269
+ Rust
270
+ ----
271
+
272
+ * Optimisations:
273
+
274
+ + Shortcut unnecessary calls to find_among, porting an optimization from the
275
+ C generator. In some stemming benchmarks this improves the performance
276
+ of the rust english stemmer by about 27%. Patch from jedav (#202).
277
+
278
+ * Code quality:
279
+
280
+ + Suppress unused_parens warning, for example triggered by the code generated
281
+ for `$x = x*x` (where `x` is an integer).
282
+
283
+ + Dispatch `among` result with `match` instead of an `if` ... `else if` chain
284
+ (which looks like we did because the Rust generator evolved from the Python
285
+ generator and Python didn't used to have a switch-like construct. This
286
+ results in a 3% speed-up for an unoptimised Rust compile but doesn't seem
287
+ to make a measurable difference when optimising so it seems the Rust
288
+ compiler is optimising both to equivalent code. However using a `match`
289
+ here seems clearer, a better match for the intent, and is a bit simpler to
290
+ generate.
291
+
292
+ + Generate Rust with UTF-8 source encoding. This makes the generated code
293
+ easier to follow, which helps during development. It's also a bit smaller.
294
+ For now codepoints U+0590 and above are still emitted as escape sequences
295
+ to avoid confusing source code rendering when LTR scripts are involved.
296
+
297
+ New stemming algorithms
298
+ -----------------------
299
+
300
+ * Add Esperanto stemmer from David Corbett (#185).
301
+
302
+ * Add Estonian algorithm from Linda Freienthal (#108).
303
+
304
+ Behavioural changes to existing algorithms
305
+ ------------------------------------------
306
+
307
+ * Dutch: Switch to Kraaij-Pohlmann as the default for Dutch. In case you
308
+ want Martin Porter's Dutch stemming algorithm for compatibility, this is now
309
+ available as `dutch_porter`. Fixes #1, reported by gboer.
310
+
311
+ * Dutch (Kraaij-Pohlmann): Fix differences between the Snowball implementation
312
+ and the original C implementation.
313
+
314
+ * Dutch (Kraaij-Pohlmann): Add a small number of exceptions to the Snowball
315
+ implementation to avoid unwanted conflations. This addresses all cases so
316
+ far identified which Martin's Dutch stemmer handled better. Fixes #208.
317
+
318
+ * Dutch (Porter): The "at least 3 characters" part of the R1 definition was
319
+ actually implemented such that when working in UTF-8 it was "at least 3
320
+ bytes". We stripped accents normally found in Dutch except for `è` before
321
+ setting R1, and no Dutch words starting `è` seem to stem differently
322
+ depending on encoding, but proper nouns and other words of foreign origin may
323
+ contain other accented characters and it seems better for the stemmer to
324
+ handle such words the same way regardless of the encoding in use.
325
+
326
+ * English: Replace '-ogist' with '-og' to conflate "geologist" and "geology", etc.
327
+ Suggested by Marc Schipperheijn on snowball-discuss.
328
+
329
+ * English: Add extra condition to undoubling. We no longer undouble if the
330
+ double consonant is preceded by exactly "a", "e" or "o" to avoid conflating
331
+ "add"/"ad", "egg"/"eg", "off"/"of", etc. Fixes #182, reported by Ed Page.
332
+
333
+ * English: Avoid conflating 'emerge' and 'emergency'. Reported by Frederick Ross
334
+ on snowball-discuss.
335
+
336
+ * English: Avoid conflating 'evening' and 'even'. Reported by Ann B on
337
+ snowball-discuss.
338
+
339
+ * English: Avoid conflating 'lateral' and 'later'. Reported by Steve Tolkin on
340
+ snowball-discuss.
341
+
342
+ * English: Avoid conflating 'organ', 'organic' and 'organize'.
343
+
344
+ * English: Avoid conflating 'past' and 'paste'. Reported by Sonny on
345
+ snowball-discuss.
346
+
347
+ * English: Avoid conflating 'universe', 'universal' and 'university'. Reported
348
+ by Clem Wang on snowball-discuss.
349
+
350
+ * English: Handle -eed and -ing exceptions in their respective rules.
351
+ This avoids the overhead of checking for them for the majority of
352
+ words which don't end -eed or -ing. It also allows us to easily handle
353
+ vying->vie and hying->hie at basically no extra cost. Reduces the time to
354
+ stem all words in our English word list by nearly 2%.
355
+
356
+ * French: Remove elisions as first step. See #187. Originally reported by
357
+ Paul Rudin and kelson42.
358
+
359
+ * French: Remove -aise and -aises so for example, "française" and "françaises"
360
+ are now conflated with "français". Fixes #209. Originally reported by
361
+ ririsoft and Fred Fung.
362
+
363
+ * French: Avoid incorrect conflation of `mauvais` (bad) with `mauve` (mauve,
364
+ mallow or seagull); avoid conflating `mal` with `malais`, `pal` with
365
+ `palais`, etc.
366
+
367
+ * French: Avoid conflating `ni` (neither/nor) with `niais`
368
+ (inexperienced/silly) and `nie`/`nié`/`nier`/`nierais`/`nierons` (to deny).
369
+
370
+ * French: -oux -> -ou. Fixes #91, reported by merwok.
371
+
372
+ * German: Replace with the "german2" variant. This normalises umlauts ("ä" to
373
+ "ae", "ö" to "oe", "ü" to "ue") which is presumably much less common in
374
+ newly created text than it once was as modern computer systems generally
375
+ don't have the limitations which motivated this, but there will still be
376
+ large amounts of legacy text which it seems helpful for the stemmer to
377
+ handle without having to know to select a variant.
378
+
379
+ On our sample German vocabulary which contains 35033 words, 77 words give
380
+ different stems. A significant proportion of these are foreign words, and
381
+ some are proper nouns. Some cases definitely seem improved, and quite a few
382
+ are just different but effectively just change the stem for a word or group
383
+ of words to a stem that isn't otherwise generated. There don't seem any
384
+ changes that are clearly worse, though there are some changes that have both
385
+ good and bad aspects to them.
386
+
387
+ Fixes #92, reported by jrabensc.
388
+
389
+ * German: Don't remove -em if preceded by -syst to avoid overstemming words
390
+ ending -system. This change means we now conflate e.g. "system" and
391
+ "systemen". Partly addresses #161, reported by Olga Gusenikova.
392
+
393
+ * German: Remove -erin and -erinnen suffixes which conflates singular and
394
+ plural female versions of nouns with the male versions. Fixes #85 and
395
+ partly addresses #161, reported by Olga Gusenikova.
396
+
397
+ * German: Replace -ln and -lns with -l. This improves 82 cases in the current
398
+ sample data without making anything worse. Tests on a larger word list look
399
+ good too. Partly addresses #161, reported by Olga Gusenikova.
400
+
401
+ * German: Remove -et suffix when we safely can. Fixes #200, reported by Robert
402
+ Frunzke.
403
+
404
+ * Greek: Fix "faulty slice operation" for input `ισαισα`. The fix changes
405
+ `ισα` to stem to `ισ` instead of the empty string, which seems better (and to
406
+ be what the second paper actually says to do if read carefully). Fixes #204,
407
+ reported by subnix.
408
+
409
+ * Italian: Address overstemming of "divano" (sofa) which previously stemmed to
410
+ "div", which is the stem for 'diva' (diva). Now it is stemmed to 'divan',
411
+ which is what its plural form 'divani' already stemmed to. Fixes #49,
412
+ reported by francesco.
413
+
414
+ * Norwegian: Improve stemming of words ending -ers. Fixes #175, reported by
415
+ Karianne Berg.
416
+
417
+ * Norwegian: Include more accented vowels - treating "ê", "ò", "ó" and "ô"
418
+ as vowels improves the stemming of a fairly small number of words, but
419
+ there's basically no cost to having extra vowels in the grouping, and some
420
+ of these words are commonly used. Fixes #218, reported by András Jankovics.
421
+
422
+ * Romanian: Fix to work with Romanian text encoded using the correct Unicode
423
+ characters. Romanian uses a "comma below" diacritic on letters "s" and "t"
424
+ ("ș" and "ț"). Before Unicode these weren't easily available so Romanian
425
+ text was written using the visually similar "cedilla" diacritic on these
426
+ letters instead ("ş" and "ţ"). Previously our stemmer only recognised the
427
+ latter. Now it maps the cedilla forms to "comma below" as a first step.
428
+ Patch from Robert Muir.
429
+
430
+ * Spanish: Handle -acion like -ación and -ucion like -ución. It's apparently
431
+ common to miss off accents in Spanish, and there are examples in our test
432
+ vocabulary that these change helps. Proposed by Damian Janowski.
433
+
434
+ * Swedish: Replace suffix "öst" with "ös" when preceded by any of 'iklnprtuv'
435
+ rather than just 'l'. The new rule only requires the "öst" to be in R1
436
+ whereas previously we required all of "löst" to be. This second tweak
437
+ doesn't seem to affect any words ending "löst" but it conflates a few extra
438
+ cases when combined with the expanded list of preceding letters, and seems
439
+ more logical linguistically (since "ös" is akin to "ous" in English). Fixes
440
+ #152, reported by znakeeye.
441
+
442
+ * Swedish: Remove -et/-ets in cases where it helps. Removing -et can't be done
443
+ unconditionally because many words end in -et where this isn't a suffix.
444
+ However it's a very common suffix so it seems worth crafting a more complex
445
+ condition under which to remove. Fixes #47.
446
+
447
+ * Turkish: Remove proper noun suffixes. For example, `Türkiye'dir` ("it is
448
+ Turkey") is now conflated with `Türkiye` ("Turkey"). Fixes #188.
449
+
450
+ * Yiddish: Avoid generating empty stem for input "גע" (not a valid word, but
451
+ it's better to avoid an empty stem for any non-empty input).
452
+
453
+ Optimisations to existing algorithms
454
+ ------------------------------------
455
+
456
+ * General change: Use `gopast` everywhere to establish R1 and R2 as it is a
457
+ little more efficient to do so.
458
+
459
+ * Basque: Use an empty action rather than replacing the suffix with itself
460
+ which seems clearer and is a little more efficient.
461
+
462
+ * Dutch (Porter): Optimise prelude routine.
463
+
464
+ * English: Remove unnecessary exception for `skis` as the algorithm stems
465
+ `skis` to `ski` by itself (`skies` and `sky` do still need a special case to
466
+ avoid conflation with `ski` though).
467
+
468
+ * Hungarian: We no longer take digraphs into account when determining where R1
469
+ starts. This can only make a difference to the stemming if we removed a
470
+ suffix that started with the last character of the digraph (or with "zs" in
471
+ the case of "dzs"), and that doesn't happen for any of the suffixes we remove
472
+ for any valid Hungarian words. This simplification speeds up stemming by
473
+ ~2% on the current sample vocabulary list. See #216. Thanks to András
474
+ Jankovics for confirming no Hungarian words are affected by this change.
475
+
476
+ * Lithuanian: Remove redundant R1 check.
477
+
478
+ * Nepali: Eliminate redundant check_category_2 routine.
479
+
480
+ * Tamil: Optimise by using `among` instead of long `or` chains. The generated
481
+ C version now takes 43% less time to processes the test vocabulary.
482
+
483
+ * Tamil: Remove many cases which can't be triggered due to being handled by
484
+ another case.
485
+
486
+ * Tamil: Clean up some uses of `test`.
487
+
488
+ * Tamil: Make `fix_va_start` simpler and faster.
489
+
490
+ * Tamil: Localise use of `found_a_match` flag.
491
+
492
+ * Tamil: Eliminate pointless flag changes.
493
+
494
+ * Turkish: Minor optimisations.
495
+
496
+ Code clarity improvements to existing algorithms
497
+ ------------------------------------------------
498
+
499
+ * Stop noting dates changes were made in comments in the code - we now maintain
500
+ a changelog in each algorithm's description page on the website (and the
501
+ version control history provides a finer grained view).
502
+
503
+ * Always use `insert` instead of `<+` as the named command seems clearer.
504
+
505
+ * English: Add comments documenting motivating examples for all exceptional
506
+ cases.
507
+
508
+ * Lithuanian: Change to recommended latin stringdef codes. Using common codes
509
+ makes it easier to work across algorithms, but they are more mnemonic so also
510
+ seem clearer when just considering this one algorithm.
511
+
512
+ * Serbian: Change to recommended latin stringdef codes. Using common codes
513
+ makes it easier to work across algorithms, but they are more mnemonic so also
514
+ seem clearer when just considering this one algorithm.
515
+
516
+ * Turkish: Use `{sc}` for s-cedilla and `{i}` for dotless-i to match other
517
+ uses.
518
+
519
+ Compiler
520
+ --------
521
+
522
+ * Generic code generation improvements:
523
+
524
+ + Show Snowball source leafname in "generated" comment at start of files.
525
+
526
+ + Add generic reachability tracking machinery. This facilitates various new
527
+ optimisations, so far the following have been implemented:
528
+
529
+ - Tail-calling
530
+ - Simpler code for calling routines which always give the same signal
531
+ - Simpler code when a routine ends in a integer test (this also allows
532
+ eliminating an Ada-specific codegen optimisation which did something
533
+ similar but only for routines which consisted *entirely* of a single
534
+ integer test.
535
+ - Dead code reporting and removal (only in simple cases currently)
536
+
537
+ Currently this overlaps in functionality with the existing reachability
538
+ tracking which is implemented on a per-language basis, and only for some
539
+ languages. This reachability tracking was originally added for Java
540
+ where some unreachable code is invalid and result in a compile time error,
541
+ but then seems to have been copied for some other newer languages which
542
+ may or may not actually need it. The approach it uses unfortunately
543
+ relies on correctly updating the reachability flag anywhere in the
544
+ generator code where reachability can change which has proved to be a
545
+ source of bugs, some unfixed. This new approach seems better and with some
546
+ more work should allow us to eliminate the older code. Fixes #83.
547
+
548
+ + Omit check for `among` failing in generated code when we can tell at
549
+ compile time that it can't fail.
550
+
551
+ + Optimise `goto`/`gopast` applied to a grouping or inverted grouping (which
552
+ is by far the most common way to use `goto`/`gopast`) for all target
553
+ languages (new for Go, Java, Javascript, Pascal and Rust).
554
+
555
+ + We never need to restore the cursor after `not`. If `not` turns signal `f`
556
+ into `t` then it sets `c` back to its old position; otherwise, `not`
557
+ signals `f` and `c` will get reset by whatever ultimately handles this `f`
558
+ (or the program exits and the position of `c` no longer matters). This
559
+ slightly improves the generated code for the `english` and `porter`
560
+ stemmers.
561
+
562
+ + Don't generate code for undefined or unused routines.
563
+
564
+ + Avoid generating variable names and then not actually using them. This
565
+ eliminates mysterious gaps in the numbering of variables in the generated
566
+ code.
567
+
568
+ + Eliminate `!`/`not` from integer test code by generating the inverse
569
+ comparison operator instead for all languages, e.g. for Python we now
570
+ generate
571
+
572
+ if self.I_p1 >= self.I_x:
573
+
574
+ instead of
575
+
576
+ if not self.I_p1 < self.I_x:
577
+
578
+ This isn't going to be faster in compiled languages with an optimiser but
579
+ for scripting languages it may be faster, and even if not, it makes for a
580
+ little less work when loading the script.
581
+
582
+ + Canonicalise `hop 1` to `next` as the generated code for `next` can be
583
+ slightly more efficient. This will also apply to `hop` followed by a
584
+ constant expression which Snowball can reduce to `1`.
585
+
586
+ + Avoid trailing whitespace in generated files.
587
+
588
+ + Fix problems with --comments option:
589
+
590
+ - When generating C code we would segfault for code containing `atleast`,
591
+ `hop` or integer tests.
592
+ - Fix missing comments for some commands in some target languages.
593
+ - Fix inconsistent formatting of comments in some target languages.
594
+ - Comments in C are now always on their own line - previously some were
595
+ after at the end of the line and some on their own line which made them
596
+ harder to follow.
597
+ - Emit comments before `among` and before routine/external definitions.
598
+
599
+ + Simplify more cases of numeric expressions (e.g. `x * 1` to `x`).
600
+
601
+ * Improve --help output.
602
+
603
+ * Division by zero during constant folding now gives an error.
604
+
605
+ * For `hop` followed by an unexpected token (e.g. `hop hop`) we were
606
+ already emitting a suitable error but would then segfault.
607
+
608
+ * Emit error for redefinition of a grouping.
609
+
610
+ * Improve errors for `define` of an undeclared name. We already peek at the
611
+ next token to decide whether to try to parse as a routine or grouping.
612
+ Previously we parsed as a routine if it was `as`, and a grouping otherwise,
613
+ but routine definitions are more common and a grouping can only start with
614
+ a literal string or a name, so now we assume a routine definition with a
615
+ missing `as` if the next token isn't valid for either.
616
+
617
+ * Suppress duplicate (or even triplicate) "unexpected" errors for the same
618
+ token when the compiler tried to recover from the error by adjusting the
619
+ parse stare and marking the token to be reparsed, but the same token then
620
+ failed to parse in the new state.
621
+
622
+ * Fix NULL pointer dereference if an undefined grouping is used in the
623
+ definition of another grouping.
624
+
625
+ * Fix mangled error for `set` or `unset` on a non-boolean:
626
+
627
+ test.sbl:2: nameInvalid type 98 in name_of_type()
628
+
629
+ * Emit warning if `=>` is used. The documentation of how it works doesn't
630
+ match the implementation, and it seems it has only ever been used in the
631
+ Schinke stemmer implementation (which assumes the implemented behaviour).
632
+ We've updated the Schinke implementation to avoid it. If you're using it
633
+ in your own Snowball code please let us know.
634
+
635
+ * Improve errors for unterminated string literals.
636
+
637
+ * Fix NULL pointer dereference on invalid code such as `$x = $y`.
638
+
639
+ * If malloc fails while compiling the compiler will now report the failure
640
+ and exit. Previously the NULL return from malloc wasn't checked for so
641
+ we'd typically segfault.
642
+
643
+ * `lenof` and `sizeof` applied to a string variable now mark the variable
644
+ as used, which avoids a bogus error followed by a confusing additional
645
+ message if this is the only use of that variable:
646
+
647
+ lenofsizeofbug.sbl:3: warning: string 's' is set but never used
648
+ Unhandled type of dead assignment via sizeof
649
+
650
+ This is situation is unlikely to occur in real world code.
651
+
652
+ * The reported line number for "string not terminated" error was one too high
653
+ in the case where we were in a stringdef (but correct if we weren't).
654
+
655
+ * Eliminate special handling for among starter. We now convert the starter
656
+ to be a command before the among, adding an explict substring if there
657
+ isn't one.
658
+
659
+ * We now warn if the body of a `repeat` or `atleast` loop always signals
660
+ `t` (meaning it will loop forever which is very undesirable for a stemming
661
+ algorithm) or always signals `f` (meaning it will never loop, which seems
662
+ unlikely to be what was intended).
663
+
664
+ * Release memory in compiler before exit. The OS will free all allocated
665
+ memory when a process exits, so this memory isn't actually leaked, but it can
666
+ be annoying with when using snowball as part of a larger build process with
667
+ some leak-finding tools. Patch from jsteemann in #166.
668
+
669
+ * Store textual data more efficiently in memory during Snowball compilation.
670
+ Previously almost all textual data was stored as 16 bit values, but most
671
+ such data only uses 8 bit character values. Doubling the memory usage
672
+ isn't really an issue as Snowball programs are tiny, but this also
673
+ complicated code handling such data. Now only literal strings use the
674
+ 16 bit values.
675
+
676
+ * Fix clang -Wunused-but-set-variable warning in compiler code.
677
+
678
+ * Fix a few -Wshadow warnings in compiler and enable this warning by default.
679
+
680
+ * Tighten parsing of `writef()` format strings. We now error out on
681
+ unrecognised escape codes or if a numbered escape is used with too high a
682
+ number or a non-digit. This change reveals that the Go and Rust generators
683
+ were using invalid escape ~A - the old writef() code was substituting this
684
+ with just A which is what is wanted so this case was harmless but being
685
+ lenient here could hide bugs, especially when copying code between
686
+ generators as they don't all support the same set of format codes.
687
+
688
+ Build system
689
+ ------------
690
+
691
+ * Turn on Java warnings and make them errors.
692
+
693
+ * Compile C code with -g by default. This makes debugging easier, and
694
+ matches the default for at least some other build systems (e.g. autotools).
695
+
696
+ * Fix "make clean" to remove all built Ada files.
697
+
698
+ * Clean `stemtest` too. Patch from Stefano Rivera.
699
+
700
+ * Add missing `COMMON_FILES` dependency to dist targets.
701
+
702
+ * GNUmakefile: Tidy up and make more consistent
703
+
704
+ * GNUmakefile: Make use of $* to improve speed and readability.
705
+
706
+ * Use $(patsubst ...) instead of sed in .java.class rule which gives cleaner
707
+ make output and is a bit more efficient.
708
+
709
+ * Add `WERROR` make variable to provide a way to add `-Werror` to existing
710
+ CFLAGS.
711
+
712
+ libstemmer
713
+ ----------
714
+
715
+ Testsuite
716
+ ---------
717
+
718
+ * Give a clear error if snowball-data isn't found. Fixes #196, reported by
719
+ Andrea Maccis.
720
+
721
+ * Handle not thinning testdata better. If THIN_FACTOR is set to 1 we no longer
722
+ run gzipped test data through awk. We also now handle THIN_FACTOR being set
723
+ empty as equivalent to 1 for convenience.
724
+
725
+ * csharp_stemwords: Correctly handle a stemmer name containing an underscore.
726
+
727
+ * csharp_stemwords: Make `-i` option optional and read from stdin if omitted,
728
+ like the C version does.
729
+
730
+ * csharp_stemwords: Process the input line by line which is more helpful for
731
+ interactive testing, and also a little faster.
732
+
733
+ * Fix Java TestApp to allow a single argument. The documented command line
734
+ syntax is that you only need to specify the language and there was already
735
+ code to read from stdin if no input file was specified, but at least two
736
+ command line options were required.
737
+
738
+ * Fix deprecation warning in TestApp.java.
739
+
740
+ * Optimise TestApp.java by creating fewer objects. Patch from Robert Muir.
741
+
742
+ * stemwords.py: We no longer create an empty output file if we fail to open the
743
+ input file.
744
+
745
+ * stemwords: Improve error message to say "Out of memory or internal error"
746
+ rather than just "Out of memory".
747
+
748
+ Documentation
749
+ -------------
750
+
751
+ * Include "what is stemming" section in each README.
752
+
753
+ * Include section on threads in each README. Based on patch for Python from
754
+ dbcerigo.
755
+
756
+ * Document that input should be lowercase with composed accents. See #186,
757
+ reported by 1993fpale.
758
+
759
+ * Add README section on building, including notes on cross-compiling. Fixes
760
+ #205, reported by sin-ack.
761
+
762
+ * CONTRIBUTING.rst: Clarify which charsets to list
763
+
764
+ * CONTRIBUTING.rst: Add general advice section. In particular, note to use
765
+ spaces-only for indentation in most cases. Thanks to Dmitry Shachnev for
766
+ raising this point.
767
+
768
+ * CONTRIBUTING.rst: Note that UTF-8 is OK in comments. Thanks to Dmitry
769
+ Shachnev for asking.
770
+
771
+ * Fix some typos. Patch from Josh Soref.
772
+
773
+ * Document that our CI now uses github actions.
774
+
775
+ * Update link to Greek stemmer PDF. Patch from Michael Bissett (#33).
776
+
1
777
  Snowball 2.2.0 (2021-11-10)
2
778
  ===========================
3
779
 
@@ -37,9 +813,13 @@ Python
37
813
 
38
814
  Noted by David Corbett.
39
815
 
40
- Code Quality Improvements
816
+ Code quality Improvements
41
817
  -------------------------
42
818
 
819
+ * C/C++: Generate INT_MIN and INT_MAX directly, including <limits.h> from
820
+ the generated C file if necessary, and remove the MAXINT and MININT macros
821
+ from runtime/header.h.
822
+
43
823
  * C#: An `among` without functions is now generated as `static` and groupings
44
824
  are now generated as constant. Patches from James Turner in #146 and #147.
45
825
 
@@ -51,6 +831,18 @@ Code generation improvements
51
831
  + Constant numeric subexpressions and constant numeric tests are now
52
832
  evaluated at Snowball compile time.
53
833
 
834
+ + Simplify the following degnerate `loop` and `atleast` constructs where
835
+ N is a compile-time constant:
836
+
837
+ - loop N C where N <= 0 is a no-op.
838
+
839
+ - loop N C where N == 1 is just C.
840
+
841
+ - atleast N C where N <= 0 is just repeat C.
842
+
843
+ If the value of N doesn't depend on the current target language, platform
844
+ or Unicode settings then we also issue a warning.
845
+
54
846
  Behavioural changes to existing algorithms
55
847
  ------------------------------------------
56
848
 
@@ -229,7 +1021,7 @@ Code generation improvements
229
1021
  + Don't generate anything for an unused `grouping`.
230
1022
 
231
1023
  + Stop warning "grouping X defined but not used" for a `grouping` which is
232
- only used to define other another `grouping`.
1024
+ only used to define another `grouping`.
233
1025
 
234
1026
  * C/C++:
235
1027
 
@@ -247,6 +1039,9 @@ Code generation improvements
247
1039
  + Avoid generating unused variables. The Pascal code generated for the
248
1040
  stemmers we ship is now warning free (tested with fpc 3.2.0).
249
1041
 
1042
+ + Don't emit empty `private` sections. Cosmetic, but makes the generated
1043
+ code a bit easier to follow.
1044
+
250
1045
  * Python:
251
1046
 
252
1047
  + End `if`-chain with `else` where possible, avoiding a redundant test
@@ -287,7 +1082,7 @@ Compiler
287
1082
  * Don't count `$x = x + 1` as initialising or using `x`, so it's now handled
288
1083
  like `$x += 1` already is.
289
1084
 
290
- * Comments are now only included in the generated code if command like option
1085
+ * Comments are now only included in the generated code if command line option
291
1086
  -comments is specified.
292
1087
 
293
1088
  The comments in the generated code are useful if you're trying to debug the