mittens 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (100) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +8 -0
  3. data/LICENSE.txt +1 -1
  4. data/README.md +4 -4
  5. data/lib/mittens/version.rb +1 -1
  6. data/mittens.gemspec +1 -1
  7. data/vendor/snowball/.github/workflows/ci.yml +216 -0
  8. data/vendor/snowball/CONTRIBUTING.rst +111 -62
  9. data/vendor/snowball/GNUmakefile +194 -136
  10. data/vendor/snowball/NEWS +798 -3
  11. data/vendor/snowball/README.rst +50 -1
  12. data/vendor/snowball/ada/src/stemmer.adb +25 -13
  13. data/vendor/snowball/ada/src/stemmer.ads +9 -9
  14. data/vendor/snowball/ada/stemmer_config.gpr +7 -7
  15. data/vendor/snowball/algorithms/basque.sbl +4 -19
  16. data/vendor/snowball/algorithms/catalan.sbl +2 -9
  17. data/vendor/snowball/algorithms/danish.sbl +1 -1
  18. data/vendor/snowball/algorithms/dutch.sbl +284 -122
  19. data/vendor/snowball/algorithms/dutch_porter.sbl +178 -0
  20. data/vendor/snowball/algorithms/english.sbl +52 -37
  21. data/vendor/snowball/algorithms/esperanto.sbl +157 -0
  22. data/vendor/snowball/algorithms/estonian.sbl +269 -0
  23. data/vendor/snowball/algorithms/finnish.sbl +2 -3
  24. data/vendor/snowball/algorithms/french.sbl +42 -16
  25. data/vendor/snowball/algorithms/german.sbl +35 -14
  26. data/vendor/snowball/algorithms/greek.sbl +76 -76
  27. data/vendor/snowball/algorithms/hungarian.sbl +8 -6
  28. data/vendor/snowball/algorithms/indonesian.sbl +14 -8
  29. data/vendor/snowball/algorithms/italian.sbl +11 -21
  30. data/vendor/snowball/algorithms/lithuanian.sbl +36 -37
  31. data/vendor/snowball/algorithms/lovins.sbl +0 -1
  32. data/vendor/snowball/algorithms/nepali.sbl +138 -37
  33. data/vendor/snowball/algorithms/norwegian.sbl +19 -5
  34. data/vendor/snowball/algorithms/porter.sbl +2 -2
  35. data/vendor/snowball/algorithms/portuguese.sbl +9 -13
  36. data/vendor/snowball/algorithms/romanian.sbl +17 -4
  37. data/vendor/snowball/algorithms/serbian.sbl +467 -468
  38. data/vendor/snowball/algorithms/spanish.sbl +5 -7
  39. data/vendor/snowball/algorithms/swedish.sbl +60 -6
  40. data/vendor/snowball/algorithms/tamil.sbl +207 -176
  41. data/vendor/snowball/algorithms/turkish.sbl +461 -445
  42. data/vendor/snowball/algorithms/yiddish.sbl +36 -38
  43. data/vendor/snowball/compiler/analyser.c +445 -192
  44. data/vendor/snowball/compiler/driver.c +109 -101
  45. data/vendor/snowball/compiler/generator.c +853 -464
  46. data/vendor/snowball/compiler/generator_ada.c +404 -366
  47. data/vendor/snowball/compiler/generator_csharp.c +297 -260
  48. data/vendor/snowball/compiler/generator_go.c +323 -254
  49. data/vendor/snowball/compiler/generator_java.c +326 -252
  50. data/vendor/snowball/compiler/generator_js.c +362 -252
  51. data/vendor/snowball/compiler/generator_pascal.c +349 -197
  52. data/vendor/snowball/compiler/generator_python.c +257 -240
  53. data/vendor/snowball/compiler/generator_rust.c +423 -251
  54. data/vendor/snowball/compiler/header.h +117 -71
  55. data/vendor/snowball/compiler/space.c +137 -68
  56. data/vendor/snowball/compiler/syswords.h +2 -2
  57. data/vendor/snowball/compiler/tokeniser.c +125 -107
  58. data/vendor/snowball/csharp/Snowball/Among.cs +14 -14
  59. data/vendor/snowball/csharp/Snowball/AssemblyInfo.cs +7 -7
  60. data/vendor/snowball/csharp/Snowball/Stemmer.cs +57 -37
  61. data/vendor/snowball/csharp/Stemwords/App.config +2 -2
  62. data/vendor/snowball/csharp/Stemwords/Program.cs +16 -12
  63. data/vendor/snowball/doc/libstemmer_c_README +7 -4
  64. data/vendor/snowball/doc/libstemmer_csharp_README +4 -1
  65. data/vendor/snowball/doc/libstemmer_java_README +12 -1
  66. data/vendor/snowball/doc/libstemmer_js_README +6 -4
  67. data/vendor/snowball/doc/libstemmer_python_README +9 -4
  68. data/vendor/snowball/examples/stemwords.c +12 -12
  69. data/vendor/snowball/go/env.go +107 -31
  70. data/vendor/snowball/go/util.go +0 -4
  71. data/vendor/snowball/include/libstemmer.h +4 -0
  72. data/vendor/snowball/java/org/tartarus/snowball/Among.java +32 -15
  73. data/vendor/snowball/java/org/tartarus/snowball/SnowballProgram.java +347 -261
  74. data/vendor/snowball/java/org/tartarus/snowball/SnowballStemmer.java +3 -0
  75. data/vendor/snowball/java/org/tartarus/snowball/TestApp.java +52 -37
  76. data/vendor/snowball/javascript/base-stemmer.js +186 -2
  77. data/vendor/snowball/javascript/stemwords.js +3 -6
  78. data/vendor/snowball/libstemmer/libstemmer_c.in +1 -1
  79. data/vendor/snowball/libstemmer/mkalgorithms.pl +6 -6
  80. data/vendor/snowball/libstemmer/mkmodules.pl +2 -2
  81. data/vendor/snowball/libstemmer/modules.txt +13 -10
  82. data/vendor/snowball/libstemmer/test.c +1 -1
  83. data/vendor/snowball/pascal/SnowballProgram.pas +84 -2
  84. data/vendor/snowball/pascal/generate.pl +13 -13
  85. data/vendor/snowball/python/create_init.py +4 -1
  86. data/vendor/snowball/python/setup.cfg +0 -3
  87. data/vendor/snowball/python/setup.py +8 -3
  88. data/vendor/snowball/python/snowballstemmer/basestemmer.py +20 -54
  89. data/vendor/snowball/python/stemwords.py +8 -12
  90. data/vendor/snowball/runtime/api.c +10 -5
  91. data/vendor/snowball/runtime/header.h +10 -9
  92. data/vendor/snowball/runtime/utilities.c +9 -9
  93. data/vendor/snowball/rust/build.rs +1 -1
  94. data/vendor/snowball/rust/src/snowball/snowball_env.rs +83 -5
  95. data/vendor/snowball/tests/stemtest.c +7 -4
  96. metadata +8 -12
  97. data/vendor/snowball/.travis.yml +0 -112
  98. data/vendor/snowball/algorithms/german2.sbl +0 -145
  99. data/vendor/snowball/algorithms/kraaij_pohlmann.sbl +0 -240
  100. data/vendor/snowball/compiler/syswords2.h +0 -13
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4fef541b86c3211b161df2de1d51f04a233d2bfaa035cb568f80e81566934066
4
- data.tar.gz: e14016289164fdf9dcd3e71229510a39514ac0d428f51632e9a80296642966ae
3
+ metadata.gz: 94a898f3a58eb04ef1ec07954623b14eea4161f21d330a3340ce98916c867755
4
+ data.tar.gz: 1fd6926ef1e62564b39ba3ea1295831493c4e52ea9a4316c2d06918b3d8d6a50
5
5
  SHA512:
6
- metadata.gz: 489e99984f058232291cb9fe57215605ea9193557037b3d3fec64d3b4cb94dec35cd1c721066ee2eea1d45c9939921f6fb88f9ae11b49f38167b9d9de32146ec
7
- data.tar.gz: 23bffb8f4b2d7dae4c100b1c7dafd491439b35f0db6b097eedf176429059b20c98cf55a1e4ee81b8d01eae4821a0f254604ed6145433ff0a22cfc91837310124
6
+ metadata.gz: ffcad206d69a72e22890032b19041af7f11abf4ef5ec7f51e6dce96ae1f2b2b9091e24cdf8a301202354508fd5cde1bd0421163b1351ec4ed133f46e120bd8f8
7
+ data.tar.gz: 0ece86c59547de3722aa5bf90afbb87e3f6a9b12c2db6323e56260e7dc3b95061b7118dd732d81ba33e7407b057c695a92aa60d4b66ea6811d6a933983037f26
data/CHANGELOG.md CHANGED
@@ -1,3 +1,11 @@
1
+ ## 0.3.0 (2025-09-12)
2
+
3
+ - Updated Snowball to 3.0.1
4
+
5
+ ## 0.2.0 (2025-04-03)
6
+
7
+ - Dropped support for Ruby < 3.2
8
+
1
9
  ## 0.1.1 (2023-05-11)
2
10
 
3
11
  - Fixed installation on Fedora
data/LICENSE.txt CHANGED
@@ -2,7 +2,7 @@ Copyright (c) 2001, Dr Martin Porter
2
2
  Copyright (c) 2004,2005, Richard Boulton
3
3
  Copyright (c) 2013, Yoshiki Shibukawa
4
4
  Copyright (c) 2006,2007,2009,2010,2011,2014-2019, Olly Betts
5
- Copyright (c) 2022, Andrew Kane
5
+ Copyright (c) 2022-2025, Andrew Kane
6
6
  All rights reserved.
7
7
 
8
8
  Redistribution and use in source and binary forms, with or without
data/README.md CHANGED
@@ -2,9 +2,9 @@
2
2
 
3
3
  Stemming for Ruby, powered by [Snowball](https://github.com/snowballstem/snowball)
4
4
 
5
- :snowflake: Supports 28 languages
5
+ :snowflake: Supports 30 languages
6
6
 
7
- [![Build Status](https://github.com/ankane/mittens/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mittens/actions)
7
+ [![Build Status](https://github.com/ankane/mittens/actions/workflows/build.yml/badge.svg)](https://github.com/ankane/mittens/actions)
8
8
 
9
9
  ## Installation
10
10
 
@@ -25,7 +25,7 @@ stemmer = Mittens::Stemmer.new
25
25
  Stem a word
26
26
 
27
27
  ```ruby
28
- stemmer.stem("tomatos") # "tomato"
28
+ stemmer.stem("tomatoes") # "tomato"
29
29
  ```
30
30
 
31
31
  ## Languages
@@ -36,7 +36,7 @@ Specify the language
36
36
  stemmer = Mittens::Stemmer.new(language: "french")
37
37
  ```
38
38
 
39
- Supports `arabic`, `armenian`, `basque`, `catalan`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `lithuanian`, `nepali`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `serbian`, `spanish`, `swedish`, `tamil`, `turkish`, and `yiddish`
39
+ Supports `arabic`, `armenian`, `basque`, `catalan`, `danish`, `dutch`, `dutch_porter`, `english`, `esperanto`, `estonian`, `finnish`, `french`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `lithuanian`, `nepali`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `serbian`, `spanish`, `swedish`, `tamil`, `turkish`, and `yiddish`
40
40
 
41
41
  ## History
42
42
 
@@ -1,3 +1,3 @@
1
1
  module Mittens
2
- VERSION = "0.1.1"
2
+ VERSION = "0.3.0"
3
3
  end
data/mittens.gemspec CHANGED
@@ -18,5 +18,5 @@ Gem::Specification.new do |spec|
18
18
  spec.require_path = "lib"
19
19
  spec.extensions = ["ext/mittens/extconf.rb"]
20
20
 
21
- spec.required_ruby_version = ">= 3"
21
+ spec.required_ruby_version = ">= 3.2"
22
22
  end
@@ -0,0 +1,216 @@
1
+ name: CI
2
+
3
+ # Use bash by default on all platforms.
4
+ defaults:
5
+ run:
6
+ shell: bash
7
+
8
+ on:
9
+ push:
10
+ paths-ignore:
11
+ - '*.rst'
12
+ - NEWS
13
+ pull_request:
14
+ branches: master
15
+ paths-ignore:
16
+ - '*.rst'
17
+ - NEWS
18
+
19
+ # Allows you to run this workflow manually from the Actions tab
20
+ workflow_dispatch:
21
+
22
+ jobs:
23
+ build:
24
+ strategy:
25
+ matrix:
26
+ include:
27
+ - name: "C distribution build"
28
+ CFLAGS_DIST_BUILD: '-O2 -Wall -W -std=c90 -Wmissing-prototypes -Wmissing-declarations -Wshadow -Wdeclaration-after-statement -Werror'
29
+ - name: "C distribution build (clang)"
30
+ CFLAGS_DIST_BUILD: '-O2 -Wall -W -std=c90 -Wmissing-prototypes -Wmissing-declarations -Wshadow -Wdeclaration-after-statement -Werror'
31
+ CC: clang
32
+ - name: "C"
33
+ c_tests: y
34
+ WERROR: '-std=c99 -Werror'
35
+ - name: "C (clang)"
36
+ c_tests: y
37
+ WERROR: '-std=c99 -Werror'
38
+ CC: clang
39
+ - name: java
40
+ JAVA: java
41
+ JAVAC: javac
42
+ JAVACFLAGS: '-Xlint:all -Werror'
43
+ - name: go_old
44
+ os: 'ubuntu-22.04'
45
+ apt_packages: 'golang-1.13'
46
+ GO: go
47
+ - name: go_new
48
+ os: 'ubuntu-24.04'
49
+ apt_packages: 'golang-1.22'
50
+ GO: go
51
+ - name: javascript_node
52
+ JSRUN: node
53
+ apt_packages: 'nodejs'
54
+ - name: rust
55
+ RUST: rust
56
+ apt_packages: 'rustc'
57
+ - name: csharp
58
+ MCS: mcs
59
+ apt_packages: 'mono-devel'
60
+ - name: Pascal
61
+ FPC: fpc
62
+ apt_packages: 'fpc'
63
+ - name: Python 3.8
64
+ PYTHON_VERSION: 3.8
65
+ os: 'ubuntu-22.04'
66
+ # The pure Python versions run slowly - when we used travis for CI
67
+ # we used to need to thin the testdata for languages such as Arabic
68
+ # where there's a lot to avoid the build exceeding the maximum time
69
+ # allowed for a CI job. GHA allows jobs to take up to 6 hours so
70
+ # we should no longer need to do this.
71
+ THIN_FACTOR: 1
72
+ - name: Python 3.10
73
+ PYTHON: python3.10
74
+ os: 'ubuntu-22.04'
75
+ apt_packages: 'python3.10'
76
+ THIN_FACTOR: 1
77
+ - name: Python 3.12
78
+ PYTHON: python3.12
79
+ os: 'ubuntu-24.04'
80
+ apt_packages: 'python3.12'
81
+ THIN_FACTOR: 1
82
+ - name: Python (pypy3)
83
+ PYTHON: pypy3
84
+ apt_packages: 'pypy3'
85
+ - name: Ada
86
+ gprbuild: gprbuild
87
+ apt_packages: 'gnat gprbuild'
88
+ - name: Windows (C)
89
+ os: windows-latest
90
+ c_tests: y
91
+ ccache: sccache
92
+ - name: Windows (Go)
93
+ os: windows-latest
94
+ GO: go
95
+ MAKE: mingw32-make
96
+ mingw64_packages: 'mingw-w64-ucrt-x86_64-go'
97
+ ccache: sccache
98
+
99
+ fail-fast: false
100
+
101
+ runs-on: ${{ matrix.os || 'ubuntu-latest' }}
102
+
103
+ env:
104
+ CC: ${{ matrix.CC || 'gcc' }}
105
+ MAKE: ${{ matrix.MAKE || 'make' }}
106
+ STEMMING_DATA: 'snowball-data'
107
+
108
+ steps:
109
+ - name: Checkout
110
+ uses: actions/checkout@v4
111
+ with:
112
+ show-progress: false
113
+
114
+ - name: Checkout data
115
+ run: |
116
+ # Try to check out a branch of the same name from the snowball-data
117
+ # repo sibling of this snowball repo, so that PRs requiring changes to
118
+ # both can be CI tested easily.
119
+ #
120
+ # For a PR, GHA will have merged the PR branch into upstream master so
121
+ # we need to similarly merge the snowball-data branch into upstream
122
+ # master of the snowball-data repo as there may be changes there
123
+ # required by snowball master.
124
+ #
125
+ # If there's no such branch (or repo) we just use the standard
126
+ # snowball-data repo's default branch. If there is such a branch but
127
+ # the merge fails, we treat that as a fatal error.
128
+ UPSTREAM_REPO_URL=https://github.com/snowballstem/snowball-data.git
129
+ if [ -n "$GITHUB_HEAD_REF" ] ; then
130
+ # Pull-request.
131
+ GH_BRANCH=${GITHUB_HEAD_REF}
132
+ GH_REPO_OWNER=${GITHUB_ACTOR}
133
+ GH_REPO_URL=https://github.com/$GH_REPO_OWNER/snowball-data.git
134
+ git clone "$UPSTREAM_REPO_URL"
135
+ cd snowball-data
136
+ git remote add pr "$GH_REPO_URL"
137
+ git config --global user.email "ci@example.org"
138
+ git config --global user.name "CI"
139
+ echo "Trying branch $GH_BRANCH from $GH_REPO_URL"
140
+ if git fetch pr && git branch --track "$GH_BRANCH" pr/"$GH_BRANCH" ; then
141
+ git merge "$GH_BRANCH"
142
+ else
143
+ echo "Falling back to $UPSTREAM_REPO_URL"
144
+ fi
145
+ else
146
+ # Push.
147
+ GH_BRANCH=${GITHUB_REF_NAME}
148
+ GH_REPO_OWNER=${GITHUB_REPOSITORY_OWNER}
149
+ GH_REPO_URL=https://github.com/$GH_REPO_OWNER/snowball-data.git
150
+ echo "Trying branch $GH_BRANCH from $GH_REPO_URL"
151
+ if ! git clone -b "$GH_BRANCH" "$GH_REPO_URL" ; then
152
+ echo "Falling back to $UPSTREAM_REPO_URL"
153
+ git clone "$UPSTREAM_REPO_URL"
154
+ fi
155
+ fi
156
+ - name: Install CCache
157
+ uses: hendrikmuhs/ccache-action@v1
158
+ with:
159
+ key: ${{ matrix.name }}
160
+ variant: ${{ matrix.ccache || 'ccache' }}
161
+ - name: Install Ubuntu packages
162
+ if: matrix.apt_packages
163
+ run: |
164
+ sudo apt-get update
165
+ sudo apt-get install -y ${{ matrix.apt_packages }}
166
+ - name: Install mingw64 packages
167
+ if: matrix.mingw64_packages
168
+ uses: msys2/setup-msys2@v2
169
+ with:
170
+ msystem: ucrt64
171
+ install: base-devel ${{ matrix.mingw64_packages }}
172
+ - name: Build
173
+ run: $MAKE CC="${{ matrix.ccache || 'ccache' }} $CC"
174
+ - name: Test C dist
175
+ if: matrix.CFLAGS_DIST_BUILD
176
+ run: |
177
+ pip install setuptools build
178
+ $MAKE dist
179
+ mkdir tmp
180
+ cd tmp
181
+ tar xf ../dist/libstemmer_c-*.tar.gz
182
+ cd libstemmer_c-*
183
+ $MAKE CFLAGS="${{ matrix.CFLAGS_DIST_BUILD }}"
184
+ - name: Test C
185
+ if: matrix.c_tests
186
+ run: $MAKE check CC="$CC"
187
+ - uses: actions/setup-python@v5
188
+ with:
189
+ python-version: ${{ matrix.PYTHON_VERSION }}
190
+ if: matrix.PYTHON_VERSION
191
+ - name: Test Python
192
+ if: matrix.PYTHON || matrix.PYTHON_VERSION
193
+ run: $MAKE check_python python="${{ matrix.PYTHON || 'python' }}" THIN_FACTOR="${{ matrix.THIN_FACTOR }}"
194
+ - name: Test Java
195
+ if: matrix.JAVA && matrix.JAVAC
196
+ run: $MAKE check_java JAVA="${{ matrix.JAVA }}" JAVAC="${{ matrix.JAVAC }}" JAVACFLAGS="${{ matrix.JAVACFLAGS }}"
197
+ - name: Test C#
198
+ if: matrix.MCS
199
+ run: $MAKE check_csharp MCS="${{ matrix.MCS }}"
200
+ - name: Test Javascript
201
+ if: matrix.JSRUN
202
+ run: $MAKE check_js JSRUN="${{ matrix.JSRUN }}"
203
+ - name: Test Rust
204
+ if: matrix.RUST
205
+ run: $MAKE check_rust RUST="${{ matrix.RUST }}"
206
+ - name: Test Go
207
+ if: matrix.GO
208
+ run: |
209
+ go mod init github.com/snowballstem/snowball
210
+ $MAKE check_go GO="${{ matrix.GO }}"
211
+ - name: Test Pascal
212
+ if: matrix.FPC
213
+ run: $MAKE check_pascal FPC="${{ matrix.FPC }}"
214
+ - name: Test Ada
215
+ if: matrix.gprbuild
216
+ run: $MAKE check_ada gprbuild="${{ matrix.gprbuild }}"
@@ -1,41 +1,35 @@
1
- Adding a new stemming algorithm
1
+ General contribution guidelines
2
2
  ===============================
3
3
 
4
- This needs PRs against three repositories. Name the branch the same for
5
- at least `snowball` and `snowball-data`, push to the latter repo first, and the
6
- CI should use your new vocabulary list when running the testsuite.
7
-
8
- Some points to note about algorithm implementations:
4
+ We don't have a formally defined coding style guide, but please strive to
5
+ make new/changed code look like the code around it.
9
6
 
10
- * Avoid literal non-ASCII characters in snowball string literals - they will
11
- work OK for languages that use UTF-8, but not wide-character Unicode or other
12
- encodings. Instead use ``stringdef`` like the existing stemmers do, and
13
- please use the newer `U+` notation rather than the older ``hex`` or
14
- ``decimal`` as this allows us to support different encodings without having
15
- to modify the source files - for example::
7
+ Use spaces-only for indentation except where there's a syntax reason (e.g.
8
+ ``GNUmakefile``) or a strong convention (e.g. Go's standard seems to be tabs,
9
+ and ``gofmt`` reindents code using tabs).
16
10
 
17
- stringdef o" {U+00F6}
18
- define foo 'o{o"}'
11
+ Avoid adding trailing whitespace on lines. Make sure there's a newline
12
+ character at the end of new text files.
19
13
 
20
- not::
14
+ Avoid mixing code reformatting changes with functional changes - doing so
15
+ makes it harder to review patches.
21
16
 
22
- stringdef o" hex F6
23
- define foo 'o{o"}'
17
+ Adding a new stemming algorithm
18
+ ===============================
24
19
 
25
- and definitely not::
20
+ To add a new stemming algorithm you need to submit PRs against three
21
+ repositories. See below for details of what's needed in each of
22
+ these.
26
23
 
27
- define foo 'oö'
28
-
29
- It's helpful to consistently use the same ``stringdef`` codes across the
30
- different stemmers - the website has `guidance on what to use
31
- <https://snowballstem.org/codesets/guide.html>`_ and a `list of stringdef
32
- lines for common characters to cut and paste from
33
- <https://snowballstem.org/codesets/latin-stringdef-list.txt>`_.
24
+ Name the branch the same for at least `snowball` and `snowball-data` and push
25
+ to `snowball-data` first, then the CI should use your new vocabulary list when
26
+ running the testsuite.
34
27
 
35
28
  snowball repo
36
29
  -------------
37
30
 
38
- Add `.sbl` source to algorithms subdirectory.
31
+ This is where the implementation of the new algorithm goes. Add the `.sbl`
32
+ source implementing it to the `algorithms/` subdirectory.
39
33
 
40
34
  Add entry to `libstemmer/modules.txt`, maintaining the current sorted order by
41
35
  the first column. The columns are:
@@ -53,6 +47,35 @@ the first column. The columns are:
53
47
  "639-2/T" and (if different) "639-2/B" columns. For example, for the `Afar`
54
48
  language you'd put `afar,aa,aar` here.
55
49
 
50
+ Some points to note about algorithm implementations:
51
+
52
+ * Avoid literal non-ASCII characters in snowball string literals - they will
53
+ work OK for languages that use UTF-8, but not wide-character Unicode or other
54
+ encodings. Instead use ``stringdef`` like the existing stemmers do, and
55
+ please use the newer `U+` notation rather than the older ``hex`` or
56
+ ``decimal`` as this allows us to support different encodings without having
57
+ to modify the source files - for example::
58
+
59
+ stringdef o" {U+00F6}
60
+ define foo 'o{o"}'
61
+
62
+ not::
63
+
64
+ stringdef o" hex F6
65
+ define foo 'o{o"}'
66
+
67
+ and definitely not::
68
+
69
+ define foo 'oö'
70
+
71
+ It's OK to use UTF-8 in comments.
72
+
73
+ * It's helpful to consistently use the same ``stringdef`` codes across the
74
+ different stemmers - for languages using the latin alphabet our website has
75
+ `guidance on what to use <https://snowballstem.org/codesets/guide.html>`_ and
76
+ a `list of stringdef lines for common characters to cut and paste from
77
+ <https://snowballstem.org/codesets/latin-stringdef-list.txt>`_.
78
+
56
79
  snowball-data repo
57
80
  ------------------
58
81
 
@@ -68,9 +91,8 @@ you may be able to use the `wikipedia-most-common-words` script to generate
68
91
  one by extracting the most frequent words from a Wikipedia dump in the
69
92
  language the stemmer is for. You need to specify the Unicode "script" (that's
70
93
  "script" in the sense of alphabet) to use - you can find the appropriate one
71
- by looking in the Unicode `Scripts.txt`::
72
-
73
- https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt
94
+ by looking in the Unicode `Scripts.txt
95
+ <https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt>`_.
74
96
 
75
97
  The script name is the second column, between `;` and `#`. The first entries
76
98
  are all "Common" which isn't what you want - scroll down to get to the entries
@@ -88,36 +110,62 @@ practice too), but at some point "more" stops being "better".
88
110
  snowball-website repo
89
111
  ---------------------
90
112
 
91
- Create subdirectory of `algorithms/` named after the language.
92
-
93
- Create `stemmer.tt` which describes the stemming algorithm. This is a
94
- "template toolkit" template which is essentially a mix of HTML and some
95
- macros for adding the navigation, sample vocabulary, etc. See the
96
- existing `stemmer.tt` files for other algorithms for inspiration.
97
-
98
- If it is based on an academic paper, cite the paper and describe any difference
99
- between your implementation and that described in the paper (for example,
100
- sometimes papers have ambiguities that need resolving to re-implement the
101
- algorithm described).
102
-
103
- If you have a stopword list, add that as `stop.txt` in your new subdirectory.
104
- The `generate` script checks if such a file exists and if it does a link to
105
- it is automatically added.
106
-
107
- Link to your new `stemmer.tt` from `algorithms/index.tt`.
108
-
109
- Add a news entry to `index.tt`.
110
-
111
- Add the new stemmer to the online demo. Assuming you have checkouts of the
112
- `snowball`, `snowball-data` and `snowball-website` repos in sibling
113
- directories:
114
-
115
- * run `make check_js` in the `snowball` repo
116
- * run `./update-js`
117
- * add the new stemmer to git with: `git add js/*-stemmer.js`
118
- * if the new language is written right-to-left (RTL) then add it to the check
119
- in `demo.tt` (search for `rtl` to find the place to change.)
120
- * `git commit`.
113
+ This is where a description of the new algorithm goes. Experience from
114
+ maintaining Snowball for many years has shown us that the most important
115
+ points to cover are **WHY** particular things are done or are not done.
116
+
117
+ For example, if a particular ending isn't removed because doing so causes
118
+ problems in other cases it's really helpful to have that recorded. Then
119
+ if years later we get a bug report because this ending isn't removed we
120
+ can easily answer, and don't have to try to contact you and hope you can
121
+ remember, or try to work out why for ourselves.
122
+
123
+ The original set of Snowball stemmers each have an English prose description
124
+ of the algorithm which focuses on **WHAT** the algorithm does. These might be
125
+ helpful if you want to implement the algorithm from scratch in a separate
126
+ language, but they've not proved very useful for maintaining the Snowball
127
+ implementations - if the prose and Snowball code disagree we know something is
128
+ wrong, but it's hard to know which is right! Therefore we recommend to let
129
+ the Snowball implementation describe what the algorithm does, and only comment
130
+ on "**WHAT**" in cases where the implementation needs explanation to help
131
+ the reader understand it.
132
+
133
+ If your algorithm is based on an academic paper, cite the paper and describe
134
+ any differences between your implementation and that described in the paper.
135
+ For example, sometimes papers have ambiguities that need resolving to
136
+ re-implement the algorithm described - see the `Hindi
137
+ <https://snowballstem.org/algorithms/hindi/stemmer.html>`_ and `Indonesian
138
+ <https://snowballstem.org/algorithms/indonesian/stemmer.html>`_
139
+ stemming algorithms descriptions for examples.
140
+
141
+ The mechanics of adding the algorithm description are:
142
+
143
+ * Create subdirectory of `algorithms/` named after the language.
144
+
145
+ * Create `stemmer.tt` which describes the stemming algorithm. This is a
146
+ "template toolkit" template which is essentially a mix of HTML and some
147
+ macros for adding the navigation, sample vocabulary, etc. See the
148
+ existing `stemmer.tt` files for other algorithms for how to use these
149
+ macros.
150
+
151
+ * If you have a stopword list, add that as `stop.txt` in your new subdirectory.
152
+ The `generate` script checks if such a file exists and if it does a link to
153
+ it is automatically added.
154
+
155
+ * Link to your new `stemmer.tt` from `algorithms/index.tt`.
156
+
157
+ * Add a news entry to `index.tt`.
158
+
159
+ * Add the new stemmer to the online demo. Assuming you have checkouts of the
160
+ `snowball`, `snowball-data` and `snowball-website` repos in sibling
161
+ directories:
162
+
163
+ * run `make check_js` in the `snowball` repo
164
+ * run `./update-js`
165
+ * add the new stemmer to git with: `git add js/*-stemmer.js`
166
+ * if the new language is written right-to-left (RTL) then add it to the check
167
+ in `demo.tt` (search for `rtl` to find the place to change.)
168
+ * `git commit`.
121
169
 
122
170
  Adding a new programming language generator
123
171
  ===========================================
@@ -150,7 +198,7 @@ to improve further in the future.
150
198
  Key problems to solve
151
199
  ---------------------
152
200
 
153
- * A key problem to solve is how to map the required flow of control in response
201
+ * You need to work out how to map the required flow of control in response
154
202
  to Snowball signals.
155
203
 
156
204
  In the generated C code this is mostly done using `goto`. If your language
@@ -172,7 +220,7 @@ Key problems to solve
172
220
  matching C - i.e. the result should be truncated (rounded towards zero).
173
221
  Some languages lack a built-in integer division operation, or have one
174
222
  which instead implements rounding towards negative infinity. Existing
175
- backends with special handling handling here which may be useful to look at
223
+ backends with special handling here which may be useful to look at
176
224
  include Javascript, Pascal and Python.
177
225
 
178
226
  Don't hardcode algorithm names
@@ -199,6 +247,7 @@ Mechanics of adding a new generator
199
247
  Copy an existing `compiler/generator_*.c` for your new language and modify
200
248
  away (`generator.c` has the generator for C, but also some common functions
201
249
  so if you start from this one you'll need to remove those common functions).
250
+
202
251
  Please resist reformatting existing C code - there's currently a lot of code
203
252
  repeated in each generator which ought to be pulled out as common code, and
204
253
  if you reformat that just makes that job harder.
@@ -211,6 +260,6 @@ Add support to `compiler/driver.c`.
211
260
 
212
261
  Add targets to `GNUmakefile` to run tests for the new language.
213
262
 
214
- Hook up automated testing via CI in `.travis.yml`.
263
+ Hook up automated testing via CI in `.github/workflows/ci.yml`.
215
264
 
216
265
  Add to the list of languages in `README.rst`.