mittens 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +4 -0
- data/README.md +3 -3
- data/lib/mittens/version.rb +1 -1
- data/vendor/snowball/.github/workflows/ci.yml +216 -0
- data/vendor/snowball/CONTRIBUTING.rst +111 -62
- data/vendor/snowball/GNUmakefile +194 -136
- data/vendor/snowball/NEWS +798 -3
- data/vendor/snowball/README.rst +50 -1
- data/vendor/snowball/ada/src/stemmer.adb +25 -13
- data/vendor/snowball/ada/src/stemmer.ads +9 -9
- data/vendor/snowball/ada/stemmer_config.gpr +7 -7
- data/vendor/snowball/algorithms/basque.sbl +4 -19
- data/vendor/snowball/algorithms/catalan.sbl +2 -9
- data/vendor/snowball/algorithms/danish.sbl +1 -1
- data/vendor/snowball/algorithms/dutch.sbl +284 -122
- data/vendor/snowball/algorithms/dutch_porter.sbl +178 -0
- data/vendor/snowball/algorithms/english.sbl +52 -37
- data/vendor/snowball/algorithms/esperanto.sbl +157 -0
- data/vendor/snowball/algorithms/estonian.sbl +269 -0
- data/vendor/snowball/algorithms/finnish.sbl +2 -3
- data/vendor/snowball/algorithms/french.sbl +42 -16
- data/vendor/snowball/algorithms/german.sbl +35 -14
- data/vendor/snowball/algorithms/greek.sbl +76 -76
- data/vendor/snowball/algorithms/hungarian.sbl +8 -6
- data/vendor/snowball/algorithms/indonesian.sbl +14 -8
- data/vendor/snowball/algorithms/italian.sbl +11 -21
- data/vendor/snowball/algorithms/lithuanian.sbl +36 -37
- data/vendor/snowball/algorithms/lovins.sbl +0 -1
- data/vendor/snowball/algorithms/nepali.sbl +138 -37
- data/vendor/snowball/algorithms/norwegian.sbl +19 -5
- data/vendor/snowball/algorithms/porter.sbl +2 -2
- data/vendor/snowball/algorithms/portuguese.sbl +9 -13
- data/vendor/snowball/algorithms/romanian.sbl +17 -4
- data/vendor/snowball/algorithms/serbian.sbl +467 -468
- data/vendor/snowball/algorithms/spanish.sbl +5 -7
- data/vendor/snowball/algorithms/swedish.sbl +60 -6
- data/vendor/snowball/algorithms/tamil.sbl +207 -176
- data/vendor/snowball/algorithms/turkish.sbl +461 -445
- data/vendor/snowball/algorithms/yiddish.sbl +36 -38
- data/vendor/snowball/compiler/analyser.c +445 -192
- data/vendor/snowball/compiler/driver.c +109 -101
- data/vendor/snowball/compiler/generator.c +853 -464
- data/vendor/snowball/compiler/generator_ada.c +404 -366
- data/vendor/snowball/compiler/generator_csharp.c +297 -260
- data/vendor/snowball/compiler/generator_go.c +323 -254
- data/vendor/snowball/compiler/generator_java.c +326 -252
- data/vendor/snowball/compiler/generator_js.c +362 -252
- data/vendor/snowball/compiler/generator_pascal.c +349 -197
- data/vendor/snowball/compiler/generator_python.c +257 -240
- data/vendor/snowball/compiler/generator_rust.c +423 -251
- data/vendor/snowball/compiler/header.h +117 -71
- data/vendor/snowball/compiler/space.c +137 -68
- data/vendor/snowball/compiler/syswords.h +2 -2
- data/vendor/snowball/compiler/tokeniser.c +125 -107
- data/vendor/snowball/csharp/Snowball/Among.cs +14 -14
- data/vendor/snowball/csharp/Snowball/AssemblyInfo.cs +7 -7
- data/vendor/snowball/csharp/Snowball/Stemmer.cs +57 -37
- data/vendor/snowball/csharp/Stemwords/App.config +2 -2
- data/vendor/snowball/csharp/Stemwords/Program.cs +16 -12
- data/vendor/snowball/doc/libstemmer_c_README +7 -4
- data/vendor/snowball/doc/libstemmer_csharp_README +4 -1
- data/vendor/snowball/doc/libstemmer_java_README +12 -1
- data/vendor/snowball/doc/libstemmer_js_README +6 -4
- data/vendor/snowball/doc/libstemmer_python_README +9 -4
- data/vendor/snowball/examples/stemwords.c +12 -12
- data/vendor/snowball/go/env.go +107 -31
- data/vendor/snowball/go/util.go +0 -4
- data/vendor/snowball/include/libstemmer.h +4 -0
- data/vendor/snowball/java/org/tartarus/snowball/Among.java +32 -15
- data/vendor/snowball/java/org/tartarus/snowball/SnowballProgram.java +347 -261
- data/vendor/snowball/java/org/tartarus/snowball/SnowballStemmer.java +3 -0
- data/vendor/snowball/java/org/tartarus/snowball/TestApp.java +52 -37
- data/vendor/snowball/javascript/base-stemmer.js +186 -2
- data/vendor/snowball/javascript/stemwords.js +3 -6
- data/vendor/snowball/libstemmer/libstemmer_c.in +1 -1
- data/vendor/snowball/libstemmer/mkalgorithms.pl +6 -6
- data/vendor/snowball/libstemmer/mkmodules.pl +2 -2
- data/vendor/snowball/libstemmer/modules.txt +13 -10
- data/vendor/snowball/libstemmer/test.c +1 -1
- data/vendor/snowball/pascal/SnowballProgram.pas +84 -2
- data/vendor/snowball/pascal/generate.pl +13 -13
- data/vendor/snowball/python/create_init.py +4 -1
- data/vendor/snowball/python/setup.cfg +0 -3
- data/vendor/snowball/python/setup.py +8 -3
- data/vendor/snowball/python/snowballstemmer/basestemmer.py +20 -54
- data/vendor/snowball/python/stemwords.py +8 -12
- data/vendor/snowball/runtime/api.c +10 -5
- data/vendor/snowball/runtime/header.h +10 -9
- data/vendor/snowball/runtime/utilities.c +9 -9
- data/vendor/snowball/rust/build.rs +1 -1
- data/vendor/snowball/rust/src/snowball/snowball_env.rs +83 -5
- data/vendor/snowball/tests/stemtest.c +7 -4
- metadata +7 -7
- data/vendor/snowball/.travis.yml +0 -112
- data/vendor/snowball/algorithms/german2.sbl +0 -145
- data/vendor/snowball/algorithms/kraaij_pohlmann.sbl +0 -240
- data/vendor/snowball/compiler/syswords2.h +0 -13
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 94a898f3a58eb04ef1ec07954623b14eea4161f21d330a3340ce98916c867755
|
4
|
+
data.tar.gz: 1fd6926ef1e62564b39ba3ea1295831493c4e52ea9a4316c2d06918b3d8d6a50
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ffcad206d69a72e22890032b19041af7f11abf4ef5ec7f51e6dce96ae1f2b2b9091e24cdf8a301202354508fd5cde1bd0421163b1351ec4ed133f46e120bd8f8
|
7
|
+
data.tar.gz: 0ece86c59547de3722aa5bf90afbb87e3f6a9b12c2db6323e56260e7dc3b95061b7118dd732d81ba33e7407b057c695a92aa60d4b66ea6811d6a933983037f26
|
data/CHANGELOG.md
CHANGED
data/README.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
|
3
3
|
Stemming for Ruby, powered by [Snowball](https://github.com/snowballstem/snowball)
|
4
4
|
|
5
|
-
:snowflake: Supports
|
5
|
+
:snowflake: Supports 30 languages
|
6
6
|
|
7
7
|
[](https://github.com/ankane/mittens/actions)
|
8
8
|
|
@@ -25,7 +25,7 @@ stemmer = Mittens::Stemmer.new
|
|
25
25
|
Stem a word
|
26
26
|
|
27
27
|
```ruby
|
28
|
-
stemmer.stem("
|
28
|
+
stemmer.stem("tomatoes") # "tomato"
|
29
29
|
```
|
30
30
|
|
31
31
|
## Languages
|
@@ -36,7 +36,7 @@ Specify the language
|
|
36
36
|
stemmer = Mittens::Stemmer.new(language: "french")
|
37
37
|
```
|
38
38
|
|
39
|
-
Supports `arabic`, `armenian`, `basque`, `catalan`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `lithuanian`, `nepali`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `serbian`, `spanish`, `swedish`, `tamil`, `turkish`, and `yiddish`
|
39
|
+
Supports `arabic`, `armenian`, `basque`, `catalan`, `danish`, `dutch`, `dutch_porter`, `english`, `esperanto`, `estonian`, `finnish`, `french`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `lithuanian`, `nepali`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `serbian`, `spanish`, `swedish`, `tamil`, `turkish`, and `yiddish`
|
40
40
|
|
41
41
|
## History
|
42
42
|
|
data/lib/mittens/version.rb
CHANGED
@@ -0,0 +1,216 @@
|
|
1
|
+
name: CI
|
2
|
+
|
3
|
+
# Use bash by default on all platforms.
|
4
|
+
defaults:
|
5
|
+
run:
|
6
|
+
shell: bash
|
7
|
+
|
8
|
+
on:
|
9
|
+
push:
|
10
|
+
paths-ignore:
|
11
|
+
- '*.rst'
|
12
|
+
- NEWS
|
13
|
+
pull_request:
|
14
|
+
branches: master
|
15
|
+
paths-ignore:
|
16
|
+
- '*.rst'
|
17
|
+
- NEWS
|
18
|
+
|
19
|
+
# Allows you to run this workflow manually from the Actions tab
|
20
|
+
workflow_dispatch:
|
21
|
+
|
22
|
+
jobs:
|
23
|
+
build:
|
24
|
+
strategy:
|
25
|
+
matrix:
|
26
|
+
include:
|
27
|
+
- name: "C distribution build"
|
28
|
+
CFLAGS_DIST_BUILD: '-O2 -Wall -W -std=c90 -Wmissing-prototypes -Wmissing-declarations -Wshadow -Wdeclaration-after-statement -Werror'
|
29
|
+
- name: "C distribution build (clang)"
|
30
|
+
CFLAGS_DIST_BUILD: '-O2 -Wall -W -std=c90 -Wmissing-prototypes -Wmissing-declarations -Wshadow -Wdeclaration-after-statement -Werror'
|
31
|
+
CC: clang
|
32
|
+
- name: "C"
|
33
|
+
c_tests: y
|
34
|
+
WERROR: '-std=c99 -Werror'
|
35
|
+
- name: "C (clang)"
|
36
|
+
c_tests: y
|
37
|
+
WERROR: '-std=c99 -Werror'
|
38
|
+
CC: clang
|
39
|
+
- name: java
|
40
|
+
JAVA: java
|
41
|
+
JAVAC: javac
|
42
|
+
JAVACFLAGS: '-Xlint:all -Werror'
|
43
|
+
- name: go_old
|
44
|
+
os: 'ubuntu-22.04'
|
45
|
+
apt_packages: 'golang-1.13'
|
46
|
+
GO: go
|
47
|
+
- name: go_new
|
48
|
+
os: 'ubuntu-24.04'
|
49
|
+
apt_packages: 'golang-1.22'
|
50
|
+
GO: go
|
51
|
+
- name: javascript_node
|
52
|
+
JSRUN: node
|
53
|
+
apt_packages: 'nodejs'
|
54
|
+
- name: rust
|
55
|
+
RUST: rust
|
56
|
+
apt_packages: 'rustc'
|
57
|
+
- name: csharp
|
58
|
+
MCS: mcs
|
59
|
+
apt_packages: 'mono-devel'
|
60
|
+
- name: Pascal
|
61
|
+
FPC: fpc
|
62
|
+
apt_packages: 'fpc'
|
63
|
+
- name: Python 3.8
|
64
|
+
PYTHON_VERSION: 3.8
|
65
|
+
os: 'ubuntu-22.04'
|
66
|
+
# The pure Python versions run slowly - when we used travis for CI
|
67
|
+
# we used to need to thin the testdata for languages such as Arabic
|
68
|
+
# where there's a lot to avoid the build exceeding the maximum time
|
69
|
+
# allowed for a CI job. GHA allows jobs to take up to 6 hours so
|
70
|
+
# we should no longer need to do this.
|
71
|
+
THIN_FACTOR: 1
|
72
|
+
- name: Python 3.10
|
73
|
+
PYTHON: python3.10
|
74
|
+
os: 'ubuntu-22.04'
|
75
|
+
apt_packages: 'python3.10'
|
76
|
+
THIN_FACTOR: 1
|
77
|
+
- name: Python 3.12
|
78
|
+
PYTHON: python3.12
|
79
|
+
os: 'ubuntu-24.04'
|
80
|
+
apt_packages: 'python3.12'
|
81
|
+
THIN_FACTOR: 1
|
82
|
+
- name: Python (pypy3)
|
83
|
+
PYTHON: pypy3
|
84
|
+
apt_packages: 'pypy3'
|
85
|
+
- name: Ada
|
86
|
+
gprbuild: gprbuild
|
87
|
+
apt_packages: 'gnat gprbuild'
|
88
|
+
- name: Windows (C)
|
89
|
+
os: windows-latest
|
90
|
+
c_tests: y
|
91
|
+
ccache: sccache
|
92
|
+
- name: Windows (Go)
|
93
|
+
os: windows-latest
|
94
|
+
GO: go
|
95
|
+
MAKE: mingw32-make
|
96
|
+
mingw64_packages: 'mingw-w64-ucrt-x86_64-go'
|
97
|
+
ccache: sccache
|
98
|
+
|
99
|
+
fail-fast: false
|
100
|
+
|
101
|
+
runs-on: ${{ matrix.os || 'ubuntu-latest' }}
|
102
|
+
|
103
|
+
env:
|
104
|
+
CC: ${{ matrix.CC || 'gcc' }}
|
105
|
+
MAKE: ${{ matrix.MAKE || 'make' }}
|
106
|
+
STEMMING_DATA: 'snowball-data'
|
107
|
+
|
108
|
+
steps:
|
109
|
+
- name: Checkout
|
110
|
+
uses: actions/checkout@v4
|
111
|
+
with:
|
112
|
+
show-progress: false
|
113
|
+
|
114
|
+
- name: Checkout data
|
115
|
+
run: |
|
116
|
+
# Try to check out a branch of the same name from the snowball-data
|
117
|
+
# repo sibling of this snowball repo, so that PRs requiring changes to
|
118
|
+
# both can be CI tested easily.
|
119
|
+
#
|
120
|
+
# For a PR, GHA will have merged the PR branch into upstream master so
|
121
|
+
# we need to similarly merge the snowball-data branch into upstream
|
122
|
+
# master of the snowball-data repo as there may be changes there
|
123
|
+
# required by snowball master.
|
124
|
+
#
|
125
|
+
# If there's no such branch (or repo) we just use the standard
|
126
|
+
# snowball-data repo's default branch. If there is such a branch but
|
127
|
+
# the merge fails, we treat that as a fatal error.
|
128
|
+
UPSTREAM_REPO_URL=https://github.com/snowballstem/snowball-data.git
|
129
|
+
if [ -n "$GITHUB_HEAD_REF" ] ; then
|
130
|
+
# Pull-request.
|
131
|
+
GH_BRANCH=${GITHUB_HEAD_REF}
|
132
|
+
GH_REPO_OWNER=${GITHUB_ACTOR}
|
133
|
+
GH_REPO_URL=https://github.com/$GH_REPO_OWNER/snowball-data.git
|
134
|
+
git clone "$UPSTREAM_REPO_URL"
|
135
|
+
cd snowball-data
|
136
|
+
git remote add pr "$GH_REPO_URL"
|
137
|
+
git config --global user.email "ci@example.org"
|
138
|
+
git config --global user.name "CI"
|
139
|
+
echo "Trying branch $GH_BRANCH from $GH_REPO_URL"
|
140
|
+
if git fetch pr && git branch --track "$GH_BRANCH" pr/"$GH_BRANCH" ; then
|
141
|
+
git merge "$GH_BRANCH"
|
142
|
+
else
|
143
|
+
echo "Falling back to $UPSTREAM_REPO_URL"
|
144
|
+
fi
|
145
|
+
else
|
146
|
+
# Push.
|
147
|
+
GH_BRANCH=${GITHUB_REF_NAME}
|
148
|
+
GH_REPO_OWNER=${GITHUB_REPOSITORY_OWNER}
|
149
|
+
GH_REPO_URL=https://github.com/$GH_REPO_OWNER/snowball-data.git
|
150
|
+
echo "Trying branch $GH_BRANCH from $GH_REPO_URL"
|
151
|
+
if ! git clone -b "$GH_BRANCH" "$GH_REPO_URL" ; then
|
152
|
+
echo "Falling back to $UPSTREAM_REPO_URL"
|
153
|
+
git clone "$UPSTREAM_REPO_URL"
|
154
|
+
fi
|
155
|
+
fi
|
156
|
+
- name: Install CCache
|
157
|
+
uses: hendrikmuhs/ccache-action@v1
|
158
|
+
with:
|
159
|
+
key: ${{ matrix.name }}
|
160
|
+
variant: ${{ matrix.ccache || 'ccache' }}
|
161
|
+
- name: Install Ubuntu packages
|
162
|
+
if: matrix.apt_packages
|
163
|
+
run: |
|
164
|
+
sudo apt-get update
|
165
|
+
sudo apt-get install -y ${{ matrix.apt_packages }}
|
166
|
+
- name: Install mingw64 packages
|
167
|
+
if: matrix.mingw64_packages
|
168
|
+
uses: msys2/setup-msys2@v2
|
169
|
+
with:
|
170
|
+
msystem: ucrt64
|
171
|
+
install: base-devel ${{ matrix.mingw64_packages }}
|
172
|
+
- name: Build
|
173
|
+
run: $MAKE CC="${{ matrix.ccache || 'ccache' }} $CC"
|
174
|
+
- name: Test C dist
|
175
|
+
if: matrix.CFLAGS_DIST_BUILD
|
176
|
+
run: |
|
177
|
+
pip install setuptools build
|
178
|
+
$MAKE dist
|
179
|
+
mkdir tmp
|
180
|
+
cd tmp
|
181
|
+
tar xf ../dist/libstemmer_c-*.tar.gz
|
182
|
+
cd libstemmer_c-*
|
183
|
+
$MAKE CFLAGS="${{ matrix.CFLAGS_DIST_BUILD }}"
|
184
|
+
- name: Test C
|
185
|
+
if: matrix.c_tests
|
186
|
+
run: $MAKE check CC="$CC"
|
187
|
+
- uses: actions/setup-python@v5
|
188
|
+
with:
|
189
|
+
python-version: ${{ matrix.PYTHON_VERSION }}
|
190
|
+
if: matrix.PYTHON_VERSION
|
191
|
+
- name: Test Python
|
192
|
+
if: matrix.PYTHON || matrix.PYTHON_VERSION
|
193
|
+
run: $MAKE check_python python="${{ matrix.PYTHON || 'python' }}" THIN_FACTOR="${{ matrix.THIN_FACTOR }}"
|
194
|
+
- name: Test Java
|
195
|
+
if: matrix.JAVA && matrix.JAVAC
|
196
|
+
run: $MAKE check_java JAVA="${{ matrix.JAVA }}" JAVAC="${{ matrix.JAVAC }}" JAVACFLAGS="${{ matrix.JAVACFLAGS }}"
|
197
|
+
- name: Test C#
|
198
|
+
if: matrix.MCS
|
199
|
+
run: $MAKE check_csharp MCS="${{ matrix.MCS }}"
|
200
|
+
- name: Test Javascript
|
201
|
+
if: matrix.JSRUN
|
202
|
+
run: $MAKE check_js JSRUN="${{ matrix.JSRUN }}"
|
203
|
+
- name: Test Rust
|
204
|
+
if: matrix.RUST
|
205
|
+
run: $MAKE check_rust RUST="${{ matrix.RUST }}"
|
206
|
+
- name: Test Go
|
207
|
+
if: matrix.GO
|
208
|
+
run: |
|
209
|
+
go mod init github.com/snowballstem/snowball
|
210
|
+
$MAKE check_go GO="${{ matrix.GO }}"
|
211
|
+
- name: Test Pascal
|
212
|
+
if: matrix.FPC
|
213
|
+
run: $MAKE check_pascal FPC="${{ matrix.FPC }}"
|
214
|
+
- name: Test Ada
|
215
|
+
if: matrix.gprbuild
|
216
|
+
run: $MAKE check_ada gprbuild="${{ matrix.gprbuild }}"
|
@@ -1,41 +1,35 @@
|
|
1
|
-
|
1
|
+
General contribution guidelines
|
2
2
|
===============================
|
3
3
|
|
4
|
-
|
5
|
-
|
6
|
-
CI should use your new vocabulary list when running the testsuite.
|
7
|
-
|
8
|
-
Some points to note about algorithm implementations:
|
4
|
+
We don't have a formally defined coding style guide, but please strive to
|
5
|
+
make new/changed code look like the code around it.
|
9
6
|
|
10
|
-
|
11
|
-
|
12
|
-
|
13
|
-
please use the newer `U+` notation rather than the older ``hex`` or
|
14
|
-
``decimal`` as this allows us to support different encodings without having
|
15
|
-
to modify the source files - for example::
|
7
|
+
Use spaces-only for indentation except where there's a syntax reason (e.g.
|
8
|
+
``GNUmakefile``) or a strong convention (e.g. Go's standard seems to be tabs,
|
9
|
+
and ``gofmt`` reindents code using tabs).
|
16
10
|
|
17
|
-
|
18
|
-
|
11
|
+
Avoid adding trailing whitespace on lines. Make sure there's a newline
|
12
|
+
character at the end of new text files.
|
19
13
|
|
20
|
-
|
14
|
+
Avoid mixing code reformatting changes with functional changes - doing so
|
15
|
+
makes it harder to review patches.
|
21
16
|
|
22
|
-
|
23
|
-
|
17
|
+
Adding a new stemming algorithm
|
18
|
+
===============================
|
24
19
|
|
25
|
-
|
20
|
+
To add a new stemming algorithm you need to submit PRs against three
|
21
|
+
repositories. See below for details of what's needed in each of
|
22
|
+
these.
|
26
23
|
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
different stemmers - the website has `guidance on what to use
|
31
|
-
<https://snowballstem.org/codesets/guide.html>`_ and a `list of stringdef
|
32
|
-
lines for common characters to cut and paste from
|
33
|
-
<https://snowballstem.org/codesets/latin-stringdef-list.txt>`_.
|
24
|
+
Name the branch the same for at least `snowball` and `snowball-data` and push
|
25
|
+
to `snowball-data` first, then the CI should use your new vocabulary list when
|
26
|
+
running the testsuite.
|
34
27
|
|
35
28
|
snowball repo
|
36
29
|
-------------
|
37
30
|
|
38
|
-
|
31
|
+
This is where the implementation of the new algorithm goes. Add the `.sbl`
|
32
|
+
source implementing it to the `algorithms/` subdirectory.
|
39
33
|
|
40
34
|
Add entry to `libstemmer/modules.txt`, maintaining the current sorted order by
|
41
35
|
the first column. The columns are:
|
@@ -53,6 +47,35 @@ the first column. The columns are:
|
|
53
47
|
"639-2/T" and (if different) "639-2/B" columns. For example, for the `Afar`
|
54
48
|
language you'd put `afar,aa,aar` here.
|
55
49
|
|
50
|
+
Some points to note about algorithm implementations:
|
51
|
+
|
52
|
+
* Avoid literal non-ASCII characters in snowball string literals - they will
|
53
|
+
work OK for languages that use UTF-8, but not wide-character Unicode or other
|
54
|
+
encodings. Instead use ``stringdef`` like the existing stemmers do, and
|
55
|
+
please use the newer `U+` notation rather than the older ``hex`` or
|
56
|
+
``decimal`` as this allows us to support different encodings without having
|
57
|
+
to modify the source files - for example::
|
58
|
+
|
59
|
+
stringdef o" {U+00F6}
|
60
|
+
define foo 'o{o"}'
|
61
|
+
|
62
|
+
not::
|
63
|
+
|
64
|
+
stringdef o" hex F6
|
65
|
+
define foo 'o{o"}'
|
66
|
+
|
67
|
+
and definitely not::
|
68
|
+
|
69
|
+
define foo 'oö'
|
70
|
+
|
71
|
+
It's OK to use UTF-8 in comments.
|
72
|
+
|
73
|
+
* It's helpful to consistently use the same ``stringdef`` codes across the
|
74
|
+
different stemmers - for languages using the latin alphabet our website has
|
75
|
+
`guidance on what to use <https://snowballstem.org/codesets/guide.html>`_ and
|
76
|
+
a `list of stringdef lines for common characters to cut and paste from
|
77
|
+
<https://snowballstem.org/codesets/latin-stringdef-list.txt>`_.
|
78
|
+
|
56
79
|
snowball-data repo
|
57
80
|
------------------
|
58
81
|
|
@@ -68,9 +91,8 @@ you may be able to use the `wikipedia-most-common-words` script to generate
|
|
68
91
|
one by extracting the most frequent words from a Wikipedia dump in the
|
69
92
|
language the stemmer is for. You need to specify the Unicode "script" (that's
|
70
93
|
"script" in the sense of alphabet) to use - you can find the appropriate one
|
71
|
-
by looking in the Unicode `Scripts.txt
|
72
|
-
|
73
|
-
https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt
|
94
|
+
by looking in the Unicode `Scripts.txt
|
95
|
+
<https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt>`_.
|
74
96
|
|
75
97
|
The script name is the second column, between `;` and `#`. The first entries
|
76
98
|
are all "Common" which isn't what you want - scroll down to get to the entries
|
@@ -88,36 +110,62 @@ practice too), but at some point "more" stops being "better".
|
|
88
110
|
snowball-website repo
|
89
111
|
---------------------
|
90
112
|
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
|
110
|
-
|
111
|
-
|
112
|
-
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
113
|
+
This is where a description of the new algorithm goes. Experience from
|
114
|
+
maintaining Snowball for many years has shown us that the most important
|
115
|
+
points to cover are **WHY** particular things are done or are not done.
|
116
|
+
|
117
|
+
For example, if a particular ending isn't removed because doing so causes
|
118
|
+
problems in other cases it's really helpful to have that recorded. Then
|
119
|
+
if years later we get a bug report because this ending isn't removed we
|
120
|
+
can easily answer, and don't have to try to contact you and hope you can
|
121
|
+
remember, or try to work out why for ourselves.
|
122
|
+
|
123
|
+
The original set of Snowball stemmers each have an English prose description
|
124
|
+
of the algorithm which focuses on **WHAT** the algorithm does. These might be
|
125
|
+
helpful if you want to implement the algorithm from scratch in a separate
|
126
|
+
language, but they've not proved very useful for maintaining the Snowball
|
127
|
+
implementations - if the prose and Snowball code disagree we know something is
|
128
|
+
wrong, but it's hard to know which is right! Therefore we recommend to let
|
129
|
+
the Snowball implementation describe what the algorithm does, and only comment
|
130
|
+
on "**WHAT**" in cases where the implementation needs explanation to help
|
131
|
+
the reader understand it.
|
132
|
+
|
133
|
+
If your algorithm is based on an academic paper, cite the paper and describe
|
134
|
+
any differences between your implementation and that described in the paper.
|
135
|
+
For example, sometimes papers have ambiguities that need resolving to
|
136
|
+
re-implement the algorithm described - see the `Hindi
|
137
|
+
<https://snowballstem.org/algorithms/hindi/stemmer.html>`_ and `Indonesian
|
138
|
+
<https://snowballstem.org/algorithms/indonesian/stemmer.html>`_
|
139
|
+
stemming algorithms descriptions for examples.
|
140
|
+
|
141
|
+
The mechanics of adding the algorithm description are:
|
142
|
+
|
143
|
+
* Create subdirectory of `algorithms/` named after the language.
|
144
|
+
|
145
|
+
* Create `stemmer.tt` which describes the stemming algorithm. This is a
|
146
|
+
"template toolkit" template which is essentially a mix of HTML and some
|
147
|
+
macros for adding the navigation, sample vocabulary, etc. See the
|
148
|
+
existing `stemmer.tt` files for other algorithms for how to use these
|
149
|
+
macros.
|
150
|
+
|
151
|
+
* If you have a stopword list, add that as `stop.txt` in your new subdirectory.
|
152
|
+
The `generate` script checks if such a file exists and if it does a link to
|
153
|
+
it is automatically added.
|
154
|
+
|
155
|
+
* Link to your new `stemmer.tt` from `algorithms/index.tt`.
|
156
|
+
|
157
|
+
* Add a news entry to `index.tt`.
|
158
|
+
|
159
|
+
* Add the new stemmer to the online demo. Assuming you have checkouts of the
|
160
|
+
`snowball`, `snowball-data` and `snowball-website` repos in sibling
|
161
|
+
directories:
|
162
|
+
|
163
|
+
* run `make check_js` in the `snowball` repo
|
164
|
+
* run `./update-js`
|
165
|
+
* add the new stemmer to git with: `git add js/*-stemmer.js`
|
166
|
+
* if the new language is written right-to-left (RTL) then add it to the check
|
167
|
+
in `demo.tt` (search for `rtl` to find the place to change.)
|
168
|
+
* `git commit`.
|
121
169
|
|
122
170
|
Adding a new programming language generator
|
123
171
|
===========================================
|
@@ -150,7 +198,7 @@ to improve further in the future.
|
|
150
198
|
Key problems to solve
|
151
199
|
---------------------
|
152
200
|
|
153
|
-
*
|
201
|
+
* You need to work out how to map the required flow of control in response
|
154
202
|
to Snowball signals.
|
155
203
|
|
156
204
|
In the generated C code this is mostly done using `goto`. If your language
|
@@ -172,7 +220,7 @@ Key problems to solve
|
|
172
220
|
matching C - i.e. the result should be truncated (rounded towards zero).
|
173
221
|
Some languages lack a built-in integer division operation, or have one
|
174
222
|
which instead implements rounding towards negative infinity. Existing
|
175
|
-
backends with special handling
|
223
|
+
backends with special handling here which may be useful to look at
|
176
224
|
include Javascript, Pascal and Python.
|
177
225
|
|
178
226
|
Don't hardcode algorithm names
|
@@ -199,6 +247,7 @@ Mechanics of adding a new generator
|
|
199
247
|
Copy an existing `compiler/generator_*.c` for your new language and modify
|
200
248
|
away (`generator.c` has the generator for C, but also some common functions
|
201
249
|
so if you start from this one you'll need to remove those common functions).
|
250
|
+
|
202
251
|
Please resist reformatting existing C code - there's currently a lot of code
|
203
252
|
repeated in each generator which ought to be pulled out as common code, and
|
204
253
|
if you reformat that just makes that job harder.
|
@@ -211,6 +260,6 @@ Add support to `compiler/driver.c`.
|
|
211
260
|
|
212
261
|
Add targets to `GNUmakefile` to run tests for the new language.
|
213
262
|
|
214
|
-
Hook up automated testing via CI in `.
|
263
|
+
Hook up automated testing via CI in `.github/workflows/ci.yml`.
|
215
264
|
|
216
265
|
Add to the list of languages in `README.rst`.
|