mittens 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/CHANGELOG.md +3 -0
- data/Gemfile +7 -0
- data/LICENSE.txt +30 -0
- data/README.md +62 -0
- data/Rakefile +21 -0
- data/ext/mittens/ext.c +96 -0
- data/ext/mittens/extconf.rb +12 -0
- data/lib/mittens/version.rb +3 -0
- data/lib/mittens.rb +7 -0
- data/mittens.gemspec +22 -0
- data/vendor/snowball/.gitignore +26 -0
- data/vendor/snowball/.travis.yml +112 -0
- data/vendor/snowball/AUTHORS +27 -0
- data/vendor/snowball/CONTRIBUTING.rst +216 -0
- data/vendor/snowball/COPYING +29 -0
- data/vendor/snowball/GNUmakefile +742 -0
- data/vendor/snowball/NEWS +754 -0
- data/vendor/snowball/README.rst +37 -0
- data/vendor/snowball/ada/README.md +74 -0
- data/vendor/snowball/ada/generate/generate.adb +83 -0
- data/vendor/snowball/ada/generate.gpr +21 -0
- data/vendor/snowball/ada/src/stemmer.adb +620 -0
- data/vendor/snowball/ada/src/stemmer.ads +219 -0
- data/vendor/snowball/ada/src/stemwords.adb +70 -0
- data/vendor/snowball/ada/stemmer_config.gpr +83 -0
- data/vendor/snowball/ada/stemwords.gpr +21 -0
- data/vendor/snowball/algorithms/arabic.sbl +558 -0
- data/vendor/snowball/algorithms/armenian.sbl +301 -0
- data/vendor/snowball/algorithms/basque.sbl +149 -0
- data/vendor/snowball/algorithms/catalan.sbl +202 -0
- data/vendor/snowball/algorithms/danish.sbl +93 -0
- data/vendor/snowball/algorithms/dutch.sbl +164 -0
- data/vendor/snowball/algorithms/english.sbl +229 -0
- data/vendor/snowball/algorithms/finnish.sbl +197 -0
- data/vendor/snowball/algorithms/french.sbl +254 -0
- data/vendor/snowball/algorithms/german.sbl +139 -0
- data/vendor/snowball/algorithms/german2.sbl +145 -0
- data/vendor/snowball/algorithms/greek.sbl +701 -0
- data/vendor/snowball/algorithms/hindi.sbl +323 -0
- data/vendor/snowball/algorithms/hungarian.sbl +241 -0
- data/vendor/snowball/algorithms/indonesian.sbl +192 -0
- data/vendor/snowball/algorithms/irish.sbl +149 -0
- data/vendor/snowball/algorithms/italian.sbl +202 -0
- data/vendor/snowball/algorithms/kraaij_pohlmann.sbl +240 -0
- data/vendor/snowball/algorithms/lithuanian.sbl +373 -0
- data/vendor/snowball/algorithms/lovins.sbl +208 -0
- data/vendor/snowball/algorithms/nepali.sbl +92 -0
- data/vendor/snowball/algorithms/norwegian.sbl +80 -0
- data/vendor/snowball/algorithms/porter.sbl +139 -0
- data/vendor/snowball/algorithms/portuguese.sbl +218 -0
- data/vendor/snowball/algorithms/romanian.sbl +236 -0
- data/vendor/snowball/algorithms/russian.sbl +221 -0
- data/vendor/snowball/algorithms/serbian.sbl +2379 -0
- data/vendor/snowball/algorithms/spanish.sbl +230 -0
- data/vendor/snowball/algorithms/swedish.sbl +72 -0
- data/vendor/snowball/algorithms/tamil.sbl +405 -0
- data/vendor/snowball/algorithms/turkish.sbl +470 -0
- data/vendor/snowball/algorithms/yiddish.sbl +460 -0
- data/vendor/snowball/charsets/ISO-8859-2.sbl +98 -0
- data/vendor/snowball/charsets/KOI8-R.sbl +74 -0
- data/vendor/snowball/charsets/cp850.sbl +130 -0
- data/vendor/snowball/compiler/analyser.c +1547 -0
- data/vendor/snowball/compiler/driver.c +615 -0
- data/vendor/snowball/compiler/generator.c +1748 -0
- data/vendor/snowball/compiler/generator_ada.c +1702 -0
- data/vendor/snowball/compiler/generator_csharp.c +1322 -0
- data/vendor/snowball/compiler/generator_go.c +1278 -0
- data/vendor/snowball/compiler/generator_java.c +1313 -0
- data/vendor/snowball/compiler/generator_js.c +1316 -0
- data/vendor/snowball/compiler/generator_pascal.c +1387 -0
- data/vendor/snowball/compiler/generator_python.c +1337 -0
- data/vendor/snowball/compiler/generator_rust.c +1295 -0
- data/vendor/snowball/compiler/header.h +418 -0
- data/vendor/snowball/compiler/space.c +286 -0
- data/vendor/snowball/compiler/syswords.h +86 -0
- data/vendor/snowball/compiler/syswords2.h +13 -0
- data/vendor/snowball/compiler/tokeniser.c +567 -0
- data/vendor/snowball/csharp/.gitignore +8 -0
- data/vendor/snowball/csharp/Snowball/Algorithms/.gitignore +1 -0
- data/vendor/snowball/csharp/Snowball/Among.cs +108 -0
- data/vendor/snowball/csharp/Snowball/AssemblyInfo.cs +36 -0
- data/vendor/snowball/csharp/Snowball/Stemmer.cs +660 -0
- data/vendor/snowball/csharp/Stemwords/App.config +6 -0
- data/vendor/snowball/csharp/Stemwords/Program.cs +114 -0
- data/vendor/snowball/doc/TODO +12 -0
- data/vendor/snowball/doc/libstemmer_c_README +148 -0
- data/vendor/snowball/doc/libstemmer_csharp_README +53 -0
- data/vendor/snowball/doc/libstemmer_java_README +67 -0
- data/vendor/snowball/doc/libstemmer_js_README +48 -0
- data/vendor/snowball/doc/libstemmer_python_README +113 -0
- data/vendor/snowball/examples/stemwords.c +204 -0
- data/vendor/snowball/go/README.md +55 -0
- data/vendor/snowball/go/among.go +16 -0
- data/vendor/snowball/go/env.go +403 -0
- data/vendor/snowball/go/stemwords/generate.go +68 -0
- data/vendor/snowball/go/stemwords/main.go +68 -0
- data/vendor/snowball/go/util.go +34 -0
- data/vendor/snowball/iconv.py +50 -0
- data/vendor/snowball/include/libstemmer.h +78 -0
- data/vendor/snowball/java/org/tartarus/snowball/Among.java +29 -0
- data/vendor/snowball/java/org/tartarus/snowball/SnowballProgram.java +381 -0
- data/vendor/snowball/java/org/tartarus/snowball/SnowballStemmer.java +8 -0
- data/vendor/snowball/java/org/tartarus/snowball/TestApp.java +75 -0
- data/vendor/snowball/javascript/base-stemmer.js +294 -0
- data/vendor/snowball/javascript/stemwords.js +106 -0
- data/vendor/snowball/libstemmer/libstemmer_c.in +96 -0
- data/vendor/snowball/libstemmer/mkalgorithms.pl +90 -0
- data/vendor/snowball/libstemmer/mkmodules.pl +267 -0
- data/vendor/snowball/libstemmer/modules.txt +63 -0
- data/vendor/snowball/libstemmer/test.c +34 -0
- data/vendor/snowball/pascal/.gitignore +4 -0
- data/vendor/snowball/pascal/SnowballProgram.pas +430 -0
- data/vendor/snowball/pascal/generate.pl +23 -0
- data/vendor/snowball/pascal/stemwords-template.dpr +78 -0
- data/vendor/snowball/python/MANIFEST.in +7 -0
- data/vendor/snowball/python/create_init.py +54 -0
- data/vendor/snowball/python/setup.cfg +6 -0
- data/vendor/snowball/python/setup.py +81 -0
- data/vendor/snowball/python/snowballstemmer/among.py +13 -0
- data/vendor/snowball/python/snowballstemmer/basestemmer.py +323 -0
- data/vendor/snowball/python/stemwords.py +101 -0
- data/vendor/snowball/python/testapp.py +28 -0
- data/vendor/snowball/runtime/api.c +58 -0
- data/vendor/snowball/runtime/api.h +32 -0
- data/vendor/snowball/runtime/header.h +61 -0
- data/vendor/snowball/runtime/utilities.c +513 -0
- data/vendor/snowball/rust/Cargo.toml +7 -0
- data/vendor/snowball/rust/build.rs +55 -0
- data/vendor/snowball/rust/rust-pre-1.27-compat.patch +30 -0
- data/vendor/snowball/rust/src/main.rs +102 -0
- data/vendor/snowball/rust/src/snowball/algorithms/mod.rs +2 -0
- data/vendor/snowball/rust/src/snowball/among.rs +6 -0
- data/vendor/snowball/rust/src/snowball/mod.rs +6 -0
- data/vendor/snowball/rust/src/snowball/snowball_env.rs +421 -0
- data/vendor/snowball/tests/stemtest.c +95 -0
- metadata +178 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA256:
|
3
|
+
metadata.gz: 5685ca729c53bcfeadf2241ffecd3eb536197dcf3863efd33e9b07aac114e4dd
|
4
|
+
data.tar.gz: a2a07b4b5d16a691695d1a420ef4acde948276077c9bce7cb98111b172bb71ab
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 2b4f3abde51e786665b83c3846cf5aa0409cae5cbf91dd8f715d3aa850a4c967c951dcd1475d9c6887aa991fc7a460be6f9205196ed7c8f4389c301de38b4931
|
7
|
+
data.tar.gz: b9ea741946ac5af5a27ae6685297a7416d1b1e64b0c780c8e0871332393990832a59821bed08ff21428a8c703dfd089ec89e0b376e34514f25f3f4c151620d00
|
data/CHANGELOG.md
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
@@ -0,0 +1,30 @@
|
|
1
|
+
Copyright (c) 2001, Dr Martin Porter
|
2
|
+
Copyright (c) 2004,2005, Richard Boulton
|
3
|
+
Copyright (c) 2013, Yoshiki Shibukawa
|
4
|
+
Copyright (c) 2006,2007,2009,2010,2011,2014-2019, Olly Betts
|
5
|
+
Copyright (c) 2022, Andrew Kane
|
6
|
+
All rights reserved.
|
7
|
+
|
8
|
+
Redistribution and use in source and binary forms, with or without
|
9
|
+
modification, are permitted provided that the following conditions
|
10
|
+
are met:
|
11
|
+
|
12
|
+
1. Redistributions of source code must retain the above copyright notice,
|
13
|
+
this list of conditions and the following disclaimer.
|
14
|
+
2. Redistributions in binary form must reproduce the above copyright notice,
|
15
|
+
this list of conditions and the following disclaimer in the documentation
|
16
|
+
and/or other materials provided with the distribution.
|
17
|
+
3. Neither the name of the Mittens project nor the names of its contributors
|
18
|
+
may be used to endorse or promote products derived from this software
|
19
|
+
without specific prior written permission.
|
20
|
+
|
21
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
22
|
+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
23
|
+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
24
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
25
|
+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
26
|
+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
27
|
+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
28
|
+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
29
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
30
|
+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.md
ADDED
@@ -0,0 +1,62 @@
|
|
1
|
+
# Mittens
|
2
|
+
|
3
|
+
Stemming for Ruby, powered by [Snowball](https://github.com/snowballstem/snowball)
|
4
|
+
|
5
|
+
:snowflake: Supports 28 languages
|
6
|
+
|
7
|
+
[![Build Status](https://github.com/ankane/mittens/workflows/build/badge.svg?branch=master)](https://github.com/ankane/mittens/actions)
|
8
|
+
|
9
|
+
## Installation
|
10
|
+
|
11
|
+
Add this line to your application’s Gemfile:
|
12
|
+
|
13
|
+
```ruby
|
14
|
+
gem "mittens"
|
15
|
+
```
|
16
|
+
|
17
|
+
## Getting Started
|
18
|
+
|
19
|
+
Create a stemmer
|
20
|
+
|
21
|
+
```ruby
|
22
|
+
stemmer = Mittens::Stemmer.new
|
23
|
+
```
|
24
|
+
|
25
|
+
Stem a word
|
26
|
+
|
27
|
+
```ruby
|
28
|
+
stemmer.stem("tomatos")
|
29
|
+
```
|
30
|
+
|
31
|
+
## Languages
|
32
|
+
|
33
|
+
Specify the language
|
34
|
+
|
35
|
+
```ruby
|
36
|
+
stemmer = Mittens::Stemmer.new(language: "french")
|
37
|
+
```
|
38
|
+
|
39
|
+
Supports `arabic`, `armenian`, `basque`, `catalan`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `lithuanian`, `nepali`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `serbian`, `spanish`, `swedish`, `tamil`, `turkish`, and `yiddish`
|
40
|
+
|
41
|
+
## History
|
42
|
+
|
43
|
+
View the [changelog](https://github.com/ankane/mittens/blob/master/CHANGELOG.md)
|
44
|
+
|
45
|
+
## Contributing
|
46
|
+
|
47
|
+
Everyone is encouraged to help improve this project. Here are a few ways you can help:
|
48
|
+
|
49
|
+
- [Report bugs](https://github.com/ankane/mittens/issues)
|
50
|
+
- Fix bugs and [submit pull requests](https://github.com/ankane/mittens/pulls)
|
51
|
+
- Write, clarify, or fix documentation
|
52
|
+
- Suggest or add new features
|
53
|
+
|
54
|
+
To get started with development:
|
55
|
+
|
56
|
+
```sh
|
57
|
+
git clone --recursive https://github.com/ankane/mittens.git
|
58
|
+
cd mittens
|
59
|
+
bundle install
|
60
|
+
bundle exec rake compile
|
61
|
+
bundle exec rake test
|
62
|
+
```
|
data/Rakefile
ADDED
@@ -0,0 +1,21 @@
|
|
1
|
+
require "bundler/gem_tasks"
|
2
|
+
require "rake/testtask"
|
3
|
+
require "rake/extensiontask"
|
4
|
+
|
5
|
+
task default: :test
|
6
|
+
Rake::TestTask.new do |t|
|
7
|
+
t.libs << "test"
|
8
|
+
t.pattern = "test/**/*_test.rb"
|
9
|
+
end
|
10
|
+
|
11
|
+
Rake::ExtensionTask.new("mittens") do |ext|
|
12
|
+
ext.name = "ext"
|
13
|
+
ext.lib_dir = "lib/mittens"
|
14
|
+
end
|
15
|
+
|
16
|
+
task :remove_ext do
|
17
|
+
path = "lib/mittens/ext.bundle"
|
18
|
+
File.unlink(path) if File.exist?(path)
|
19
|
+
end
|
20
|
+
|
21
|
+
Rake::Task["build"].enhance [:remove_ext]
|
data/ext/mittens/ext.c
ADDED
@@ -0,0 +1,96 @@
|
|
1
|
+
#include "libstemmer.h"
|
2
|
+
#include "ruby/ruby.h"
|
3
|
+
|
4
|
+
typedef struct stemmer {
|
5
|
+
struct sb_stemmer * stemmer;
|
6
|
+
} stemmer_t;
|
7
|
+
|
8
|
+
static void stemmer_free(void *ptr)
|
9
|
+
{
|
10
|
+
stemmer_t *stemmer = (stemmer_t *)ptr;
|
11
|
+
// safe to pass null pointer according to docs
|
12
|
+
sb_stemmer_delete(stemmer->stemmer);
|
13
|
+
xfree(ptr);
|
14
|
+
}
|
15
|
+
|
16
|
+
const rb_data_type_t stemmer_data_type = {
|
17
|
+
.wrap_struct_name = "stemmer",
|
18
|
+
.function = {
|
19
|
+
.dfree = stemmer_free,
|
20
|
+
},
|
21
|
+
.flags = RUBY_TYPED_FREE_IMMEDIATELY
|
22
|
+
};
|
23
|
+
|
24
|
+
static VALUE stemmer_allocate(VALUE klass)
|
25
|
+
{
|
26
|
+
stemmer_t *stemmer;
|
27
|
+
VALUE obj = TypedData_Make_Struct(klass, stemmer_t, &stemmer_data_type, stemmer);
|
28
|
+
stemmer->stemmer = NULL;
|
29
|
+
return obj;
|
30
|
+
}
|
31
|
+
|
32
|
+
static VALUE stemmer_initialize(int argc, VALUE* argv, VALUE self)
|
33
|
+
{
|
34
|
+
VALUE opts;
|
35
|
+
rb_scan_args(argc, argv, ":", &opts);
|
36
|
+
|
37
|
+
const char * algorithm = "english";
|
38
|
+
if (!NIL_P(opts)) {
|
39
|
+
VALUE language = rb_hash_aref(opts, ID2SYM(rb_intern("language")));
|
40
|
+
if (!NIL_P(language)) {
|
41
|
+
Check_Type(language, T_STRING);
|
42
|
+
algorithm = RSTRING_PTR(language);
|
43
|
+
}
|
44
|
+
}
|
45
|
+
|
46
|
+
stemmer_t *stemmer;
|
47
|
+
TypedData_Get_Struct(self, stemmer_t, &stemmer_data_type, stemmer);
|
48
|
+
|
49
|
+
// in case called multiple times
|
50
|
+
sb_stemmer_delete(stemmer->stemmer);
|
51
|
+
|
52
|
+
// if adding support for encoding, may want to change encoding returned from stem
|
53
|
+
stemmer->stemmer = sb_stemmer_new(algorithm, NULL);
|
54
|
+
if (stemmer->stemmer == NULL) {
|
55
|
+
rb_raise(rb_eArgError, "unknown language: %s", algorithm);
|
56
|
+
}
|
57
|
+
|
58
|
+
return self;
|
59
|
+
}
|
60
|
+
|
61
|
+
static VALUE stemmer_stem(VALUE self, VALUE value)
|
62
|
+
{
|
63
|
+
stemmer_t *stemmer;
|
64
|
+
TypedData_Get_Struct(self, stemmer_t, &stemmer_data_type, stemmer);
|
65
|
+
|
66
|
+
Check_Type(value, T_STRING);
|
67
|
+
|
68
|
+
const sb_symbol * word = (const sb_symbol *) RSTRING_PTR(value);
|
69
|
+
int size = (int) RSTRING_LEN(value);
|
70
|
+
const sb_symbol * pointer_out = sb_stemmer_stem(stemmer->stemmer, word, size);
|
71
|
+
|
72
|
+
return rb_utf8_str_new_cstr((char *) pointer_out);
|
73
|
+
}
|
74
|
+
|
75
|
+
static VALUE stemmer_languages(VALUE klass)
|
76
|
+
{
|
77
|
+
VALUE out = rb_ary_new();
|
78
|
+
|
79
|
+
const char **language = sb_stemmer_list();
|
80
|
+
while (*language != NULL) {
|
81
|
+
rb_ary_push(out, rb_utf8_str_new_cstr(*language));
|
82
|
+
language++;
|
83
|
+
}
|
84
|
+
|
85
|
+
return out;
|
86
|
+
}
|
87
|
+
|
88
|
+
void Init_ext(void)
|
89
|
+
{
|
90
|
+
VALUE rb_mMittens = rb_define_module("Mittens");
|
91
|
+
VALUE rb_cStemmer = rb_define_class_under(rb_mMittens, "Stemmer", rb_cObject);
|
92
|
+
rb_define_alloc_func(rb_cStemmer, stemmer_allocate);
|
93
|
+
rb_define_method(rb_cStemmer, "initialize", stemmer_initialize, -1);
|
94
|
+
rb_define_method(rb_cStemmer, "stem", stemmer_stem, 1);
|
95
|
+
rb_define_singleton_method(rb_cStemmer, "languages", stemmer_languages, 0);
|
96
|
+
}
|
@@ -0,0 +1,12 @@
|
|
1
|
+
require "mkmf"
|
2
|
+
require "open3"
|
3
|
+
|
4
|
+
vendor = File.expand_path("../../vendor/snowball", __dir__)
|
5
|
+
output, status = Open3.capture2("make", chdir: vendor)
|
6
|
+
puts output
|
7
|
+
raise "Command failed" unless status.success?
|
8
|
+
|
9
|
+
$INCFLAGS += " -I$(srcdir)/../../vendor/snowball/include"
|
10
|
+
$LDFLAGS += " $(srcdir)/../../vendor/snowball/libstemmer.a"
|
11
|
+
|
12
|
+
create_makefile("mittens/ext")
|
data/lib/mittens.rb
ADDED
data/mittens.gemspec
ADDED
@@ -0,0 +1,22 @@
|
|
1
|
+
require_relative "lib/mittens/version"
|
2
|
+
|
3
|
+
Gem::Specification.new do |spec|
|
4
|
+
spec.name = "mittens"
|
5
|
+
spec.version = Mittens::VERSION
|
6
|
+
spec.summary = "Stemming for Ruby, powered by Snowball"
|
7
|
+
spec.homepage = "https://github.com/ankane/mittens"
|
8
|
+
spec.license = "BSD-3-Clause"
|
9
|
+
|
10
|
+
spec.author = "Andrew Kane"
|
11
|
+
spec.email = "andrew@ankane.org"
|
12
|
+
|
13
|
+
spec.files = Dir.chdir(__dir__) do
|
14
|
+
`git ls-files -z --recurse-submodules`.split("\x0").reject do |f|
|
15
|
+
(f == __FILE__) || f.match(%r{\A(?:(?:bin|test|spec|features)/|\.(?:git|travis|circleci)|appveyor)})
|
16
|
+
end
|
17
|
+
end
|
18
|
+
spec.require_path = "lib"
|
19
|
+
spec.extensions = ["ext/mittens/extconf.rb"]
|
20
|
+
|
21
|
+
spec.required_ruby_version = ">= 3"
|
22
|
+
end
|
@@ -0,0 +1,26 @@
|
|
1
|
+
*.o
|
2
|
+
/ada/bin/
|
3
|
+
/ada/obj/
|
4
|
+
/algorithms.mk
|
5
|
+
/libstemmer/libstemmer.c
|
6
|
+
/libstemmer/libstemmer_utf8.c
|
7
|
+
/libstemmer/mkinc.mak
|
8
|
+
/libstemmer/mkinc_utf8.mak
|
9
|
+
/libstemmer/modules.h
|
10
|
+
/libstemmer/modules_utf8.h
|
11
|
+
/libstemmer.a
|
12
|
+
/snowball
|
13
|
+
/src_c
|
14
|
+
/stemtest
|
15
|
+
/stemwords
|
16
|
+
/dist
|
17
|
+
/java/org/tartarus/snowball/ext/
|
18
|
+
/js_out
|
19
|
+
/python_check
|
20
|
+
/python_out
|
21
|
+
*.generated.cs
|
22
|
+
/rust/Cargo.lock
|
23
|
+
/rust/src/snowball/algorithms/*.rs
|
24
|
+
/rust/target/
|
25
|
+
/go/algorithms/
|
26
|
+
/go/stemwords/algorithms.go
|
@@ -0,0 +1,112 @@
|
|
1
|
+
language: minimal
|
2
|
+
arch: arm64
|
3
|
+
dist: xenial
|
4
|
+
env:
|
5
|
+
global: MAKE=make
|
6
|
+
matrix:
|
7
|
+
include:
|
8
|
+
- name: "C distribution build"
|
9
|
+
language: c
|
10
|
+
dist: focal
|
11
|
+
compiler: gcc
|
12
|
+
env: CFLAGS_DIST_BUILD='-O2 -Wall -W -std=c90 -Wdeclaration-after-statement -Werror'
|
13
|
+
- name: "C distribution build (clang)"
|
14
|
+
language: c
|
15
|
+
dist: focal
|
16
|
+
compiler: clang
|
17
|
+
env: CFLAGS_DIST_BUILD='-O2 -Wall -W -std=c90 -Wdeclaration-after-statement -Werror'
|
18
|
+
- language: c
|
19
|
+
compiler: gcc
|
20
|
+
env: c_tests=y CFLAGS='-O2 -Wall -W -std=c99 -Werror'
|
21
|
+
- language: c
|
22
|
+
compiler: clang
|
23
|
+
env: c_tests=y CFLAGS='-O2 -Wall -W -std=c99 -Werror'
|
24
|
+
- language: java
|
25
|
+
env: JAVA=java JAVAC=javac
|
26
|
+
- language: go
|
27
|
+
go: "1.8"
|
28
|
+
env: GO=go
|
29
|
+
- language: go
|
30
|
+
dist: bionic
|
31
|
+
go: "1.17"
|
32
|
+
env: GO=go
|
33
|
+
- language: node_js
|
34
|
+
node_js: "node"
|
35
|
+
env: NODE=node
|
36
|
+
dist: focal # seems to be needed for working node
|
37
|
+
- language: rust
|
38
|
+
rust:
|
39
|
+
- stable
|
40
|
+
- beta
|
41
|
+
dist: bionic
|
42
|
+
env: RUST=rust
|
43
|
+
- language: csharp
|
44
|
+
arch: amd64 # csharp doesn't seem to work on arm64
|
45
|
+
env: MCS=mcs
|
46
|
+
dist: bionic
|
47
|
+
- name: Pascal
|
48
|
+
env: FPC=fpc
|
49
|
+
dist: bionic
|
50
|
+
addons:
|
51
|
+
apt:
|
52
|
+
packages:
|
53
|
+
- fpc
|
54
|
+
# The pure Python versions run slowly so we need to thin the testdata
|
55
|
+
# for languages such as Arabic where there's a lot, or else the build
|
56
|
+
# hits the travis time limit. With pypy, it's enough faster than we
|
57
|
+
# can run the full tests.
|
58
|
+
- language: python
|
59
|
+
python: "3.9"
|
60
|
+
env: PYTHON=python THIN_FACTOR=10
|
61
|
+
- language: python
|
62
|
+
python: "3.7"
|
63
|
+
env: PYTHON=python THIN_FACTOR=10
|
64
|
+
- language: python
|
65
|
+
python: "3.6"
|
66
|
+
env: PYTHON=python THIN_FACTOR=10
|
67
|
+
- language: python
|
68
|
+
python: "pypy3.7-7.3.5"
|
69
|
+
env: PYTHON=python
|
70
|
+
dist: bionic
|
71
|
+
- name: "Ada"
|
72
|
+
env: gprbuild=gprbuild
|
73
|
+
dist: bionic
|
74
|
+
addons:
|
75
|
+
apt:
|
76
|
+
packages:
|
77
|
+
- gnat
|
78
|
+
- gprbuild
|
79
|
+
- os: windows
|
80
|
+
language: c
|
81
|
+
env: c_tests=y MAKE=mingw32-make
|
82
|
+
- os: windows
|
83
|
+
language: go
|
84
|
+
env: GO=go MAKE=mingw32-make
|
85
|
+
|
86
|
+
before_install:
|
87
|
+
# Try to check out a branch of the same name from the snowball-data repo
|
88
|
+
# sibling of this snowball repo, so that PRs requiring changes to both can be
|
89
|
+
# CI tested easily.
|
90
|
+
#
|
91
|
+
# If that fails, just use the standard snowball-data repo's default branch.
|
92
|
+
- GH_BRANCH=${TRAVIS_PULL_REQUEST_BRANCH:-$TRAVIS_BRANCH}
|
93
|
+
- GH_REPO_SLUG=${TRAVIS_PULL_REQUEST_SLUG:-$TRAVIS_REPO_SLUG}
|
94
|
+
- GH_REPO_URL=https://github.com/${GH_REPO_SLUG%%/*}/snowball-data.git
|
95
|
+
- echo "Trying branch $GH_BRANCH from $GH_REPO_URL"
|
96
|
+
- git clone --depth=1 -b "$GH_BRANCH" "$GH_REPO_URL" || git clone --depth=1 https://github.com/snowballstem/snowball-data.git
|
97
|
+
|
98
|
+
script:
|
99
|
+
# Ensure CC is set for building the compiler in non-C builds.
|
100
|
+
- test -n "$CC" || export CC=gcc
|
101
|
+
- $MAKE CC="$CC"
|
102
|
+
- test -z "$CFLAGS_DIST_BUILD" || { pip install setuptools && $MAKE dist && mkdir tmp && cd tmp && tar xf ../dist/libstemmer_c-*.tar.gz && cd libstemmer_c-* && $MAKE CFLAGS="$CFLAGS_DIST_BUILD" ; }
|
103
|
+
- test -z "$c_tests" || $MAKE check CC="$CC" STEMMING_DATA=snowball-data
|
104
|
+
- test -z "$PYTHON" || $MAKE check_python python="$PYTHON" STEMMING_DATA=snowball-data
|
105
|
+
- test -z "$JAVA" -o -z "$JAVAC" || $MAKE check_java STEMMING_DATA=snowball-data
|
106
|
+
- test -z "$MCS" || $MAKE check_csharp MCS="$MCS" STEMMING_DATA=snowball-data
|
107
|
+
- test -z "$NODE" || $MAKE check_js STEMMING_DATA=snowball-data
|
108
|
+
- test -z "$RUST" || $MAKE check_rust STEMMING_DATA=snowball-data
|
109
|
+
- test -z "$RUST" || $MAKE check_rust STEMMING_DATA=snowball-data
|
110
|
+
- test -z "$GO" || $MAKE check_go STEMMING_DATA=snowball-data
|
111
|
+
- test -z "$FPC" || $MAKE check_pascal STEMMING_DATA=snowball-data
|
112
|
+
- test -z "$gprbuild" || $MAKE check_ada STEMMING_DATA=snowball-data
|
@@ -0,0 +1,27 @@
|
|
1
|
+
Authors
|
2
|
+
=======
|
3
|
+
|
4
|
+
Martin Porter
|
5
|
+
-------------
|
6
|
+
|
7
|
+
- Designed the snowball language.
|
8
|
+
- Implemented the snowball to C compiler.
|
9
|
+
- Implemented the stemming algorithms in C.
|
10
|
+
- Wrote the documentation.
|
11
|
+
|
12
|
+
Richard Boulton
|
13
|
+
---------------
|
14
|
+
|
15
|
+
- Implemented Java backend of the snowball compiler.
|
16
|
+
- Developed build system.
|
17
|
+
- Assisted with website maintenance.
|
18
|
+
|
19
|
+
|
20
|
+
Assistance from
|
21
|
+
---------------
|
22
|
+
|
23
|
+
Olivier Bornet - fixes to java packaging and build system.
|
24
|
+
Andreas Jung - useful bug reports on the libstemmer library.
|
25
|
+
Olly Betts - several patches, bug reports, and performance improvements.
|
26
|
+
Sebastiano Vigna and Oerd Cukalla - patches for the Java stemming algorithms.
|
27
|
+
Ralf Junker - fix a potential memory leak in sb_stemmer_new().
|
@@ -0,0 +1,216 @@
|
|
1
|
+
Adding a new stemming algorithm
|
2
|
+
===============================
|
3
|
+
|
4
|
+
This needs PRs against three repositories. Name the branch the same for
|
5
|
+
at least `snowball` and `snowball-data`, push to the latter repo first, and the
|
6
|
+
CI should use your new vocabulary list when running the testsuite.
|
7
|
+
|
8
|
+
Some points to note about algorithm implementations:
|
9
|
+
|
10
|
+
* Avoid literal non-ASCII characters in snowball string literals - they will
|
11
|
+
work OK for languages that use UTF-8, but not wide-character Unicode or other
|
12
|
+
encodings. Instead use ``stringdef`` like the existing stemmers do, and
|
13
|
+
please use the newer `U+` notation rather than the older ``hex`` or
|
14
|
+
``decimal`` as this allows us to support different encodings without having
|
15
|
+
to modify the source files - for example::
|
16
|
+
|
17
|
+
stringdef o" {U+00F6}
|
18
|
+
define foo 'o{o"}'
|
19
|
+
|
20
|
+
not::
|
21
|
+
|
22
|
+
stringdef o" hex F6
|
23
|
+
define foo 'o{o"}'
|
24
|
+
|
25
|
+
and definitely not::
|
26
|
+
|
27
|
+
define foo 'oö'
|
28
|
+
|
29
|
+
It's helpful to consistently use the same ``stringdef`` codes across the
|
30
|
+
different stemmers - the website has `guidance on what to use
|
31
|
+
<https://snowballstem.org/codesets/guide.html>`_ and a `list of stringdef
|
32
|
+
lines for common characters to cut and paste from
|
33
|
+
<https://snowballstem.org/codesets/latin-stringdef-list.txt>`_.
|
34
|
+
|
35
|
+
snowball repo
|
36
|
+
-------------
|
37
|
+
|
38
|
+
Add `.sbl` source to algorithms subdirectory.
|
39
|
+
|
40
|
+
Add entry to `libstemmer/modules.txt`, maintaining the current sorted order by
|
41
|
+
the first column. The columns are:
|
42
|
+
|
43
|
+
* Algorithm name (needs to match the `.sbl` source without extension)
|
44
|
+
* Encodings to support. Wide-character Unicode is always supported
|
45
|
+
and doesn't need to be listed here. You should always include `UTF_8`, and
|
46
|
+
also any of `ISO_8859_1`, `ISO_8859_2` and `KOI8_R` which the language can
|
47
|
+
usefully be written using only characters from (in particular they need to
|
48
|
+
contain all the characters the stemmer explicitly uses). Support for other
|
49
|
+
single-byte character sets is easy to add if they're useful.
|
50
|
+
* Names and ISO-639 codes for the language. Wikipedia has a handy list of `all
|
51
|
+
the ISO-639 codes <https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes>`_ -
|
52
|
+
find the row for your new language and include the codes from the "639-1",
|
53
|
+
"639-2/T" and (if different) "639-2/B" columns. For example, for the `Afar`
|
54
|
+
language you'd put `afar,aa,aar` here.
|
55
|
+
|
56
|
+
snowball-data repo
|
57
|
+
------------------
|
58
|
+
|
59
|
+
Add subdirectory named after new stemmer containing:
|
60
|
+
|
61
|
+
* voc.txt - word list
|
62
|
+
* output.txt - stemmed equivalents
|
63
|
+
* COPYING - licensing details (word lists need to be under an OSI-approved
|
64
|
+
licence)
|
65
|
+
|
66
|
+
If you don't have access to a suitably licensed word list of a suitable size,
|
67
|
+
you may be able to use the `wikipedia-most-common-words` script to generate
|
68
|
+
one by extracting the most frequent words from a Wikipedia dump in the
|
69
|
+
language the stemmer is for. You need to specify the Unicode "script" (that's
|
70
|
+
"script" in the sense of alphabet) to use - you can find the appropriate one
|
71
|
+
by looking in the Unicode `Scripts.txt`::
|
72
|
+
|
73
|
+
https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt
|
74
|
+
|
75
|
+
The script name is the second column, between `;` and `#`. The first entries
|
76
|
+
are all "Common" which isn't what you want - scroll down to get to the entries
|
77
|
+
that are useful here.
|
78
|
+
|
79
|
+
You also need to specify the minimum frequency to select. Picking this value
|
80
|
+
will probably need some experimentation as the appropriate threshold depends on
|
81
|
+
how much data there is in the wikipedia dump for a particular language, as well
|
82
|
+
as the size of the vocabulary for the language, and how inflected the language
|
83
|
+
is. Try counting the number of unique words extracted (`wc -l voc.txt` on
|
84
|
+
Unix) and also looking through the list - some proper nouns, words from other
|
85
|
+
languages, typos, etc are OK (since the stemmer will encounter all these in
|
86
|
+
practice too), but at some point "more" stops being "better".
|
87
|
+
|
88
|
+
snowball-website repo
|
89
|
+
---------------------
|
90
|
+
|
91
|
+
Create subdirectory of `algorithms/` named after the language.
|
92
|
+
|
93
|
+
Create `stemmer.tt` which describes the stemming algorithm. This is a
|
94
|
+
"template toolkit" template which is essentially a mix of HTML and some
|
95
|
+
macros for adding the navigation, sample vocabulary, etc. See the
|
96
|
+
existing `stemmer.tt` files for other algorithms for inspiration.
|
97
|
+
|
98
|
+
If it is based on an academic paper, cite the paper and describe any difference
|
99
|
+
between your implementation and that described in the paper (for example,
|
100
|
+
sometimes papers have ambiguities that need resolving to re-implement the
|
101
|
+
algorithm described).
|
102
|
+
|
103
|
+
If you have a stopword list, add that as `stop.txt` in your new subdirectory.
|
104
|
+
The `generate` script checks if such a file exists and if it does a link to
|
105
|
+
it is automatically added.
|
106
|
+
|
107
|
+
Link to your new `stemmer.tt` from `algorithms/index.tt`.
|
108
|
+
|
109
|
+
Add a news entry to `index.tt`.
|
110
|
+
|
111
|
+
Add the new stemmer to the online demo. Assuming you have checkouts of the
|
112
|
+
`snowball`, `snowball-data` and `snowball-website` repos in sibling
|
113
|
+
directories:
|
114
|
+
|
115
|
+
* run `make check_js` in the `snowball` repo
|
116
|
+
* run `./update-js`
|
117
|
+
* add the new stemmer to git with: `git add js/*-stemmer.js`
|
118
|
+
* if the new language is written right-to-left (RTL) then add it to the check
|
119
|
+
in `demo.tt` (search for `rtl` to find the place to change.)
|
120
|
+
* `git commit`.
|
121
|
+
|
122
|
+
Adding a new programming language generator
|
123
|
+
===========================================
|
124
|
+
|
125
|
+
This is a short guide to adding support for generating code for another
|
126
|
+
programming language.
|
127
|
+
|
128
|
+
Is a new generator the right solution?
|
129
|
+
--------------------------------------
|
130
|
+
|
131
|
+
Adding a new code generator is probably not your only option if you want
|
132
|
+
to use Snowball from another language - most languages have support for
|
133
|
+
writing bindings to a C library, so this is probably another option.
|
134
|
+
|
135
|
+
Generating code can have advantages. For example, it can be simpler to
|
136
|
+
deploy without C bindings which need to be built for a specific platform.
|
137
|
+
|
138
|
+
However, it's likely to be significantly more work to implement a new generator
|
139
|
+
than to write bindings to the generated C code, especially as the libstemmer
|
140
|
+
C API is a very small and simple one. Generated code can also be slower -
|
141
|
+
currently the Snowball compiler often generates code that assumes an optimising
|
142
|
+
compiler will clean up redundant constructs, which is not a problem for C, and
|
143
|
+
probably not for most compiled languages, but for a language like Python C
|
144
|
+
bindings are much faster than the generated Python code (using pypy helps a
|
145
|
+
lot, but is still slower). See doc/libstemmer_python_README for some timings.
|
146
|
+
|
147
|
+
That said, the unoptimised generated code has improved over time, and is likely
|
148
|
+
to improve further in the future.
|
149
|
+
|
150
|
+
Key problems to solve
|
151
|
+
---------------------
|
152
|
+
|
153
|
+
* A key problem to solve is how to map the required flow of control in response
|
154
|
+
to Snowball signals.
|
155
|
+
|
156
|
+
In the generated C code this is mostly done using `goto`. If your language
|
157
|
+
doesn't provide an equivalent to `goto` then you'll need an alternative
|
158
|
+
solution.
|
159
|
+
|
160
|
+
In Java and JavaScript we use labelled `break` from blocks and loops
|
161
|
+
instead. If your language has an equivalent to this feature, that will
|
162
|
+
probably work.
|
163
|
+
|
164
|
+
For Python, we currently generate a `try:` ... `raise lab123` ...
|
165
|
+
`except lab123: pass` construct. This works, but doesn't seem ideal.
|
166
|
+
|
167
|
+
If one of the mechanisms above sounds suitable then take a look at the
|
168
|
+
generator for the respective generated output and generator code. If
|
169
|
+
not, come and talk to us on the snowball-discuss mailing list.
|
170
|
+
|
171
|
+
* Snowball's division is specified as integer division with semantics
|
172
|
+
matching C - i.e. the result should be truncated (rounded towards zero).
|
173
|
+
Some languages lack a built-in integer division operation, or have one
|
174
|
+
which instead implements rounding towards negative infinity. Existing
|
175
|
+
backends with special handling handling here which may be useful to look at
|
176
|
+
include Javascript, Pascal and Python.
|
177
|
+
|
178
|
+
Don't hardcode algorithm names
|
179
|
+
------------------------------
|
180
|
+
|
181
|
+
We want to avoid hard-coded lists of algorithms in the language-specific code
|
182
|
+
that have to be manually updated each time a new algorithm is added, because
|
183
|
+
that adds some extra tedious work for adding a new algorithm, and mechanical
|
184
|
+
updates done by hand tend to miss places that need updating, or code gets
|
185
|
+
copied and pasted from an existing case but not fully updated.
|
186
|
+
|
187
|
+
All the existing language backends generate any such code at build time, and
|
188
|
+
adding a new algorithm just requires updating `libstemmer/modules.txt`.
|
189
|
+
|
190
|
+
You can probably copy the approach used for Pascal (script `pascal/generate.pl`
|
191
|
+
works from template `stemwords-template.dpr` which has marked blocks of code
|
192
|
+
that get expanded for each stemming algorithm with a placeholder replaced by
|
193
|
+
the algorithm name. For an alternative approach, see Rust where this is done
|
194
|
+
by `rust/build.rs`.
|
195
|
+
|
196
|
+
Mechanics of adding a new generator
|
197
|
+
-----------------------------------
|
198
|
+
|
199
|
+
Copy an existing `compiler/generator_*.c` for your new language and modify
|
200
|
+
away (`generator.c` has the generator for C, but also some common functions
|
201
|
+
so if you start from this one you'll need to remove those common functions).
|
202
|
+
Please resist reformatting existing C code - there's currently a lot of code
|
203
|
+
repeated in each generator which ought to be pulled out as common code, and
|
204
|
+
if you reformat that just makes that job harder.
|
205
|
+
|
206
|
+
Add your new source to `COMPILER_SOURCES` in `GNUmakefile`.
|
207
|
+
|
208
|
+
Add prototypes for the new functions to `compiler/header.h`.
|
209
|
+
|
210
|
+
Add support to `compiler/driver.c`.
|
211
|
+
|
212
|
+
Add targets to `GNUmakefile` to run tests for the new language.
|
213
|
+
|
214
|
+
Hook up automated testing via CI in `.travis.yml`.
|
215
|
+
|
216
|
+
Add to the list of languages in `README.rst`.
|