mittens 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +3 -0
- data/Gemfile +7 -0
- data/LICENSE.txt +30 -0
- data/README.md +62 -0
- data/Rakefile +21 -0
- data/ext/mittens/ext.c +96 -0
- data/ext/mittens/extconf.rb +12 -0
- data/lib/mittens/version.rb +3 -0
- data/lib/mittens.rb +7 -0
- data/mittens.gemspec +22 -0
- data/vendor/snowball/.gitignore +26 -0
- data/vendor/snowball/.travis.yml +112 -0
- data/vendor/snowball/AUTHORS +27 -0
- data/vendor/snowball/CONTRIBUTING.rst +216 -0
- data/vendor/snowball/COPYING +29 -0
- data/vendor/snowball/GNUmakefile +742 -0
- data/vendor/snowball/NEWS +754 -0
- data/vendor/snowball/README.rst +37 -0
- data/vendor/snowball/ada/README.md +74 -0
- data/vendor/snowball/ada/generate/generate.adb +83 -0
- data/vendor/snowball/ada/generate.gpr +21 -0
- data/vendor/snowball/ada/src/stemmer.adb +620 -0
- data/vendor/snowball/ada/src/stemmer.ads +219 -0
- data/vendor/snowball/ada/src/stemwords.adb +70 -0
- data/vendor/snowball/ada/stemmer_config.gpr +83 -0
- data/vendor/snowball/ada/stemwords.gpr +21 -0
- data/vendor/snowball/algorithms/arabic.sbl +558 -0
- data/vendor/snowball/algorithms/armenian.sbl +301 -0
- data/vendor/snowball/algorithms/basque.sbl +149 -0
- data/vendor/snowball/algorithms/catalan.sbl +202 -0
- data/vendor/snowball/algorithms/danish.sbl +93 -0
- data/vendor/snowball/algorithms/dutch.sbl +164 -0
- data/vendor/snowball/algorithms/english.sbl +229 -0
- data/vendor/snowball/algorithms/finnish.sbl +197 -0
- data/vendor/snowball/algorithms/french.sbl +254 -0
- data/vendor/snowball/algorithms/german.sbl +139 -0
- data/vendor/snowball/algorithms/german2.sbl +145 -0
- data/vendor/snowball/algorithms/greek.sbl +701 -0
- data/vendor/snowball/algorithms/hindi.sbl +323 -0
- data/vendor/snowball/algorithms/hungarian.sbl +241 -0
- data/vendor/snowball/algorithms/indonesian.sbl +192 -0
- data/vendor/snowball/algorithms/irish.sbl +149 -0
- data/vendor/snowball/algorithms/italian.sbl +202 -0
- data/vendor/snowball/algorithms/kraaij_pohlmann.sbl +240 -0
- data/vendor/snowball/algorithms/lithuanian.sbl +373 -0
- data/vendor/snowball/algorithms/lovins.sbl +208 -0
- data/vendor/snowball/algorithms/nepali.sbl +92 -0
- data/vendor/snowball/algorithms/norwegian.sbl +80 -0
- data/vendor/snowball/algorithms/porter.sbl +139 -0
- data/vendor/snowball/algorithms/portuguese.sbl +218 -0
- data/vendor/snowball/algorithms/romanian.sbl +236 -0
- data/vendor/snowball/algorithms/russian.sbl +221 -0
- data/vendor/snowball/algorithms/serbian.sbl +2379 -0
- data/vendor/snowball/algorithms/spanish.sbl +230 -0
- data/vendor/snowball/algorithms/swedish.sbl +72 -0
- data/vendor/snowball/algorithms/tamil.sbl +405 -0
- data/vendor/snowball/algorithms/turkish.sbl +470 -0
- data/vendor/snowball/algorithms/yiddish.sbl +460 -0
- data/vendor/snowball/charsets/ISO-8859-2.sbl +98 -0
- data/vendor/snowball/charsets/KOI8-R.sbl +74 -0
- data/vendor/snowball/charsets/cp850.sbl +130 -0
- data/vendor/snowball/compiler/analyser.c +1547 -0
- data/vendor/snowball/compiler/driver.c +615 -0
- data/vendor/snowball/compiler/generator.c +1748 -0
- data/vendor/snowball/compiler/generator_ada.c +1702 -0
- data/vendor/snowball/compiler/generator_csharp.c +1322 -0
- data/vendor/snowball/compiler/generator_go.c +1278 -0
- data/vendor/snowball/compiler/generator_java.c +1313 -0
- data/vendor/snowball/compiler/generator_js.c +1316 -0
- data/vendor/snowball/compiler/generator_pascal.c +1387 -0
- data/vendor/snowball/compiler/generator_python.c +1337 -0
- data/vendor/snowball/compiler/generator_rust.c +1295 -0
- data/vendor/snowball/compiler/header.h +418 -0
- data/vendor/snowball/compiler/space.c +286 -0
- data/vendor/snowball/compiler/syswords.h +86 -0
- data/vendor/snowball/compiler/syswords2.h +13 -0
- data/vendor/snowball/compiler/tokeniser.c +567 -0
- data/vendor/snowball/csharp/.gitignore +8 -0
- data/vendor/snowball/csharp/Snowball/Algorithms/.gitignore +1 -0
- data/vendor/snowball/csharp/Snowball/Among.cs +108 -0
- data/vendor/snowball/csharp/Snowball/AssemblyInfo.cs +36 -0
- data/vendor/snowball/csharp/Snowball/Stemmer.cs +660 -0
- data/vendor/snowball/csharp/Stemwords/App.config +6 -0
- data/vendor/snowball/csharp/Stemwords/Program.cs +114 -0
- data/vendor/snowball/doc/TODO +12 -0
- data/vendor/snowball/doc/libstemmer_c_README +148 -0
- data/vendor/snowball/doc/libstemmer_csharp_README +53 -0
- data/vendor/snowball/doc/libstemmer_java_README +67 -0
- data/vendor/snowball/doc/libstemmer_js_README +48 -0
- data/vendor/snowball/doc/libstemmer_python_README +113 -0
- data/vendor/snowball/examples/stemwords.c +204 -0
- data/vendor/snowball/go/README.md +55 -0
- data/vendor/snowball/go/among.go +16 -0
- data/vendor/snowball/go/env.go +403 -0
- data/vendor/snowball/go/stemwords/generate.go +68 -0
- data/vendor/snowball/go/stemwords/main.go +68 -0
- data/vendor/snowball/go/util.go +34 -0
- data/vendor/snowball/iconv.py +50 -0
- data/vendor/snowball/include/libstemmer.h +78 -0
- data/vendor/snowball/java/org/tartarus/snowball/Among.java +29 -0
- data/vendor/snowball/java/org/tartarus/snowball/SnowballProgram.java +381 -0
- data/vendor/snowball/java/org/tartarus/snowball/SnowballStemmer.java +8 -0
- data/vendor/snowball/java/org/tartarus/snowball/TestApp.java +75 -0
- data/vendor/snowball/javascript/base-stemmer.js +294 -0
- data/vendor/snowball/javascript/stemwords.js +106 -0
- data/vendor/snowball/libstemmer/libstemmer_c.in +96 -0
- data/vendor/snowball/libstemmer/mkalgorithms.pl +90 -0
- data/vendor/snowball/libstemmer/mkmodules.pl +267 -0
- data/vendor/snowball/libstemmer/modules.txt +63 -0
- data/vendor/snowball/libstemmer/test.c +34 -0
- data/vendor/snowball/pascal/.gitignore +4 -0
- data/vendor/snowball/pascal/SnowballProgram.pas +430 -0
- data/vendor/snowball/pascal/generate.pl +23 -0
- data/vendor/snowball/pascal/stemwords-template.dpr +78 -0
- data/vendor/snowball/python/MANIFEST.in +7 -0
- data/vendor/snowball/python/create_init.py +54 -0
- data/vendor/snowball/python/setup.cfg +6 -0
- data/vendor/snowball/python/setup.py +81 -0
- data/vendor/snowball/python/snowballstemmer/among.py +13 -0
- data/vendor/snowball/python/snowballstemmer/basestemmer.py +323 -0
- data/vendor/snowball/python/stemwords.py +101 -0
- data/vendor/snowball/python/testapp.py +28 -0
- data/vendor/snowball/runtime/api.c +58 -0
- data/vendor/snowball/runtime/api.h +32 -0
- data/vendor/snowball/runtime/header.h +61 -0
- data/vendor/snowball/runtime/utilities.c +513 -0
- data/vendor/snowball/rust/Cargo.toml +7 -0
- data/vendor/snowball/rust/build.rs +55 -0
- data/vendor/snowball/rust/rust-pre-1.27-compat.patch +30 -0
- data/vendor/snowball/rust/src/main.rs +102 -0
- data/vendor/snowball/rust/src/snowball/algorithms/mod.rs +2 -0
- data/vendor/snowball/rust/src/snowball/among.rs +6 -0
- data/vendor/snowball/rust/src/snowball/mod.rs +6 -0
- data/vendor/snowball/rust/src/snowball/snowball_env.rs +421 -0
- data/vendor/snowball/tests/stemtest.c +95 -0
- metadata +178 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 5685ca729c53bcfeadf2241ffecd3eb536197dcf3863efd33e9b07aac114e4dd
|
|
4
|
+
data.tar.gz: a2a07b4b5d16a691695d1a420ef4acde948276077c9bce7cb98111b172bb71ab
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 2b4f3abde51e786665b83c3846cf5aa0409cae5cbf91dd8f715d3aa850a4c967c951dcd1475d9c6887aa991fc7a460be6f9205196ed7c8f4389c301de38b4931
|
|
7
|
+
data.tar.gz: b9ea741946ac5af5a27ae6685297a7416d1b1e64b0c780c8e0871332393990832a59821bed08ff21428a8c703dfd089ec89e0b376e34514f25f3f4c151620d00
|
data/CHANGELOG.md
ADDED
data/Gemfile
ADDED
data/LICENSE.txt
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
1
|
+
Copyright (c) 2001, Dr Martin Porter
|
|
2
|
+
Copyright (c) 2004,2005, Richard Boulton
|
|
3
|
+
Copyright (c) 2013, Yoshiki Shibukawa
|
|
4
|
+
Copyright (c) 2006,2007,2009,2010,2011,2014-2019, Olly Betts
|
|
5
|
+
Copyright (c) 2022, Andrew Kane
|
|
6
|
+
All rights reserved.
|
|
7
|
+
|
|
8
|
+
Redistribution and use in source and binary forms, with or without
|
|
9
|
+
modification, are permitted provided that the following conditions
|
|
10
|
+
are met:
|
|
11
|
+
|
|
12
|
+
1. Redistributions of source code must retain the above copyright notice,
|
|
13
|
+
this list of conditions and the following disclaimer.
|
|
14
|
+
2. Redistributions in binary form must reproduce the above copyright notice,
|
|
15
|
+
this list of conditions and the following disclaimer in the documentation
|
|
16
|
+
and/or other materials provided with the distribution.
|
|
17
|
+
3. Neither the name of the Mittens project nor the names of its contributors
|
|
18
|
+
may be used to endorse or promote products derived from this software
|
|
19
|
+
without specific prior written permission.
|
|
20
|
+
|
|
21
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
|
22
|
+
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
|
23
|
+
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
24
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
|
25
|
+
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
|
26
|
+
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
|
27
|
+
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
|
|
28
|
+
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
|
29
|
+
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
|
30
|
+
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
data/README.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
# Mittens
|
|
2
|
+
|
|
3
|
+
Stemming for Ruby, powered by [Snowball](https://github.com/snowballstem/snowball)
|
|
4
|
+
|
|
5
|
+
:snowflake: Supports 28 languages
|
|
6
|
+
|
|
7
|
+
[](https://github.com/ankane/mittens/actions)
|
|
8
|
+
|
|
9
|
+
## Installation
|
|
10
|
+
|
|
11
|
+
Add this line to your application’s Gemfile:
|
|
12
|
+
|
|
13
|
+
```ruby
|
|
14
|
+
gem "mittens"
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
## Getting Started
|
|
18
|
+
|
|
19
|
+
Create a stemmer
|
|
20
|
+
|
|
21
|
+
```ruby
|
|
22
|
+
stemmer = Mittens::Stemmer.new
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
Stem a word
|
|
26
|
+
|
|
27
|
+
```ruby
|
|
28
|
+
stemmer.stem("tomatos")
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
## Languages
|
|
32
|
+
|
|
33
|
+
Specify the language
|
|
34
|
+
|
|
35
|
+
```ruby
|
|
36
|
+
stemmer = Mittens::Stemmer.new(language: "french")
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Supports `arabic`, `armenian`, `basque`, `catalan`, `danish`, `dutch`, `english`, `finnish`, `french`, `german`, `greek`, `hindi`, `hungarian`, `indonesian`, `irish`, `italian`, `lithuanian`, `nepali`, `norwegian`, `porter`, `portuguese`, `romanian`, `russian`, `serbian`, `spanish`, `swedish`, `tamil`, `turkish`, and `yiddish`
|
|
40
|
+
|
|
41
|
+
## History
|
|
42
|
+
|
|
43
|
+
View the [changelog](https://github.com/ankane/mittens/blob/master/CHANGELOG.md)
|
|
44
|
+
|
|
45
|
+
## Contributing
|
|
46
|
+
|
|
47
|
+
Everyone is encouraged to help improve this project. Here are a few ways you can help:
|
|
48
|
+
|
|
49
|
+
- [Report bugs](https://github.com/ankane/mittens/issues)
|
|
50
|
+
- Fix bugs and [submit pull requests](https://github.com/ankane/mittens/pulls)
|
|
51
|
+
- Write, clarify, or fix documentation
|
|
52
|
+
- Suggest or add new features
|
|
53
|
+
|
|
54
|
+
To get started with development:
|
|
55
|
+
|
|
56
|
+
```sh
|
|
57
|
+
git clone --recursive https://github.com/ankane/mittens.git
|
|
58
|
+
cd mittens
|
|
59
|
+
bundle install
|
|
60
|
+
bundle exec rake compile
|
|
61
|
+
bundle exec rake test
|
|
62
|
+
```
|
data/Rakefile
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
require "bundler/gem_tasks"
|
|
2
|
+
require "rake/testtask"
|
|
3
|
+
require "rake/extensiontask"
|
|
4
|
+
|
|
5
|
+
task default: :test
|
|
6
|
+
Rake::TestTask.new do |t|
|
|
7
|
+
t.libs << "test"
|
|
8
|
+
t.pattern = "test/**/*_test.rb"
|
|
9
|
+
end
|
|
10
|
+
|
|
11
|
+
Rake::ExtensionTask.new("mittens") do |ext|
|
|
12
|
+
ext.name = "ext"
|
|
13
|
+
ext.lib_dir = "lib/mittens"
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
task :remove_ext do
|
|
17
|
+
path = "lib/mittens/ext.bundle"
|
|
18
|
+
File.unlink(path) if File.exist?(path)
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
Rake::Task["build"].enhance [:remove_ext]
|
data/ext/mittens/ext.c
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
1
|
+
#include "libstemmer.h"
|
|
2
|
+
#include "ruby/ruby.h"
|
|
3
|
+
|
|
4
|
+
typedef struct stemmer {
|
|
5
|
+
struct sb_stemmer * stemmer;
|
|
6
|
+
} stemmer_t;
|
|
7
|
+
|
|
8
|
+
static void stemmer_free(void *ptr)
|
|
9
|
+
{
|
|
10
|
+
stemmer_t *stemmer = (stemmer_t *)ptr;
|
|
11
|
+
// safe to pass null pointer according to docs
|
|
12
|
+
sb_stemmer_delete(stemmer->stemmer);
|
|
13
|
+
xfree(ptr);
|
|
14
|
+
}
|
|
15
|
+
|
|
16
|
+
const rb_data_type_t stemmer_data_type = {
|
|
17
|
+
.wrap_struct_name = "stemmer",
|
|
18
|
+
.function = {
|
|
19
|
+
.dfree = stemmer_free,
|
|
20
|
+
},
|
|
21
|
+
.flags = RUBY_TYPED_FREE_IMMEDIATELY
|
|
22
|
+
};
|
|
23
|
+
|
|
24
|
+
static VALUE stemmer_allocate(VALUE klass)
|
|
25
|
+
{
|
|
26
|
+
stemmer_t *stemmer;
|
|
27
|
+
VALUE obj = TypedData_Make_Struct(klass, stemmer_t, &stemmer_data_type, stemmer);
|
|
28
|
+
stemmer->stemmer = NULL;
|
|
29
|
+
return obj;
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
static VALUE stemmer_initialize(int argc, VALUE* argv, VALUE self)
|
|
33
|
+
{
|
|
34
|
+
VALUE opts;
|
|
35
|
+
rb_scan_args(argc, argv, ":", &opts);
|
|
36
|
+
|
|
37
|
+
const char * algorithm = "english";
|
|
38
|
+
if (!NIL_P(opts)) {
|
|
39
|
+
VALUE language = rb_hash_aref(opts, ID2SYM(rb_intern("language")));
|
|
40
|
+
if (!NIL_P(language)) {
|
|
41
|
+
Check_Type(language, T_STRING);
|
|
42
|
+
algorithm = RSTRING_PTR(language);
|
|
43
|
+
}
|
|
44
|
+
}
|
|
45
|
+
|
|
46
|
+
stemmer_t *stemmer;
|
|
47
|
+
TypedData_Get_Struct(self, stemmer_t, &stemmer_data_type, stemmer);
|
|
48
|
+
|
|
49
|
+
// in case called multiple times
|
|
50
|
+
sb_stemmer_delete(stemmer->stemmer);
|
|
51
|
+
|
|
52
|
+
// if adding support for encoding, may want to change encoding returned from stem
|
|
53
|
+
stemmer->stemmer = sb_stemmer_new(algorithm, NULL);
|
|
54
|
+
if (stemmer->stemmer == NULL) {
|
|
55
|
+
rb_raise(rb_eArgError, "unknown language: %s", algorithm);
|
|
56
|
+
}
|
|
57
|
+
|
|
58
|
+
return self;
|
|
59
|
+
}
|
|
60
|
+
|
|
61
|
+
static VALUE stemmer_stem(VALUE self, VALUE value)
|
|
62
|
+
{
|
|
63
|
+
stemmer_t *stemmer;
|
|
64
|
+
TypedData_Get_Struct(self, stemmer_t, &stemmer_data_type, stemmer);
|
|
65
|
+
|
|
66
|
+
Check_Type(value, T_STRING);
|
|
67
|
+
|
|
68
|
+
const sb_symbol * word = (const sb_symbol *) RSTRING_PTR(value);
|
|
69
|
+
int size = (int) RSTRING_LEN(value);
|
|
70
|
+
const sb_symbol * pointer_out = sb_stemmer_stem(stemmer->stemmer, word, size);
|
|
71
|
+
|
|
72
|
+
return rb_utf8_str_new_cstr((char *) pointer_out);
|
|
73
|
+
}
|
|
74
|
+
|
|
75
|
+
static VALUE stemmer_languages(VALUE klass)
|
|
76
|
+
{
|
|
77
|
+
VALUE out = rb_ary_new();
|
|
78
|
+
|
|
79
|
+
const char **language = sb_stemmer_list();
|
|
80
|
+
while (*language != NULL) {
|
|
81
|
+
rb_ary_push(out, rb_utf8_str_new_cstr(*language));
|
|
82
|
+
language++;
|
|
83
|
+
}
|
|
84
|
+
|
|
85
|
+
return out;
|
|
86
|
+
}
|
|
87
|
+
|
|
88
|
+
void Init_ext(void)
|
|
89
|
+
{
|
|
90
|
+
VALUE rb_mMittens = rb_define_module("Mittens");
|
|
91
|
+
VALUE rb_cStemmer = rb_define_class_under(rb_mMittens, "Stemmer", rb_cObject);
|
|
92
|
+
rb_define_alloc_func(rb_cStemmer, stemmer_allocate);
|
|
93
|
+
rb_define_method(rb_cStemmer, "initialize", stemmer_initialize, -1);
|
|
94
|
+
rb_define_method(rb_cStemmer, "stem", stemmer_stem, 1);
|
|
95
|
+
rb_define_singleton_method(rb_cStemmer, "languages", stemmer_languages, 0);
|
|
96
|
+
}
|
|
@@ -0,0 +1,12 @@
|
|
|
1
|
+
require "mkmf"
|
|
2
|
+
require "open3"
|
|
3
|
+
|
|
4
|
+
vendor = File.expand_path("../../vendor/snowball", __dir__)
|
|
5
|
+
output, status = Open3.capture2("make", chdir: vendor)
|
|
6
|
+
puts output
|
|
7
|
+
raise "Command failed" unless status.success?
|
|
8
|
+
|
|
9
|
+
$INCFLAGS += " -I$(srcdir)/../../vendor/snowball/include"
|
|
10
|
+
$LDFLAGS += " $(srcdir)/../../vendor/snowball/libstemmer.a"
|
|
11
|
+
|
|
12
|
+
create_makefile("mittens/ext")
|
data/lib/mittens.rb
ADDED
data/mittens.gemspec
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
1
|
+
require_relative "lib/mittens/version"
|
|
2
|
+
|
|
3
|
+
Gem::Specification.new do |spec|
|
|
4
|
+
spec.name = "mittens"
|
|
5
|
+
spec.version = Mittens::VERSION
|
|
6
|
+
spec.summary = "Stemming for Ruby, powered by Snowball"
|
|
7
|
+
spec.homepage = "https://github.com/ankane/mittens"
|
|
8
|
+
spec.license = "BSD-3-Clause"
|
|
9
|
+
|
|
10
|
+
spec.author = "Andrew Kane"
|
|
11
|
+
spec.email = "andrew@ankane.org"
|
|
12
|
+
|
|
13
|
+
spec.files = Dir.chdir(__dir__) do
|
|
14
|
+
`git ls-files -z --recurse-submodules`.split("\x0").reject do |f|
|
|
15
|
+
(f == __FILE__) || f.match(%r{\A(?:(?:bin|test|spec|features)/|\.(?:git|travis|circleci)|appveyor)})
|
|
16
|
+
end
|
|
17
|
+
end
|
|
18
|
+
spec.require_path = "lib"
|
|
19
|
+
spec.extensions = ["ext/mittens/extconf.rb"]
|
|
20
|
+
|
|
21
|
+
spec.required_ruby_version = ">= 3"
|
|
22
|
+
end
|
|
@@ -0,0 +1,26 @@
|
|
|
1
|
+
*.o
|
|
2
|
+
/ada/bin/
|
|
3
|
+
/ada/obj/
|
|
4
|
+
/algorithms.mk
|
|
5
|
+
/libstemmer/libstemmer.c
|
|
6
|
+
/libstemmer/libstemmer_utf8.c
|
|
7
|
+
/libstemmer/mkinc.mak
|
|
8
|
+
/libstemmer/mkinc_utf8.mak
|
|
9
|
+
/libstemmer/modules.h
|
|
10
|
+
/libstemmer/modules_utf8.h
|
|
11
|
+
/libstemmer.a
|
|
12
|
+
/snowball
|
|
13
|
+
/src_c
|
|
14
|
+
/stemtest
|
|
15
|
+
/stemwords
|
|
16
|
+
/dist
|
|
17
|
+
/java/org/tartarus/snowball/ext/
|
|
18
|
+
/js_out
|
|
19
|
+
/python_check
|
|
20
|
+
/python_out
|
|
21
|
+
*.generated.cs
|
|
22
|
+
/rust/Cargo.lock
|
|
23
|
+
/rust/src/snowball/algorithms/*.rs
|
|
24
|
+
/rust/target/
|
|
25
|
+
/go/algorithms/
|
|
26
|
+
/go/stemwords/algorithms.go
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
language: minimal
|
|
2
|
+
arch: arm64
|
|
3
|
+
dist: xenial
|
|
4
|
+
env:
|
|
5
|
+
global: MAKE=make
|
|
6
|
+
matrix:
|
|
7
|
+
include:
|
|
8
|
+
- name: "C distribution build"
|
|
9
|
+
language: c
|
|
10
|
+
dist: focal
|
|
11
|
+
compiler: gcc
|
|
12
|
+
env: CFLAGS_DIST_BUILD='-O2 -Wall -W -std=c90 -Wdeclaration-after-statement -Werror'
|
|
13
|
+
- name: "C distribution build (clang)"
|
|
14
|
+
language: c
|
|
15
|
+
dist: focal
|
|
16
|
+
compiler: clang
|
|
17
|
+
env: CFLAGS_DIST_BUILD='-O2 -Wall -W -std=c90 -Wdeclaration-after-statement -Werror'
|
|
18
|
+
- language: c
|
|
19
|
+
compiler: gcc
|
|
20
|
+
env: c_tests=y CFLAGS='-O2 -Wall -W -std=c99 -Werror'
|
|
21
|
+
- language: c
|
|
22
|
+
compiler: clang
|
|
23
|
+
env: c_tests=y CFLAGS='-O2 -Wall -W -std=c99 -Werror'
|
|
24
|
+
- language: java
|
|
25
|
+
env: JAVA=java JAVAC=javac
|
|
26
|
+
- language: go
|
|
27
|
+
go: "1.8"
|
|
28
|
+
env: GO=go
|
|
29
|
+
- language: go
|
|
30
|
+
dist: bionic
|
|
31
|
+
go: "1.17"
|
|
32
|
+
env: GO=go
|
|
33
|
+
- language: node_js
|
|
34
|
+
node_js: "node"
|
|
35
|
+
env: NODE=node
|
|
36
|
+
dist: focal # seems to be needed for working node
|
|
37
|
+
- language: rust
|
|
38
|
+
rust:
|
|
39
|
+
- stable
|
|
40
|
+
- beta
|
|
41
|
+
dist: bionic
|
|
42
|
+
env: RUST=rust
|
|
43
|
+
- language: csharp
|
|
44
|
+
arch: amd64 # csharp doesn't seem to work on arm64
|
|
45
|
+
env: MCS=mcs
|
|
46
|
+
dist: bionic
|
|
47
|
+
- name: Pascal
|
|
48
|
+
env: FPC=fpc
|
|
49
|
+
dist: bionic
|
|
50
|
+
addons:
|
|
51
|
+
apt:
|
|
52
|
+
packages:
|
|
53
|
+
- fpc
|
|
54
|
+
# The pure Python versions run slowly so we need to thin the testdata
|
|
55
|
+
# for languages such as Arabic where there's a lot, or else the build
|
|
56
|
+
# hits the travis time limit. With pypy, it's enough faster than we
|
|
57
|
+
# can run the full tests.
|
|
58
|
+
- language: python
|
|
59
|
+
python: "3.9"
|
|
60
|
+
env: PYTHON=python THIN_FACTOR=10
|
|
61
|
+
- language: python
|
|
62
|
+
python: "3.7"
|
|
63
|
+
env: PYTHON=python THIN_FACTOR=10
|
|
64
|
+
- language: python
|
|
65
|
+
python: "3.6"
|
|
66
|
+
env: PYTHON=python THIN_FACTOR=10
|
|
67
|
+
- language: python
|
|
68
|
+
python: "pypy3.7-7.3.5"
|
|
69
|
+
env: PYTHON=python
|
|
70
|
+
dist: bionic
|
|
71
|
+
- name: "Ada"
|
|
72
|
+
env: gprbuild=gprbuild
|
|
73
|
+
dist: bionic
|
|
74
|
+
addons:
|
|
75
|
+
apt:
|
|
76
|
+
packages:
|
|
77
|
+
- gnat
|
|
78
|
+
- gprbuild
|
|
79
|
+
- os: windows
|
|
80
|
+
language: c
|
|
81
|
+
env: c_tests=y MAKE=mingw32-make
|
|
82
|
+
- os: windows
|
|
83
|
+
language: go
|
|
84
|
+
env: GO=go MAKE=mingw32-make
|
|
85
|
+
|
|
86
|
+
before_install:
|
|
87
|
+
# Try to check out a branch of the same name from the snowball-data repo
|
|
88
|
+
# sibling of this snowball repo, so that PRs requiring changes to both can be
|
|
89
|
+
# CI tested easily.
|
|
90
|
+
#
|
|
91
|
+
# If that fails, just use the standard snowball-data repo's default branch.
|
|
92
|
+
- GH_BRANCH=${TRAVIS_PULL_REQUEST_BRANCH:-$TRAVIS_BRANCH}
|
|
93
|
+
- GH_REPO_SLUG=${TRAVIS_PULL_REQUEST_SLUG:-$TRAVIS_REPO_SLUG}
|
|
94
|
+
- GH_REPO_URL=https://github.com/${GH_REPO_SLUG%%/*}/snowball-data.git
|
|
95
|
+
- echo "Trying branch $GH_BRANCH from $GH_REPO_URL"
|
|
96
|
+
- git clone --depth=1 -b "$GH_BRANCH" "$GH_REPO_URL" || git clone --depth=1 https://github.com/snowballstem/snowball-data.git
|
|
97
|
+
|
|
98
|
+
script:
|
|
99
|
+
# Ensure CC is set for building the compiler in non-C builds.
|
|
100
|
+
- test -n "$CC" || export CC=gcc
|
|
101
|
+
- $MAKE CC="$CC"
|
|
102
|
+
- test -z "$CFLAGS_DIST_BUILD" || { pip install setuptools && $MAKE dist && mkdir tmp && cd tmp && tar xf ../dist/libstemmer_c-*.tar.gz && cd libstemmer_c-* && $MAKE CFLAGS="$CFLAGS_DIST_BUILD" ; }
|
|
103
|
+
- test -z "$c_tests" || $MAKE check CC="$CC" STEMMING_DATA=snowball-data
|
|
104
|
+
- test -z "$PYTHON" || $MAKE check_python python="$PYTHON" STEMMING_DATA=snowball-data
|
|
105
|
+
- test -z "$JAVA" -o -z "$JAVAC" || $MAKE check_java STEMMING_DATA=snowball-data
|
|
106
|
+
- test -z "$MCS" || $MAKE check_csharp MCS="$MCS" STEMMING_DATA=snowball-data
|
|
107
|
+
- test -z "$NODE" || $MAKE check_js STEMMING_DATA=snowball-data
|
|
108
|
+
- test -z "$RUST" || $MAKE check_rust STEMMING_DATA=snowball-data
|
|
109
|
+
- test -z "$RUST" || $MAKE check_rust STEMMING_DATA=snowball-data
|
|
110
|
+
- test -z "$GO" || $MAKE check_go STEMMING_DATA=snowball-data
|
|
111
|
+
- test -z "$FPC" || $MAKE check_pascal STEMMING_DATA=snowball-data
|
|
112
|
+
- test -z "$gprbuild" || $MAKE check_ada STEMMING_DATA=snowball-data
|
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
Authors
|
|
2
|
+
=======
|
|
3
|
+
|
|
4
|
+
Martin Porter
|
|
5
|
+
-------------
|
|
6
|
+
|
|
7
|
+
- Designed the snowball language.
|
|
8
|
+
- Implemented the snowball to C compiler.
|
|
9
|
+
- Implemented the stemming algorithms in C.
|
|
10
|
+
- Wrote the documentation.
|
|
11
|
+
|
|
12
|
+
Richard Boulton
|
|
13
|
+
---------------
|
|
14
|
+
|
|
15
|
+
- Implemented Java backend of the snowball compiler.
|
|
16
|
+
- Developed build system.
|
|
17
|
+
- Assisted with website maintenance.
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
Assistance from
|
|
21
|
+
---------------
|
|
22
|
+
|
|
23
|
+
Olivier Bornet - fixes to java packaging and build system.
|
|
24
|
+
Andreas Jung - useful bug reports on the libstemmer library.
|
|
25
|
+
Olly Betts - several patches, bug reports, and performance improvements.
|
|
26
|
+
Sebastiano Vigna and Oerd Cukalla - patches for the Java stemming algorithms.
|
|
27
|
+
Ralf Junker - fix a potential memory leak in sb_stemmer_new().
|
|
@@ -0,0 +1,216 @@
|
|
|
1
|
+
Adding a new stemming algorithm
|
|
2
|
+
===============================
|
|
3
|
+
|
|
4
|
+
This needs PRs against three repositories. Name the branch the same for
|
|
5
|
+
at least `snowball` and `snowball-data`, push to the latter repo first, and the
|
|
6
|
+
CI should use your new vocabulary list when running the testsuite.
|
|
7
|
+
|
|
8
|
+
Some points to note about algorithm implementations:
|
|
9
|
+
|
|
10
|
+
* Avoid literal non-ASCII characters in snowball string literals - they will
|
|
11
|
+
work OK for languages that use UTF-8, but not wide-character Unicode or other
|
|
12
|
+
encodings. Instead use ``stringdef`` like the existing stemmers do, and
|
|
13
|
+
please use the newer `U+` notation rather than the older ``hex`` or
|
|
14
|
+
``decimal`` as this allows us to support different encodings without having
|
|
15
|
+
to modify the source files - for example::
|
|
16
|
+
|
|
17
|
+
stringdef o" {U+00F6}
|
|
18
|
+
define foo 'o{o"}'
|
|
19
|
+
|
|
20
|
+
not::
|
|
21
|
+
|
|
22
|
+
stringdef o" hex F6
|
|
23
|
+
define foo 'o{o"}'
|
|
24
|
+
|
|
25
|
+
and definitely not::
|
|
26
|
+
|
|
27
|
+
define foo 'oö'
|
|
28
|
+
|
|
29
|
+
It's helpful to consistently use the same ``stringdef`` codes across the
|
|
30
|
+
different stemmers - the website has `guidance on what to use
|
|
31
|
+
<https://snowballstem.org/codesets/guide.html>`_ and a `list of stringdef
|
|
32
|
+
lines for common characters to cut and paste from
|
|
33
|
+
<https://snowballstem.org/codesets/latin-stringdef-list.txt>`_.
|
|
34
|
+
|
|
35
|
+
snowball repo
|
|
36
|
+
-------------
|
|
37
|
+
|
|
38
|
+
Add `.sbl` source to algorithms subdirectory.
|
|
39
|
+
|
|
40
|
+
Add entry to `libstemmer/modules.txt`, maintaining the current sorted order by
|
|
41
|
+
the first column. The columns are:
|
|
42
|
+
|
|
43
|
+
* Algorithm name (needs to match the `.sbl` source without extension)
|
|
44
|
+
* Encodings to support. Wide-character Unicode is always supported
|
|
45
|
+
and doesn't need to be listed here. You should always include `UTF_8`, and
|
|
46
|
+
also any of `ISO_8859_1`, `ISO_8859_2` and `KOI8_R` which the language can
|
|
47
|
+
usefully be written using only characters from (in particular they need to
|
|
48
|
+
contain all the characters the stemmer explicitly uses). Support for other
|
|
49
|
+
single-byte character sets is easy to add if they're useful.
|
|
50
|
+
* Names and ISO-639 codes for the language. Wikipedia has a handy list of `all
|
|
51
|
+
the ISO-639 codes <https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes>`_ -
|
|
52
|
+
find the row for your new language and include the codes from the "639-1",
|
|
53
|
+
"639-2/T" and (if different) "639-2/B" columns. For example, for the `Afar`
|
|
54
|
+
language you'd put `afar,aa,aar` here.
|
|
55
|
+
|
|
56
|
+
snowball-data repo
|
|
57
|
+
------------------
|
|
58
|
+
|
|
59
|
+
Add subdirectory named after new stemmer containing:
|
|
60
|
+
|
|
61
|
+
* voc.txt - word list
|
|
62
|
+
* output.txt - stemmed equivalents
|
|
63
|
+
* COPYING - licensing details (word lists need to be under an OSI-approved
|
|
64
|
+
licence)
|
|
65
|
+
|
|
66
|
+
If you don't have access to a suitably licensed word list of a suitable size,
|
|
67
|
+
you may be able to use the `wikipedia-most-common-words` script to generate
|
|
68
|
+
one by extracting the most frequent words from a Wikipedia dump in the
|
|
69
|
+
language the stemmer is for. You need to specify the Unicode "script" (that's
|
|
70
|
+
"script" in the sense of alphabet) to use - you can find the appropriate one
|
|
71
|
+
by looking in the Unicode `Scripts.txt`::
|
|
72
|
+
|
|
73
|
+
https://www.unicode.org/Public/13.0.0/ucd/Scripts.txt
|
|
74
|
+
|
|
75
|
+
The script name is the second column, between `;` and `#`. The first entries
|
|
76
|
+
are all "Common" which isn't what you want - scroll down to get to the entries
|
|
77
|
+
that are useful here.
|
|
78
|
+
|
|
79
|
+
You also need to specify the minimum frequency to select. Picking this value
|
|
80
|
+
will probably need some experimentation as the appropriate threshold depends on
|
|
81
|
+
how much data there is in the wikipedia dump for a particular language, as well
|
|
82
|
+
as the size of the vocabulary for the language, and how inflected the language
|
|
83
|
+
is. Try counting the number of unique words extracted (`wc -l voc.txt` on
|
|
84
|
+
Unix) and also looking through the list - some proper nouns, words from other
|
|
85
|
+
languages, typos, etc are OK (since the stemmer will encounter all these in
|
|
86
|
+
practice too), but at some point "more" stops being "better".
|
|
87
|
+
|
|
88
|
+
snowball-website repo
|
|
89
|
+
---------------------
|
|
90
|
+
|
|
91
|
+
Create subdirectory of `algorithms/` named after the language.
|
|
92
|
+
|
|
93
|
+
Create `stemmer.tt` which describes the stemming algorithm. This is a
|
|
94
|
+
"template toolkit" template which is essentially a mix of HTML and some
|
|
95
|
+
macros for adding the navigation, sample vocabulary, etc. See the
|
|
96
|
+
existing `stemmer.tt` files for other algorithms for inspiration.
|
|
97
|
+
|
|
98
|
+
If it is based on an academic paper, cite the paper and describe any difference
|
|
99
|
+
between your implementation and that described in the paper (for example,
|
|
100
|
+
sometimes papers have ambiguities that need resolving to re-implement the
|
|
101
|
+
algorithm described).
|
|
102
|
+
|
|
103
|
+
If you have a stopword list, add that as `stop.txt` in your new subdirectory.
|
|
104
|
+
The `generate` script checks if such a file exists and if it does a link to
|
|
105
|
+
it is automatically added.
|
|
106
|
+
|
|
107
|
+
Link to your new `stemmer.tt` from `algorithms/index.tt`.
|
|
108
|
+
|
|
109
|
+
Add a news entry to `index.tt`.
|
|
110
|
+
|
|
111
|
+
Add the new stemmer to the online demo. Assuming you have checkouts of the
|
|
112
|
+
`snowball`, `snowball-data` and `snowball-website` repos in sibling
|
|
113
|
+
directories:
|
|
114
|
+
|
|
115
|
+
* run `make check_js` in the `snowball` repo
|
|
116
|
+
* run `./update-js`
|
|
117
|
+
* add the new stemmer to git with: `git add js/*-stemmer.js`
|
|
118
|
+
* if the new language is written right-to-left (RTL) then add it to the check
|
|
119
|
+
in `demo.tt` (search for `rtl` to find the place to change.)
|
|
120
|
+
* `git commit`.
|
|
121
|
+
|
|
122
|
+
Adding a new programming language generator
|
|
123
|
+
===========================================
|
|
124
|
+
|
|
125
|
+
This is a short guide to adding support for generating code for another
|
|
126
|
+
programming language.
|
|
127
|
+
|
|
128
|
+
Is a new generator the right solution?
|
|
129
|
+
--------------------------------------
|
|
130
|
+
|
|
131
|
+
Adding a new code generator is probably not your only option if you want
|
|
132
|
+
to use Snowball from another language - most languages have support for
|
|
133
|
+
writing bindings to a C library, so this is probably another option.
|
|
134
|
+
|
|
135
|
+
Generating code can have advantages. For example, it can be simpler to
|
|
136
|
+
deploy without C bindings which need to be built for a specific platform.
|
|
137
|
+
|
|
138
|
+
However, it's likely to be significantly more work to implement a new generator
|
|
139
|
+
than to write bindings to the generated C code, especially as the libstemmer
|
|
140
|
+
C API is a very small and simple one. Generated code can also be slower -
|
|
141
|
+
currently the Snowball compiler often generates code that assumes an optimising
|
|
142
|
+
compiler will clean up redundant constructs, which is not a problem for C, and
|
|
143
|
+
probably not for most compiled languages, but for a language like Python C
|
|
144
|
+
bindings are much faster than the generated Python code (using pypy helps a
|
|
145
|
+
lot, but is still slower). See doc/libstemmer_python_README for some timings.
|
|
146
|
+
|
|
147
|
+
That said, the unoptimised generated code has improved over time, and is likely
|
|
148
|
+
to improve further in the future.
|
|
149
|
+
|
|
150
|
+
Key problems to solve
|
|
151
|
+
---------------------
|
|
152
|
+
|
|
153
|
+
* A key problem to solve is how to map the required flow of control in response
|
|
154
|
+
to Snowball signals.
|
|
155
|
+
|
|
156
|
+
In the generated C code this is mostly done using `goto`. If your language
|
|
157
|
+
doesn't provide an equivalent to `goto` then you'll need an alternative
|
|
158
|
+
solution.
|
|
159
|
+
|
|
160
|
+
In Java and JavaScript we use labelled `break` from blocks and loops
|
|
161
|
+
instead. If your language has an equivalent to this feature, that will
|
|
162
|
+
probably work.
|
|
163
|
+
|
|
164
|
+
For Python, we currently generate a `try:` ... `raise lab123` ...
|
|
165
|
+
`except lab123: pass` construct. This works, but doesn't seem ideal.
|
|
166
|
+
|
|
167
|
+
If one of the mechanisms above sounds suitable then take a look at the
|
|
168
|
+
generator for the respective generated output and generator code. If
|
|
169
|
+
not, come and talk to us on the snowball-discuss mailing list.
|
|
170
|
+
|
|
171
|
+
* Snowball's division is specified as integer division with semantics
|
|
172
|
+
matching C - i.e. the result should be truncated (rounded towards zero).
|
|
173
|
+
Some languages lack a built-in integer division operation, or have one
|
|
174
|
+
which instead implements rounding towards negative infinity. Existing
|
|
175
|
+
backends with special handling handling here which may be useful to look at
|
|
176
|
+
include Javascript, Pascal and Python.
|
|
177
|
+
|
|
178
|
+
Don't hardcode algorithm names
|
|
179
|
+
------------------------------
|
|
180
|
+
|
|
181
|
+
We want to avoid hard-coded lists of algorithms in the language-specific code
|
|
182
|
+
that have to be manually updated each time a new algorithm is added, because
|
|
183
|
+
that adds some extra tedious work for adding a new algorithm, and mechanical
|
|
184
|
+
updates done by hand tend to miss places that need updating, or code gets
|
|
185
|
+
copied and pasted from an existing case but not fully updated.
|
|
186
|
+
|
|
187
|
+
All the existing language backends generate any such code at build time, and
|
|
188
|
+
adding a new algorithm just requires updating `libstemmer/modules.txt`.
|
|
189
|
+
|
|
190
|
+
You can probably copy the approach used for Pascal (script `pascal/generate.pl`
|
|
191
|
+
works from template `stemwords-template.dpr` which has marked blocks of code
|
|
192
|
+
that get expanded for each stemming algorithm with a placeholder replaced by
|
|
193
|
+
the algorithm name. For an alternative approach, see Rust where this is done
|
|
194
|
+
by `rust/build.rs`.
|
|
195
|
+
|
|
196
|
+
Mechanics of adding a new generator
|
|
197
|
+
-----------------------------------
|
|
198
|
+
|
|
199
|
+
Copy an existing `compiler/generator_*.c` for your new language and modify
|
|
200
|
+
away (`generator.c` has the generator for C, but also some common functions
|
|
201
|
+
so if you start from this one you'll need to remove those common functions).
|
|
202
|
+
Please resist reformatting existing C code - there's currently a lot of code
|
|
203
|
+
repeated in each generator which ought to be pulled out as common code, and
|
|
204
|
+
if you reformat that just makes that job harder.
|
|
205
|
+
|
|
206
|
+
Add your new source to `COMPILER_SOURCES` in `GNUmakefile`.
|
|
207
|
+
|
|
208
|
+
Add prototypes for the new functions to `compiler/header.h`.
|
|
209
|
+
|
|
210
|
+
Add support to `compiler/driver.c`.
|
|
211
|
+
|
|
212
|
+
Add targets to `GNUmakefile` to run tests for the new language.
|
|
213
|
+
|
|
214
|
+
Hook up automated testing via CI in `.travis.yml`.
|
|
215
|
+
|
|
216
|
+
Add to the list of languages in `README.rst`.
|