engtagger 0.4.0 → 0.4.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.gitignore +3 -0
- data/.rubocop.yml +0 -3
- data/Gemfile +3 -1
- data/README.md +42 -12
- data/engtagger.gemspec +1 -1
- data/lib/engtagger/porter.rb +169 -170
- data/lib/engtagger/version.rb +1 -1
- data/lib/engtagger.rb +2 -2
- metadata +4 -4
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 043237e54c8a17bcf8871e4a45a6231fb84cf75a6a975cfb027f2bdc2cda7fa9
|
4
|
+
data.tar.gz: 6bc1e9161ade26750731d4d9c11ecad6e406dc3edfcc1774bd1e52970890c6dd
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: bcae03556ad6402de71668519418889b76b9b850e18719b5e91c5d9bd0095725676523fb4e9bc51114f11e01dd18c44d9d21bcab30cd5ef58b8780707030233e
|
7
|
+
data.tar.gz: 2f518f2f6968838cca458ec1ee51a614ae332b43dce59302ec6fe746b24923b868a682bb9b2d2c7c656989d68bc9d66ddc3a0d26869da6a98561748f23336cf9
|
data/.gitignore
CHANGED
data/.rubocop.yml
CHANGED
data/Gemfile
CHANGED
data/README.md
CHANGED
@@ -2,7 +2,7 @@
|
|
2
2
|
|
3
3
|
English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
|
4
4
|
|
5
|
-
|
5
|
+
## Description
|
6
6
|
|
7
7
|
A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained
|
8
8
|
tagger that assigns POS tags to English text based on a lookup dictionary and
|
@@ -13,13 +13,13 @@ word morphology or can be set to be treated as nouns or other parts of speech.
|
|
13
13
|
The tagger also extracts as many nouns and noun phrases as it can, using a set
|
14
14
|
of regular expressions.
|
15
15
|
|
16
|
-
|
16
|
+
## Features
|
17
17
|
|
18
18
|
* Assigns POS tags to English text
|
19
19
|
* Extract noun phrases from tagged text
|
20
20
|
* etc.
|
21
21
|
|
22
|
-
|
22
|
+
## Synopsis
|
23
23
|
|
24
24
|
```ruby
|
25
25
|
require 'engtagger'
|
@@ -72,7 +72,7 @@ nps = tgr.get_noun_phrases(tagged)
|
|
72
72
|
#=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}
|
73
73
|
```
|
74
74
|
|
75
|
-
|
75
|
+
## Tag Set
|
76
76
|
|
77
77
|
The set of POS tags used here is a modified version of the Penn Treebank tagset. Tags with non-letter characters have been redefined to work better in our data structures. Also, the "Determiner" tag (DET) has been changed from 'DT', in order to avoid confusion with the HTML tag, `<DT>`.
|
78
78
|
|
@@ -122,26 +122,56 @@ The set of POS tags used here is a modified version of the Penn Treebank tagset.
|
|
122
122
|
LRB Punctuation, left bracket (, {, [
|
123
123
|
RRB Punctuation, right bracket ), }, ]
|
124
124
|
|
125
|
-
|
125
|
+
## Installation
|
126
126
|
|
127
|
-
|
127
|
+
**Recommended Approach (without sudo):**
|
128
128
|
|
129
|
-
|
129
|
+
It is recommended to install the `engtagger` gem within your user environment without root privileges. This ensures proper file permissions and avoids potential issues. You can achieve this by using Ruby version managers like `rbenv` or `rvm` to manage your Ruby versions and gemsets.
|
130
130
|
|
131
|
-
|
131
|
+
To install without `sudo`, simply run:
|
132
132
|
|
133
|
-
|
133
|
+
```bash
|
134
|
+
gem install engtagger
|
135
|
+
```
|
136
|
+
|
137
|
+
**Alternative Approach (with sudo):**
|
138
|
+
|
139
|
+
If you must use `sudo` for installation, you'll need to adjust file permissions afterward to ensure accessibility.
|
140
|
+
|
141
|
+
1. Install the gem with `sudo`:
|
142
|
+
|
143
|
+
```bash
|
144
|
+
sudo gem install engtagger
|
145
|
+
```
|
146
|
+
|
147
|
+
2. Grant necessary permissions to your user:
|
148
|
+
|
149
|
+
```bash
|
150
|
+
sudo chown -R $(whoami) /Library/Ruby/Gems/2.6.0/gems/engtagger-0.4.2
|
151
|
+
```
|
152
|
+
|
153
|
+
**Note:** The path above assumes you are using Ruby version 2.6.0. If you are using a different version, you will need to modify the path accordingly. You can find your Ruby version by running `ruby -v`.
|
154
|
+
|
155
|
+
## Troubleshooting
|
156
|
+
|
157
|
+
**Permission Issues:**
|
158
|
+
|
159
|
+
If you encounter "cannot load such file" errors after installation, it might be due to incorrect file permissions. Ensure you've followed the instructions for adjusting permissions if you used `sudo` during installation.
|
160
|
+
|
161
|
+
## Author
|
162
|
+
|
163
|
+
Yoichiro Hasebe (yohasebe [at] gmail.com)
|
134
164
|
|
135
|
-
|
165
|
+
## Contributors
|
136
166
|
|
137
167
|
Many thanks to the collaborators listed in the right column of this GitHub page.
|
138
168
|
|
139
|
-
|
169
|
+
## Acknowledgement
|
140
170
|
|
141
171
|
This Ruby library is a direct port of Lingua::EN::Tagger available at CPAN.
|
142
172
|
The credit for the crucial part of its algorithm/design therefore goes to
|
143
173
|
Aaron Coburn, the author of the original Perl version.
|
144
174
|
|
145
|
-
|
175
|
+
## License
|
146
176
|
|
147
177
|
This library is distributed under the GPL. Please see the LICENSE file.
|
data/engtagger.gemspec
CHANGED
data/lib/engtagger/porter.rb
CHANGED
@@ -1,170 +1,169 @@
|
|
1
|
-
# frozen_string_literal: true
|
2
|
-
|
3
|
-
module Stemmable
|
4
|
-
STEP_2_LIST = {
|
5
|
-
"ational" => "ate", "tional" => "tion", "enci" => "ence", "anci" => "ance",
|
6
|
-
"izer" => "ize", "bli" => "ble",
|
7
|
-
"alli" => "al", "entli" => "ent", "eli" => "e", "ousli" => "ous",
|
8
|
-
"ization" => "ize", "ation" => "ate",
|
9
|
-
"ator" => "ate", "alism" => "al", "iveness" => "ive", "fulness" => "ful",
|
10
|
-
"ousness" => "ous", "aliti" => "al",
|
11
|
-
"iviti" => "ive", "biliti" => "ble", "logi" => "log"
|
12
|
-
}.freeze
|
13
|
-
|
14
|
-
STEP_3_LIST = {
|
15
|
-
"icate" => "ic", "ative" => "", "alize" => "al", "iciti" => "ic",
|
16
|
-
"ical" => "ic", "ful" => "", "ness" => ""
|
17
|
-
}.freeze
|
18
|
-
|
19
|
-
SUFFIX_1_REGEXP = /(
|
20
|
-
ational |
|
21
|
-
tional |
|
22
|
-
enci |
|
23
|
-
anci |
|
24
|
-
izer |
|
25
|
-
bli |
|
26
|
-
alli |
|
27
|
-
entli |
|
28
|
-
eli |
|
29
|
-
ousli |
|
30
|
-
ization |
|
31
|
-
ation |
|
32
|
-
ator |
|
33
|
-
alism |
|
34
|
-
iveness |
|
35
|
-
fulness |
|
36
|
-
ousness |
|
37
|
-
aliti |
|
38
|
-
iviti |
|
39
|
-
biliti |
|
40
|
-
logi)$/x.freeze
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
|
57
|
-
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
62
|
-
|
63
|
-
|
64
|
-
|
65
|
-
|
66
|
-
|
67
|
-
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
72
|
-
|
73
|
-
#
|
74
|
-
#
|
75
|
-
#
|
76
|
-
#
|
77
|
-
#
|
78
|
-
#
|
79
|
-
#
|
80
|
-
#
|
81
|
-
#
|
82
|
-
#
|
83
|
-
#
|
84
|
-
#
|
85
|
-
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
|
91
|
-
|
92
|
-
|
93
|
-
|
94
|
-
|
95
|
-
|
96
|
-
|
97
|
-
|
98
|
-
|
99
|
-
|
100
|
-
|
101
|
-
|
102
|
-
|
103
|
-
|
104
|
-
|
105
|
-
|
106
|
-
|
107
|
-
|
108
|
-
|
109
|
-
stem
|
110
|
-
|
111
|
-
w
|
112
|
-
|
113
|
-
when /(
|
114
|
-
when
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
120
|
-
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
-
|
125
|
-
|
126
|
-
|
127
|
-
|
128
|
-
|
129
|
-
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
135
|
-
|
136
|
-
|
137
|
-
|
138
|
-
|
139
|
-
|
140
|
-
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
|
146
|
-
|
147
|
-
|
148
|
-
|
149
|
-
|
150
|
-
|
151
|
-
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
|
157
|
-
|
158
|
-
w
|
159
|
-
|
160
|
-
|
161
|
-
|
162
|
-
#
|
163
|
-
|
164
|
-
|
165
|
-
|
166
|
-
|
167
|
-
|
168
|
-
|
169
|
-
|
170
|
-
end
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
module Stemmable
|
4
|
+
STEP_2_LIST = {
|
5
|
+
"ational" => "ate", "tional" => "tion", "enci" => "ence", "anci" => "ance",
|
6
|
+
"izer" => "ize", "bli" => "ble",
|
7
|
+
"alli" => "al", "entli" => "ent", "eli" => "e", "ousli" => "ous",
|
8
|
+
"ization" => "ize", "ation" => "ate",
|
9
|
+
"ator" => "ate", "alism" => "al", "iveness" => "ive", "fulness" => "ful",
|
10
|
+
"ousness" => "ous", "aliti" => "al",
|
11
|
+
"iviti" => "ive", "biliti" => "ble", "logi" => "log"
|
12
|
+
}.freeze
|
13
|
+
|
14
|
+
STEP_3_LIST = {
|
15
|
+
"icate" => "ic", "ative" => "", "alize" => "al", "iciti" => "ic",
|
16
|
+
"ical" => "ic", "ful" => "", "ness" => ""
|
17
|
+
}.freeze
|
18
|
+
|
19
|
+
SUFFIX_1_REGEXP = /(
|
20
|
+
ational |
|
21
|
+
tional |
|
22
|
+
enci |
|
23
|
+
anci |
|
24
|
+
izer |
|
25
|
+
bli |
|
26
|
+
alli |
|
27
|
+
entli |
|
28
|
+
eli |
|
29
|
+
ousli |
|
30
|
+
ization |
|
31
|
+
ation |
|
32
|
+
ator |
|
33
|
+
alism |
|
34
|
+
iveness |
|
35
|
+
fulness |
|
36
|
+
ousness |
|
37
|
+
aliti |
|
38
|
+
iviti |
|
39
|
+
biliti |
|
40
|
+
logi)$/x.freeze
|
41
|
+
|
42
|
+
SUFFIX_2_REGEXP = /(
|
43
|
+
al |
|
44
|
+
ance |
|
45
|
+
ence |
|
46
|
+
er |
|
47
|
+
ic |
|
48
|
+
able |
|
49
|
+
ible |
|
50
|
+
ant |
|
51
|
+
ement |
|
52
|
+
ment |
|
53
|
+
ent |
|
54
|
+
ou |
|
55
|
+
ism |
|
56
|
+
ate |
|
57
|
+
iti |
|
58
|
+
ous |
|
59
|
+
ive |
|
60
|
+
ize)$/x.freeze
|
61
|
+
|
62
|
+
C = "[^aeiou]" # consonant
|
63
|
+
V = "[aeiouy]" # vowel
|
64
|
+
CC = "#{C}(?>[^aeiouy]*)" # consonant sequence
|
65
|
+
VV = "#{V}(?>[aeiou]*)" # vowel sequence
|
66
|
+
|
67
|
+
MGR0 = /^(#{CC})?#{VV}#{CC}/o.freeze # [cc]vvcc... is m>0
|
68
|
+
MEQ1 = /^(#{CC})?#{VV}#{CC}(#{VV})?$/o.freeze # [cc]vvcc[vv] is m=1
|
69
|
+
MGR1 = /^(#{CC})?#{VV}#{CC}#{VV}#{CC}/o.freeze # [cc]vvccvvcc... is m>1
|
70
|
+
VOWEL_IN_STEM = /^(#{CC})?#{V}/o.freeze # vowel in stem
|
71
|
+
|
72
|
+
# Porter stemmer in Ruby.
|
73
|
+
#
|
74
|
+
# This is the Porter stemming algorithm, ported to Ruby from the
|
75
|
+
# version coded up in Perl. It's easy to follow against the rules
|
76
|
+
# in the original paper in:
|
77
|
+
#
|
78
|
+
# Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
|
79
|
+
# no. 3, pp 130-137,
|
80
|
+
#
|
81
|
+
# See also http://www.tartarus.org/~martin/PorterStemmer
|
82
|
+
#
|
83
|
+
# Send comments to raypereda@hotmail.com
|
84
|
+
#
|
85
|
+
|
86
|
+
def stem_porter
|
87
|
+
# make a copy of the given object and convert it to a string.
|
88
|
+
w = dup.to_str
|
89
|
+
|
90
|
+
return w if w.length < 3
|
91
|
+
|
92
|
+
# now map initial y to Y so that the patterns never treat it as vowel
|
93
|
+
w[0] = "Y" if w[0] == "y"
|
94
|
+
|
95
|
+
# Step 1a
|
96
|
+
case w
|
97
|
+
when /(ss|i)es$/
|
98
|
+
w = $` + $1
|
99
|
+
when /([^s])s$/
|
100
|
+
w = $` + $1
|
101
|
+
end
|
102
|
+
|
103
|
+
# Step 1b
|
104
|
+
case w
|
105
|
+
when /eed$/
|
106
|
+
w.chop! if $` =~ MGR0
|
107
|
+
when /(ed|ing)$/
|
108
|
+
stem = $`
|
109
|
+
if stem =~ VOWEL_IN_STEM
|
110
|
+
w = stem
|
111
|
+
case w
|
112
|
+
when /(at|bl|iz)$/ then w << "e"
|
113
|
+
when /([^aeiouylsz])\1$/ then w.chop!
|
114
|
+
when /^#{CC}#{V}[^aeiouwxy]$/o then w << "e"
|
115
|
+
end
|
116
|
+
end
|
117
|
+
end
|
118
|
+
|
119
|
+
if w =~ /y$/
|
120
|
+
stem = $`
|
121
|
+
w = stem + "i" if stem =~ VOWEL_IN_STEM
|
122
|
+
end
|
123
|
+
|
124
|
+
# Step 2
|
125
|
+
if w =~ SUFFIX_1_REGEXP
|
126
|
+
stem = $`
|
127
|
+
suffix = $1
|
128
|
+
# print "stem= " + stem + "\n" + "suffix=" + suffix + "\n"
|
129
|
+
w = stem + STEP_2_LIST[suffix] if stem =~ MGR0
|
130
|
+
end
|
131
|
+
|
132
|
+
# Step 3
|
133
|
+
if w =~ /(icate|ative|alize|iciti|ical|ful|ness)$/
|
134
|
+
stem = $`
|
135
|
+
suffix = $1
|
136
|
+
w = stem + STEP_3_LIST[suffix] if stem =~ MGR0
|
137
|
+
end
|
138
|
+
|
139
|
+
# Step 4
|
140
|
+
if w =~ SUFFIX_2_REGEXP
|
141
|
+
stem = $`
|
142
|
+
w = stem if stem =~ MGR1
|
143
|
+
elsif w =~ /(s|t)(ion)$/
|
144
|
+
stem = $` + $1
|
145
|
+
w = stem if stem =~ MGR1
|
146
|
+
end
|
147
|
+
|
148
|
+
# Step 5
|
149
|
+
if w =~ /e$/
|
150
|
+
stem = $`
|
151
|
+
w = stem if (stem =~ MGR1) || (stem =~ MEQ1 && stem !~ /^#{CC}#{V}[^aeiouwxy]$/o)
|
152
|
+
end
|
153
|
+
|
154
|
+
w.chop! if w =~ /ll$/ && w =~ MGR1
|
155
|
+
|
156
|
+
# and turn initial Y back to y
|
157
|
+
w[0] = "y" if w[0] == "Y"
|
158
|
+
w
|
159
|
+
end
|
160
|
+
|
161
|
+
# make the stem_porter the default stem method, just in case we
|
162
|
+
# feel like having multiple stemmers available later.
|
163
|
+
alias stem stem_porter
|
164
|
+
end
|
165
|
+
|
166
|
+
# Add stem method to all Strings
|
167
|
+
class String
|
168
|
+
include Stemmable
|
169
|
+
end
|
data/lib/engtagger/version.rb
CHANGED
data/lib/engtagger.rb
CHANGED
@@ -4,7 +4,7 @@
|
|
4
4
|
|
5
5
|
require "rubygems"
|
6
6
|
require "lru_redux"
|
7
|
-
require_relative "engtagger/porter"
|
7
|
+
require_relative "./engtagger/porter"
|
8
8
|
|
9
9
|
module BoundedSpaceMemoizable
|
10
10
|
def memoize(method, max_cache_size = 100_000)
|
@@ -12,7 +12,7 @@ module BoundedSpaceMemoizable
|
|
12
12
|
alias_method "__memoized__#{method}", method
|
13
13
|
module_eval <<-MODEV
|
14
14
|
def #{method}(*a)
|
15
|
-
@__memoized_#{method}_cache ||= LruRedux::Cache.new(#{max_cache_size})
|
15
|
+
@__memoized_#{method}_cache ||= LruRedux::Cache.new(#{max_cache_size}, true)
|
16
16
|
@__memoized_#{method}_cache[a] ||= __memoized__#{method}(*a)
|
17
17
|
end
|
18
18
|
MODEV
|
metadata
CHANGED
@@ -1,17 +1,17 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: engtagger
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.4.
|
4
|
+
version: 0.4.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Yoichiro Hasebe
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2025-01-16 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
|
-
name:
|
14
|
+
name: sin_lru_redux
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
17
|
- - ">="
|
@@ -69,7 +69,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
69
69
|
- !ruby/object:Gem::Version
|
70
70
|
version: '0'
|
71
71
|
requirements: []
|
72
|
-
rubygems_version: 3.
|
72
|
+
rubygems_version: 3.5.9
|
73
73
|
signing_key:
|
74
74
|
specification_version: 4
|
75
75
|
summary: A probability based, corpus-trained English POS tagger
|