engtagger 0.4.0 → 0.4.2

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 0b61370e322595bd880097f51fe0728780fa6a01ee9975e6eb333c8720ff36d8
4
- data.tar.gz: 0f990be4f4d5f71908d76f0fb52f2c925a2a01891a815cbc70eaf7a39f77edfe
3
+ metadata.gz: 043237e54c8a17bcf8871e4a45a6231fb84cf75a6a975cfb027f2bdc2cda7fa9
4
+ data.tar.gz: 6bc1e9161ade26750731d4d9c11ecad6e406dc3edfcc1774bd1e52970890c6dd
5
5
  SHA512:
6
- metadata.gz: ade5d1cf6fc11553519fe9217dffb06453e0ab7d69ab1532b3f2e2079dd05d035d90ce5ce92e4d0e1195f2a8f79df5b4d44c4cedb27f14df529ac0b0e91cf730
7
- data.tar.gz: ff085546b0db152df0983dabea49ec5b0cf47525cca6118d3776378e908ea04fd675f0bb1daceb944d6be141615e3a5d9da5774025a0dc6ef609dd8b311b1412
6
+ metadata.gz: bcae03556ad6402de71668519418889b76b9b850e18719b5e91c5d9bd0095725676523fb4e9bc51114f11e01dd18c44d9d21bcab30cd5ef58b8780707030233e
7
+ data.tar.gz: 2f518f2f6968838cca458ec1ee51a614ae332b43dce59302ec6fe746b24923b868a682bb9b2d2c7c656989d68bc9d66ddc3a0d26869da6a98561748f23336cf9
data/.gitignore CHANGED
@@ -16,3 +16,6 @@ test/tmp
16
16
  test/version_tmp
17
17
  tmp
18
18
  /.idea
19
+ .rubocop.yml
20
+ .solargraph.yml
21
+ .yardopts
data/.rubocop.yml CHANGED
@@ -18,9 +18,6 @@ Naming/FileName:
18
18
  Security/MarshalLoad:
19
19
  Enabled: false
20
20
 
21
- Layout/EndOfLine:
22
- Enabled: False
23
-
24
21
  Style/ClassVars:
25
22
  Enabled: false
26
23
 
data/Gemfile CHANGED
@@ -4,4 +4,6 @@ source "https://rubygems.org"
4
4
 
5
5
  gemspec
6
6
 
7
- gem "lru_redux"
7
+ gem "rake"
8
+ gem "test-unit"
9
+ gem "sin_lru_redux"
data/README.md CHANGED
@@ -2,7 +2,7 @@
2
2
 
3
3
  English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
4
4
 
5
- ### Description
5
+ ## Description
6
6
 
7
7
  A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained
8
8
  tagger that assigns POS tags to English text based on a lookup dictionary and
@@ -13,13 +13,13 @@ word morphology or can be set to be treated as nouns or other parts of speech.
13
13
  The tagger also extracts as many nouns and noun phrases as it can, using a set
14
14
  of regular expressions.
15
15
 
16
- ### Features
16
+ ## Features
17
17
 
18
18
  * Assigns POS tags to English text
19
19
  * Extract noun phrases from tagged text
20
20
  * etc.
21
21
 
22
- ### Synopsis
22
+ ## Synopsis
23
23
 
24
24
  ```ruby
25
25
  require 'engtagger'
@@ -72,7 +72,7 @@ nps = tgr.get_noun_phrases(tagged)
72
72
  #=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}
73
73
  ```
74
74
 
75
- ### Tag Set
75
+ ## Tag Set
76
76
 
77
77
  The set of POS tags used here is a modified version of the Penn Treebank tagset. Tags with non-letter characters have been redefined to work better in our data structures. Also, the "Determiner" tag (DET) has been changed from 'DT', in order to avoid confusion with the HTML tag, `<DT>`.
78
78
 
@@ -122,26 +122,56 @@ The set of POS tags used here is a modified version of the Penn Treebank tagset.
122
122
  LRB Punctuation, left bracket (, {, [
123
123
  RRB Punctuation, right bracket ), }, ]
124
124
 
125
- ### Install
125
+ ## Installation
126
126
 
127
- gem install engtagger
127
+ **Recommended Approach (without sudo):**
128
128
 
129
- ### Author
129
+ It is recommended to install the `engtagger` gem within your user environment without root privileges. This ensures proper file permissions and avoids potential issues. You can achieve this by using Ruby version managers like `rbenv` or `rvm` to manage your Ruby versions and gemsets.
130
130
 
131
- of this Ruby library
131
+ To install without `sudo`, simply run:
132
132
 
133
- * Yoichiro Hasebe (yohasebe [at] gmail.com)
133
+ ```bash
134
+ gem install engtagger
135
+ ```
136
+
137
+ **Alternative Approach (with sudo):**
138
+
139
+ If you must use `sudo` for installation, you'll need to adjust file permissions afterward to ensure accessibility.
140
+
141
+ 1. Install the gem with `sudo`:
142
+
143
+ ```bash
144
+ sudo gem install engtagger
145
+ ```
146
+
147
+ 2. Grant necessary permissions to your user:
148
+
149
+ ```bash
150
+ sudo chown -R $(whoami) /Library/Ruby/Gems/2.6.0/gems/engtagger-0.4.2
151
+ ```
152
+
153
+ **Note:** The path above assumes you are using Ruby version 2.6.0. If you are using a different version, you will need to modify the path accordingly. You can find your Ruby version by running `ruby -v`.
154
+
155
+ ## Troubleshooting
156
+
157
+ **Permission Issues:**
158
+
159
+ If you encounter "cannot load such file" errors after installation, it might be due to incorrect file permissions. Ensure you've followed the instructions for adjusting permissions if you used `sudo` during installation.
160
+
161
+ ## Author
162
+
163
+ Yoichiro Hasebe (yohasebe [at] gmail.com)
134
164
 
135
- ### Contributors
165
+ ## Contributors
136
166
 
137
167
  Many thanks to the collaborators listed in the right column of this GitHub page.
138
168
 
139
- ### Acknowledgement
169
+ ## Acknowledgement
140
170
 
141
171
  This Ruby library is a direct port of Lingua::EN::Tagger available at CPAN.
142
172
  The credit for the crucial part of its algorithm/design therefore goes to
143
173
  Aaron Coburn, the author of the original Perl version.
144
174
 
145
- ### License
175
+ ## License
146
176
 
147
177
  This library is distributed under the GPL. Please see the LICENSE file.
data/engtagger.gemspec CHANGED
@@ -18,5 +18,5 @@ Gem::Specification.new do |gem|
18
18
  gem.name = "engtagger"
19
19
  gem.require_paths = ["lib"]
20
20
  gem.version = EngTagger::VERSION
21
- gem.add_dependency "lru_redux"
21
+ gem.add_dependency "sin_lru_redux"
22
22
  end
@@ -1,170 +1,169 @@
1
- # frozen_string_literal: true
2
-
3
- module Stemmable
4
- STEP_2_LIST = {
5
- "ational" => "ate", "tional" => "tion", "enci" => "ence", "anci" => "ance",
6
- "izer" => "ize", "bli" => "ble",
7
- "alli" => "al", "entli" => "ent", "eli" => "e", "ousli" => "ous",
8
- "ization" => "ize", "ation" => "ate",
9
- "ator" => "ate", "alism" => "al", "iveness" => "ive", "fulness" => "ful",
10
- "ousness" => "ous", "aliti" => "al",
11
- "iviti" => "ive", "biliti" => "ble", "logi" => "log"
12
- }.freeze
13
-
14
- STEP_3_LIST = {
15
- "icate" => "ic", "ative" => "", "alize" => "al", "iciti" => "ic",
16
- "ical" => "ic", "ful" => "", "ness" => ""
17
- }.freeze
18
-
19
- SUFFIX_1_REGEXP = /(
20
- ational |
21
- tional |
22
- enci |
23
- anci |
24
- izer |
25
- bli |
26
- alli |
27
- entli |
28
- eli |
29
- ousli |
30
- ization |
31
- ation |
32
- ator |
33
- alism |
34
- iveness |
35
- fulness |
36
- ousness |
37
- aliti |
38
- iviti |
39
- biliti |
40
- logi)$/x.freeze
41
-
42
-
43
- SUFFIX_2_REGEXP = /(
44
- al |
45
- ance |
46
- ence |
47
- er |
48
- ic |
49
- able |
50
- ible |
51
- ant |
52
- ement |
53
- ment |
54
- ent |
55
- ou |
56
- ism |
57
- ate |
58
- iti |
59
- ous |
60
- ive |
61
- ize)$/x.freeze
62
-
63
- C = "[^aeiou]" # consonant
64
- V = "[aeiouy]" # vowel
65
- CC = "#{C}(?>[^aeiouy]*)" # consonant sequence
66
- VV = "#{V}(?>[aeiou]*)" # vowel sequence
67
-
68
- MGR0 = /^(#{CC})?#{VV}#{CC}/o.freeze # [cc]vvcc... is m>0
69
- MEQ1 = /^(#{CC})?#{VV}#{CC}(#{VV})?$/o.freeze # [cc]vvcc[vv] is m=1
70
- MGR1 = /^(#{CC})?#{VV}#{CC}#{VV}#{CC}/o.freeze # [cc]vvccvvcc... is m>1
71
- VOWEL_IN_STEM = /^(#{CC})?#{V}/o.freeze # vowel in stem
72
-
73
- # Porter stemmer in Ruby.
74
- #
75
- # This is the Porter stemming algorithm, ported to Ruby from the
76
- # version coded up in Perl. It's easy to follow against the rules
77
- # in the original paper in:
78
- #
79
- # Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
80
- # no. 3, pp 130-137,
81
- #
82
- # See also http://www.tartarus.org/~martin/PorterStemmer
83
- #
84
- # Send comments to raypereda@hotmail.com
85
- #
86
-
87
- def stem_porter
88
- # make a copy of the given object and convert it to a string.
89
- w = dup.to_str
90
-
91
- return w if w.length < 3
92
-
93
- # now map initial y to Y so that the patterns never treat it as vowel
94
- w[0] = "Y" if w[0] == "y"
95
-
96
- # Step 1a
97
- case w
98
- when /(ss|i)es$/
99
- w = $` + $1
100
- when /([^s])s$/
101
- w = $` + $1
102
- end
103
-
104
- # Step 1b
105
- case w
106
- when /eed$/
107
- w.chop! if $` =~ MGR0
108
- when /(ed|ing)$/
109
- stem = $`
110
- if stem =~ VOWEL_IN_STEM
111
- w = stem
112
- case w
113
- when /(at|bl|iz)$/ then w << "e"
114
- when /([^aeiouylsz])\1$/ then w.chop!
115
- when /^#{CC}#{V}[^aeiouwxy]$/o then w << "e"
116
- end
117
- end
118
- end
119
-
120
- if w =~ /y$/
121
- stem = $`
122
- w = stem + "i" if stem =~ VOWEL_IN_STEM
123
- end
124
-
125
- # Step 2
126
- if w =~ SUFFIX_1_REGEXP
127
- stem = $`
128
- suffix = $1
129
- # print "stem= " + stem + "\n" + "suffix=" + suffix + "\n"
130
- w = stem + STEP_2_LIST[suffix] if stem =~ MGR0
131
- end
132
-
133
- # Step 3
134
- if w =~ /(icate|ative|alize|iciti|ical|ful|ness)$/
135
- stem = $`
136
- suffix = $1
137
- w = stem + STEP_3_LIST[suffix] if stem =~ MGR0
138
- end
139
-
140
- # Step 4
141
- if w =~ SUFFIX_2_REGEXP
142
- stem = $`
143
- w = stem if stem =~ MGR1
144
- elsif w =~ /(s|t)(ion)$/
145
- stem = $` + $1
146
- w = stem if stem =~ MGR1
147
- end
148
-
149
- # Step 5
150
- if w =~ /e$/
151
- stem = $`
152
- w = stem if (stem =~ MGR1) || (stem =~ MEQ1 && stem !~ /^#{CC}#{V}[^aeiouwxy]$/o)
153
- end
154
-
155
- w.chop! if w =~ /ll$/ && w =~ MGR1
156
-
157
- # and turn initial Y back to y
158
- w[0] = "y" if w[0] == "Y"
159
- w
160
- end
161
-
162
- # make the stem_porter the default stem method, just in case we
163
- # feel like having multiple stemmers available later.
164
- alias stem stem_porter
165
- end
166
-
167
- # Add stem method to all Strings
168
- class String
169
- include Stemmable
170
- end
1
+ # frozen_string_literal: true
2
+
3
+ module Stemmable
4
+ STEP_2_LIST = {
5
+ "ational" => "ate", "tional" => "tion", "enci" => "ence", "anci" => "ance",
6
+ "izer" => "ize", "bli" => "ble",
7
+ "alli" => "al", "entli" => "ent", "eli" => "e", "ousli" => "ous",
8
+ "ization" => "ize", "ation" => "ate",
9
+ "ator" => "ate", "alism" => "al", "iveness" => "ive", "fulness" => "ful",
10
+ "ousness" => "ous", "aliti" => "al",
11
+ "iviti" => "ive", "biliti" => "ble", "logi" => "log"
12
+ }.freeze
13
+
14
+ STEP_3_LIST = {
15
+ "icate" => "ic", "ative" => "", "alize" => "al", "iciti" => "ic",
16
+ "ical" => "ic", "ful" => "", "ness" => ""
17
+ }.freeze
18
+
19
+ SUFFIX_1_REGEXP = /(
20
+ ational |
21
+ tional |
22
+ enci |
23
+ anci |
24
+ izer |
25
+ bli |
26
+ alli |
27
+ entli |
28
+ eli |
29
+ ousli |
30
+ ization |
31
+ ation |
32
+ ator |
33
+ alism |
34
+ iveness |
35
+ fulness |
36
+ ousness |
37
+ aliti |
38
+ iviti |
39
+ biliti |
40
+ logi)$/x.freeze
41
+
42
+ SUFFIX_2_REGEXP = /(
43
+ al |
44
+ ance |
45
+ ence |
46
+ er |
47
+ ic |
48
+ able |
49
+ ible |
50
+ ant |
51
+ ement |
52
+ ment |
53
+ ent |
54
+ ou |
55
+ ism |
56
+ ate |
57
+ iti |
58
+ ous |
59
+ ive |
60
+ ize)$/x.freeze
61
+
62
+ C = "[^aeiou]" # consonant
63
+ V = "[aeiouy]" # vowel
64
+ CC = "#{C}(?>[^aeiouy]*)" # consonant sequence
65
+ VV = "#{V}(?>[aeiou]*)" # vowel sequence
66
+
67
+ MGR0 = /^(#{CC})?#{VV}#{CC}/o.freeze # [cc]vvcc... is m>0
68
+ MEQ1 = /^(#{CC})?#{VV}#{CC}(#{VV})?$/o.freeze # [cc]vvcc[vv] is m=1
69
+ MGR1 = /^(#{CC})?#{VV}#{CC}#{VV}#{CC}/o.freeze # [cc]vvccvvcc... is m>1
70
+ VOWEL_IN_STEM = /^(#{CC})?#{V}/o.freeze # vowel in stem
71
+
72
+ # Porter stemmer in Ruby.
73
+ #
74
+ # This is the Porter stemming algorithm, ported to Ruby from the
75
+ # version coded up in Perl. It's easy to follow against the rules
76
+ # in the original paper in:
77
+ #
78
+ # Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
79
+ # no. 3, pp 130-137,
80
+ #
81
+ # See also http://www.tartarus.org/~martin/PorterStemmer
82
+ #
83
+ # Send comments to raypereda@hotmail.com
84
+ #
85
+
86
+ def stem_porter
87
+ # make a copy of the given object and convert it to a string.
88
+ w = dup.to_str
89
+
90
+ return w if w.length < 3
91
+
92
+ # now map initial y to Y so that the patterns never treat it as vowel
93
+ w[0] = "Y" if w[0] == "y"
94
+
95
+ # Step 1a
96
+ case w
97
+ when /(ss|i)es$/
98
+ w = $` + $1
99
+ when /([^s])s$/
100
+ w = $` + $1
101
+ end
102
+
103
+ # Step 1b
104
+ case w
105
+ when /eed$/
106
+ w.chop! if $` =~ MGR0
107
+ when /(ed|ing)$/
108
+ stem = $`
109
+ if stem =~ VOWEL_IN_STEM
110
+ w = stem
111
+ case w
112
+ when /(at|bl|iz)$/ then w << "e"
113
+ when /([^aeiouylsz])\1$/ then w.chop!
114
+ when /^#{CC}#{V}[^aeiouwxy]$/o then w << "e"
115
+ end
116
+ end
117
+ end
118
+
119
+ if w =~ /y$/
120
+ stem = $`
121
+ w = stem + "i" if stem =~ VOWEL_IN_STEM
122
+ end
123
+
124
+ # Step 2
125
+ if w =~ SUFFIX_1_REGEXP
126
+ stem = $`
127
+ suffix = $1
128
+ # print "stem= " + stem + "\n" + "suffix=" + suffix + "\n"
129
+ w = stem + STEP_2_LIST[suffix] if stem =~ MGR0
130
+ end
131
+
132
+ # Step 3
133
+ if w =~ /(icate|ative|alize|iciti|ical|ful|ness)$/
134
+ stem = $`
135
+ suffix = $1
136
+ w = stem + STEP_3_LIST[suffix] if stem =~ MGR0
137
+ end
138
+
139
+ # Step 4
140
+ if w =~ SUFFIX_2_REGEXP
141
+ stem = $`
142
+ w = stem if stem =~ MGR1
143
+ elsif w =~ /(s|t)(ion)$/
144
+ stem = $` + $1
145
+ w = stem if stem =~ MGR1
146
+ end
147
+
148
+ # Step 5
149
+ if w =~ /e$/
150
+ stem = $`
151
+ w = stem if (stem =~ MGR1) || (stem =~ MEQ1 && stem !~ /^#{CC}#{V}[^aeiouwxy]$/o)
152
+ end
153
+
154
+ w.chop! if w =~ /ll$/ && w =~ MGR1
155
+
156
+ # and turn initial Y back to y
157
+ w[0] = "y" if w[0] == "Y"
158
+ w
159
+ end
160
+
161
+ # make the stem_porter the default stem method, just in case we
162
+ # feel like having multiple stemmers available later.
163
+ alias stem stem_porter
164
+ end
165
+
166
+ # Add stem method to all Strings
167
+ class String
168
+ include Stemmable
169
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  class EngTagger
4
- VERSION = "0.4.0"
4
+ VERSION = "0.4.2"
5
5
  end
data/lib/engtagger.rb CHANGED
@@ -4,7 +4,7 @@
4
4
 
5
5
  require "rubygems"
6
6
  require "lru_redux"
7
- require_relative "engtagger/porter"
7
+ require_relative "./engtagger/porter"
8
8
 
9
9
  module BoundedSpaceMemoizable
10
10
  def memoize(method, max_cache_size = 100_000)
@@ -12,7 +12,7 @@ module BoundedSpaceMemoizable
12
12
  alias_method "__memoized__#{method}", method
13
13
  module_eval <<-MODEV
14
14
  def #{method}(*a)
15
- @__memoized_#{method}_cache ||= LruRedux::Cache.new(#{max_cache_size})
15
+ @__memoized_#{method}_cache ||= LruRedux::Cache.new(#{max_cache_size}, true)
16
16
  @__memoized_#{method}_cache[a] ||= __memoized__#{method}(*a)
17
17
  end
18
18
  MODEV
metadata CHANGED
@@ -1,17 +1,17 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: engtagger
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 0.4.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Yoichiro Hasebe
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2023-01-21 00:00:00.000000000 Z
11
+ date: 2025-01-16 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
- name: lru_redux
14
+ name: sin_lru_redux
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
17
  - - ">="
@@ -69,7 +69,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
69
69
  - !ruby/object:Gem::Version
70
70
  version: '0'
71
71
  requirements: []
72
- rubygems_version: 3.4.2
72
+ rubygems_version: 3.5.9
73
73
  signing_key:
74
74
  specification_version: 4
75
75
  summary: A probability based, corpus-trained English POS tagger