neologdish-normalizer 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: f2f96d57b26c4e44d37d47bcabcec5a820470d4e89beb59b709c2dc36ecd449b
4
+ data.tar.gz: fc7479594872db9c262cd13acf8e4f717936b289ead7fe136c65f0f211d32417
5
+ SHA512:
6
+ metadata.gz: 84457e6fd2b5fae0b87d51a86b0e9a5c4aaf808b436ba97654c0939630fe19142c6bb2eb6729baeb0773882726aacffb62f67139be5c03b8c2404e74ca1b2408
7
+ data.tar.gz: 312f9993dceed61d3443f7ddb6c06904d49b21a52725be48721bf0683e75111b76b6b3bd53cde1d7a34315adb157f91306a75f3561f584af74d269bd71eb19aa
data/.rubocop.yml ADDED
@@ -0,0 +1,18 @@
1
+ require:
2
+ - rubocop-rake
3
+ - rubocop-minitest
4
+
5
+ AllCops:
6
+ NewCops: enable
7
+
8
+ Metrics:
9
+ Enabled: false
10
+ Layout/LineLength:
11
+ Max: 140
12
+ Metrics/MethodLength:
13
+ Max: 14
14
+ Layout/LeadingCommentSpace:
15
+ AllowRBSInlineAnnotation: true
16
+ Minitest/MultipleAssertions:
17
+ Enabled: false
18
+
data/.ruby-version ADDED
@@ -0,0 +1 @@
1
+ 3.3
data/LICENSE ADDED
@@ -0,0 +1,8 @@
1
+ Copyright 2024 Taiki Kawakami (a.k.a moznion) https://moznion.net
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4
+
5
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6
+
7
+ THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8
+
data/README.md ADDED
@@ -0,0 +1,53 @@
1
+ # Neologdish::Normalizer for Ruby
2
+
3
+ A Japanese text normalization library for Ruby follows the conventions of [neologd/mecab-ipadic-neologd](https://github.com/neologd/mecab-ipadic-neologd), with some performance optimizations, without external dependencies. It is designed to preprocess Japanese text before applying NLP techniques.
4
+
5
+ The specific rules are documented here: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
6
+
7
+ ## Usage
8
+
9
+ ```ruby
10
+ require "neologdish-normalizer"
11
+
12
+ Neologdish::Normalizer.normalize("南アルプスの 天然水- Sparking* Lemon+ レモン一絞り")
13
+ # => 南アルプスの天然水-Sparking*Lemon+レモン一絞り
14
+ ```
15
+
16
+ ## Benchmark
17
+
18
+ The performance comparison between the official Ruby example (https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#ruby-written-by-kimoto-and-overlast) and this library is as follows:
19
+
20
+ ```
21
+ user system total real
22
+ original normalizer: 4.200670 0.032004 4.232674 ( 4.274573)
23
+ this library: 1.158801 0.005238 1.164039 ( 1.170226)
24
+ ```
25
+
26
+ The benchmark script is here: [./scripts/benchmark.rb](./scripts/benchmark.rb)
27
+
28
+ ## Installation
29
+
30
+ TODO: Replace `UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG` with your gem name right after releasing it to RubyGems.org. Please do not do it earlier due to security reasons. Alternatively, replace this section with instructions to install your gem from git if you don't plan to release to RubyGems.org.
31
+
32
+ Install the gem and add to the application's Gemfile by executing:
33
+
34
+ ```bash
35
+ bundle add UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG
36
+ ```
37
+
38
+ If bundler is not being used to manage dependencies, install the gem by executing:
39
+
40
+ ```bash
41
+ gem install UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG
42
+ ```
43
+
44
+ ## Development
45
+
46
+ After checking out the repo, run `bin/setup` to install dependencies. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
47
+
48
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
49
+
50
+ ## Contributing
51
+
52
+ Bug reports and pull requests are welcome on GitHub at https://github.com/moznion/neologdish-normalizer.
53
+
data/Rakefile ADDED
@@ -0,0 +1,20 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'bundler/gem_tasks'
4
+ require 'rake/testtask'
5
+ require 'rubocop/rake_task'
6
+
7
+ task default: %i[]
8
+
9
+ namespace :rbs do
10
+ task gen: %i[] do
11
+ sh 'rbs-inline --output --opt-out lib'
12
+ end
13
+ end
14
+
15
+ Rake::TestTask.new do |task|
16
+ task.libs = %w[lib test]
17
+ task.test_files = FileList['test/**/*.rb']
18
+ end
19
+
20
+ RuboCop::RakeTask.new
@@ -0,0 +1,7 @@
1
+ # frozen_string_literal: true
2
+
3
+ module Neologdish
4
+ module Normalizer
5
+ VERSION = '0.0.1' #: String
6
+ end
7
+ end
@@ -0,0 +1,127 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'normalizer/version'
4
+
5
+ module Neologdish
6
+ # A Japanese text normalizer module according to the neologd convention.
7
+ module Normalizer
8
+ CONVERSION_MAP = {
9
+ # Normalize [0-9a-zA-Z] to half-width
10
+ '0' => '0', '1' => '1', '2' => '2', '3' => '3', '4' => '4', '5' => '5', '6' => '6', '7' => '7', '8' => '8', '9' => '9',
11
+ 'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E', 'F' => 'F', 'G' => 'G', 'H' => 'H', 'I' => 'I', 'J' => 'J',
12
+ 'K' => 'K', 'L' => 'L', 'M' => 'M', 'N' => 'N', 'O' => 'O', 'P' => 'P', 'Q' => 'Q', 'R' => 'R', 'S' => 'S', 'T' => 'T',
13
+ 'U' => 'U', 'V' => 'V', 'W' => 'W', 'X' => 'X', 'Y' => 'Y', 'Z' => 'Z',
14
+ 'a' => 'a', 'b' => 'b', 'c' => 'c', 'd' => 'd', 'e' => 'e', 'f' => 'f', 'g' => 'g', 'h' => 'h', 'i' => 'i', 'j' => 'j',
15
+ 'k' => 'k', 'l' => 'l', 'm' => 'm', 'n' => 'n', 'o' => 'o', 'p' => 'p', 'q' => 'q', 'r' => 'r', 's' => 's', 't' => 't',
16
+ 'u' => 'u', 'v' => 'v', 'w' => 'w', 'x' => 'x', 'y' => 'y', 'z' => 'z',
17
+ # normalize the hyphen/minus-ish characters to '-'
18
+ '˗' => '-', '֊' => '-', '‐' => '-', '‑' => '-', '‒' => '-', '–' => '-', '⁃' => '-', '⁻' => '-', '₋' => '-', '−' => '-',
19
+ # normalize the long-vowel mark-ish characters to 'ー'
20
+ '﹣' => 'ー', '-' => 'ー', 'ー' => 'ー', '—' => 'ー', '―' => 'ー', '─' => 'ー', '━' => 'ー',
21
+ # remove the tilde-ish characters
22
+ '~' => '', '∼' => '', '∾' => '', '〜' => '', '〰' => '', '~' => '',
23
+ # normalize the full-width special symbol characters (/!”#$%&’()*+,−./:;<>?@[¥]^_`{|}) and space characters to half-width
24
+ ' ' => ' ', '!' => '!', '”' => '"', '#' => '#', '$' => '$', '%' => '%', '&' => '&', '’' => "'", '(' => '(', ')' => ')',
25
+ '*' => '*', '+' => '+', ',' => ',', '.' => '.', '/' => '/', ':' => ':', ';' => ';', '<' => '<', '>' => '>', '?' => '?',
26
+ '@' => '@', '[' => '[', '¥' => '¥', ']' => ']', '^' => '^', '_' => '_', '`' => '`', '{' => '{', '|' => '|', '}' => '}',
27
+ # normalize the half-width special symbol characters (。、・=「」) to full-width
28
+ '。' => '。', '、' => '、', '・' => '・', '「' => '「', '」' => '」'
29
+ }.freeze #: Hash[String, String]
30
+
31
+ LATIN_MAP = {
32
+ '0' => true, '1' => true, '2' => true, '3' => true, '4' => true, '5' => true, '6' => true, '7' => true, '8' => true, '9' => true,
33
+ 'A' => true, 'B' => true, 'C' => true, 'D' => true, 'E' => true, 'F' => true, 'G' => true, 'H' => true, 'I' => true, 'J' => true,
34
+ 'K' => true, 'L' => true, 'M' => true, 'N' => true, 'O' => true, 'P' => true, 'Q' => true, 'R' => true, 'S' => true, 'T' => true,
35
+ 'U' => true, 'V' => true, 'W' => true, 'X' => true, 'Y' => true, 'Z' => true,
36
+ 'a' => true, 'b' => true, 'c' => true, 'd' => true, 'e' => true, 'f' => true, 'g' => true, 'h' => true, 'i' => true, 'j' => true,
37
+ 'k' => true, 'l' => true, 'm' => true, 'n' => true, 'o' => true, 'p' => true, 'q' => true, 'r' => true, 's' => true, 't' => true,
38
+ 'u' => true, 'v' => true, 'w' => true, 'x' => true, 'y' => true, 'z' => true
39
+ }.freeze #: Hash[String, bool]
40
+
41
+ HALF_WIDTH_KANA_MAP = {
42
+ 'ア' => 'ア', 'イ' => 'イ', 'ウ' => 'ウ', 'エ' => 'エ', 'オ' => 'オ',
43
+ 'カ' => 'カ', 'キ' => 'キ', 'ク' => 'ク', 'ケ' => 'ケ', 'コ' => 'コ',
44
+ 'サ' => 'サ', 'シ' => 'シ', 'ス' => 'ス', 'セ' => 'セ', 'ソ' => 'ソ',
45
+ 'タ' => 'タ', 'チ' => 'チ', 'ツ' => 'ツ', 'テ' => 'テ', 'ト' => 'ト',
46
+ 'ナ' => 'ナ', 'ニ' => 'ニ', 'ヌ' => 'ヌ', 'ネ' => 'ネ', 'ノ' => 'ノ',
47
+ 'ハ' => 'ハ', 'ヒ' => 'ヒ', 'フ' => 'フ', 'ヘ' => 'ヘ', 'ホ' => 'ホ',
48
+ 'マ' => 'マ', 'ミ' => 'ミ', 'ム' => 'ム', 'メ' => 'メ', 'モ' => 'モ',
49
+ 'ヤ' => 'ヤ', 'ユ' => 'ユ', 'ヨ' => 'ヨ',
50
+ 'ラ' => 'ラ', 'リ' => 'リ', 'ル' => 'ル', 'レ' => 'レ', 'ロ' => 'ロ',
51
+ 'ワ' => 'ワ', 'ヲ' => 'ヲ', 'ン' => 'ン',
52
+ 'ァ' => 'ァ', 'ィ' => 'ィ', 'ゥ' => 'ゥ', 'ェ' => 'ェ', 'ォ' => 'ォ',
53
+ 'ッ' => 'ッ', 'ャ' => 'ヤ', 'ュ' => 'ユ', 'ョ' => 'ヨ'
54
+ }.freeze #: Hash[String, String]
55
+
56
+ DAKUON_KANA_MAP = {
57
+ 'カ' => 'ガ', 'キ' => 'ギ', 'ク' => 'グ', 'ケ' => 'ゲ', 'コ' => 'ゴ',
58
+ 'サ' => 'ザ', 'シ' => 'ジ', 'ス' => 'ズ', 'セ' => 'ゼ', 'ソ' => 'ゾ',
59
+ 'タ' => 'ダ', 'チ' => 'ヂ', 'ツ' => 'ヅ', 'テ' => 'デ', 'ト' => 'ド',
60
+ 'ハ' => 'バ', 'ヒ' => 'ビ', 'フ' => 'ブ', 'ヘ' => 'ベ', 'ホ' => 'ボ'
61
+ }.freeze #: Hash[String, String]
62
+
63
+ HANDAKUON_KANA_MAP = {
64
+ 'ハ' => 'パ', 'ヒ' => 'ピ', 'フ' => 'プ', 'ヘ' => 'ペ', 'ホ' => 'ポ'
65
+ }.freeze #: Hash[String, String]
66
+
67
+ private_constant :CONVERSION_MAP, :LATIN_MAP, :HALF_WIDTH_KANA_MAP, :DAKUON_KANA_MAP, :HANDAKUON_KANA_MAP
68
+
69
+ # Normalize the given text.
70
+ #
71
+ # @rbs str: String
72
+ # @rbs return: String
73
+ def normalize(str)
74
+ squeezee = ''
75
+ prev_latin = false
76
+ whitespace_encountered = false
77
+ encountered_half_width_kana = nil
78
+ normalized = str.chars.map do |c|
79
+ prefix = ''
80
+ c = CONVERSION_MAP[c] || c
81
+
82
+ # normalize the Half-width kana to full-width
83
+ if encountered_half_width_kana
84
+ if (c == '゙' && (k = DAKUON_KANA_MAP[encountered_half_width_kana])) ||
85
+ (c == '゚' && (k = HANDAKUON_KANA_MAP[encountered_half_width_kana]))
86
+ c = ''
87
+ prefix = k
88
+ else
89
+ prefix = encountered_half_width_kana
90
+ end
91
+ end
92
+
93
+ if (encountered_half_width_kana = HALF_WIDTH_KANA_MAP[c])
94
+ c = ''
95
+ end
96
+
97
+ # squash consecutive special characters (space or long-vowel)
98
+ if [' ', 'ー'].include?(c)
99
+ if squeezee == c
100
+ c = ''
101
+ else
102
+ squeezee = c
103
+ end
104
+ else
105
+ squeezee = ''
106
+ end
107
+
108
+ # remove the white space character in the middle of non-latin characters
109
+ is_latin = LATIN_MAP[c] || false
110
+ if c == ' '
111
+ whitespace_encountered = prev_latin
112
+ c = ''
113
+ else
114
+ prefix = ' ' if is_latin && whitespace_encountered
115
+ whitespace_encountered = false
116
+ end
117
+ prev_latin = is_latin
118
+
119
+ prefix + c
120
+ end.join + (encountered_half_width_kana || '')
121
+
122
+ normalized.strip
123
+ end
124
+
125
+ module_function :normalize
126
+ end
127
+ end
@@ -0,0 +1,7 @@
1
+ # Generated from lib/neologdish/normalizer/version.rb with RBS::Inline
2
+
3
+ module Neologdish
4
+ module Normalizer
5
+ VERSION: String
6
+ end
7
+ end
@@ -0,0 +1,19 @@
1
+ # Generated from lib/neologdish/normalizer.rb with RBS::Inline
2
+
3
+ module Neologdish
4
+ module Normalizer
5
+ CONVERSION_MAP: Hash[String, String]
6
+
7
+ LATIN_MAP: Hash[String, bool]
8
+
9
+ HALF_WIDTH_KANA_MAP: Hash[String, String]
10
+
11
+ DAKUON_KANA_MAP: Hash[String, String]
12
+
13
+ HANDAKUON_KANA_MAP: Hash[String, String]
14
+
15
+ # @rbs str: String
16
+ # @rbs return: String
17
+ def normalize: (String str) -> String
18
+ end
19
+ end
metadata ADDED
@@ -0,0 +1,57 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: neologdish-normalizer
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.0.1
5
+ platform: ruby
6
+ authors:
7
+ - moznion
8
+ autorequire:
9
+ bindir: exe
10
+ cert_chain: []
11
+ date: 2024-10-28 00:00:00.000000000 Z
12
+ dependencies: []
13
+ description: A Japanese text normalization library follows the conventions of neologd
14
+ with some performance optimizations. It is designed to preprocess Japanese text
15
+ before applying NLP techniques.
16
+ email:
17
+ - moznion@mail.moznion.net
18
+ executables: []
19
+ extensions: []
20
+ extra_rdoc_files: []
21
+ files:
22
+ - ".rubocop.yml"
23
+ - ".ruby-version"
24
+ - LICENSE
25
+ - README.md
26
+ - Rakefile
27
+ - lib/neologdish/normalizer.rb
28
+ - lib/neologdish/normalizer/version.rb
29
+ - sig/generated/neologdish/normalizer.rbs
30
+ - sig/generated/neologdish/normalizer/version.rbs
31
+ homepage: https://github.com/moznion/neologdish-normalizer
32
+ licenses: []
33
+ metadata:
34
+ homepage_uri: https://github.com/moznion/neologdish-normalizer
35
+ source_code_uri: https://github.com/moznion/neologdish-normalizer
36
+ changelog_uri: https://github.com/moznion/neologdish-normalizer/releases
37
+ rubygems_mfa_required: 'true'
38
+ post_install_message:
39
+ rdoc_options: []
40
+ require_paths:
41
+ - lib
42
+ required_ruby_version: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - ">="
45
+ - !ruby/object:Gem::Version
46
+ version: 3.0.0
47
+ required_rubygems_version: !ruby/object:Gem::Requirement
48
+ requirements:
49
+ - - ">="
50
+ - !ruby/object:Gem::Version
51
+ version: '0'
52
+ requirements: []
53
+ rubygems_version: 3.5.22
54
+ signing_key:
55
+ specification_version: 4
56
+ summary: A Japanese text normalization library follows the conventions of neologd
57
+ test_files: []