RubyGems - neologdish-normalizer - Versions diffs - 0.0.1 → 0.2.0 - Mend

neologdish-normalizer 0.0.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/.rubocop.yml +2 -3
data/.ruby-version +1 -1
data/README.md +4 -6
data/Rakefile +5 -0
data/lib/neologdish/normalizer/version.rb +1 -1
data/lib/neologdish/normalizer.rb +68 -15
data/renovate.json +6 -0
data/sig/generated/neologdish/normalizer/version.rbs +1 -1
data/sig/generated/neologdish/normalizer.rbs +11 -1
metadata +8 -10

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: f2f96d57b26c4e44d37d47bcabcec5a820470d4e89beb59b709c2dc36ecd449b
-  data.tar.gz: fc7479594872db9c262cd13acf8e4f717936b289ead7fe136c65f0f211d32417
+  metadata.gz: 57bda9b5b4b5de8dc0b6582e3a5eb5d2103d3207a405bb7b977b52e0eff02322
+  data.tar.gz: a35a668f517d31229e370dd4839c911f2e1b16f6c868e8537d9e61f2e44c7eee
 SHA512:
-  metadata.gz: 84457e6fd2b5fae0b87d51a86b0e9a5c4aaf808b436ba97654c0939630fe19142c6bb2eb6729baeb0773882726aacffb62f67139be5c03b8c2404e74ca1b2408
-  data.tar.gz: 312f9993dceed61d3443f7ddb6c06904d49b21a52725be48721bf0683e75111b76b6b3bd53cde1d7a34315adb157f91306a75f3561f584af74d269bd71eb19aa
+  metadata.gz: bf2788e135ccecdfdea6a4be0e52e4e1b151e4fc6c8847844299d2bbc1578c23553f92d6d47bea0e955dff2ddb941f9944e94cbfe420bbe51b733c2094497f6e
+  data.tar.gz: 8da7c4aa8848bef56fd5d612f22e08e0cf33cde2ccc0af829d2394da3f8bd9cb5c2ae38223adcd7d3b871df6b3909f1c550a36beb86f0bdcf6a9ab14c0d65345

data/.rubocop.yml CHANGED Viewed

@@ -1,10 +1,9 @@
 require:
-  - rubocop-rake
   - rubocop-minitest
+plugins:
+  - rubocop-rake
 AllCops:
   NewCops: enable
 Metrics:
   Enabled: false
 Layout/LineLength:

data/.ruby-version CHANGED Viewed

	@@ -1 +1 @@
1	- 3.3
1	+ 3.4.4

data/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Neologdish::Normalizer for Ruby
+# Neologdish::Normalizer for Ruby [![Check](https://github.com/moznion/neologdish-normalizer-ruby/actions/workflows/check.yml/badge.svg)](https://github.com/moznion/neologdish-normalizer-ruby/actions/workflows/check.yml) [![Gem Version](https://badge.fury.io/rb/neologdish-normalizer.svg)](https://badge.fury.io/rb/neologdish-normalizer)
 A Japanese text normalization library for Ruby follows the conventions of [neologd/mecab-ipadic-neologd](https://github.com/neologd/mecab-ipadic-neologd), with some performance optimizations, without external dependencies. It is designed to preprocess Japanese text before applying NLP techniques.
@@ -27,18 +27,16 @@ The benchmark script is here: [./scripts/benchmark.rb](./scripts/benchmark.rb)
 ## Installation
-TODO: Replace `UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG` with your gem name right after releasing it to RubyGems.org. Please do not do it earlier due to security reasons. Alternatively, replace this section with instructions to install your gem from git if you don't plan to release to RubyGems.org.
 Install the gem and add to the application's Gemfile by executing:
 ```bash
-bundle add UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG
+bundle add 'neologdish-normalizer'
 ```
 If bundler is not being used to manage dependencies, install the gem by executing:
 ```bash
-gem install UPDATE_WITH_YOUR_GEM_NAME_IMMEDIATELY_AFTER_RELEASE_TO_RUBYGEMS_ORG
+gem install 'neologdish-normalizer'
 ```
 ## Development
@@ -49,5 +47,5 @@ To install this gem onto your local machine, run `bundle exec rake install`. To
 ## Contributing
-Bug reports and pull requests are welcome on GitHub at https://github.com/moznion/neologdish-normalizer.
+Bug reports and pull requests are welcome on GitHub at https://github.com/moznion/neologdish-normalizer-ruby.

data/Rakefile CHANGED Viewed

@@ -12,6 +12,11 @@ namespace :rbs do
   end
 end
+desc 'run benchmark script'
+task :benchmark do
+  sh 'ruby ./scripts/benchmark.rb'
+end
 Rake::TestTask.new do |task|
   task.libs = %w[lib test]
   task.test_files = FileList['test/**/*.rb']

data/lib/neologdish/normalizer/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module Neologdish
   module Normalizer
-    VERSION = '0.0.1' #: String
+    VERSION = '0.2.0'
   end
 end

data/lib/neologdish/normalizer.rb CHANGED Viewed

@@ -5,6 +5,9 @@ require_relative 'normalizer/version'
 module Neologdish
   # A Japanese text normalizer module according to the neologd convention.
   module Normalizer
+    NORMALIZED_HYPHEN = "\u002d" # -
+    NORMALIZED_VOWEL = "\u30fc" # ー
     CONVERSION_MAP = {
       # Normalize [0-9a-zA-Z] to half-width
       '０' => '0', '１' => '1', '２' => '2', '３' => '3', '４' => '4', '５' => '5', '６' => '6', '７' => '7', '８' => '8', '９' => '9',
@@ -15,11 +18,31 @@ module Neologdish
       'ｋ' => 'k', 'ｌ' => 'l', 'ｍ' => 'm', 'ｎ' => 'n', 'ｏ' => 'o', 'ｐ' => 'p', 'ｑ' => 'q', 'ｒ' => 'r', 'ｓ' => 's', 'ｔ' => 't',
       'ｕ' => 'u', 'ｖ' => 'v', 'ｗ' => 'w', 'ｘ' => 'x', 'ｙ' => 'y', 'ｚ' => 'z',
       # normalize the hyphen/minus-ish characters to '-'
-      '˗' => '-', '֊' => '-', '‐' => '-', '‑' => '-', '‒' => '-', '–' => '-', '⁃' => '-', '⁻' => '-', '₋' => '-', '−' => '-',
+      "\u02d7" => NORMALIZED_HYPHEN, # ˗
+      "\u058a" => NORMALIZED_HYPHEN, # ֊
+      "\u2010" => NORMALIZED_HYPHEN, # ‐
+      "\u2011" => NORMALIZED_HYPHEN, # ‑
+      "\u2012" => NORMALIZED_HYPHEN, # ‒
+      "\u2013" => NORMALIZED_HYPHEN, # –
+      "\u2043" => NORMALIZED_HYPHEN, # ⁃
+      "\u207b" => NORMALIZED_HYPHEN, # ⁻
+      "\u208b" => NORMALIZED_HYPHEN, # ₋
+      "\u2212" => NORMALIZED_HYPHEN, # −
       # normalize the long-vowel mark-ish characters to 'ー'
-      '﹣' => 'ー', '－' => 'ー', 'ｰ' => 'ー', '—' => 'ー', '―' => 'ー', '─' => 'ー', '━' => 'ー',
+      "\u2014" => NORMALIZED_VOWEL, # —
+      "\u2015" => NORMALIZED_VOWEL, # ―
+      "\u2500" => NORMALIZED_VOWEL, # ─
+      "\u2501" => NORMALIZED_VOWEL, # ━
+      "\ufe63" => NORMALIZED_VOWEL, # ﹣
+      "\uff0d" => NORMALIZED_VOWEL, # －
+      "\uff70" => NORMALIZED_VOWEL, # ｰ
       # remove the tilde-ish characters
-      '~' => '', '∼' => '', '∾' => '', '〜' => '', '〰' => '', '～' => '',
+      "\u007e" => '', # ~
+      "\u223c" => '', # ∼
+      "\u223e" => '', # ∾
+      "\u301c" => '', # 〜
+      "\u3030" => '', # 〰
+      "\uff5e" => '', # ～
       # normalize the full-width special symbol characters (/！”＃＄％＆’（）＊＋，−．／：；＜＞？＠［￥］＾＿｀｛｜｝) and space characters to half-width
       '　' => ' ', '！' => '!', '”' => '"', '＃' => '#', '＄' => '$', '％' => '%', '＆' => '&', '’' => "'", '（' => '(', '）' => ')',
       '＊' => '*', '＋' => '+', '，' => ',', '．' => '.', '／' => '/', '：' => ':', '；' => ';', '＜' => '<', '＞' => '>', '？' => '?',
@@ -53,44 +76,74 @@ module Neologdish
       'ｯ' => 'ッ', 'ｬ' => 'ヤ', 'ｭ' => 'ユ', 'ｮ' => 'ヨ'
     }.freeze #: Hash[String, String]
+    DAKUON_HANDAKUON_POSSIBLES = {
+      'ウ' => true,
+      'カ' => true, 'キ' => true, 'ク' => true, 'ケ' => true, 'コ' => true,
+      'サ' => true, 'シ' => true, 'ス' => true, 'セ' => true, 'ソ' => true,
+      'タ' => true, 'チ' => true, 'ツ' => true, 'テ' => true, 'ト' => true,
+      'ハ' => true, 'ヒ' => true, 'フ' => true, 'ヘ' => true, 'ホ' => true,
+      'う' => true,
+      'か' => true, 'き' => true, 'く' => true, 'け' => true, 'こ' => true,
+      'さ' => true, 'し' => true, 'す' => true, 'せ' => true, 'そ' => true,
+      'た' => true, 'ち' => true, 'つ' => true, 'て' => true, 'と' => true,
+      'は' => true, 'ひ' => true, 'ふ' => true, 'へ' => true, 'ほ' => true
+    }.freeze #: Hash[String, bool]
     DAKUON_KANA_MAP = {
+      'ウ' => 'ヴ',
       'カ' => 'ガ', 'キ' => 'ギ', 'ク' => 'グ', 'ケ' => 'ゲ', 'コ' => 'ゴ',
       'サ' => 'ザ', 'シ' => 'ジ', 'ス' => 'ズ', 'セ' => 'ゼ', 'ソ' => 'ゾ',
       'タ' => 'ダ', 'チ' => 'ヂ', 'ツ' => 'ヅ', 'テ' => 'デ', 'ト' => 'ド',
-      'ハ' => 'バ', 'ヒ' => 'ビ', 'フ' => 'ブ', 'ヘ' => 'ベ', 'ホ' => 'ボ'
+      'ハ' => 'バ', 'ヒ' => 'ビ', 'フ' => 'ブ', 'ヘ' => 'ベ', 'ホ' => 'ボ',
+      'う' => 'ゔ',
+      'か' => 'が', 'き' => 'ぎ', 'く' => 'ぐ', 'け' => 'げ', 'こ' => 'ご',
+      'さ' => 'ざ', 'し' => 'じ', 'す' => 'ず', 'せ' => 'ぜ', 'そ' => 'ぞ',
+      'た' => 'だ', 'ち' => 'ぢ', 'つ' => 'づ', 'て' => 'で', 'と' => 'ど',
+      'は' => 'ば', 'ひ' => 'び', 'ふ' => 'ぶ', 'へ' => 'べ', 'ほ' => 'ぼ'
     }.freeze #: Hash[String, String]
     HANDAKUON_KANA_MAP = {
-      'ハ' => 'パ', 'ヒ' => 'ピ', 'フ' => 'プ', 'ヘ' => 'ペ', 'ホ' => 'ポ'
+      'ハ' => 'パ', 'ヒ' => 'ピ', 'フ' => 'プ', 'ヘ' => 'ペ', 'ホ' => 'ポ',
+      'は' => 'ぱ', 'ひ' => 'ぴ', 'ふ' => 'ぷ', 'へ' => 'ぺ', 'ほ' => 'ぽ'
     }.freeze #: Hash[String, String]
-    private_constant :CONVERSION_MAP, :LATIN_MAP, :HALF_WIDTH_KANA_MAP, :DAKUON_KANA_MAP, :HANDAKUON_KANA_MAP
+    private_constant :CONVERSION_MAP, :LATIN_MAP, :HALF_WIDTH_KANA_MAP, :DAKUON_KANA_MAP, :HANDAKUON_KANA_MAP, :DAKUON_HANDAKUON_POSSIBLES,
+                     :NORMALIZED_HYPHEN, :NORMALIZED_VOWEL
     # Normalize the given text.
     #
     # @rbs str: String
+    # @rbs override_conversion_map: Hash[String, String]
     # @rbs return: String
-    def normalize(str)
+    def normalize(str, override_conversion_map = {})
+      conversion_map = CONVERSION_MAP.merge(override_conversion_map)
       squeezee = ''
       prev_latin = false
       whitespace_encountered = false
-      encountered_half_width_kana = nil
+      dakuon_handakuon_possible = nil
       normalized = str.chars.map do |c|
         prefix = ''
-        c = CONVERSION_MAP[c] || c
+        c = conversion_map[c] || c
         # normalize the Half-width kana to full-width
-        if encountered_half_width_kana
-          if (c == 'ﾞ' && (k = DAKUON_KANA_MAP[encountered_half_width_kana])) ||
-             (c == 'ﾟ' && (k = HANDAKUON_KANA_MAP[encountered_half_width_kana]))
+        if dakuon_handakuon_possible
+          if (["\u309b", "\u3099", "\uff9e"].include?(c) && (k = DAKUON_KANA_MAP[dakuon_handakuon_possible])) ||
+             (["\u309c", "\u309a", "\uff9f"].include?(c) && (k = HANDAKUON_KANA_MAP[dakuon_handakuon_possible]))
             c = ''
             prefix = k
           else
-            prefix = encountered_half_width_kana
+            prefix = dakuon_handakuon_possible
           end
         end
         if (encountered_half_width_kana = HALF_WIDTH_KANA_MAP[c])
+          c = encountered_half_width_kana
+        end
+        dakuon_handakuon_possible = nil
+        if DAKUON_HANDAKUON_POSSIBLES[c]
+          dakuon_handakuon_possible = c
           c = ''
         end
@@ -112,12 +165,12 @@ module Neologdish
           c = ''
         else
           prefix = ' ' if is_latin && whitespace_encountered
-          whitespace_encountered = false
+          whitespace_encountered &&= c == '' # take care for consecutive spaces on the right side
         end
         prev_latin = is_latin
         prefix + c
-      end.join + (encountered_half_width_kana || '')
+      end.join + (dakuon_handakuon_possible || '')
       normalized.strip
     end

data/renovate.json ADDED Viewed

@@ -0,0 +1,6 @@
+{
+  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
+  "extends": [
+    "config:recommended"
+  ]
+}

data/sig/generated/neologdish/normalizer/version.rbs CHANGED Viewed

@@ -2,6 +2,6 @@
 module Neologdish
   module Normalizer
-    VERSION: String
+    VERSION: ::String
   end
 end

data/sig/generated/neologdish/normalizer.rbs CHANGED Viewed

@@ -1,19 +1,29 @@
 # Generated from lib/neologdish/normalizer.rb with RBS::Inline
 module Neologdish
+  # A Japanese text normalizer module according to the neologd convention.
   module Normalizer
+    NORMALIZED_HYPHEN: ::String
+    NORMALIZED_VOWEL: ::String
     CONVERSION_MAP: Hash[String, String]
     LATIN_MAP: Hash[String, bool]
     HALF_WIDTH_KANA_MAP: Hash[String, String]
+    DAKUON_HANDAKUON_POSSIBLES: Hash[String, bool]
     DAKUON_KANA_MAP: Hash[String, String]
     HANDAKUON_KANA_MAP: Hash[String, String]
+    # Normalize the given text.
+    #
     # @rbs str: String
+    # @rbs override_conversion_map: Hash[String, String]
     # @rbs return: String
-    def normalize: (String str) -> String
+    def self?.normalize: (String str, ?Hash[String, String] override_conversion_map) -> String
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: neologdish-normalizer
 version: !ruby/object:Gem::Version
-  version: 0.0.1
+  version: 0.2.0
 platform: ruby
 authors:
 - moznion
-autorequire:
 bindir: exe
 cert_chain: []
-date: 2024-10-28 00:00:00.000000000 Z
+date: 1980-01-02 00:00:00.000000000 Z
 dependencies: []
 description: A Japanese text normalization library follows the conventions of neologd
   with some performance optimizations. It is designed to preprocess Japanese text
@@ -26,16 +25,16 @@ files:
 - Rakefile
 - lib/neologdish/normalizer.rb
 - lib/neologdish/normalizer/version.rb
+- renovate.json
 - sig/generated/neologdish/normalizer.rbs
 - sig/generated/neologdish/normalizer/version.rbs
-homepage: https://github.com/moznion/neologdish-normalizer
+homepage: https://github.com/moznion/neologdish-normalizer-ruby
 licenses: []
 metadata:
-  homepage_uri: https://github.com/moznion/neologdish-normalizer
-  source_code_uri: https://github.com/moznion/neologdish-normalizer
-  changelog_uri: https://github.com/moznion/neologdish-normalizer/releases
+  homepage_uri: https://github.com/moznion/neologdish-normalizer-ruby
+  source_code_uri: https://github.com/moznion/neologdish-normalizer-ruby
+  changelog_uri: https://github.com/moznion/neologdish-normalizer-ruby/releases
   rubygems_mfa_required: 'true'
-post_install_message:
 rdoc_options: []
 require_paths:
 - lib
@@ -50,8 +49,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.5.22
-signing_key:
+rubygems_version: 3.6.7
 specification_version: 4
 summary: A Japanese text normalization library follows the conventions of neologd
 test_files: []