RubyGems - unihan_lang - Versions diffs - 0.1.0 → 0.2.0 - Mend

unihan_lang 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/LICENSE.md +24 -0
data/README.ja.md +85 -0
data/README.md +54 -23
data/lib/unihan_lang/chinese_processor.rb +12 -16
data/lib/unihan_lang/version.rb +1 -1
data/lib/unihan_lang.rb +0 -1
data/unihan_lang.gemspec +2 -2
metadata +7 -8
data/data/traditional_chinese_list.txt +0 -6017
data/test.rb +0 -58
data/traditional_characters.txt +0 -0

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: bc3bac523b20f37850e5d1e43587a736cb1922bb8fc0ea01a39b3c534d257045
-  data.tar.gz: f29a6426a0f23ac69d4865a0988586b628c7dd10869adafff412b0f4daabfafb
+  metadata.gz: 81d1394f3bee01c607c5440f3682b6401c9aa1d1ff9ba5ad286c196f8eebc54b
+  data.tar.gz: 8be0a8218adbe226b8f2079a893200912e8cbcba6b2c00aa405b0b886dd01f50
 SHA512:
-  metadata.gz: 490b594addd8d6517bbbba34da1c3c7528c12e8a5a914101ccd5c0e71ce8fe0f406e282bf29ab0281dc8317ee7d240a58ccdfc3928911d009e5859147474081c
-  data.tar.gz: 13b72769f0e4e10f1e02c56659f56a39aa0f205a82312eeb14be0f431f36e5b8794c312d98306eeea063c56d5f1c711abc829539c83256d7772556eb1fb54d3b
+  metadata.gz: 01ae746510cad08ab38db9f21752049e7fd99141fdc6886f6a93a2c521feb0554b47fafa8d7d3bfa4fd6aee560bbd1dfa3a08ad0ac477e399583971fa7258ebe
+  data.tar.gz: 302c234bc6616021682ff2b9f86df610f24bc9bd7894e6107369d44a5e2e4164738130ec31d785f45485b5e508fc03934cd50bb5bea52d9423299ca5679e182c

data/LICENSE.md ADDED Viewed

@@ -0,0 +1,24 @@
+<!-- @format -->
+# The MIT License (MIT)
+Copyright 2024 kyubey1228
+---
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the “Software”), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software is furnished to do so,
+subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.ja.md ADDED Viewed

@@ -0,0 +1,85 @@
+<!-- @format -->
+# UnihanLang
+`unihan_lang` は、テキストの言語（繁体字中国語、簡体字中国語）を識別し、中国語の文字に関する様々な判定を行うための Ruby ライブラリです。
+## インストール
+Gemfile に以下の行を追加してください：
+```ruby
+gem 'unihan_lang'
+```
+そして、以下のコマンドを実行してください：
+```sh
+bundle install
+```
+または、直接インストールする場合は以下のコマンドを使用してください：
+```sh
+gem install unihan_lang
+```
+## 使用方法
+```ruby
+require 'unihan_lang'
+unihan = UnihanLang::Unihan.new
+# 言語の判定
+puts unihan.determine_language("這是繁體中文") # => "ZH_TW"
+puts unihan.determine_language("这是简体中文") # => "ZH_CN"
+# 繁体字中国語かどうかの判定
+puts unihan.zh_tw?("這是繁體中文") # => true
+puts unihan.zh_tw?("这不是繁体中文") # => false
+# 簡体字中国語かどうかの判定
+puts unihan.zh_cn?("这是简体中文") # => true
+puts unihan.zh_cn?("這不是簡體中文") # => false
+# テキストに中国語の文字が含まれているかの判定
+puts unihan.contains_chinese?("This text contains 中文") # => true
+puts unihan.contains_chinese?("This text has no Chinese") # => false
+# テキストから中国語の文字を抽出
+puts unihan.extract_chinese_characters("This text contains 中文").join # => "中文"
+# 繁体字のみで構成されているかの判定
+puts unihan.only_zh_tw?("繁體") # => true
+puts unihan.only_zh_tw?("繁體简体") # => false
+# 簡体字のみで構成されているかの判定
+puts unihan.only_zh_cn?("简体") # => true
+puts unihan.only_zh_cn?("简体繁體") # => false
+# 繁体字を含むかどうかの判定
+puts unihan.contains_zh_tw?("這個text包含繁體字") # => true
+puts unihan.contains_zh_tw?("这个text不包含繁体字") # => false
+# 簡体字を含むかどうかの判定
+puts unihan.contains_zh_cn?("这个text包含简体字") # => true
+puts unihan.contains_zh_cn?("這個text不包含簡體字") # => false
+```
+## 機能説明
+- `determine_language(text)`: テキストの言語を判定します（"ZH_TW", "ZH_CN", "Unknown"）。
+- `zh_tw?(text)`: テキストが繁体字中国語かどうかを判定します。
+- `zh_cn?(text)`: テキストが簡体字中国語かどうかを判定します。
+- `contains_chinese?(text)`: テキストに中国語の文字が含まれているかを判定します。
+- `extract_chinese_characters(text)`: テキストから中国語の文字を抽出します。
+- `only_zh_tw?(text)`: テキストが繁体字のみで構成されているかを判定します。
+- `only_zh_cn?(text)`: テキストが簡体字のみで構成されているかを判定します。
+- `contains_zh_tw?(text)`: テキストに繁体字が含まれているかを判定します。
+- `contains_zh_cn?(text)`: テキストに簡体字が含まれているかを判定します。
+## 注意事項
+このライブラリは、テキストの言語を完全に正確に判定することを保証するものではありません。
+特に、短いテキストや複数の言語が混在するテキストの場合、判定が難しい場合があります。

data/README.md CHANGED Viewed

@@ -2,56 +2,87 @@
 # UnihanLang
-UnihanLang は、テキストの言語（日本語、繁体字中国語、簡体字中国語）を識別するための Ruby ライブラリです。
+`unihan_lang` is a Ruby library for identifying text language (Traditional Chinese, Simplified Chinese) and performing various checks on Chinese characters.
-## インストール
+This document can also be read in [Japanese](https://github.com/kyubey1228/unihan_lang/blob/master/README.ja.md).
-Gemfile に以下の行を追加してください：
+## Installation
+Add this line to your application's Gemfile:
 ```ruby
 gem 'unihan_lang'
 ```
-そして、以下のコマンドを実行してください：
+And then execute:
 ```sh
 bundle install
 ```
-または、直接インストールする場合は以下のコマンドを使用してください：
+Or install it yourself as:
 ```sh
 gem install unihan_lang
 ```
-## 使用方法
+## Usage
 ```ruby
 require 'unihan_lang'
 unihan = UnihanLang::Unihan.new
-# 言語の判定
-puts unihan.determine_language("這是繁體中文")      # => "ZH_TW"
-puts unihan.determine_language("这是简体中文")      # => "ZH_CN"
+# Language determination
+puts unihan.determine_language("這是繁體中文") # => "ZH_TW"
+puts unihan.determine_language("这是简体中文") # => "ZH_CN"
+# Check if text is Traditional Chinese
+puts unihan.zh_tw?("這是繁體中文") # => true
+puts unihan.zh_tw?("这不是繁体中文") # => false
+# Check if text is Simplified Chinese
+puts unihan.zh_cn?("这是简体中文") # => true
+puts unihan.zh_cn?("這不是簡體中文") # => false
+# Check if text contains Chinese characters
+puts unihan.contains_chinese?("This text contains 中文") # => true
+puts unihan.contains_chinese?("This text has no Chinese") # => false
-# 繁体字中国語かどうかの判定
-puts unihan.zh_tw?("這是繁體中文")  # => true
-puts unihan.zh_tw?("这不是繁体中文")  # => false
+# Extract Chinese characters from text
+puts unihan.extract_chinese_characters("This text contains 中文").join # => "中文"
-# 簡体字中国語かどうかの判定
-puts unihan.zh_cn?("这是简体中文")  # => true
-puts unihan.zh_cn?("這不是簡體中文")  # => false
+# Check if text consists only of Traditional Chinese characters
+puts unihan.only_zh_tw?("繁體") # => true
+puts unihan.only_zh_tw?("繁體简体") # => false
-# テキストに中国語の文字が含まれているかの判定
-puts unihan.contains_chinese?("This text contains 中文")  # => true
-puts unihan.contains_chinese?("This text has no Chinese")  # => false
+# Check if text consists only of Simplified Chinese characters
+puts unihan.only_zh_cn?("简体") # => true
+puts unihan.only_zh_cn?("简体繁體") # => false
-# テキストから中国語の文字を抽出
-puts unihan.extract_chinese_characters("This text contains 中文").join  # => "中文"
+# Check if text contains Traditional Chinese characters
+puts unihan.contains_zh_tw?("這個text包含繁體字") # => true
+puts unihan.contains_zh_tw?("这个text不包含繁体字") # => false
+# Check if text contains Simplified Chinese characters
+puts unihan.contains_zh_cn?("这个text包含简体字") # => true
+puts unihan.contains_zh_cn?("這個text不包含簡體字") # => false
 ```
-## 注意事項
+## Features
+- `determine_language(text)`: Determines the language of the text ("ZH_TW", "ZH_CN", "JA", "Unknown").
+- `zh_tw?(text)`: Checks if the text is in Traditional Chinese.
+- `zh_cn?(text)`: Checks if the text is in Simplified Chinese.
+- `contains_chinese?(text)`: Checks if the text contains Chinese characters.
+- `extract_chinese_characters(text)`: Extracts Chinese characters from the text.
+- `only_zh_tw?(text)`: Checks if the text consists only of Traditional Chinese characters.
+- `only_zh_cn?(text)`: Checks if the text consists only of Simplified Chinese characters.
+- `contains_zh_tw?(text)`: Checks if the text contains Traditional Chinese characters.
+- `contains_zh_cn?(text)`: Checks if the text contains Simplified Chinese characters.
+## Note
-このライブラリは、テキストの言語を完全に正確に判定することを保証するものではありません。
-特に、短いテキストや複数の言語が混在するテキストの場合、判定が難しい場合があります。
+This library does not guarantee 100% accuracy in language identification.
+Particularly for short texts or texts containing multiple languages, determination may be challenging.
+The distinction between Traditional and Simplified Chinese is based on the Unihan database.

data/lib/unihan_lang/chinese_processor.rb CHANGED Viewed

@@ -43,7 +43,6 @@ module UnihanLang
     def load_chinese_characters
       load_unihan_variants
-      load_traditional_chinese_list
       process_character_sets
     end
@@ -58,31 +57,28 @@ module UnihanLang
     end
     def process_unihan_fields(fields)
-      char = [fields[0].gsub(/^U\+/, "").hex].pack("U")
+      from = [fields[0].gsub(/^U\+/, "").hex].pack("U")
       # Remove dictionary name.
       # Example: U+348B kSemanticVariant U+5EDD<kMatthews U+53AE<kMatthews
-      variant = [fields[2].split("<")[0].gsub(/^U\+/, "").hex].pack("U")
+      to = [fields[2].split("<")[0].gsub(/^U\+/, "").hex].pack("U")
       case fields[1]
       when "kTraditionalVariant"
-        @zh_tw << variant
-        @zh_cn << char
+        @zh_cn << from
+        @zh_tw << to
       when "kSimplifiedVariant"
-        @zh_cn << variant
-        @zh_tw << char
+        @zh_tw << from
+        @zh_cn << to
       end
     end
-    def load_traditional_chinese_list
-      file_path = File.join(File.dirname(__FILE__), "..", "..", "data",
-                            "traditional_chinese_list.txt")
-      File.foreach(file_path, encoding: "UTF-8") { |line| @zh_tw << line.strip }
-    end
     def process_character_sets
+      # There are same code point both zh_tw and zh_cn in Unihan_Variants.txt.
+      # Example: 台(U+53F0)
+      # U+53F0	kSimplifiedVariant	U+53F0
+      # U+53F0	kTraditionalVariant	U+53F0 U+6AAF U+81FA U+98B1
       @common = @zh_tw & @zh_cn
-      @zh_tw -= @zh_cn
-      @zh_cn -= @zh_tw
-      @zh_cn |= @common
+      @zh_tw -= @common
+      @zh_cn -= @common
     end
   end
 end

data/lib/unihan_lang/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module UnihanLang
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

data/lib/unihan_lang.rb CHANGED Viewed

@@ -43,7 +43,6 @@ module UnihanLang
     def determine_language(text)
       case language_ratio(text)
-      when :ja then "JA"
       when :tw then "ZH_TW"
       when :cn then "ZH_CN"
       else "Unknown"

data/unihan_lang.gemspec CHANGED Viewed

@@ -9,8 +9,8 @@ Gem::Specification.new do |spec|
   spec.authors       = ["kyubey1228"]
   spec.email         = ["kyuuka1228@gmail.com"]
-  spec.summary       = "Language detection for Chinese and Japanese characters"
-  spec.description   = "A gem to detect and differentiate between Traditional Chinese, Simplified Chinese, and Japanese characters based on Unihan data."
+  spec.summary       = "Language detection for Chinese characters"
+  spec.description   = "A gem to detect and differentiate between Traditional Chinese, Simplified Chinese based on Unihan data."
   spec.homepage      = "https://github.com/kyubey1228/unihan_lang"
   spec.license       = "MIT"
   spec.required_ruby_version = Gem::Requirement.new(">= 2.5.0")

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: unihan_lang
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - kyubey1228
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2024-09-08 00:00:00.000000000 Z
+date: 2024-10-17 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
@@ -53,7 +53,7 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '3.0'
 description: A gem to detect and differentiate between Traditional Chinese, Simplified
-  Chinese, and Japanese characters based on Unihan data.
+  Chinese based on Unihan data.
 email:
 - kyuuka1228@gmail.com
 executables: []
@@ -66,15 +66,14 @@ files:
 - ".rubocop.yml"
 - Gemfile
 - Gemfile.lock
+- LICENSE.md
+- README.ja.md
 - README.md
 - Rakefile
 - data/Unihan_Variants.txt
-- data/traditional_chinese_list.txt
 - lib/unihan_lang.rb
 - lib/unihan_lang/chinese_processor.rb
 - lib/unihan_lang/version.rb
-- test.rb
-- traditional_characters.txt
 - unihan_lang.gemspec
 homepage: https://github.com/kyubey1228/unihan_lang
 licenses:
@@ -95,8 +94,8 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.5.11
+rubygems_version: 3.5.3
 signing_key:
 specification_version: 4
-summary: Language detection for Chinese and Japanese characters
+summary: Language detection for Chinese characters
 test_files: []