kabosu 0.6.10.dev.20260225.4c46cc6 → 0.6.10.dev.20260225.c3c6711

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +32 -46
  3. data/lib/kabosu/version.rb +1 -1
  4. metadata +1 -1
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 777e090671b61cde329346c6179b4fb89373d52f93f89dff9cacbf3fe31806b7
4
- data.tar.gz: 17f5680cbcf45ad4cbcdac0c7c8e28827fd86ec480d999827eb2da1a192e75fd
3
+ metadata.gz: 3d02b7f222732b36afe4c3da668cc8aa347a00e53d57dc59905dbb40f60cb937
4
+ data.tar.gz: f105c63fda73d071b4e659891afa868656cb299ee7e04b6c53e0595a34634751
5
5
  SHA512:
6
- metadata.gz: ca0311c2bc431ffb2cfb4a144933db1f59530624aa29f87106ed8f97ed87f5bc858debe4168ab6a5af4b693627a2eb72b25060aaa2be5778f367b7830b8a9614
7
- data.tar.gz: e31111ad3f7ce865d667702b977db999dc4519c8f894ef879ab9191ccf6bc62dc6c65d6233f79e380f76b697c9f23451ec248a01bbc8568e38fdd7c5b79956ae
6
+ metadata.gz: e776b3e802b8a6d6a9439367010b81580cb36d59069406d66ee2746ed059b089004e08d62d962b89d787dcfbe30463ef6ac5fba5128f469f944ef6f662e4fe37
7
+ data.tar.gz: a26907ac57248903d6531d62b35888942dbdd15cdb488a64fa6e782c18ffc7419b0ad52153c0e1656242c7fe9d671d6d80a8ea9fcc9fa8c7059d3ec24ec846ff
data/README.md CHANGED
@@ -111,69 +111,41 @@ tok_c.tokenize("東京都").surfaces # => ["東京都"]
111
111
  Modes are symbols only (`:a`, `:b`, `:c` or `Kabosu::MODE_A/B/C`).
112
112
  Invalid modes now raise `ArgumentError` (for example, `"A"`).
113
113
 
114
- ## Dictionary and Tokenizer Internal API
115
-
116
- For more control over dictionary and tokenizer configuration, create them directly:
114
+ ## Advanced Use Cases
117
115
 
118
116
  ```ruby
117
+ # Custom system dictionary + optional user dictionaries
119
118
  dict = Kabosu::Dictionary.new(
120
119
  system_dict: "/path/to/custom/system.dic",
121
120
  user_dicts: ["/path/to/domain.dic", "/path/to/names.dic"]
122
121
  )
122
+
123
+ # Create tokenizer with explicit mode/fields
123
124
  tokenizer = dict.create(mode: :c, fields: %i[surface pos_id reading_form])
124
125
 
125
- morphemes = tokenizer.tokenize("国会議事堂前駅")
126
- # MorphemeList is lazy: morphemes are hydrated on first indexed/iterated access.
127
- # surfaces uses a fast path and does not force full morpheme hydration.
128
- morphemes.surfaces
129
- morphemes.first.part_of_speech
126
+ # Tokenize (returns MorphemeList; lazily hydrates morphemes)
127
+ list = tokenizer.tokenize("国会議事堂前駅")
128
+ list.surfaces
129
+ list.first.part_of_speech
130
130
 
131
- # Lexicon lookup (prefix matches from position 0), returns MorphemeList
131
+ # Dictionary prefix lookup
132
132
  dict.lookup("東京都").surfaces
133
133
 
134
- # Morpheme split returns MorphemeList
134
+ # Morpheme split
135
135
  m = tokenizer.tokenize("東京都").first
136
136
  m.split(mode: :a).surfaces
137
137
 
138
- # Native bulk extraction helpers (fewer Ruby<->Rust crossings)
138
+ # Bulk extractors
139
139
  tokenizer.tokenize_surfaces("東京都に住んでいる")
140
140
  tokenizer.tokenize_readings("東京都に住んでいる")
141
141
  tokenizer.tokenize_dictionary_forms("東京都に住んでいる")
142
142
  tokenizer.tokenize_normalized_forms("東京都に住んでいる")
143
143
 
144
- # Sentence splitting options
144
+ # Sentence splitting
145
145
  Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。", ranges: true)
146
146
  Kabosu.split_sentences("長い文...", limit: 12, with_checker: true)
147
-
148
- # ranges: true returns SentenceRange objects
149
- ranges = Kabosu.split_sentences("東京都に住んでいる。", ranges: true)
150
- ranges.first.start
151
- ranges.first.end
152
- ranges.first.text
153
147
  ```
154
148
 
155
- Dictionary initialization failures raise typed errors:
156
- - `Kabosu::ConfigError` for configuration issues
157
- - `Kabosu::DictionaryError` for dictionary loading issues
158
-
159
- Runtime failures in analysis APIs are also typed:
160
- - `Kabosu::TokenizationError` for tokenization/split failures
161
- - `Kabosu::SentenceSplitError` for sentence splitter failures
162
- - `Kabosu::LookupError` for dictionary lookup failures
163
-
164
- ## Public API Contract
165
-
166
- | API | Parameters | Return | Notes |
167
- |---|---|---|---|
168
- | `Kabosu::Dictionary.new` | `config: String?`, `system_dict: String?`, `user_dicts: Array<String>?` | `Kabosu::Dictionary` | One of `config` or `system_dict` is required |
169
- | `Dictionary#create` | `mode: :a|:b|:c`, `fields: Array<String\|Symbol>?`, `debug: bool`, `projection: nil` | `Kabosu::Tokenizer` | Unknown kwargs raise `ArgumentError`; `projection` currently raises `NotImplementedError` |
170
- | `Dictionary#lookup` | `text: String` | `Kabosu::MorphemeList` | Prefix lookup from byte offset 0 |
171
- | `Tokenizer#tokenize` | `text: String` | `Kabosu::MorphemeList` | Lazy morpheme hydration; raises `Kabosu::TokenizationError` on native failures |
172
- | `Tokenizer#tokenize_surfaces/readings/dictionary_forms/normalized_forms` | `text: String` | `Array<String>` | Raises `Kabosu::TokenizationError` on native failures |
173
- | `Morpheme#split` | `mode: :a|:b|:c`, `add_single: bool` | `Kabosu::MorphemeList` | Standardized with `tokenize` return type |
174
- | `Kabosu.split_sentences` | `text: String`, `limit: Integer?`, `with_checker: bool`, `ranges: bool`, `dictionary: String?` | `Array<String>` or `Array<Kabosu::SentenceRange>` | `limit` must be `>= 1` |
175
- | `Kabosu.tokenize` | `text: String`, `tokenizer: Kabosu::Tokenizer` | `Kabosu::MorphemeList` | No hidden global tokenizer cache |
176
-
177
149
  ## Benchmarks
178
150
 
179
151
  Kabosu ships with a benchmark suite that measures tokenization throughput and compares the Ruby bindings against raw [sudachi.rs](https://github.com/WorksApplications/sudachi.rs).
@@ -184,15 +156,29 @@ This benchmark uses [Wagahai wa Neko de Aru](https://www.aozora.gr.jp/cards/0001
184
156
 
185
157
  Measured on an AMD Ryzen 7 5800X, `full` dictionary edition, Ruby 3.4, Rust 1.84:
186
158
 
159
+ Single-thread (10 iterations):
160
+
187
161
  | Scenario | Rust | Ruby | Ratio |
188
162
  |---|---|---|---|
189
- | split_sentences | 1.597s | 1.677s | 1.0x |
190
- | tokenize (mode C) | 3.274s | 4.034s | 1.2x |
191
- | tokenize (mode A) | 3.429s | 4.273s | 1.2x |
192
- | tokenize (mode B) | 3.465s | 4.297s | 1.2x |
193
- | **Throughput** | **2.66 MB/s** | **2.18 MB/s** | **1.2x** |
163
+ | split_sentences | 1.550s | 1.615s | 1.0x |
164
+ | tokenize (mode C) | 3.148s | 3.395s | 1.1x |
165
+ | tokenize (mode A) | 3.227s | 3.525s | 1.1x |
166
+ | tokenize (mode B) | 3.226s | 3.582s | 1.1x |
167
+ | **Throughput** | **2.94 MB/s** | **2.69 MB/s** | **1.1x** |
194
168
 
195
- The Ruby bindings add ~20% overhead over raw Rust, primarily from FFI boundary crossings and Ruby object allocation for each morpheme.
169
+ Multithread (8 threads x 20,000 requests):
170
+
171
+ | Scenario | Rust | Ruby | Ratio |
172
+ |---|---|---|---|
173
+ | rails-style shared tokenizer | 1.475s | 2.101s | 1.4x |
174
+ | tokenizer per thread | 1.381s | 2.154s | 1.6x |
175
+ | **Throughput ST** | **20.44 MB/s** | **14.35 MB/s** | **1.4x** |
176
+ | **Throughput PT** | **21.84 MB/s** | **14.00 MB/s** | **1.6x** |
177
+
178
+ Notes:
179
+ - `shared tokenizer` matches Rails-style access where all request threads call one tokenizer instance.
180
+ - `per thread` creates one tokenizer per worker thread.
181
+ - Ratios are `Ruby / Rust`, and values vary by CPU, Ruby version, and dictionary edition.
196
182
 
197
183
  To reproduce these results, run:
198
184
 
@@ -1,3 +1,3 @@
1
1
  module Kabosu
2
- VERSION = "0.6.10.dev.20260225.4c46cc6"
2
+ VERSION = "0.6.10.dev.20260225.c3c6711"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: kabosu
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.10.dev.20260225.4c46cc6
4
+ version: 0.6.10.dev.20260225.c3c6711
5
5
  platform: ruby
6
6
  authors:
7
7
  - davafons