kabosu 0.6.10.dev.20260225.4c46cc6 → 0.6.10.dev.20260225.c3c6711
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +32 -46
- data/lib/kabosu/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 3d02b7f222732b36afe4c3da668cc8aa347a00e53d57dc59905dbb40f60cb937
|
|
4
|
+
data.tar.gz: f105c63fda73d071b4e659891afa868656cb299ee7e04b6c53e0595a34634751
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: e776b3e802b8a6d6a9439367010b81580cb36d59069406d66ee2746ed059b089004e08d62d962b89d787dcfbe30463ef6ac5fba5128f469f944ef6f662e4fe37
|
|
7
|
+
data.tar.gz: a26907ac57248903d6531d62b35888942dbdd15cdb488a64fa6e782c18ffc7419b0ad52153c0e1656242c7fe9d671d6d80a8ea9fcc9fa8c7059d3ec24ec846ff
|
data/README.md
CHANGED
|
@@ -111,69 +111,41 @@ tok_c.tokenize("東京都").surfaces # => ["東京都"]
|
|
|
111
111
|
Modes are symbols only (`:a`, `:b`, `:c` or `Kabosu::MODE_A/B/C`).
|
|
112
112
|
Invalid modes now raise `ArgumentError` (for example, `"A"`).
|
|
113
113
|
|
|
114
|
-
##
|
|
115
|
-
|
|
116
|
-
For more control over dictionary and tokenizer configuration, create them directly:
|
|
114
|
+
## Advanced Use Cases
|
|
117
115
|
|
|
118
116
|
```ruby
|
|
117
|
+
# Custom system dictionary + optional user dictionaries
|
|
119
118
|
dict = Kabosu::Dictionary.new(
|
|
120
119
|
system_dict: "/path/to/custom/system.dic",
|
|
121
120
|
user_dicts: ["/path/to/domain.dic", "/path/to/names.dic"]
|
|
122
121
|
)
|
|
122
|
+
|
|
123
|
+
# Create tokenizer with explicit mode/fields
|
|
123
124
|
tokenizer = dict.create(mode: :c, fields: %i[surface pos_id reading_form])
|
|
124
125
|
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
morphemes.first.part_of_speech
|
|
126
|
+
# Tokenize (returns MorphemeList; lazily hydrates morphemes)
|
|
127
|
+
list = tokenizer.tokenize("国会議事堂前駅")
|
|
128
|
+
list.surfaces
|
|
129
|
+
list.first.part_of_speech
|
|
130
130
|
|
|
131
|
-
#
|
|
131
|
+
# Dictionary prefix lookup
|
|
132
132
|
dict.lookup("東京都").surfaces
|
|
133
133
|
|
|
134
|
-
# Morpheme split
|
|
134
|
+
# Morpheme split
|
|
135
135
|
m = tokenizer.tokenize("東京都").first
|
|
136
136
|
m.split(mode: :a).surfaces
|
|
137
137
|
|
|
138
|
-
#
|
|
138
|
+
# Bulk extractors
|
|
139
139
|
tokenizer.tokenize_surfaces("東京都に住んでいる")
|
|
140
140
|
tokenizer.tokenize_readings("東京都に住んでいる")
|
|
141
141
|
tokenizer.tokenize_dictionary_forms("東京都に住んでいる")
|
|
142
142
|
tokenizer.tokenize_normalized_forms("東京都に住んでいる")
|
|
143
143
|
|
|
144
|
-
# Sentence splitting
|
|
144
|
+
# Sentence splitting
|
|
145
145
|
Kabosu.split_sentences("東京都に住んでいる。大阪も好きだ。", ranges: true)
|
|
146
146
|
Kabosu.split_sentences("長い文...", limit: 12, with_checker: true)
|
|
147
|
-
|
|
148
|
-
# ranges: true returns SentenceRange objects
|
|
149
|
-
ranges = Kabosu.split_sentences("東京都に住んでいる。", ranges: true)
|
|
150
|
-
ranges.first.start
|
|
151
|
-
ranges.first.end
|
|
152
|
-
ranges.first.text
|
|
153
147
|
```
|
|
154
148
|
|
|
155
|
-
Dictionary initialization failures raise typed errors:
|
|
156
|
-
- `Kabosu::ConfigError` for configuration issues
|
|
157
|
-
- `Kabosu::DictionaryError` for dictionary loading issues
|
|
158
|
-
|
|
159
|
-
Runtime failures in analysis APIs are also typed:
|
|
160
|
-
- `Kabosu::TokenizationError` for tokenization/split failures
|
|
161
|
-
- `Kabosu::SentenceSplitError` for sentence splitter failures
|
|
162
|
-
- `Kabosu::LookupError` for dictionary lookup failures
|
|
163
|
-
|
|
164
|
-
## Public API Contract
|
|
165
|
-
|
|
166
|
-
| API | Parameters | Return | Notes |
|
|
167
|
-
|---|---|---|---|
|
|
168
|
-
| `Kabosu::Dictionary.new` | `config: String?`, `system_dict: String?`, `user_dicts: Array<String>?` | `Kabosu::Dictionary` | One of `config` or `system_dict` is required |
|
|
169
|
-
| `Dictionary#create` | `mode: :a|:b|:c`, `fields: Array<String\|Symbol>?`, `debug: bool`, `projection: nil` | `Kabosu::Tokenizer` | Unknown kwargs raise `ArgumentError`; `projection` currently raises `NotImplementedError` |
|
|
170
|
-
| `Dictionary#lookup` | `text: String` | `Kabosu::MorphemeList` | Prefix lookup from byte offset 0 |
|
|
171
|
-
| `Tokenizer#tokenize` | `text: String` | `Kabosu::MorphemeList` | Lazy morpheme hydration; raises `Kabosu::TokenizationError` on native failures |
|
|
172
|
-
| `Tokenizer#tokenize_surfaces/readings/dictionary_forms/normalized_forms` | `text: String` | `Array<String>` | Raises `Kabosu::TokenizationError` on native failures |
|
|
173
|
-
| `Morpheme#split` | `mode: :a|:b|:c`, `add_single: bool` | `Kabosu::MorphemeList` | Standardized with `tokenize` return type |
|
|
174
|
-
| `Kabosu.split_sentences` | `text: String`, `limit: Integer?`, `with_checker: bool`, `ranges: bool`, `dictionary: String?` | `Array<String>` or `Array<Kabosu::SentenceRange>` | `limit` must be `>= 1` |
|
|
175
|
-
| `Kabosu.tokenize` | `text: String`, `tokenizer: Kabosu::Tokenizer` | `Kabosu::MorphemeList` | No hidden global tokenizer cache |
|
|
176
|
-
|
|
177
149
|
## Benchmarks
|
|
178
150
|
|
|
179
151
|
Kabosu ships with a benchmark suite that measures tokenization throughput and compares the Ruby bindings against raw [sudachi.rs](https://github.com/WorksApplications/sudachi.rs).
|
|
@@ -184,15 +156,29 @@ This benchmark uses [Wagahai wa Neko de Aru](https://www.aozora.gr.jp/cards/0001
|
|
|
184
156
|
|
|
185
157
|
Measured on an AMD Ryzen 7 5800X, `full` dictionary edition, Ruby 3.4, Rust 1.84:
|
|
186
158
|
|
|
159
|
+
Single-thread (10 iterations):
|
|
160
|
+
|
|
187
161
|
| Scenario | Rust | Ruby | Ratio |
|
|
188
162
|
|---|---|---|---|
|
|
189
|
-
| split_sentences | 1.
|
|
190
|
-
| tokenize (mode C) | 3.
|
|
191
|
-
| tokenize (mode A) | 3.
|
|
192
|
-
| tokenize (mode B) | 3.
|
|
193
|
-
| **Throughput** | **2.
|
|
163
|
+
| split_sentences | 1.550s | 1.615s | 1.0x |
|
|
164
|
+
| tokenize (mode C) | 3.148s | 3.395s | 1.1x |
|
|
165
|
+
| tokenize (mode A) | 3.227s | 3.525s | 1.1x |
|
|
166
|
+
| tokenize (mode B) | 3.226s | 3.582s | 1.1x |
|
|
167
|
+
| **Throughput** | **2.94 MB/s** | **2.69 MB/s** | **1.1x** |
|
|
194
168
|
|
|
195
|
-
|
|
169
|
+
Multithread (8 threads x 20,000 requests):
|
|
170
|
+
|
|
171
|
+
| Scenario | Rust | Ruby | Ratio |
|
|
172
|
+
|---|---|---|---|
|
|
173
|
+
| rails-style shared tokenizer | 1.475s | 2.101s | 1.4x |
|
|
174
|
+
| tokenizer per thread | 1.381s | 2.154s | 1.6x |
|
|
175
|
+
| **Throughput ST** | **20.44 MB/s** | **14.35 MB/s** | **1.4x** |
|
|
176
|
+
| **Throughput PT** | **21.84 MB/s** | **14.00 MB/s** | **1.6x** |
|
|
177
|
+
|
|
178
|
+
Notes:
|
|
179
|
+
- `shared tokenizer` matches Rails-style access where all request threads call one tokenizer instance.
|
|
180
|
+
- `per thread` creates one tokenizer per worker thread.
|
|
181
|
+
- Ratios are `Ruby / Rust`, and values vary by CPU, Ruby version, and dictionary edition.
|
|
196
182
|
|
|
197
183
|
To reproduce these results, run:
|
|
198
184
|
|
data/lib/kabosu/version.rb
CHANGED