spellkit 0.1.0.pre.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/LICENSE +21 -0
- data/README.md +448 -0
- data/ext/spellkit/Cargo.lock +402 -0
- data/ext/spellkit/Cargo.toml +21 -0
- data/ext/spellkit/extconf.rb +4 -0
- data/ext/spellkit/src/guards.rs +57 -0
- data/ext/spellkit/src/lib.rs +255 -0
- data/ext/spellkit/src/symspell.rs +264 -0
- data/ext/spellkit/target/debug/build/clang-sys-051521a65ca8f402/out/common.rs +355 -0
- data/ext/spellkit/target/debug/build/clang-sys-051521a65ca8f402/out/dynamic.rs +276 -0
- data/ext/spellkit/target/debug/build/clang-sys-051521a65ca8f402/out/macros.rs +49 -0
- data/ext/spellkit/target/debug/build/rb-sys-4cf7db3819c4a6ed/out/bindings-0.9.117-mri-arm64-darwin24-3.3.0.rs +8902 -0
- data/ext/spellkit/target/debug/build/serde-b1b39c86cf577219/out/private.rs +6 -0
- data/ext/spellkit/target/debug/build/serde_core-7a7752261f0e4007/out/private.rs +5 -0
- data/ext/spellkit/target/debug/incremental/spellkit-10n1yon0n2c8v/s-hbha7isu2i-02ly2uq.lock +0 -0
- data/ext/spellkit/target/debug/incremental/spellkit-2jusczkp089xp/s-hbhcyx6yob-0pqrnyt.lock +0 -0
- data/ext/spellkit/target/debug/incremental/spellkit-39nm03wp54lxw/s-hbhcyx6ynq-08lhwc0.lock +0 -0
- data/lib/spellkit/version.rb +5 -0
- data/lib/spellkit.rb +216 -0
- metadata +123 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 8b0bb02947ee896cb9fab3cef53771c6a80e4e14123e518ffacb360b2c136b4d
|
|
4
|
+
data.tar.gz: 2a5af3e67e414f37fb6ff0a3c3a56a1d771e1480f29380b6196000f9f27e7071
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 1cf8d17fbdbcea925c32414e0d746ab4533604941b1603b127667f17f224b5c027f1aeb40b31f712c02a4bc584a1694ca75ef2a717563fe98d5d85d505a427da
|
|
7
|
+
data.tar.gz: 379eb6fa669ea2f13906cb121a69877414c1338f2b68509f9be55d79c29946eaf293136bbee7a766c75841db2d7fc6527cecbb01550a0ebd4cbfb97bcaf13f7c
|
data/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 Chris Petersen
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
data/README.md
ADDED
|
@@ -0,0 +1,448 @@
|
|
|
1
|
+
# SpellKit
|
|
2
|
+
|
|
3
|
+
Fast, safe typo correction for search-term extraction, wrapping the SymSpell algorithm in Rust via Magnus.
|
|
4
|
+
|
|
5
|
+
SpellKit provides:
|
|
6
|
+
- **Fast correction** using SymSpell with configurable edit distance (1 or 2)
|
|
7
|
+
- **Term protection** - never alter protected terms using exact matches or regex patterns
|
|
8
|
+
- **Hot reload** - update dictionaries without restarting your application
|
|
9
|
+
- **Sub-millisecond latency** - p95 < 2µs on small dictionaries
|
|
10
|
+
- **Thread-safe** - built with Rust's Arc<RwLock> for safe concurrent access
|
|
11
|
+
|
|
12
|
+
## Installation
|
|
13
|
+
|
|
14
|
+
Add to your Gemfile:
|
|
15
|
+
|
|
16
|
+
```ruby
|
|
17
|
+
gem "spellkit"
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
Or install directly:
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
gem install spellkit
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## Quick Start
|
|
27
|
+
|
|
28
|
+
SpellKit works with dictionaries from URLs or local files. Try it immediately:
|
|
29
|
+
|
|
30
|
+
```ruby
|
|
31
|
+
require "spellkit"
|
|
32
|
+
|
|
33
|
+
# Load from URL (downloads and caches automatically)
|
|
34
|
+
SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
|
|
35
|
+
|
|
36
|
+
# Or use a configure block (recommended for Rails)
|
|
37
|
+
SpellKit.configure do |config|
|
|
38
|
+
config.dictionary = SpellKit::DEFAULT_DICTIONARY_URL
|
|
39
|
+
config.edit_distance = 1
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
# Or load from local file
|
|
43
|
+
# SpellKit.load!(dictionary: "path/to/dictionary.tsv")
|
|
44
|
+
|
|
45
|
+
# Get suggestions for a misspelled word
|
|
46
|
+
suggestions = SpellKit.suggest("helo", 5)
|
|
47
|
+
puts suggestions.inspect
|
|
48
|
+
# => [{"term"=>"hello", "distance"=>1, "freq"=>...}]
|
|
49
|
+
|
|
50
|
+
# Correct a typo
|
|
51
|
+
corrected = SpellKit.correct_if_unknown("helo")
|
|
52
|
+
puts corrected
|
|
53
|
+
# => "hello"
|
|
54
|
+
|
|
55
|
+
# Batch correction
|
|
56
|
+
tokens = %w[helllo wrld ruby teset]
|
|
57
|
+
corrected_tokens = SpellKit.correct_tokens(tokens)
|
|
58
|
+
puts corrected_tokens.inspect
|
|
59
|
+
# => ["hello", "world", "ruby", "test"]
|
|
60
|
+
|
|
61
|
+
# Check stats
|
|
62
|
+
puts SpellKit.stats.inspect
|
|
63
|
+
# => {"loaded"=>true, "dictionary_size"=>..., "edit_distance"=>1, "loaded_at"=>...}
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
## Usage
|
|
67
|
+
|
|
68
|
+
### Basic Correction
|
|
69
|
+
|
|
70
|
+
```ruby
|
|
71
|
+
require "spellkit"
|
|
72
|
+
|
|
73
|
+
# Load from URL (auto-downloads and caches)
|
|
74
|
+
SpellKit.load!(dictionary: "https://example.com/dict.tsv")
|
|
75
|
+
|
|
76
|
+
# Or from local file
|
|
77
|
+
SpellKit.load!(dictionary: "models/dictionary.tsv", edit_distance: 1)
|
|
78
|
+
|
|
79
|
+
# Get suggestions
|
|
80
|
+
SpellKit.suggest("lyssis", 5)
|
|
81
|
+
# => [{"term"=>"lysis", "distance"=>1, "freq"=>2000}, ...]
|
|
82
|
+
|
|
83
|
+
# Correct a typo
|
|
84
|
+
SpellKit.correct_if_unknown("helo")
|
|
85
|
+
# => "hello"
|
|
86
|
+
|
|
87
|
+
# Batch correction
|
|
88
|
+
tokens = %w[helo wrld ruby]
|
|
89
|
+
SpellKit.correct_tokens(tokens)
|
|
90
|
+
# => ["hello", "world", "ruby"]
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Term Protection
|
|
94
|
+
|
|
95
|
+
Protect specific terms from correction using exact matches or regex patterns:
|
|
96
|
+
|
|
97
|
+
```ruby
|
|
98
|
+
# Load with exact-match protected terms
|
|
99
|
+
SpellKit.load!(
|
|
100
|
+
dictionary: "models/dictionary.tsv",
|
|
101
|
+
protected_path: "models/protected.txt" # file with terms to protect
|
|
102
|
+
)
|
|
103
|
+
|
|
104
|
+
# Protect terms matching regex patterns
|
|
105
|
+
SpellKit.load!(
|
|
106
|
+
dictionary: "models/dictionary.tsv",
|
|
107
|
+
protected_patterns: [
|
|
108
|
+
/^[A-Z]{3,4}\d+$/, # gene symbols like CDK10, BRCA1
|
|
109
|
+
/^\d{2,7}-\d{2}-\d$/, # CAS numbers like 7732-18-5
|
|
110
|
+
/^[A-Z]{2,3}-\d+$/ # SKU patterns like ABC-123
|
|
111
|
+
]
|
|
112
|
+
)
|
|
113
|
+
|
|
114
|
+
# Or combine both
|
|
115
|
+
SpellKit.load!(
|
|
116
|
+
dictionary: "models/dictionary.tsv",
|
|
117
|
+
protected_path: "models/protected.txt",
|
|
118
|
+
protected_patterns: [/^[A-Z]{3,4}\d+$/]
|
|
119
|
+
)
|
|
120
|
+
|
|
121
|
+
# Use guard: :domain to enable protection
|
|
122
|
+
SpellKit.correct_if_unknown("CDK10", guard: :domain)
|
|
123
|
+
# => "CDK10" # protected, never changed
|
|
124
|
+
|
|
125
|
+
# Batch correction with guards
|
|
126
|
+
tokens = %w[helo wrld ABC-123 for CDK10]
|
|
127
|
+
SpellKit.correct_tokens(tokens, guard: :domain)
|
|
128
|
+
# => ["hello", "world", "ABC-123", "for", "CDK10"]
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
### Multiple Instances
|
|
132
|
+
|
|
133
|
+
SpellKit supports multiple independent checker instances, useful for different domains or languages:
|
|
134
|
+
|
|
135
|
+
```ruby
|
|
136
|
+
# Create separate instances for different domains
|
|
137
|
+
medical_checker = SpellKit::Checker.new
|
|
138
|
+
medical_checker.load!(
|
|
139
|
+
dictionary: "models/medical_dictionary.tsv",
|
|
140
|
+
protected_path: "models/medical_terms.txt"
|
|
141
|
+
)
|
|
142
|
+
|
|
143
|
+
legal_checker = SpellKit::Checker.new
|
|
144
|
+
legal_checker.load!(
|
|
145
|
+
dictionary: "models/legal_dictionary.tsv",
|
|
146
|
+
protected_path: "models/legal_terms.txt"
|
|
147
|
+
)
|
|
148
|
+
|
|
149
|
+
# Use them independently
|
|
150
|
+
medical_checker.suggest("lyssis", 5)
|
|
151
|
+
legal_checker.suggest("contractt", 5)
|
|
152
|
+
|
|
153
|
+
# Each maintains its own state
|
|
154
|
+
medical_checker.stats # Shows medical dictionary stats
|
|
155
|
+
legal_checker.stats # Shows legal dictionary stats
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### Configuration Block
|
|
159
|
+
|
|
160
|
+
Use the configure block pattern for Rails initializers:
|
|
161
|
+
|
|
162
|
+
```ruby
|
|
163
|
+
SpellKit.configure do |config|
|
|
164
|
+
config.dictionary = "models/dictionary.tsv"
|
|
165
|
+
config.protected_path = "models/protected.txt"
|
|
166
|
+
config.protected_patterns = [/^[A-Z]{3,4}\d+$/]
|
|
167
|
+
config.edit_distance = 1
|
|
168
|
+
config.frequency_threshold = 10.0
|
|
169
|
+
end
|
|
170
|
+
|
|
171
|
+
# This becomes the default instance
|
|
172
|
+
SpellKit.suggest("word", 5) # Uses configured dictionary
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
## Dictionary Format
|
|
176
|
+
|
|
177
|
+
### Dictionary (required)
|
|
178
|
+
|
|
179
|
+
Whitespace-separated file with term and frequency (supports both space and tab delimiters):
|
|
180
|
+
|
|
181
|
+
```
|
|
182
|
+
hello 10000
|
|
183
|
+
world 8000
|
|
184
|
+
lysis 2000
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
Or space-separated:
|
|
188
|
+
```
|
|
189
|
+
hello 10000
|
|
190
|
+
world 8000
|
|
191
|
+
lysis 2000
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
### Protected Terms (optional)
|
|
195
|
+
|
|
196
|
+
One term per line. Terms are matched case-insensitively:
|
|
197
|
+
|
|
198
|
+
**protected.txt**
|
|
199
|
+
```
|
|
200
|
+
# Product codes
|
|
201
|
+
ABC-123
|
|
202
|
+
XYZ-999
|
|
203
|
+
|
|
204
|
+
# Technical terms
|
|
205
|
+
CDK10
|
|
206
|
+
BRCA1
|
|
207
|
+
|
|
208
|
+
# Brand names
|
|
209
|
+
MyBrand
|
|
210
|
+
SpecialTerm
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
## Dictionary Sources
|
|
214
|
+
|
|
215
|
+
SpellKit doesn't bundle dictionaries, but works with several sources:
|
|
216
|
+
|
|
217
|
+
### Use the Default Dictionary (Recommended)
|
|
218
|
+
```ruby
|
|
219
|
+
# English 80k word dictionary from SymSpell
|
|
220
|
+
SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
### Public Dictionary URLs
|
|
224
|
+
- **SymSpell English 80k**: `https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-80k.txt`
|
|
225
|
+
- **SymSpell English 500k**: `https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-500k.txt`
|
|
226
|
+
|
|
227
|
+
### Build Your Own
|
|
228
|
+
See "Building Dictionaries" section below for creating domain-specific dictionaries.
|
|
229
|
+
|
|
230
|
+
### Caching
|
|
231
|
+
Dictionaries downloaded from URLs are cached in `~/.cache/spellkit/` for faster subsequent loads.
|
|
232
|
+
|
|
233
|
+
## Configuration
|
|
234
|
+
|
|
235
|
+
```ruby
|
|
236
|
+
SpellKit.load!(
|
|
237
|
+
dictionary: "models/dictionary.tsv", # required: path or URL
|
|
238
|
+
protected_path: "models/protected.txt", # optional
|
|
239
|
+
protected_patterns: [/^[A-Z]{3,4}\d+$/], # optional
|
|
240
|
+
edit_distance: 1, # 1 (default) or 2
|
|
241
|
+
frequency_threshold: 10.0 # default: 10.0
|
|
242
|
+
)
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
## API Reference
|
|
246
|
+
|
|
247
|
+
### `SpellKit.load!(**options)`
|
|
248
|
+
|
|
249
|
+
Load or reload dictionaries. Thread-safe atomic swap. Accepts URLs (auto-downloads and caches) or local file paths.
|
|
250
|
+
|
|
251
|
+
**Options:**
|
|
252
|
+
- `dictionary:` (required) - URL or path to TSV file with term<TAB>frequency
|
|
253
|
+
- `protected_path:` (optional) - Path to file with protected terms (one per line)
|
|
254
|
+
- `protected_patterns:` (optional) - Array of Regexp or String patterns to protect
|
|
255
|
+
- `edit_distance:` (default: 1) - Maximum edit distance (1 or 2)
|
|
256
|
+
- `frequency_threshold:` (default: 10.0) - Minimum frequency ratio for corrections
|
|
257
|
+
|
|
258
|
+
**Examples:**
|
|
259
|
+
```ruby
|
|
260
|
+
# From URL (recommended for getting started)
|
|
261
|
+
SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
|
|
262
|
+
|
|
263
|
+
# From custom URL
|
|
264
|
+
SpellKit.load!(dictionary: "https://example.com/dict.tsv")
|
|
265
|
+
|
|
266
|
+
# From local file
|
|
267
|
+
SpellKit.load!(dictionary: "/path/to/dictionary.tsv")
|
|
268
|
+
```
|
|
269
|
+
|
|
270
|
+
### `SpellKit.suggest(word, max = 5)`
|
|
271
|
+
|
|
272
|
+
Get ranked suggestions for a word.
|
|
273
|
+
|
|
274
|
+
**Parameters:**
|
|
275
|
+
- `word` (required) - The word to get suggestions for
|
|
276
|
+
- `max` (optional, default: 5) - Maximum number of suggestions to return
|
|
277
|
+
|
|
278
|
+
**Returns:** Array of hashes with `"term"`, `"distance"`, and `"freq"` keys
|
|
279
|
+
|
|
280
|
+
### `SpellKit.correct_if_unknown(word, guard:)`
|
|
281
|
+
|
|
282
|
+
Return corrected word or original if no better match found.
|
|
283
|
+
|
|
284
|
+
**Options:**
|
|
285
|
+
- `guard:` - Set to `:domain` to enable protection checks
|
|
286
|
+
|
|
287
|
+
### `SpellKit.correct_tokens(tokens, guard:)`
|
|
288
|
+
|
|
289
|
+
Batch correction of an array of tokens.
|
|
290
|
+
|
|
291
|
+
**Returns:** Array of corrected strings
|
|
292
|
+
|
|
293
|
+
### `SpellKit.stats`
|
|
294
|
+
|
|
295
|
+
Get current state statistics.
|
|
296
|
+
|
|
297
|
+
**Returns:** Hash with:
|
|
298
|
+
- `"loaded"` - Boolean
|
|
299
|
+
- `"dictionary_size"` - Number of terms
|
|
300
|
+
- `"edit_distance"` - Configured edit distance
|
|
301
|
+
- `"loaded_at"` - Unix timestamp
|
|
302
|
+
|
|
303
|
+
### `SpellKit.healthcheck`
|
|
304
|
+
|
|
305
|
+
Verify system is properly loaded. Raises error if not.
|
|
306
|
+
|
|
307
|
+
## Term Protection
|
|
308
|
+
|
|
309
|
+
The `guard: :domain` option enables protection for specific terms:
|
|
310
|
+
|
|
311
|
+
### Exact Matches
|
|
312
|
+
Terms in `protected_path` file are never corrected, even if similar dictionary words exist. Matching is case-insensitive, but original casing is preserved in output.
|
|
313
|
+
|
|
314
|
+
### Pattern Matching
|
|
315
|
+
Terms matching any pattern in `protected_patterns` are protected. Patterns can be:
|
|
316
|
+
- Ruby Regexp objects: `/^[A-Z]{3,4}\d+$/`
|
|
317
|
+
- Regex strings: `"^[A-Z]{3,4}\\d+$"`
|
|
318
|
+
|
|
319
|
+
### Examples
|
|
320
|
+
```ruby
|
|
321
|
+
# Protect specific terms
|
|
322
|
+
protected_patterns: [
|
|
323
|
+
/^[A-Z]{3,4}\d+$/, # Gene symbols: CDK10, BRCA1
|
|
324
|
+
/^\d{2,7}-\d{2}-\d$/, # CAS numbers: 7732-18-5
|
|
325
|
+
/^[A-Z]{2,3}-\d+$/ # Product codes: ABC-123
|
|
326
|
+
]
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
## Rails Integration
|
|
330
|
+
|
|
331
|
+
```ruby
|
|
332
|
+
# config/initializers/spellkit.rb
|
|
333
|
+
|
|
334
|
+
# Option 1: Use default dictionary (easiest)
|
|
335
|
+
SpellKit.configure do |config|
|
|
336
|
+
config.dictionary = SpellKit::DEFAULT_DICTIONARY_URL
|
|
337
|
+
end
|
|
338
|
+
|
|
339
|
+
# Option 2: Use local dictionary with full configuration
|
|
340
|
+
SpellKit.configure do |config|
|
|
341
|
+
config.dictionary = Rails.root.join("models/dictionary.tsv")
|
|
342
|
+
config.protected_path = Rails.root.join("models/protected.txt")
|
|
343
|
+
config.protected_patterns = [
|
|
344
|
+
/^[A-Z]{3,4}\d+$/, # Product codes
|
|
345
|
+
/^\d{2,7}-\d{2}-\d$/ # Reference numbers
|
|
346
|
+
]
|
|
347
|
+
config.edit_distance = 1
|
|
348
|
+
config.frequency_threshold = 10.0
|
|
349
|
+
end
|
|
350
|
+
|
|
351
|
+
# Option 3: Multiple domain-specific instances
|
|
352
|
+
# config/initializers/spellkit.rb
|
|
353
|
+
module SpellCheckers
|
|
354
|
+
MEDICAL = SpellKit::Checker.new.tap do |c|
|
|
355
|
+
c.load!(
|
|
356
|
+
dictionary: Rails.root.join("models/medical_dictionary.tsv"),
|
|
357
|
+
protected_path: Rails.root.join("models/medical_terms.txt")
|
|
358
|
+
)
|
|
359
|
+
end
|
|
360
|
+
|
|
361
|
+
LEGAL = SpellKit::Checker.new.tap do |c|
|
|
362
|
+
c.load!(
|
|
363
|
+
dictionary: Rails.root.join("models/legal_dictionary.tsv"),
|
|
364
|
+
protected_path: Rails.root.join("models/legal_terms.txt")
|
|
365
|
+
)
|
|
366
|
+
end
|
|
367
|
+
end
|
|
368
|
+
|
|
369
|
+
# In your search preprocessing
|
|
370
|
+
class SearchPreprocessor
|
|
371
|
+
def self.correct_query(text)
|
|
372
|
+
tokens = text.downcase.split(/\s+/)
|
|
373
|
+
SpellKit.correct_tokens(tokens, guard: :domain).join(" ")
|
|
374
|
+
end
|
|
375
|
+
end
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
## Performance
|
|
379
|
+
|
|
380
|
+
Benchmarked on M1 MacBook Pro with 20-term test dictionary:
|
|
381
|
+
|
|
382
|
+
- **Load time**: < 100ms
|
|
383
|
+
- **Suggestion latency**: p50 < 2µs, p95 < 2µs
|
|
384
|
+
- **Guard checks**: p95 < 1µs
|
|
385
|
+
- **Memory**: ~150MB for 1M term dictionary (estimated)
|
|
386
|
+
|
|
387
|
+
Target for production (1-5M terms):
|
|
388
|
+
- Load: < 500ms
|
|
389
|
+
- p50: < 30µs, p95: < 100µs
|
|
390
|
+
- Memory: 50-150MB
|
|
391
|
+
|
|
392
|
+
## Building Dictionaries
|
|
393
|
+
|
|
394
|
+
Create your dictionary from your corpus:
|
|
395
|
+
|
|
396
|
+
```ruby
|
|
397
|
+
# example_builder.rb
|
|
398
|
+
require "set"
|
|
399
|
+
|
|
400
|
+
counts = Hash.new(0)
|
|
401
|
+
|
|
402
|
+
# Read your corpus
|
|
403
|
+
File.foreach("corpus.txt") do |line|
|
|
404
|
+
line.downcase.split(/\W+/).each do |word|
|
|
405
|
+
next if word.length < 3
|
|
406
|
+
counts[word] += 1
|
|
407
|
+
end
|
|
408
|
+
end
|
|
409
|
+
|
|
410
|
+
# Filter by minimum count and write
|
|
411
|
+
min_count = 5
|
|
412
|
+
File.open("dictionary.tsv", "w") do |f|
|
|
413
|
+
counts.select { |_, count| count >= min_count }
|
|
414
|
+
.sort_by { |_, count| -count }
|
|
415
|
+
.each { |term, count| f.puts "#{term}\t#{count}" }
|
|
416
|
+
end
|
|
417
|
+
```
|
|
418
|
+
|
|
419
|
+
## Development
|
|
420
|
+
|
|
421
|
+
After checking out the repo:
|
|
422
|
+
|
|
423
|
+
```bash
|
|
424
|
+
bundle install
|
|
425
|
+
bundle exec rake compile
|
|
426
|
+
bundle exec rake spec
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
To build the gem:
|
|
430
|
+
|
|
431
|
+
```bash
|
|
432
|
+
bundle exec rake build
|
|
433
|
+
```
|
|
434
|
+
|
|
435
|
+
## Platform Support
|
|
436
|
+
|
|
437
|
+
Pre-built gems available for:
|
|
438
|
+
- macOS (x86_64, arm64)
|
|
439
|
+
- Linux (glibc, musl)
|
|
440
|
+
- Ruby 3.1, 3.2, 3.3
|
|
441
|
+
|
|
442
|
+
## Contributing
|
|
443
|
+
|
|
444
|
+
Bug reports and pull requests are welcome at https://github.com/scientist-labs/spellkit
|
|
445
|
+
|
|
446
|
+
## License
|
|
447
|
+
|
|
448
|
+
MIT License - see [LICENSE](LICENSE) file for details.
|