spellkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: 8b0bb02947ee896cb9fab3cef53771c6a80e4e14123e518ffacb360b2c136b4d
4
+ data.tar.gz: 2a5af3e67e414f37fb6ff0a3c3a56a1d771e1480f29380b6196000f9f27e7071
5
+ SHA512:
6
+ metadata.gz: 1cf8d17fbdbcea925c32414e0d746ab4533604941b1603b127667f17f224b5c027f1aeb40b31f712c02a4bc584a1694ca75ef2a717563fe98d5d85d505a427da
7
+ data.tar.gz: 379eb6fa669ea2f13906cb121a69877414c1338f2b68509f9be55d79c29946eaf293136bbee7a766c75841db2d7fc6527cecbb01550a0ebd4cbfb97bcaf13f7c
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Chris Petersen
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,448 @@
1
+ # SpellKit
2
+
3
+ Fast, safe typo correction for search-term extraction, wrapping the SymSpell algorithm in Rust via Magnus.
4
+
5
+ SpellKit provides:
6
+ - **Fast correction** using SymSpell with configurable edit distance (1 or 2)
7
+ - **Term protection** - never alter protected terms using exact matches or regex patterns
8
+ - **Hot reload** - update dictionaries without restarting your application
9
+ - **Sub-millisecond latency** - p95 < 2µs on small dictionaries
10
+ - **Thread-safe** - built with Rust's Arc<RwLock> for safe concurrent access
11
+
12
+ ## Installation
13
+
14
+ Add to your Gemfile:
15
+
16
+ ```ruby
17
+ gem "spellkit"
18
+ ```
19
+
20
+ Or install directly:
21
+
22
+ ```bash
23
+ gem install spellkit
24
+ ```
25
+
26
+ ## Quick Start
27
+
28
+ SpellKit works with dictionaries from URLs or local files. Try it immediately:
29
+
30
+ ```ruby
31
+ require "spellkit"
32
+
33
+ # Load from URL (downloads and caches automatically)
34
+ SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
35
+
36
+ # Or use a configure block (recommended for Rails)
37
+ SpellKit.configure do |config|
38
+ config.dictionary = SpellKit::DEFAULT_DICTIONARY_URL
39
+ config.edit_distance = 1
40
+ end
41
+
42
+ # Or load from local file
43
+ # SpellKit.load!(dictionary: "path/to/dictionary.tsv")
44
+
45
+ # Get suggestions for a misspelled word
46
+ suggestions = SpellKit.suggest("helo", 5)
47
+ puts suggestions.inspect
48
+ # => [{"term"=>"hello", "distance"=>1, "freq"=>...}]
49
+
50
+ # Correct a typo
51
+ corrected = SpellKit.correct_if_unknown("helo")
52
+ puts corrected
53
+ # => "hello"
54
+
55
+ # Batch correction
56
+ tokens = %w[helllo wrld ruby teset]
57
+ corrected_tokens = SpellKit.correct_tokens(tokens)
58
+ puts corrected_tokens.inspect
59
+ # => ["hello", "world", "ruby", "test"]
60
+
61
+ # Check stats
62
+ puts SpellKit.stats.inspect
63
+ # => {"loaded"=>true, "dictionary_size"=>..., "edit_distance"=>1, "loaded_at"=>...}
64
+ ```
65
+
66
+ ## Usage
67
+
68
+ ### Basic Correction
69
+
70
+ ```ruby
71
+ require "spellkit"
72
+
73
+ # Load from URL (auto-downloads and caches)
74
+ SpellKit.load!(dictionary: "https://example.com/dict.tsv")
75
+
76
+ # Or from local file
77
+ SpellKit.load!(dictionary: "models/dictionary.tsv", edit_distance: 1)
78
+
79
+ # Get suggestions
80
+ SpellKit.suggest("lyssis", 5)
81
+ # => [{"term"=>"lysis", "distance"=>1, "freq"=>2000}, ...]
82
+
83
+ # Correct a typo
84
+ SpellKit.correct_if_unknown("helo")
85
+ # => "hello"
86
+
87
+ # Batch correction
88
+ tokens = %w[helo wrld ruby]
89
+ SpellKit.correct_tokens(tokens)
90
+ # => ["hello", "world", "ruby"]
91
+ ```
92
+
93
+ ### Term Protection
94
+
95
+ Protect specific terms from correction using exact matches or regex patterns:
96
+
97
+ ```ruby
98
+ # Load with exact-match protected terms
99
+ SpellKit.load!(
100
+ dictionary: "models/dictionary.tsv",
101
+ protected_path: "models/protected.txt" # file with terms to protect
102
+ )
103
+
104
+ # Protect terms matching regex patterns
105
+ SpellKit.load!(
106
+ dictionary: "models/dictionary.tsv",
107
+ protected_patterns: [
108
+ /^[A-Z]{3,4}\d+$/, # gene symbols like CDK10, BRCA1
109
+ /^\d{2,7}-\d{2}-\d$/, # CAS numbers like 7732-18-5
110
+ /^[A-Z]{2,3}-\d+$/ # SKU patterns like ABC-123
111
+ ]
112
+ )
113
+
114
+ # Or combine both
115
+ SpellKit.load!(
116
+ dictionary: "models/dictionary.tsv",
117
+ protected_path: "models/protected.txt",
118
+ protected_patterns: [/^[A-Z]{3,4}\d+$/]
119
+ )
120
+
121
+ # Use guard: :domain to enable protection
122
+ SpellKit.correct_if_unknown("CDK10", guard: :domain)
123
+ # => "CDK10" # protected, never changed
124
+
125
+ # Batch correction with guards
126
+ tokens = %w[helo wrld ABC-123 for CDK10]
127
+ SpellKit.correct_tokens(tokens, guard: :domain)
128
+ # => ["hello", "world", "ABC-123", "for", "CDK10"]
129
+ ```
130
+
131
+ ### Multiple Instances
132
+
133
+ SpellKit supports multiple independent checker instances, useful for different domains or languages:
134
+
135
+ ```ruby
136
+ # Create separate instances for different domains
137
+ medical_checker = SpellKit::Checker.new
138
+ medical_checker.load!(
139
+ dictionary: "models/medical_dictionary.tsv",
140
+ protected_path: "models/medical_terms.txt"
141
+ )
142
+
143
+ legal_checker = SpellKit::Checker.new
144
+ legal_checker.load!(
145
+ dictionary: "models/legal_dictionary.tsv",
146
+ protected_path: "models/legal_terms.txt"
147
+ )
148
+
149
+ # Use them independently
150
+ medical_checker.suggest("lyssis", 5)
151
+ legal_checker.suggest("contractt", 5)
152
+
153
+ # Each maintains its own state
154
+ medical_checker.stats # Shows medical dictionary stats
155
+ legal_checker.stats # Shows legal dictionary stats
156
+ ```
157
+
158
+ ### Configuration Block
159
+
160
+ Use the configure block pattern for Rails initializers:
161
+
162
+ ```ruby
163
+ SpellKit.configure do |config|
164
+ config.dictionary = "models/dictionary.tsv"
165
+ config.protected_path = "models/protected.txt"
166
+ config.protected_patterns = [/^[A-Z]{3,4}\d+$/]
167
+ config.edit_distance = 1
168
+ config.frequency_threshold = 10.0
169
+ end
170
+
171
+ # This becomes the default instance
172
+ SpellKit.suggest("word", 5) # Uses configured dictionary
173
+ ```
174
+
175
+ ## Dictionary Format
176
+
177
+ ### Dictionary (required)
178
+
179
+ Whitespace-separated file with term and frequency (supports both space and tab delimiters):
180
+
181
+ ```
182
+ hello 10000
183
+ world 8000
184
+ lysis 2000
185
+ ```
186
+
187
+ Or space-separated:
188
+ ```
189
+ hello 10000
190
+ world 8000
191
+ lysis 2000
192
+ ```
193
+
194
+ ### Protected Terms (optional)
195
+
196
+ One term per line. Terms are matched case-insensitively:
197
+
198
+ **protected.txt**
199
+ ```
200
+ # Product codes
201
+ ABC-123
202
+ XYZ-999
203
+
204
+ # Technical terms
205
+ CDK10
206
+ BRCA1
207
+
208
+ # Brand names
209
+ MyBrand
210
+ SpecialTerm
211
+ ```
212
+
213
+ ## Dictionary Sources
214
+
215
+ SpellKit doesn't bundle dictionaries, but works with several sources:
216
+
217
+ ### Use the Default Dictionary (Recommended)
218
+ ```ruby
219
+ # English 80k word dictionary from SymSpell
220
+ SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
221
+ ```
222
+
223
+ ### Public Dictionary URLs
224
+ - **SymSpell English 80k**: `https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-80k.txt`
225
+ - **SymSpell English 500k**: `https://raw.githubusercontent.com/wolfgarbe/SymSpell/master/SymSpell.FrequencyDictionary/en-500k.txt`
226
+
227
+ ### Build Your Own
228
+ See "Building Dictionaries" section below for creating domain-specific dictionaries.
229
+
230
+ ### Caching
231
+ Dictionaries downloaded from URLs are cached in `~/.cache/spellkit/` for faster subsequent loads.
232
+
233
+ ## Configuration
234
+
235
+ ```ruby
236
+ SpellKit.load!(
237
+ dictionary: "models/dictionary.tsv", # required: path or URL
238
+ protected_path: "models/protected.txt", # optional
239
+ protected_patterns: [/^[A-Z]{3,4}\d+$/], # optional
240
+ edit_distance: 1, # 1 (default) or 2
241
+ frequency_threshold: 10.0 # default: 10.0
242
+ )
243
+ ```
244
+
245
+ ## API Reference
246
+
247
+ ### `SpellKit.load!(**options)`
248
+
249
+ Load or reload dictionaries. Thread-safe atomic swap. Accepts URLs (auto-downloads and caches) or local file paths.
250
+
251
+ **Options:**
252
+ - `dictionary:` (required) - URL or path to TSV file with term<TAB>frequency
253
+ - `protected_path:` (optional) - Path to file with protected terms (one per line)
254
+ - `protected_patterns:` (optional) - Array of Regexp or String patterns to protect
255
+ - `edit_distance:` (default: 1) - Maximum edit distance (1 or 2)
256
+ - `frequency_threshold:` (default: 10.0) - Minimum frequency ratio for corrections
257
+
258
+ **Examples:**
259
+ ```ruby
260
+ # From URL (recommended for getting started)
261
+ SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)
262
+
263
+ # From custom URL
264
+ SpellKit.load!(dictionary: "https://example.com/dict.tsv")
265
+
266
+ # From local file
267
+ SpellKit.load!(dictionary: "/path/to/dictionary.tsv")
268
+ ```
269
+
270
+ ### `SpellKit.suggest(word, max = 5)`
271
+
272
+ Get ranked suggestions for a word.
273
+
274
+ **Parameters:**
275
+ - `word` (required) - The word to get suggestions for
276
+ - `max` (optional, default: 5) - Maximum number of suggestions to return
277
+
278
+ **Returns:** Array of hashes with `"term"`, `"distance"`, and `"freq"` keys
279
+
280
+ ### `SpellKit.correct_if_unknown(word, guard:)`
281
+
282
+ Return corrected word or original if no better match found.
283
+
284
+ **Options:**
285
+ - `guard:` - Set to `:domain` to enable protection checks
286
+
287
+ ### `SpellKit.correct_tokens(tokens, guard:)`
288
+
289
+ Batch correction of an array of tokens.
290
+
291
+ **Returns:** Array of corrected strings
292
+
293
+ ### `SpellKit.stats`
294
+
295
+ Get current state statistics.
296
+
297
+ **Returns:** Hash with:
298
+ - `"loaded"` - Boolean
299
+ - `"dictionary_size"` - Number of terms
300
+ - `"edit_distance"` - Configured edit distance
301
+ - `"loaded_at"` - Unix timestamp
302
+
303
+ ### `SpellKit.healthcheck`
304
+
305
+ Verify system is properly loaded. Raises error if not.
306
+
307
+ ## Term Protection
308
+
309
+ The `guard: :domain` option enables protection for specific terms:
310
+
311
+ ### Exact Matches
312
+ Terms in `protected_path` file are never corrected, even if similar dictionary words exist. Matching is case-insensitive, but original casing is preserved in output.
313
+
314
+ ### Pattern Matching
315
+ Terms matching any pattern in `protected_patterns` are protected. Patterns can be:
316
+ - Ruby Regexp objects: `/^[A-Z]{3,4}\d+$/`
317
+ - Regex strings: `"^[A-Z]{3,4}\\d+$"`
318
+
319
+ ### Examples
320
+ ```ruby
321
+ # Protect specific terms
322
+ protected_patterns: [
323
+ /^[A-Z]{3,4}\d+$/, # Gene symbols: CDK10, BRCA1
324
+ /^\d{2,7}-\d{2}-\d$/, # CAS numbers: 7732-18-5
325
+ /^[A-Z]{2,3}-\d+$/ # Product codes: ABC-123
326
+ ]
327
+ ```
328
+
329
+ ## Rails Integration
330
+
331
+ ```ruby
332
+ # config/initializers/spellkit.rb
333
+
334
+ # Option 1: Use default dictionary (easiest)
335
+ SpellKit.configure do |config|
336
+ config.dictionary = SpellKit::DEFAULT_DICTIONARY_URL
337
+ end
338
+
339
+ # Option 2: Use local dictionary with full configuration
340
+ SpellKit.configure do |config|
341
+ config.dictionary = Rails.root.join("models/dictionary.tsv")
342
+ config.protected_path = Rails.root.join("models/protected.txt")
343
+ config.protected_patterns = [
344
+ /^[A-Z]{3,4}\d+$/, # Product codes
345
+ /^\d{2,7}-\d{2}-\d$/ # Reference numbers
346
+ ]
347
+ config.edit_distance = 1
348
+ config.frequency_threshold = 10.0
349
+ end
350
+
351
+ # Option 3: Multiple domain-specific instances
352
+ # config/initializers/spellkit.rb
353
+ module SpellCheckers
354
+ MEDICAL = SpellKit::Checker.new.tap do |c|
355
+ c.load!(
356
+ dictionary: Rails.root.join("models/medical_dictionary.tsv"),
357
+ protected_path: Rails.root.join("models/medical_terms.txt")
358
+ )
359
+ end
360
+
361
+ LEGAL = SpellKit::Checker.new.tap do |c|
362
+ c.load!(
363
+ dictionary: Rails.root.join("models/legal_dictionary.tsv"),
364
+ protected_path: Rails.root.join("models/legal_terms.txt")
365
+ )
366
+ end
367
+ end
368
+
369
+ # In your search preprocessing
370
+ class SearchPreprocessor
371
+ def self.correct_query(text)
372
+ tokens = text.downcase.split(/\s+/)
373
+ SpellKit.correct_tokens(tokens, guard: :domain).join(" ")
374
+ end
375
+ end
376
+ ```
377
+
378
+ ## Performance
379
+
380
+ Benchmarked on M1 MacBook Pro with 20-term test dictionary:
381
+
382
+ - **Load time**: < 100ms
383
+ - **Suggestion latency**: p50 < 2µs, p95 < 2µs
384
+ - **Guard checks**: p95 < 1µs
385
+ - **Memory**: ~150MB for 1M term dictionary (estimated)
386
+
387
+ Target for production (1-5M terms):
388
+ - Load: < 500ms
389
+ - p50: < 30µs, p95: < 100µs
390
+ - Memory: 50-150MB
391
+
392
+ ## Building Dictionaries
393
+
394
+ Create your dictionary from your corpus:
395
+
396
+ ```ruby
397
+ # example_builder.rb
398
+ require "set"
399
+
400
+ counts = Hash.new(0)
401
+
402
+ # Read your corpus
403
+ File.foreach("corpus.txt") do |line|
404
+ line.downcase.split(/\W+/).each do |word|
405
+ next if word.length < 3
406
+ counts[word] += 1
407
+ end
408
+ end
409
+
410
+ # Filter by minimum count and write
411
+ min_count = 5
412
+ File.open("dictionary.tsv", "w") do |f|
413
+ counts.select { |_, count| count >= min_count }
414
+ .sort_by { |_, count| -count }
415
+ .each { |term, count| f.puts "#{term}\t#{count}" }
416
+ end
417
+ ```
418
+
419
+ ## Development
420
+
421
+ After checking out the repo:
422
+
423
+ ```bash
424
+ bundle install
425
+ bundle exec rake compile
426
+ bundle exec rake spec
427
+ ```
428
+
429
+ To build the gem:
430
+
431
+ ```bash
432
+ bundle exec rake build
433
+ ```
434
+
435
+ ## Platform Support
436
+
437
+ Pre-built gems available for:
438
+ - macOS (x86_64, arm64)
439
+ - Linux (glibc, musl)
440
+ - Ruby 3.1, 3.2, 3.3
441
+
442
+ ## Contributing
443
+
444
+ Bug reports and pull requests are welcome at https://github.com/scientist-labs/spellkit
445
+
446
+ ## License
447
+
448
+ MIT License - see [LICENSE](LICENSE) file for details.