ruby-spellchecker 0.1.1 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.github/workflows/benchmark.yml +14 -0
- data/.github/workflows/rspec.yml +26 -0
- data/.github/workflows/rubocop.yml +26 -0
- data/.rspec +0 -1
- data/.rubocop.yml +0 -7
- data/README.md +71 -12
- data/benchmark/benchmark.rb +29 -0
- data/dictionaries/company_names.txt +0 -3
- data/dictionaries/ngrams.csv +0 -8
- data/dictionaries/typos.csv +5 -7
- data/lib/spellchecker.rb +1 -3
- data/lib/spellchecker/detect_duplicate.rb +9 -5
- data/lib/spellchecker/detect_typo.rb +2 -35
- data/lib/spellchecker/dictionaries/typos_list.rb +2 -2
- data/lib/spellchecker/dictionaries/us_toponyms.rb +4 -3
- data/lib/spellchecker/tokenizer.rb +25 -16
- data/lib/spellchecker/tokenizer/token.rb +5 -0
- data/lib/spellchecker/version.rb +1 -1
- data/ruby-spellchecker.gemspec +4 -2
- metadata +42 -12
- data/.travis.yml +0 -6
- data/LICENSE +0 -21
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 19fe4bc1957bb2abc21b2cdba7ccc55ab33cefc7c064055d35a338a01fbd910d
|
4
|
+
data.tar.gz: 74f000046be2ba09622d6bf725058a4b018e7fdd81e340b9a61593709312c626
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 18c5dfde1bb90223e24a87da7a68c15f85b57cfd8709c8a1938d2e0c2e9dfbb12382cc99091209afd5f6d9951840113658a9c0d5e57d34a24dc394a935522481
|
7
|
+
data.tar.gz: 46c48cb356d5f3f825bfe3bb0ce4998e36d3b69ed233592905e9d7eee1a456c524019f8606673ef8bdbe329bf8a68ea3663c5395fdb8e6b47647379942a6dbce
|
@@ -0,0 +1,14 @@
|
|
1
|
+
name: Benchmark
|
2
|
+
on: push
|
3
|
+
|
4
|
+
jobs:
|
5
|
+
verify:
|
6
|
+
runs-on: ubuntu-latest
|
7
|
+
steps:
|
8
|
+
- uses: actions/checkout@v2
|
9
|
+
- name: Set up Ruby 2.6.0
|
10
|
+
uses: ruby/setup-ruby@v1
|
11
|
+
with:
|
12
|
+
ruby-version: 2.6.0
|
13
|
+
- name: Run benchmarks
|
14
|
+
run: ruby benchmark/benchmark.rb
|
@@ -0,0 +1,26 @@
|
|
1
|
+
name: Rspec
|
2
|
+
on: push
|
3
|
+
|
4
|
+
jobs:
|
5
|
+
verify:
|
6
|
+
runs-on: ubuntu-latest
|
7
|
+
steps:
|
8
|
+
- uses: actions/checkout@v2
|
9
|
+
- name: Set up Ruby 2.6.0
|
10
|
+
uses: ruby/setup-ruby@v1
|
11
|
+
with:
|
12
|
+
ruby-version: 2.6.0
|
13
|
+
- name: Cache gems
|
14
|
+
uses: actions/cache@v1
|
15
|
+
with:
|
16
|
+
path: vendor/bundle
|
17
|
+
key: ${{ runner.os }}-gem-${{ hashFiles('**/Gemfile.lock') }}
|
18
|
+
restore-keys: |
|
19
|
+
${{ runner.os }}-gem-
|
20
|
+
- name: Install gems
|
21
|
+
run: |
|
22
|
+
gem install bundler
|
23
|
+
bundle config path vendor/bundle
|
24
|
+
bundle install --jobs 4 --retry 3
|
25
|
+
- name: Run RSpec
|
26
|
+
run: bundle exec rspec
|
@@ -0,0 +1,26 @@
|
|
1
|
+
name: Rubocop
|
2
|
+
on: push
|
3
|
+
|
4
|
+
jobs:
|
5
|
+
rubocop:
|
6
|
+
runs-on: ubuntu-latest
|
7
|
+
steps:
|
8
|
+
- uses: actions/checkout@v2
|
9
|
+
- name: Set up Ruby 2.6.0
|
10
|
+
uses: ruby/setup-ruby@v1
|
11
|
+
with:
|
12
|
+
ruby-version: 2.6.0
|
13
|
+
- name: Cache gems
|
14
|
+
uses: actions/cache@v1
|
15
|
+
with:
|
16
|
+
path: vendor/bundle
|
17
|
+
key: ${{ runner.os }}-gem-${{ hashFiles('**/Gemfile.lock') }}
|
18
|
+
restore-keys: |
|
19
|
+
${{ runner.os }}-gem-
|
20
|
+
- name: Install gems
|
21
|
+
run: |
|
22
|
+
gem install bundler
|
23
|
+
bundle config path vendor/bundle
|
24
|
+
bundle install --jobs 4 --retry 3
|
25
|
+
- name: Run RuboCop
|
26
|
+
run: bundle exec rubocop
|
data/.rspec
CHANGED
data/.rubocop.yml
CHANGED
data/README.md
CHANGED
@@ -1,15 +1,11 @@
|
|
1
|
-
# Spellchecker
|
2
|
-
|
3
|
-
Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/spellchecker`. To experiment with that code, run `bin/console` for an interactive prompt.
|
4
|
-
|
5
|
-
TODO: Delete this and the text above, and describe your gem
|
1
|
+
# Ruby Spellchecker
|
6
2
|
|
7
3
|
## Installation
|
8
4
|
|
9
5
|
Add this line to your application's Gemfile:
|
10
6
|
|
11
7
|
```ruby
|
12
|
-
gem 'spellchecker'
|
8
|
+
gem 'ruby-spellchecker'
|
13
9
|
```
|
14
10
|
|
15
11
|
And then execute:
|
@@ -22,14 +18,77 @@ Or install it yourself as:
|
|
22
18
|
|
23
19
|
## Usage
|
24
20
|
|
25
|
-
|
21
|
+
### Get list of errors
|
22
|
+
|
23
|
+
```ruby
|
24
|
+
Spellchecker.check(text)
|
25
|
+
```
|
26
|
+
|
27
|
+
### Autocorrection
|
26
28
|
|
27
|
-
|
29
|
+
```ruby
|
30
|
+
text = <<~TEXT
|
31
|
+
I started my schooling as the majority did in my area, at the local
|
32
|
+
primarry school. I then went to the local secondarry school and
|
33
|
+
recieved grades in English, Maths, Phisics, Biology, Geography,
|
34
|
+
Art, Graphical Comunication and Philosophy of Religeon. I'll not
|
35
|
+
bore you with the 'A' levels and above.
|
36
|
+
|
37
|
+
Notice the ambigous English qualification above. It was, in truth,
|
38
|
+
a cource dedicated to reading "Lord of the flies" and other gems,
|
39
|
+
and a weak atempt at getting us to commprehend them. Luckilly my
|
40
|
+
middle-class upbringing gave me a head start as I was was already
|
41
|
+
aquainted with that sort of langauge these books used (and not just
|
42
|
+
the Peter and Jane books) and had read simillar books before. I will
|
43
|
+
never be able to put that paticular course down as much as I desire
|
44
|
+
to because, for all its faults, it introduced me to Steinbeck,
|
45
|
+
Malkovich and the wonders of Lenny, mice and pockets.
|
46
|
+
|
47
|
+
My education never included one iota of grammar. Lynn Truss points
|
48
|
+
out in "Eats, shoots and leaves" that many people were excused from
|
49
|
+
the rigours of learning English grammar during their schooling over
|
50
|
+
the last 30 or so years because the majority or decision-makers
|
51
|
+
decided one day that it might hinder imagination and expresion (so
|
52
|
+
what, I ask, happened to all those expresive and imaginative people
|
53
|
+
before the ruling?).
|
54
|
+
TEXT
|
55
|
+
|
56
|
+
corrected = Spellchecker.correct(text)
|
57
|
+
```
|
58
|
+
|
59
|
+
Wdiff:
|
28
60
|
|
29
|
-
|
61
|
+
```ruby
|
62
|
+
require 'wdiff'
|
30
63
|
|
31
|
-
|
64
|
+
Wdiff.diff(text, corrected)
|
32
65
|
|
33
|
-
|
66
|
+
```
|
34
67
|
|
35
|
-
|
68
|
+
Result:
|
69
|
+
|
70
|
+
```diff
|
71
|
+
I started my schooling as the majority did in my area, at the local
|
72
|
+
[-primarry-] {+primary+} school. I then went to the local [-secondarry-] {+secondary+} school and
|
73
|
+
[-recieved-] {+received+} grades in English, Maths, [-Phisics,-] {+Physics,+} Biology, Geography,
|
74
|
+
Art, Graphical Comunication and Philosophy of [-Religeon.-] {+Religion.+} I'll not
|
75
|
+
bore you with the 'A' levels and above.
|
76
|
+
|
77
|
+
Notice the [-ambigous-] {+ambiguous+} English qualification above. It was, in truth,
|
78
|
+
a [-cource-] {+course+} dedicated to reading "Lord of the flies" and other gems,
|
79
|
+
and a weak [-atempt-] {+attempt+} at getting us to [-commprehend-] {+comprehend+} them. [-Luckilly-] {+Luckily+} my
|
80
|
+
middle-class upbringing gave me a head start as I was [-was-] already
|
81
|
+
[-aquainted-] {+acquainted+} with that sort of [-langauge-] {+language+} these books used (and not just
|
82
|
+
the Peter and Jane books) and had read [-simillar-] {+similar+} books before. I will
|
83
|
+
never be able to put that [-paticular-] {+particular+} course down as much as I desire
|
84
|
+
to because, for all its faults, it introduced me to Steinbeck,
|
85
|
+
Malkovich and the wonders of Lenny, mice and pockets.
|
86
|
+
|
87
|
+
My education never included one iota of grammar. Lynn Truss points
|
88
|
+
out in "Eats, shoots and leaves" that many people were excused from
|
89
|
+
the rigours of learning English grammar during their schooling over
|
90
|
+
the last 30 or so years because the majority or decision-makers
|
91
|
+
decided one day that it might hinder imagination and [-expresion-] {+expression+} (so
|
92
|
+
what, I ask, happened to all those [-expresive-] {+expressive+} and imaginative people
|
93
|
+
before the ruling?).
|
94
|
+
```
|
@@ -0,0 +1,29 @@
|
|
1
|
+
# frozen_string_literal: true
|
2
|
+
|
3
|
+
require 'benchmark'
|
4
|
+
require_relative '../lib/spellchecker'
|
5
|
+
|
6
|
+
text1 = <<~TEXT
|
7
|
+
I started my schooling as the majority did in my area, at the local primarry school. I then went to the local secondarry school and recieved grades in English, Maths, Phisics, Biology, Geography, Art, Graphical Comunication and Philosophy of Religeon. I'll not bore you with the 'A' levels and above.
|
8
|
+
Notice the ambigous English qualification above. It was, in truth, a cource dedicated to reading "Lord of the flies" and other gems, and a weak atempt at getting us to commprehend them. Luckilly my middle-class upbringing gave me a head start as I was already aquainted with that sort of langauge these books used (and not just the Peter and Jane books) and had read simillar books before. I will never be able to put that paticular course down as much as I desire to because, for all its faults, it introduced me to Steinbeck, Malkovich and the wonders of Lenny, mice and pockets.
|
9
|
+
My education never included one iota of grammar. Lynn Truss points out in "Eats, shoots and leaves" that many people were excused from the rigours of learning English grammar during their schooling over the last 30 or so years because the majority or decision-makers decided one day that it might hinder imagination and expresion (so what, I ask, happened to all those expresive and imaginative people before the ruling?).
|
10
|
+
|
11
|
+
I started my schooling as the majority did in my area, at the local primary school. I then went to the local secondary school and received grades in English, Maths, Physics, Biology, Geography, Art, Graphical Communication and Philosophy of Religion. I'll not bore you with the 'A' levels and above.
|
12
|
+
Notice the ambiguous English qualification above. It was, in truth, a course dedicated to reading "Lord of the flies" and other gems, and a weak attempt at getting us to comprehend them. Luckily my middle-class upbringing gave me a head start as I was already acquainted with that sort of language these books used (and not just the Peter and Jane books) and had read similar books before. I will never be able to put that particular course down as much as I desire to because, for all its faults, it introduced me to Steinbeck, Malkovich and the wonders of Lenny, mice and pockets.
|
13
|
+
My education never included one iota of grammar. Lynn Truss points out in "Eats, shoots and leaves" that many people were excused from the rigours of learning English grammar during their schooling over the last 30 or so years because the majority or decision-makers decided one day that it might hinder imagination and expression (so what, I ask, happened to all those expressive and imaginative people before the ruling?).
|
14
|
+
TEXT
|
15
|
+
|
16
|
+
text2 = <<~TEXT
|
17
|
+
Mail Attachment Support Viewable document types (apple.com)
|
18
|
+
.jpg, .tiff, .gif (images); .doc and .docx (Microsoft Word); .htm and .html (web pages); .key (Keynote); .numbers (Numbers); .pages (Pages); .pdf (Preview and Adobe Acrobat); .ppt and .pptx (Microsoft PowerPoint); .txt (text); .rtf (rich text format); .vcf (contact information); .xls and .xlsx (Microsoft Excel); .zip; .ics; .usdz (USDZ-Universal).
|
19
|
+
TEXT
|
20
|
+
|
21
|
+
text = text1 + ([text2] * 5).join("\n")
|
22
|
+
|
23
|
+
Spellchecker.check(text)
|
24
|
+
|
25
|
+
Benchmark.bm do |x|
|
26
|
+
x.report('tokenize') { 500.times { Spellchecker::Tokenizer.call(text) } }
|
27
|
+
x.report('check ') { 500.times { Spellchecker.check(text) } }
|
28
|
+
x.report('correct ') { 500.times { Spellchecker.correct(text) } }
|
29
|
+
end
|
@@ -350588,9 +350588,6 @@ Comunicatii
|
|
350588
350588
|
Comunicatiilor
|
350589
350589
|
Comunicating
|
350590
350590
|
Comunicatio
|
350591
|
-
Comunication
|
350592
|
-
Comunicational
|
350593
|
-
Comunications
|
350594
350591
|
Comunicatistampa
|
350595
350592
|
Comunicativa
|
350596
350593
|
Comunicativas
|
data/dictionaries/ngrams.csv
CHANGED
@@ -452,7 +452,6 @@ atlanta journal and constitution,Atlanta Journal-Constitution
|
|
452
452
|
atlanta journal constitution,Atlanta Journal-Constitution
|
453
453
|
atlanta-journal and constitution,Atlanta Journal-Constitution
|
454
454
|
atlanta-journal constitution,Atlanta Journal-Constitution
|
455
|
-
atlantic ocean,atlantic Ocean
|
456
455
|
award winning,award-winning
|
457
456
|
b'nai brith,B'nai B'rith
|
458
457
|
b'nai b’rith,B'nai B'rith
|
@@ -1318,7 +1317,6 @@ in tact,intact
|
|
1318
1317
|
in their life time,in their lifetime
|
1319
1318
|
in their life-time,in their lifetime
|
1320
1319
|
in united states,in the United States
|
1321
|
-
indian ocean,indian Ocean
|
1322
1320
|
indira gahndi,Indira Gandhi
|
1323
1321
|
indira ghandi,Indira Gandhi
|
1324
1322
|
inherlife time,inher lifetime
|
@@ -1585,7 +1583,6 @@ lloyds of london,Lloyd's of London
|
|
1585
1583
|
long awaited,long-awaited
|
1586
1584
|
longer then,longer than
|
1587
1585
|
loosing on penalties,losing on penalties
|
1588
|
-
lorem ipsum dolor sit,[default text]
|
1589
1586
|
los angelas,los Angeles
|
1590
1587
|
los angels,los Angeles
|
1591
1588
|
los angles,los Angeles
|
@@ -1732,8 +1729,6 @@ mostly knowed as,mostly known as
|
|
1732
1729
|
mostly knowed for,mostly known for
|
1733
1730
|
mostly knows as,mostly known as
|
1734
1731
|
mostly knows for,mostly known for
|
1735
|
-
moyen age,moyen Âge
|
1736
|
-
moyen âge,moyen Âge
|
1737
1732
|
muhammed ali,Muhammad Ali
|
1738
1733
|
mullerian duct,Müllerian Duct
|
1739
1734
|
mullerian ducts,Müllerian Ducts
|
@@ -1911,7 +1906,6 @@ over sized,oversized
|
|
1911
1906
|
over-size,oversize
|
1912
1907
|
over-sized,oversized
|
1913
1908
|
owning to,owing to
|
1914
|
-
pacific ocean,pacific Ocean
|
1915
1909
|
palm d'or,Palme d'Or
|
1916
1910
|
palm d`or,Palme d'Or
|
1917
1911
|
palm d’or,Palme d'Or
|
@@ -2263,8 +2257,6 @@ the fist time,the first time
|
|
2263
2257
|
the frist time,the first time
|
2264
2258
|
the just the,just the
|
2265
2259
|
the least the least,the least
|
2266
|
-
the on going,the ongoing
|
2267
|
-
the on-going,the ongoing
|
2268
2260
|
the question how,the question of how
|
2269
2261
|
the question where,the question of where
|
2270
2262
|
the roughly the,roughly the
|
data/dictionaries/typos.csv
CHANGED
@@ -8898,7 +8898,6 @@ arful,awful
|
|
8898
8898
|
arfull,awful
|
8899
8899
|
arfully,artfully
|
8900
8900
|
arfument,argument
|
8901
|
-
arg,argument
|
8902
8901
|
argement,argument
|
8903
8902
|
argentia,argentina
|
8904
8903
|
argentinia,argentina
|
@@ -10568,7 +10567,6 @@ attrbibutes,attributes
|
|
10568
10567
|
attrbiutes,attributes
|
10569
10568
|
attrbute,attribute
|
10570
10569
|
attrbutes,attributes
|
10571
|
-
attrib,attribute
|
10572
10570
|
attribbutes,attributes
|
10573
10571
|
attribites,attributes
|
10574
10572
|
attribte,attribute
|
@@ -10609,7 +10607,6 @@ attrivute,attribute
|
|
10609
10607
|
attrocious,atrocious
|
10610
10608
|
attrocities,atrocities
|
10611
10609
|
attrocity,atrocity
|
10612
|
-
attrs,attributes
|
10613
10610
|
attruibutes,attributes
|
10614
10611
|
atttempts,attempts
|
10615
10612
|
atttract,attract
|
@@ -20329,6 +20326,8 @@ commpletion,completion
|
|
20329
20326
|
commplexity,complexity
|
20330
20327
|
commplishion,completion
|
20331
20328
|
commpm,common
|
20329
|
+
commprehend,comprehend
|
20330
|
+
commprehended,comprehended
|
20332
20331
|
commpression,compression
|
20333
20332
|
commptiblity,commptibility
|
20334
20333
|
commpunted,competent
|
@@ -26494,7 +26493,8 @@ countufersey,controversy
|
|
26494
26493
|
countuness,countenance
|
26495
26494
|
couontable,countable
|
26496
26495
|
coupld,couple
|
26497
|
-
cource,
|
26496
|
+
cource,course
|
26497
|
+
primarry,primary
|
26498
26498
|
cources,courses
|
26499
26499
|
courcework,coursework
|
26500
26500
|
courching,crouching
|
@@ -40347,7 +40347,6 @@ enusre,ensure
|
|
40347
40347
|
enusres,ensures
|
40348
40348
|
enusring,ensuring
|
40349
40349
|
enuthic,enthusiastic
|
40350
|
-
env,environment
|
40351
40350
|
enveloppe,envelope
|
40352
40351
|
envelopped,envelope
|
40353
40352
|
enveloppen,envelope
|
@@ -61708,7 +61707,6 @@ isolatuon,isolation
|
|
61708
61707
|
isoldation,isolation
|
61709
61708
|
isomorphim,isomorphism
|
61710
61709
|
isomorphims,isomorphisms
|
61711
|
-
isort,frosted
|
61712
61710
|
isotretioin,isotretion
|
61713
61711
|
isotrop,isotope
|
61714
61712
|
ispired,inspired
|
@@ -96580,6 +96578,7 @@ secodns,seconds
|
|
96580
96578
|
secods,seconds
|
96581
96579
|
secomdary,secondary
|
96582
96580
|
secondady,secondary
|
96581
|
+
secondarry,secondary
|
96583
96582
|
seconday,secondary
|
96584
96583
|
seconderies,secondaries
|
96585
96584
|
secondery,secondary
|
@@ -112694,7 +112693,6 @@ unitl,until
|
|
112694
112693
|
unitoligist,unitologist
|
112695
112694
|
unitoligists,unitologists
|
112696
112695
|
unitomious,unanimous
|
112697
|
-
unittests,unit
|
112698
112696
|
uniue,unique
|
112699
112697
|
univeral,universal
|
112700
112698
|
univeralism,universalism
|
data/lib/spellchecker.rb
CHANGED
@@ -13,8 +13,6 @@ require_relative 'spellchecker/detect_typo'
|
|
13
13
|
require_relative 'spellchecker/detect_ngram'
|
14
14
|
|
15
15
|
module Spellchecker
|
16
|
-
NGRAM_NUMBER = 5
|
17
|
-
|
18
16
|
module MistakeTypes
|
19
17
|
ALL = [
|
20
18
|
DUPLICATE = 'duplicate',
|
@@ -60,7 +58,7 @@ module Spellchecker
|
|
60
58
|
# @param mistakes [Array<Spellchecker::Mistake>]
|
61
59
|
# @return [String]
|
62
60
|
def apply_fixes(text, mistakes)
|
63
|
-
mistakes_hash = mistakes.map { |m| [m.
|
61
|
+
mistakes_hash = mistakes.map { |m| [m.text, m.correction] }.to_h
|
64
62
|
regexp = Regexp.union(mistakes_hash.keys)
|
65
63
|
|
66
64
|
text.gsub(regexp, mistakes_hash)
|
@@ -12,7 +12,8 @@ module Spellchecker
|
|
12
12
|
yum yummy agar kori lai please mumble extremely
|
13
13
|
highly root whoa knock check woof bounce bouncy
|
14
14
|
million tut wow mola paw hubba histrio cha nom
|
15
|
-
chop same extra more bang big go no pom
|
15
|
+
chop same extra more bang big go no pom la ah
|
16
|
+
ha oh ew]
|
16
17
|
).freeze
|
17
18
|
|
18
19
|
SKIP_PHRASES = Set.new(['try and', 'and try', 'and again', 'again and',
|
@@ -46,13 +47,13 @@ module Spellchecker
|
|
46
47
|
text, correction = find_duplicate(t1, t2, t3, t4)
|
47
48
|
|
48
49
|
return unless text
|
49
|
-
return if t2.
|
50
|
+
return if t2.capital? || t3.capital?
|
50
51
|
return if SKIP_PHRASES.include?(correction.downcase)
|
51
52
|
return unless Dictionaries::EnglishWords.include?(t2.text)
|
52
53
|
|
53
54
|
return if skip_phrase?(t1, t2, t3, t4)
|
54
55
|
return if repetition?(t1, t2, t3, t4)
|
55
|
-
return if from_to_phrase?(t1, t2, t3
|
56
|
+
return if from_to_phrase?(t1, t2, t3)
|
56
57
|
return if quoted?(t1, t2, t3, t4)
|
57
58
|
|
58
59
|
Mistake.new(text: text, correction: correction,
|
@@ -79,22 +80,25 @@ module Spellchecker
|
|
79
80
|
false
|
80
81
|
end
|
81
82
|
|
83
|
+
# rubocop:disable Metrics/AbcSize
|
82
84
|
def repetition?(t1, t2, t3, t4)
|
83
85
|
return true if t1.downcased == t3.downcased && t1.downcased == t4.next.downcased
|
84
86
|
return true if t1.prev.downcased == t2.downcased && t2.downcased == t4.downcased
|
87
|
+
return true if t1.prev.downcased == t1.downcased && t1.downcased == t3.downcased
|
85
88
|
return true if t1.downcased == t2.downcased && (t1.downcased == t3.downcased ||
|
86
89
|
t1.downcased == t1.prev.downcased ||
|
87
90
|
t1.downcased == t4.downcased)
|
88
91
|
|
89
92
|
false
|
90
93
|
end
|
94
|
+
# rubocop:enable Metrics/AbcSize
|
91
95
|
|
92
96
|
def quoted?(t1, _t2, t3, t4)
|
93
97
|
t1.prev.text == '"' && (t3.text == '"' || t4.text == '"')
|
94
98
|
end
|
95
99
|
|
96
|
-
def from_to_phrase?(t1, t2, t3
|
97
|
-
t1.downcased == 'from' &&
|
100
|
+
def from_to_phrase?(t1, t2, t3)
|
101
|
+
t1.prev.downcased == 'from' && t2.downcased == 'to' && t1.downcased == t3.downcased
|
98
102
|
end
|
99
103
|
end
|
100
104
|
end
|
@@ -4,15 +4,9 @@ module Spellchecker
|
|
4
4
|
module DetectTypo
|
5
5
|
PROPER_NAME_REGEXP = /\A(?:[a-z]+[A-Z])|(?:[A-Z]+.+[A-Z]+)|(?:[A-Z]{2,}[^A-Z]+)/.freeze
|
6
6
|
ABBREVIATION_REGEXP = /\A(?:[A-Z]{2,4})|(?:[A-Z][a-z])\z/.freeze
|
7
|
-
MUTEX = Mutex.new
|
8
7
|
|
9
8
|
LENGTH_LIMIT = 2
|
10
9
|
|
11
|
-
POSTFILTERS = {
|
12
|
-
'aan' => :all_english_words?,
|
13
|
-
'dont' => :any_english_word?
|
14
|
-
}.freeze
|
15
|
-
|
16
10
|
module_function
|
17
11
|
|
18
12
|
# @param token [Spellchecker::Tokenizer::Token]
|
@@ -29,12 +23,9 @@ module Spellchecker
|
|
29
23
|
return if ABBREVIATION_REGEXP.match?(word)
|
30
24
|
return if Dictionaries::EnglishWords.include?(Utils.replace_quote(word))
|
31
25
|
|
32
|
-
|
33
|
-
|
34
|
-
return if is_capital && proper_noun?(word)
|
35
|
-
return if postfilter?(token)
|
26
|
+
return if token.capital? && proper_noun?(word)
|
36
27
|
|
37
|
-
correction = correction.sub(/\S/, &:upcase) if
|
28
|
+
correction = correction.sub(/\S/, &:upcase) if token.capital?
|
38
29
|
|
39
30
|
Mistake.new(text: word, correction: correction,
|
40
31
|
position: token.position, type: MistakeTypes::SPELLING)
|
@@ -47,29 +38,5 @@ module Spellchecker
|
|
47
38
|
Dictionaries::CompanyNames.include?(word) ||
|
48
39
|
Dictionaries::UsToponyms.include?(word)
|
49
40
|
end
|
50
|
-
|
51
|
-
# @param token [Spellchecker::Tokenizer::Token]
|
52
|
-
# @return [Boolean]
|
53
|
-
def postfilter?(token)
|
54
|
-
filter = POSTFILTERS[token.downcased]
|
55
|
-
|
56
|
-
return false unless filter
|
57
|
-
|
58
|
-
!method(filter).call(token)
|
59
|
-
end
|
60
|
-
|
61
|
-
# @param token [Spellchecker::Tokenizer::Token]
|
62
|
-
# @return [Boolean]
|
63
|
-
def all_english_words?(token)
|
64
|
-
Dictionaries::EnglishWords.include?(token.prev.text) &&
|
65
|
-
Dictionaries::EnglishWords.include?(token.next.text)
|
66
|
-
end
|
67
|
-
|
68
|
-
# @param token [Spellchecker::Tokenizer::Token]
|
69
|
-
# @return [Boolean]
|
70
|
-
def any_english_word?(token)
|
71
|
-
Dictionaries::EnglishWords.include?(token.prev.text) ||
|
72
|
-
Dictionaries::EnglishWords.include?(token.next.text)
|
73
|
-
end
|
74
41
|
end
|
75
42
|
end
|
@@ -4,6 +4,7 @@ module Spellchecker
|
|
4
4
|
module Dictionaries
|
5
5
|
module UsToponyms
|
6
6
|
MUTEX = Mutex.new
|
7
|
+
# https://github.com/grammakov/USA-cities-and-states
|
7
8
|
PATH = Dictionaries.path.join('us_toponyms.csv')
|
8
9
|
|
9
10
|
module_function
|
@@ -28,10 +29,10 @@ module Spellchecker
|
|
28
29
|
csv = CSV.parse(PATH.read, headers: true, col_sep: '|')
|
29
30
|
|
30
31
|
csv.each_with_object(Set.new) do |row, set|
|
31
|
-
set.add(row['City'])
|
32
|
-
set.add(row['State full'])
|
32
|
+
set.add(row['City']) if row['City']
|
33
|
+
set.add(row['State full']) if row['State full']
|
33
34
|
set.add(row['County'].to_s.split(/\s+/).map(&:capitalize).join(' ')) unless row['County'].to_s.empty?
|
34
|
-
set.add(row['City alias'])
|
35
|
+
set.add(row['City alias']) if row['City alias']
|
35
36
|
end
|
36
37
|
end
|
37
38
|
end
|
@@ -10,6 +10,8 @@ module Spellchecker
|
|
10
10
|
WORD_REGEXP = /[[:word:]]/.freeze
|
11
11
|
LINEBREAK = "\n"
|
12
12
|
|
13
|
+
DOT = '.'
|
14
|
+
|
13
15
|
SIMPLE_PRE = ['¿', '¡'].freeze
|
14
16
|
SIMPLE_POST = ['!', '?', ',', ':', ';', '.'].freeze
|
15
17
|
PAIR_PRE = ['(', '{', '[', '<', '«', '„', '‘'].freeze
|
@@ -22,9 +24,9 @@ module Spellchecker
|
|
22
24
|
|
23
25
|
module_function
|
24
26
|
|
25
|
-
# rubocop:disable Metrics/AbcSize
|
26
|
-
# @param [String]
|
27
|
-
# @return [
|
27
|
+
# rubocop:disable Metrics/AbcSize, Metrics/MethodLength, Metrics/PerceivedComplexity
|
28
|
+
# @param str [String] string to be tokenized.
|
29
|
+
# @return [Spellchecker::Tokenizer::List]
|
28
30
|
def call(str)
|
29
31
|
chars = str.chars
|
30
32
|
pos = 0
|
@@ -36,33 +38,40 @@ module Spellchecker
|
|
36
38
|
if char.nil?
|
37
39
|
list << Token.new(acc.join, pos) unless acc.empty?
|
38
40
|
|
39
|
-
break
|
41
|
+
break
|
40
42
|
end
|
41
43
|
|
42
44
|
if char.match?(BLANK_REGEXP)
|
43
45
|
list << Token.new(acc.join, pos) unless acc.empty?
|
44
46
|
acc.clear
|
45
|
-
elsif splitable?(char
|
46
|
-
|
47
|
-
list << Token.new(char, i)
|
47
|
+
elsif splitable?(char)
|
48
|
+
is_next_wordchar = word_char?(chars[i + 1])
|
48
49
|
|
49
|
-
acc.
|
50
|
+
if acc.empty? && char == DOT && is_next_wordchar
|
51
|
+
pos = i
|
52
|
+
acc << char
|
53
|
+
elsif !word_char?(chars[i - 1]) || !is_next_wordchar
|
54
|
+
list << Token.new(acc.join, pos) unless acc.empty?
|
55
|
+
list << Token.new(char, i)
|
56
|
+
|
57
|
+
acc.clear
|
58
|
+
else
|
59
|
+
acc << char
|
60
|
+
end
|
50
61
|
else
|
51
62
|
pos = i if acc.empty?
|
52
63
|
acc << char
|
53
64
|
end
|
54
65
|
end
|
66
|
+
|
67
|
+
list
|
55
68
|
end
|
56
|
-
# rubocop:enable Metrics/AbcSize
|
69
|
+
# rubocop:enable Metrics/AbcSize, Metrics/MethodLength, Metrics/PerceivedComplexity
|
57
70
|
|
58
|
-
# @param
|
59
|
-
# @param prev [String]
|
60
|
-
# @param nxt [String]
|
71
|
+
# @param char [String]
|
61
72
|
# @return [Boolean]
|
62
|
-
def splitable?(
|
63
|
-
|
64
|
-
|
65
|
-
cur == LINEBREAK
|
73
|
+
def splitable?(char)
|
74
|
+
SPLITTABLES_REGEXP.match?(char) || char == LINEBREAK
|
66
75
|
end
|
67
76
|
|
68
77
|
# @param char [String]
|
data/lib/spellchecker/version.rb
CHANGED
data/ruby-spellchecker.gemspec
CHANGED
@@ -25,6 +25,8 @@ Gem::Specification.new do |spec|
|
|
25
25
|
|
26
26
|
spec.require_paths = ['lib']
|
27
27
|
|
28
|
-
spec.add_development_dependency 'rspec'
|
29
|
-
spec.add_development_dependency 'rubocop'
|
28
|
+
spec.add_development_dependency 'rspec'
|
29
|
+
spec.add_development_dependency 'rubocop'
|
30
|
+
spec.add_development_dependency 'simplecov'
|
31
|
+
spec.add_development_dependency 'yard'
|
30
32
|
end
|
metadata
CHANGED
@@ -1,43 +1,71 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ruby-spellchecker
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Pete Matsyburka
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-11-
|
11
|
+
date: 2020-11-27 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: rspec
|
15
15
|
requirement: !ruby/object:Gem::Requirement
|
16
16
|
requirements:
|
17
|
-
- - "
|
17
|
+
- - ">="
|
18
18
|
- !ruby/object:Gem::Version
|
19
|
-
version: '
|
19
|
+
version: '0'
|
20
20
|
type: :development
|
21
21
|
prerelease: false
|
22
22
|
version_requirements: !ruby/object:Gem::Requirement
|
23
23
|
requirements:
|
24
|
-
- - "
|
24
|
+
- - ">="
|
25
25
|
- !ruby/object:Gem::Version
|
26
|
-
version: '
|
26
|
+
version: '0'
|
27
27
|
- !ruby/object:Gem::Dependency
|
28
28
|
name: rubocop
|
29
29
|
requirement: !ruby/object:Gem::Requirement
|
30
30
|
requirements:
|
31
|
-
- - "
|
31
|
+
- - ">="
|
32
32
|
- !ruby/object:Gem::Version
|
33
|
-
version: '
|
33
|
+
version: '0'
|
34
34
|
type: :development
|
35
35
|
prerelease: false
|
36
36
|
version_requirements: !ruby/object:Gem::Requirement
|
37
37
|
requirements:
|
38
|
-
- - "
|
38
|
+
- - ">="
|
39
39
|
- !ruby/object:Gem::Version
|
40
|
-
version: '
|
40
|
+
version: '0'
|
41
|
+
- !ruby/object:Gem::Dependency
|
42
|
+
name: simplecov
|
43
|
+
requirement: !ruby/object:Gem::Requirement
|
44
|
+
requirements:
|
45
|
+
- - ">="
|
46
|
+
- !ruby/object:Gem::Version
|
47
|
+
version: '0'
|
48
|
+
type: :development
|
49
|
+
prerelease: false
|
50
|
+
version_requirements: !ruby/object:Gem::Requirement
|
51
|
+
requirements:
|
52
|
+
- - ">="
|
53
|
+
- !ruby/object:Gem::Version
|
54
|
+
version: '0'
|
55
|
+
- !ruby/object:Gem::Dependency
|
56
|
+
name: yard
|
57
|
+
requirement: !ruby/object:Gem::Requirement
|
58
|
+
requirements:
|
59
|
+
- - ">="
|
60
|
+
- !ruby/object:Gem::Version
|
61
|
+
version: '0'
|
62
|
+
type: :development
|
63
|
+
prerelease: false
|
64
|
+
version_requirements: !ruby/object:Gem::Requirement
|
65
|
+
requirements:
|
66
|
+
- - ">="
|
67
|
+
- !ruby/object:Gem::Version
|
68
|
+
version: '0'
|
41
69
|
description: Ruby spelling and grammar checker that can be used for autocorrection.
|
42
70
|
email:
|
43
71
|
- pete.matsy@gmail.com
|
@@ -45,14 +73,16 @@ executables: []
|
|
45
73
|
extensions: []
|
46
74
|
extra_rdoc_files: []
|
47
75
|
files:
|
76
|
+
- ".github/workflows/benchmark.yml"
|
77
|
+
- ".github/workflows/rspec.yml"
|
78
|
+
- ".github/workflows/rubocop.yml"
|
48
79
|
- ".gitignore"
|
49
80
|
- ".rspec"
|
50
81
|
- ".rubocop.yml"
|
51
|
-
- ".travis.yml"
|
52
82
|
- Gemfile
|
53
|
-
- LICENSE
|
54
83
|
- README.md
|
55
84
|
- Rakefile
|
85
|
+
- benchmark/benchmark.rb
|
56
86
|
- bin/console
|
57
87
|
- bin/setup
|
58
88
|
- dictionaries/company_names.txt
|
data/.travis.yml
DELETED
data/LICENSE
DELETED
@@ -1,21 +0,0 @@
|
|
1
|
-
The MIT License (MIT)
|
2
|
-
|
3
|
-
Copyright (c) 2020 Pete Matsyburka
|
4
|
-
|
5
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
-
of this software and associated documentation files (the "Software"), to deal
|
7
|
-
in the Software without restriction, including without limitation the rights
|
8
|
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
-
copies of the Software, and to permit persons to whom the Software is
|
10
|
-
furnished to do so, subject to the following conditions:
|
11
|
-
|
12
|
-
The above copyright notice and this permission notice shall be included in
|
13
|
-
all copies or substantial portions of the Software.
|
14
|
-
|
15
|
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
-
THE SOFTWARE.
|