fuzzy_set 1.0.0 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.rspec +1 -0
- data/.rubocop.yml +3 -0
- data/README.md +49 -4
- data/fuzzy_set.gemspec +2 -2
- data/lib/fuzzy_set.rb +51 -9
- data/lib/fuzzy_set/version.rb +1 -1
- metadata +6 -5
- data/lib/core_ext/string.rb +0 -14
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: ea177c8ec92af90bff837cb332e1ed19878d448f
|
4
|
+
data.tar.gz: 15b67fc336b12cd46b9c5e109dadcc65655b7fcf
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4efd199d4afc9caf35bb46099779fdc86096f17cd1a7c3f5dac4113f251c215672c57fc35a8fb1aabe3efb19a078758ba8d79579006d0f9a84ac7a73efc43c43
|
7
|
+
data.tar.gz: 87e9d04d3043ec1842cbca56b097fef76ff17c8859552b4b133fc4ee38c12b9758178a5076445ed9e873f45f95fb48664447c4f71dddb89a7453dee3713c0fff
|
data/.rspec
CHANGED
data/.rubocop.yml
ADDED
data/README.md
CHANGED
@@ -1,8 +1,24 @@
|
|
1
1
|
# FuzzySet
|
2
2
|
|
3
|
-
|
3
|
+
[![Gem Version](https://badge.fury.io/rb/fuzzy_set.svg)](http://badge.fury.io/rb/fuzzy_set)
|
4
|
+
[![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](http://rubydoc.org/gems/fuzzy_set/frames)
|
5
|
+
[![Build Status](https://travis-ci.org/mhutter/fuzzy_set.svg)](https://travis-ci.org/mhutter/fuzzy_set)
|
6
|
+
[![Code Climate](https://codeclimate.com/github/mhutter/fuzzy_set/badges/gpa.svg)](https://codeclimate.com/github/mhutter/fuzzy_set)
|
7
|
+
[![Test Coverage](https://codeclimate.com/github/mhutter/fuzzy_set/badges/coverage.svg)](https://codeclimate.com/github/mhutter/fuzzy_set/coverage)
|
4
8
|
|
5
|
-
|
9
|
+
|
10
|
+
FuzzySet represents a set which allows searching its entries by using [Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching).
|
11
|
+
|
12
|
+
It allows you to create a fuzzy-search!
|
13
|
+
|
14
|
+
## How does it work?
|
15
|
+
|
16
|
+
When `add`ing an element to the Set, it first gets indexed. This is, on a very basic level, cutting it up into ngrams and building an index with each ngram pointing to the element.
|
17
|
+
|
18
|
+
If you then query the set with `get`, the query itself is also sliced into ngrams. We then select all elements in the set which share at least one common ngram with the query. The results are then ordered by their [cosine string similarity](https://github.com/mhutter/string-similarity) to the query.
|
19
|
+
|
20
|
+
**TODO**:
|
21
|
+
See [Issues labeled #feature](https://github.com/mhutter/fuzzy_set/labels/feature)
|
6
22
|
|
7
23
|
## Installation
|
8
24
|
|
@@ -24,17 +40,46 @@ Or install it yourself as:
|
|
24
40
|
|
25
41
|
```ruby
|
26
42
|
require 'fuzzy_set'
|
27
|
-
|
28
43
|
states = open('states.txt').read.split(/\n/)
|
29
|
-
fs = FuzzySet.new(*states)
|
30
44
|
|
45
|
+
# Create a new set and add some elements:
|
46
|
+
fs = FuzzySet.new
|
47
|
+
fs.add 'Some'
|
48
|
+
fs.add 'Words'
|
49
|
+
fs.add "or", "even", "multiple", "words!"
|
50
|
+
|
51
|
+
# Or provide your elements when creating the set:
|
52
|
+
fs = FuzzySet.new(states)
|
53
|
+
|
54
|
+
# Use #exact_match to find exact matches (= the normalized query
|
55
|
+
# matches a normalized element in the set):
|
31
56
|
fs.exact_match('michigan!') # => "Michigan"
|
32
57
|
fs.exact_match('mischigen') # => nil
|
33
58
|
|
59
|
+
# Use #get to get all approximate matches:
|
34
60
|
fs.get('mischigen')
|
35
61
|
# => ["Michigan", "Wisconsin", "Mississippi", "Minnesota", "Missouri"]
|
62
|
+
|
63
|
+
# With the default settings, #get will always first try to get an
|
64
|
+
# exact match (see above), and return if there is one:
|
65
|
+
fs.get('mississippi') # => ["Mississippi"]
|
66
|
+
|
67
|
+
# set `all_matches` to true, to do a full query, even if there is
|
68
|
+
# an exact match:
|
69
|
+
fs = FuzzySet.new(states, all_matches: true)
|
70
|
+
fs.get('mississippi') # => ["Mississippi", "Missouri", "Michigan", "Minnesota"]
|
71
|
+
|
72
|
+
# You can configure more stuff (see below)
|
73
|
+
fs = FuzzySet.new(states, all_matches: true, ngram_size_min: 1)
|
36
74
|
```
|
37
75
|
|
76
|
+
### Options
|
77
|
+
|
78
|
+
- `:all_matches` - If `false` and there is an exact match for `#get`, return the match immediately. If `true`, do the ngram-query to get more possible matches.
|
79
|
+
- `:ngram_size_max` - The maximum Ngram size to use (if there is no match using the max ngram size, try again with a smaller ngran size).
|
80
|
+
- `:ngram_size_min` - The minimum Ngram size to use.
|
81
|
+
|
82
|
+
|
38
83
|
## Development
|
39
84
|
|
40
85
|
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
|
data/fuzzy_set.gemspec
CHANGED
@@ -9,8 +9,8 @@ Gem::Specification.new do |spec|
|
|
9
9
|
spec.authors = ['Manuel Hutter']
|
10
10
|
spec.email = ['manuel@hutter.io']
|
11
11
|
|
12
|
-
spec.summary = %q{
|
13
|
-
spec.description = %q{FuzzySet allows
|
12
|
+
spec.summary = %q{Set which allows you to fuzzy-search Strings}
|
13
|
+
spec.description = %q{FuzzySet represents a set which allows searching its entries by using [Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching).}
|
14
14
|
spec.homepage = 'https://github.com/mhutter/fuzzy_set'
|
15
15
|
spec.license = 'MIT'
|
16
16
|
|
data/lib/fuzzy_set.rb
CHANGED
@@ -1,20 +1,35 @@
|
|
1
1
|
require 'string/similarity'
|
2
2
|
|
3
3
|
require 'fuzzy_set/version'
|
4
|
-
require 'core_ext/string'
|
5
4
|
|
6
5
|
# FuzzySet implements a fuzzy-searchable set of strings.
|
7
6
|
#
|
8
7
|
# As a set, it cannot contain duplicate elements.
|
9
8
|
class FuzzySet
|
10
|
-
|
9
|
+
# default options for creating new instances
|
10
|
+
DEFAULT_OPTS = {
|
11
|
+
all_matches: false,
|
12
|
+
ngram_size_max: 3,
|
13
|
+
ngram_size_min: 2
|
14
|
+
}
|
15
|
+
|
16
|
+
# @param items [#each,#to_s] item(s) to add
|
17
|
+
# @param opts [Hash] options, see {DEFAULT_OPTS}
|
18
|
+
# @option opts [Boolean] :all_matches
|
19
|
+
# return all matches, even if an exact match is found
|
20
|
+
# @option opts [Fixnum] :ngram_size_max upper limit for ngram sizes
|
21
|
+
# @option opts [Fixnum] :ngram_size_min lower limit for ngram sizes
|
22
|
+
def initialize(*items, **opts)
|
23
|
+
opts = DEFAULT_OPTS.merge(opts)
|
11
24
|
|
12
|
-
def initialize(*items)
|
13
25
|
@items = []
|
14
26
|
@denormalize = {}
|
15
27
|
@index = {}
|
28
|
+
@all_matches = opts[:all_matches]
|
29
|
+
@ngram_size_max = opts[:ngram_size_max]
|
30
|
+
@ngram_size_min = opts[:ngram_size_min]
|
16
31
|
|
17
|
-
add(
|
32
|
+
add(items)
|
18
33
|
end
|
19
34
|
|
20
35
|
# Normalizes +query+, and looks up an entry by its normalized value.
|
@@ -29,9 +44,10 @@ class FuzzySet
|
|
29
44
|
#
|
30
45
|
# Each item will be converted into a string and indexed upon adding.
|
31
46
|
#
|
32
|
-
# @param items [#to_s] item(s) to add
|
47
|
+
# @param items [#each,#to_s] item(s) to add
|
33
48
|
# @return [FuzzySet] +self+
|
34
49
|
def add(*items)
|
50
|
+
items = [items].flatten
|
35
51
|
items.each do |item|
|
36
52
|
item = item.to_s
|
37
53
|
return self if @items.include?(item)
|
@@ -53,13 +69,15 @@ class FuzzySet
|
|
53
69
|
# 2. check for an exact match and return, if present
|
54
70
|
# 3. find matches based on Ngrams
|
55
71
|
# 4. sort matches by their cosine similarity to +query+
|
72
|
+
#
|
73
|
+
# @param query [String] search query
|
56
74
|
def get(query)
|
57
75
|
query = normalize(query)
|
58
76
|
|
59
77
|
# check for exact match
|
60
|
-
return [@denormalize[query]] if @denormalize[query]
|
78
|
+
return [@denormalize[query]] if !@all_matches && @denormalize[query]
|
61
79
|
|
62
|
-
match_ids = query
|
80
|
+
match_ids = matches_for(query)
|
63
81
|
match_ids = match_ids.flatten.compact.uniq
|
64
82
|
matches = match_ids.map { |id| @items[id] }
|
65
83
|
|
@@ -85,6 +103,14 @@ class FuzzySet
|
|
85
103
|
|
86
104
|
private
|
87
105
|
|
106
|
+
def matches_for(query)
|
107
|
+
@ngram_size_max.downto(@ngram_size_min).each do |size|
|
108
|
+
match_ids = ngram(query, size).map { |ng| @index[ng] }
|
109
|
+
return match_ids if match_ids.any?
|
110
|
+
end
|
111
|
+
[]
|
112
|
+
end
|
113
|
+
|
88
114
|
# Normalize a string by removing all non-word characters
|
89
115
|
# except spaces and then converting it to lowercase.
|
90
116
|
def normalize(str)
|
@@ -98,9 +124,25 @@ class FuzzySet
|
|
98
124
|
@items.index(item)
|
99
125
|
end
|
100
126
|
|
127
|
+
# calculate Ngrams and add them to the items
|
101
128
|
def calculate_grams_for(string, id)
|
102
|
-
|
103
|
-
|
129
|
+
@ngram_size_max.downto(@ngram_size_min).each do |size|
|
130
|
+
ngram(string, size).each do |gram|
|
131
|
+
@index[gram] = (@index[gram] || []).push(id)
|
132
|
+
end
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
# break apart the string into strings of length `n`
|
137
|
+
#
|
138
|
+
# @example
|
139
|
+
# 'foobar'.ngram(3)
|
140
|
+
# # => ["-fo", "foo", "oob", "oba", "bar", "ar-"]
|
141
|
+
def ngram(str, n)
|
142
|
+
fail ArgumentError, "n must be >= 1, is #{n}" if n < 1
|
143
|
+
str = "-#{str}-" if n > 1
|
144
|
+
(str.length - n + 1).times.map do |i|
|
145
|
+
str.slice(i, n)
|
104
146
|
end
|
105
147
|
end
|
106
148
|
end
|
data/lib/fuzzy_set/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fuzzy_set
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Manuel Hutter
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2015-09-
|
11
|
+
date: 2015-09-10 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: string-similarity
|
@@ -94,7 +94,8 @@ dependencies:
|
|
94
94
|
- - ">="
|
95
95
|
- !ruby/object:Gem::Version
|
96
96
|
version: '0'
|
97
|
-
description: FuzzySet allows
|
97
|
+
description: FuzzySet represents a set which allows searching its entries by using
|
98
|
+
[Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching).
|
98
99
|
email:
|
99
100
|
- manuel@hutter.io
|
100
101
|
executables: []
|
@@ -103,6 +104,7 @@ extra_rdoc_files: []
|
|
103
104
|
files:
|
104
105
|
- ".gitignore"
|
105
106
|
- ".rspec"
|
107
|
+
- ".rubocop.yml"
|
106
108
|
- ".travis.yml"
|
107
109
|
- Gemfile
|
108
110
|
- Guardfile
|
@@ -112,7 +114,6 @@ files:
|
|
112
114
|
- bin/console
|
113
115
|
- bin/setup
|
114
116
|
- fuzzy_set.gemspec
|
115
|
-
- lib/core_ext/string.rb
|
116
117
|
- lib/fuzzy_set.rb
|
117
118
|
- lib/fuzzy_set/version.rb
|
118
119
|
homepage: https://github.com/mhutter/fuzzy_set
|
@@ -138,6 +139,6 @@ rubyforge_project:
|
|
138
139
|
rubygems_version: 2.4.5.1
|
139
140
|
signing_key:
|
140
141
|
specification_version: 4
|
141
|
-
summary:
|
142
|
+
summary: Set which allows you to fuzzy-search Strings
|
142
143
|
test_files: []
|
143
144
|
has_rdoc:
|
data/lib/core_ext/string.rb
DELETED
@@ -1,14 +0,0 @@
|
|
1
|
-
class String
|
2
|
-
# break apart the string into strings of length `n`
|
3
|
-
#
|
4
|
-
# @example
|
5
|
-
# 'foobar'.ngram(3)
|
6
|
-
# # => ["-fo", "foo", "oob", "oba", "bar", "ar-"]
|
7
|
-
def ngram(n)
|
8
|
-
fail ArgumentError, "n must be > 1, is #{n}" if n < 2
|
9
|
-
str = "-#{self}-"
|
10
|
-
(str.length - n + 1).times.map do |i|
|
11
|
-
str.slice(i, n)
|
12
|
-
end
|
13
|
-
end
|
14
|
-
end
|