fuzzy_set 1.0.0 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: b8ca0bdbe55c19972e9df5c589d606a81b94270d
4
- data.tar.gz: da1885c2d00aa8ba8043f72294cbad17e90bc8ec
3
+ metadata.gz: ea177c8ec92af90bff837cb332e1ed19878d448f
4
+ data.tar.gz: 15b67fc336b12cd46b9c5e109dadcc65655b7fcf
5
5
  SHA512:
6
- metadata.gz: b0b987aa8cd7d0143424fe6d6fcff33f90dccf7a1c7fb60d5bb23c699fb1df229119f0e8de97405bb682cd0ecbc631beed51a19800460e67c0e8f4385f690ff0
7
- data.tar.gz: 18a257eb888a1feffbfecb9065116b19add560c5354a598f75605f1660a132d861abe86fd69f6afc641b2341647cbb0e64a4128fb70de3bf6f65d28b11df6556
6
+ metadata.gz: 4efd199d4afc9caf35bb46099779fdc86096f17cd1a7c3f5dac4113f251c215672c57fc35a8fb1aabe3efb19a078758ba8d79579006d0f9a84ac7a73efc43c43
7
+ data.tar.gz: 87e9d04d3043ec1842cbca56b097fef76ff17c8859552b4b133fc4ee38c12b9758178a5076445ed9e873f45f95fb48664447c4f71dddb89a7453dee3713c0fff
data/.rspec CHANGED
@@ -1,2 +1,3 @@
1
1
  --color
2
2
  --require spec_helper
3
+ --format doc
@@ -0,0 +1,3 @@
1
+ AllCops:
2
+ Exclude:
3
+ - lib/fuzzy_set/version.rb
data/README.md CHANGED
@@ -1,8 +1,24 @@
1
1
  # FuzzySet
2
2
 
3
- Welcome to your new gem! In this directory, you'll find the files you need to be able to package up your Ruby library into a gem. Put your Ruby code in the file `lib/fuzzy_set`. To experiment with that code, run `bin/console` for an interactive prompt.
3
+ [![Gem Version](https://badge.fury.io/rb/fuzzy_set.svg)](http://badge.fury.io/rb/fuzzy_set)
4
+ [![Documentation](http://img.shields.io/badge/docs-rdoc.info-blue.svg)](http://rubydoc.org/gems/fuzzy_set/frames)
5
+ [![Build Status](https://travis-ci.org/mhutter/fuzzy_set.svg)](https://travis-ci.org/mhutter/fuzzy_set)
6
+ [![Code Climate](https://codeclimate.com/github/mhutter/fuzzy_set/badges/gpa.svg)](https://codeclimate.com/github/mhutter/fuzzy_set)
7
+ [![Test Coverage](https://codeclimate.com/github/mhutter/fuzzy_set/badges/coverage.svg)](https://codeclimate.com/github/mhutter/fuzzy_set/coverage)
4
8
 
5
- TODO: Delete this and the text above, and describe your gem
9
+
10
+ FuzzySet represents a set which allows searching its entries by using [Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching).
11
+
12
+ It allows you to create a fuzzy-search!
13
+
14
+ ## How does it work?
15
+
16
+ When `add`ing an element to the Set, it first gets indexed. This is, on a very basic level, cutting it up into ngrams and building an index with each ngram pointing to the element.
17
+
18
+ If you then query the set with `get`, the query itself is also sliced into ngrams. We then select all elements in the set which share at least one common ngram with the query. The results are then ordered by their [cosine string similarity](https://github.com/mhutter/string-similarity) to the query.
19
+
20
+ **TODO**:
21
+ See [Issues labeled #feature](https://github.com/mhutter/fuzzy_set/labels/feature)
6
22
 
7
23
  ## Installation
8
24
 
@@ -24,17 +40,46 @@ Or install it yourself as:
24
40
 
25
41
  ```ruby
26
42
  require 'fuzzy_set'
27
-
28
43
  states = open('states.txt').read.split(/\n/)
29
- fs = FuzzySet.new(*states)
30
44
 
45
+ # Create a new set and add some elements:
46
+ fs = FuzzySet.new
47
+ fs.add 'Some'
48
+ fs.add 'Words'
49
+ fs.add "or", "even", "multiple", "words!"
50
+
51
+ # Or provide your elements when creating the set:
52
+ fs = FuzzySet.new(states)
53
+
54
+ # Use #exact_match to find exact matches (= the normalized query
55
+ # matches a normalized element in the set):
31
56
  fs.exact_match('michigan!') # => "Michigan"
32
57
  fs.exact_match('mischigen') # => nil
33
58
 
59
+ # Use #get to get all approximate matches:
34
60
  fs.get('mischigen')
35
61
  # => ["Michigan", "Wisconsin", "Mississippi", "Minnesota", "Missouri"]
62
+
63
+ # With the default settings, #get will always first try to get an
64
+ # exact match (see above), and return if there is one:
65
+ fs.get('mississippi') # => ["Mississippi"]
66
+
67
+ # set `all_matches` to true, to do a full query, even if there is
68
+ # an exact match:
69
+ fs = FuzzySet.new(states, all_matches: true)
70
+ fs.get('mississippi') # => ["Mississippi", "Missouri", "Michigan", "Minnesota"]
71
+
72
+ # You can configure more stuff (see below)
73
+ fs = FuzzySet.new(states, all_matches: true, ngram_size_min: 1)
36
74
  ```
37
75
 
76
+ ### Options
77
+
78
+ - `:all_matches` - If `false` and there is an exact match for `#get`, return the match immediately. If `true`, do the ngram-query to get more possible matches.
79
+ - `:ngram_size_max` - The maximum Ngram size to use (if there is no match using the max ngram size, try again with a smaller ngran size).
80
+ - `:ngram_size_min` - The minimum Ngram size to use.
81
+
82
+
38
83
  ## Development
39
84
 
40
85
  After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
@@ -9,8 +9,8 @@ Gem::Specification.new do |spec|
9
9
  spec.authors = ['Manuel Hutter']
10
10
  spec.email = ['manuel@hutter.io']
11
11
 
12
- spec.summary = %q{FuzzySet allows you to fuzzy-search Strings!}
13
- spec.description = %q{FuzzySet allows you to fuzzy-search Strings!}
12
+ spec.summary = %q{Set which allows you to fuzzy-search Strings}
13
+ spec.description = %q{FuzzySet represents a set which allows searching its entries by using [Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching).}
14
14
  spec.homepage = 'https://github.com/mhutter/fuzzy_set'
15
15
  spec.license = 'MIT'
16
16
 
@@ -1,20 +1,35 @@
1
1
  require 'string/similarity'
2
2
 
3
3
  require 'fuzzy_set/version'
4
- require 'core_ext/string'
5
4
 
6
5
  # FuzzySet implements a fuzzy-searchable set of strings.
7
6
  #
8
7
  # As a set, it cannot contain duplicate elements.
9
8
  class FuzzySet
10
- NGRAM_SIZE = 3
9
+ # default options for creating new instances
10
+ DEFAULT_OPTS = {
11
+ all_matches: false,
12
+ ngram_size_max: 3,
13
+ ngram_size_min: 2
14
+ }
15
+
16
+ # @param items [#each,#to_s] item(s) to add
17
+ # @param opts [Hash] options, see {DEFAULT_OPTS}
18
+ # @option opts [Boolean] :all_matches
19
+ # return all matches, even if an exact match is found
20
+ # @option opts [Fixnum] :ngram_size_max upper limit for ngram sizes
21
+ # @option opts [Fixnum] :ngram_size_min lower limit for ngram sizes
22
+ def initialize(*items, **opts)
23
+ opts = DEFAULT_OPTS.merge(opts)
11
24
 
12
- def initialize(*items)
13
25
  @items = []
14
26
  @denormalize = {}
15
27
  @index = {}
28
+ @all_matches = opts[:all_matches]
29
+ @ngram_size_max = opts[:ngram_size_max]
30
+ @ngram_size_min = opts[:ngram_size_min]
16
31
 
17
- add(*items)
32
+ add(items)
18
33
  end
19
34
 
20
35
  # Normalizes +query+, and looks up an entry by its normalized value.
@@ -29,9 +44,10 @@ class FuzzySet
29
44
  #
30
45
  # Each item will be converted into a string and indexed upon adding.
31
46
  #
32
- # @param items [#to_s] item(s) to add
47
+ # @param items [#each,#to_s] item(s) to add
33
48
  # @return [FuzzySet] +self+
34
49
  def add(*items)
50
+ items = [items].flatten
35
51
  items.each do |item|
36
52
  item = item.to_s
37
53
  return self if @items.include?(item)
@@ -53,13 +69,15 @@ class FuzzySet
53
69
  # 2. check for an exact match and return, if present
54
70
  # 3. find matches based on Ngrams
55
71
  # 4. sort matches by their cosine similarity to +query+
72
+ #
73
+ # @param query [String] search query
56
74
  def get(query)
57
75
  query = normalize(query)
58
76
 
59
77
  # check for exact match
60
- return [@denormalize[query]] if @denormalize[query]
78
+ return [@denormalize[query]] if !@all_matches && @denormalize[query]
61
79
 
62
- match_ids = query.ngram(NGRAM_SIZE).map { |ng| @index[ng] }
80
+ match_ids = matches_for(query)
63
81
  match_ids = match_ids.flatten.compact.uniq
64
82
  matches = match_ids.map { |id| @items[id] }
65
83
 
@@ -85,6 +103,14 @@ class FuzzySet
85
103
 
86
104
  private
87
105
 
106
+ def matches_for(query)
107
+ @ngram_size_max.downto(@ngram_size_min).each do |size|
108
+ match_ids = ngram(query, size).map { |ng| @index[ng] }
109
+ return match_ids if match_ids.any?
110
+ end
111
+ []
112
+ end
113
+
88
114
  # Normalize a string by removing all non-word characters
89
115
  # except spaces and then converting it to lowercase.
90
116
  def normalize(str)
@@ -98,9 +124,25 @@ class FuzzySet
98
124
  @items.index(item)
99
125
  end
100
126
 
127
+ # calculate Ngrams and add them to the items
101
128
  def calculate_grams_for(string, id)
102
- string.ngram(NGRAM_SIZE).each do |gram|
103
- @index[gram] = (@index[gram] || []).push(id)
129
+ @ngram_size_max.downto(@ngram_size_min).each do |size|
130
+ ngram(string, size).each do |gram|
131
+ @index[gram] = (@index[gram] || []).push(id)
132
+ end
133
+ end
134
+ end
135
+
136
+ # break apart the string into strings of length `n`
137
+ #
138
+ # @example
139
+ # 'foobar'.ngram(3)
140
+ # # => ["-fo", "foo", "oob", "oba", "bar", "ar-"]
141
+ def ngram(str, n)
142
+ fail ArgumentError, "n must be >= 1, is #{n}" if n < 1
143
+ str = "-#{str}-" if n > 1
144
+ (str.length - n + 1).times.map do |i|
145
+ str.slice(i, n)
104
146
  end
105
147
  end
106
148
  end
@@ -1,3 +1,3 @@
1
1
  class FuzzySet
2
- VERSION = '1.0.0'
2
+ VERSION = '1.1.0'
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fuzzy_set
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Manuel Hutter
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2015-09-04 00:00:00.000000000 Z
11
+ date: 2015-09-10 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: string-similarity
@@ -94,7 +94,8 @@ dependencies:
94
94
  - - ">="
95
95
  - !ruby/object:Gem::Version
96
96
  version: '0'
97
- description: FuzzySet allows you to fuzzy-search Strings!
97
+ description: FuzzySet represents a set which allows searching its entries by using
98
+ [Approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching).
98
99
  email:
99
100
  - manuel@hutter.io
100
101
  executables: []
@@ -103,6 +104,7 @@ extra_rdoc_files: []
103
104
  files:
104
105
  - ".gitignore"
105
106
  - ".rspec"
107
+ - ".rubocop.yml"
106
108
  - ".travis.yml"
107
109
  - Gemfile
108
110
  - Guardfile
@@ -112,7 +114,6 @@ files:
112
114
  - bin/console
113
115
  - bin/setup
114
116
  - fuzzy_set.gemspec
115
- - lib/core_ext/string.rb
116
117
  - lib/fuzzy_set.rb
117
118
  - lib/fuzzy_set/version.rb
118
119
  homepage: https://github.com/mhutter/fuzzy_set
@@ -138,6 +139,6 @@ rubyforge_project:
138
139
  rubygems_version: 2.4.5.1
139
140
  signing_key:
140
141
  specification_version: 4
141
- summary: FuzzySet allows you to fuzzy-search Strings!
142
+ summary: Set which allows you to fuzzy-search Strings
142
143
  test_files: []
143
144
  has_rdoc:
@@ -1,14 +0,0 @@
1
- class String
2
- # break apart the string into strings of length `n`
3
- #
4
- # @example
5
- # 'foobar'.ngram(3)
6
- # # => ["-fo", "foo", "oob", "oba", "bar", "ar-"]
7
- def ngram(n)
8
- fail ArgumentError, "n must be > 1, is #{n}" if n < 2
9
- str = "-#{self}-"
10
- (str.length - n + 1).times.map do |i|
11
- str.slice(i, n)
12
- end
13
- end
14
- end