jaccard 1.0.1 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (6) hide show
  1. checksums.yaml +7 -0
  2. data/Gemfile +5 -0
  3. data/LICENSE +20 -0
  4. data/README.md +81 -0
  5. data/lib/jaccard.rb +117 -0
  6. metadata +84 -25
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: a7f59e59910e3a93f27753822076ab579651c5b138406d71802779c14996726b
4
+ data.tar.gz: 711d75975eca08a4d6f1ac79d4926bef9d359baa6e80bab2bd54c2f8929259e8
5
+ SHA512:
6
+ metadata.gz: 979b053c4a4ca1fe294d532fc0d53bbdfef7d4d61ad644ed114181526095a85a961ae69e7a65208659ac84e685b6ab8679312e9141671bfbb81a61fb1090ef0a
7
+ data.tar.gz: 41e8b09e279c6afb490ebb30bf834bfd5bccc1dc19d29b6d3bb021c66b50841b1ac51a2e96257d5c31b9de671f60aeeb9c63059d50ba2fdeea13ed068024b070
data/Gemfile ADDED
@@ -0,0 +1,5 @@
1
+ source "https://rubygems.org"
2
+
3
+ ruby ">= 1.9.2"
4
+
5
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2010 François Beausoleil
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,81 @@
1
+ Jaccard
2
+ =======
3
+
4
+ The [Jaccard Coefficient Index][1] is a measure of how similar two sets are. This library makes calculating the coefficient very easy, and provides useful helpers.
5
+
6
+ Examples
7
+ ========
8
+
9
+ Calculate how similar two sets are:
10
+
11
+ ```ruby
12
+ require 'jaccard'
13
+
14
+ a = ["likes:jeans", "likes:blue"]
15
+ b = ["likes:jeans", "likes:women", "likes:red"]
16
+ c = ["likes:women", "likes:red"]
17
+
18
+ # Determines how similar a pair of sets are
19
+ Jaccard.coefficient(a, b)
20
+ #=> 0.25
21
+
22
+ Jaccard.coefficient(a, c)
23
+ #=> 0.0
24
+
25
+ Jaccard.coefficient(b, c)
26
+ #=> 0.6666666666666666
27
+
28
+ # According to the input data, b and c have the most similar likes.
29
+ ```
30
+
31
+ We can also extract the distance quite easily:
32
+
33
+ ```ruby
34
+ Jaccard.distance(a, b)
35
+ #=> 0.75
36
+ ```
37
+
38
+ The Jaccard distance is the inverse relation of the coefficient: `1 - coefficient`.
39
+
40
+ Find out which set is closest to a given set of attributes (return a value where the distance is the minimum):
41
+
42
+ ```ruby
43
+ Jaccard.closest_to(a, [b, c])
44
+ #=> ["likes:jeans", "likes:women", "likes:red"]
45
+
46
+ Jaccard.closest_to(b, [a, c])
47
+ #=> ["likes:women", "likes:red"]
48
+ ```
49
+
50
+ Finally, we can find the best pair in a set:
51
+
52
+ ```ruby
53
+ require "pp"
54
+ pp Jaccard.best_match([a, b, c])
55
+ # [["likes:jeans", "likes:women", "likes:red"],
56
+ # ["likes:women", "likes:red"]]
57
+ #=> nil
58
+ ```
59
+
60
+ Notes on scalability
61
+ ====================
62
+
63
+ This library wasn't designed to handle millions of entries. You'll have to benchmark and see if this library meets your needs.
64
+
65
+ Note on Patches/Pull Requests
66
+ =============================
67
+
68
+ * Fork the project.
69
+ * Make your feature addition or bug fix.
70
+ * Add tests for it. This is important so I don't break it in a
71
+ future version unintentionally.
72
+ * Commit, do not mess with rakefile, version, or history.
73
+ (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
74
+ * Send me a pull request. Bonus points for topic branches.
75
+
76
+ Copyright
77
+ =========
78
+
79
+ Copyright (c) 2010 François Beausoleil. See LICENSE for details.
80
+
81
+ [1]: http://en.wikipedia.org/wiki/Jaccard_index
data/lib/jaccard.rb ADDED
@@ -0,0 +1,117 @@
1
+ # We must keep this due to Ruby 2.7 being supported
2
+ # rubocop:disable Lint/RedundantRequireStatement
3
+ require "set"
4
+ # rubocop:enable Lint/RedundantRequireStatement
5
+
6
+ # Helpers to calculate the Jaccard Coefficient Index and related metrics easily.
7
+ #
8
+ # (from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined
9
+ # as the size of the intersection divided by the size of the union of the sample sets.
10
+ #
11
+ # The closer to 1.0 this number is, the more similar two items are.
12
+ module Jaccard
13
+ # Calculates the Jaccard Coefficient Index.
14
+ #
15
+ # +a+ must implement the set intersection and set union operators: <code>#&</code> and <code>#+</code>. Array and Set
16
+ # both implement these methods natively. It is expected that the results of <code>+</code> will either return a
17
+ # unique set or that it returns an object that responds to +#uniq!+. The results of +#coefficient+ will be
18
+ # wrong if the union contains duplicate elements.
19
+ #
20
+ # Also note that the individual items in +a+ and +b+ must implement a sane #eql? method.
21
+ # ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement
22
+ # a correct notion of equality. Other instances might have to be checked to ensure correct
23
+ # behavior.
24
+ #
25
+ # @param [#&, #+] a A set of items
26
+ # @param [#&, #+] b A second set of items
27
+ #
28
+ # @return [Float] The Jaccard Coefficient Index between +a+ and +b+.
29
+ #
30
+ # @example
31
+ #
32
+ # a = [1, 2, 3, 4]
33
+ # b = [1, 3, 4]
34
+ # Jaccard.coefficient(a, b) #=> 0.75
35
+ #
36
+ # @see http://en.wikipedia.org/wiki/Jaccard_index Jaccard Coefficient Index on Wikipedia.
37
+ def self.coefficient(a, b)
38
+ raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&)
39
+ raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+)
40
+
41
+ intersection = a & b
42
+ union = a + b
43
+
44
+ # Set does not implement #uniq or #uniq! since elements are
45
+ # always guaranteed to be present only once. That's the only
46
+ # reason we need to guard against that here.
47
+ union.uniq! if union.respond_to?(:uniq!)
48
+
49
+ intersection.length.to_f / union.length.to_f
50
+ end
51
+
52
+ # Calculates the inverse of the Jaccard coefficient.
53
+ #
54
+ # The closer to 0.0 the distance is, the more similar two items are.
55
+ #
56
+ # @return [Float] <code>1.0 - #coefficient(a, b)</code>
57
+ #
58
+ # @see Jaccard#coefficient for parameter calling convention and caveats about Array vs Set vs other object types.
59
+ def self.distance(a, b)
60
+ 1.0 - coefficient(a, b)
61
+ end
62
+
63
+ # Determines which member of +others+ has the smallest distance vs +a+.
64
+ #
65
+ # Because of the implementation, if multiple items from +others+ have
66
+ # the same distance, the last one will be returned. If this is undesirable,
67
+ # reverse +others+ before calling #closest_to.
68
+ #
69
+ # @param [#&, #+] a A set of attributes
70
+ # @param [#inject] others A collection of set of attributes
71
+ #
72
+ # @return The item from +others+ with the distance minimized to 0.0.
73
+ #
74
+ # @example
75
+ #
76
+ # a = [1, 2, 3]
77
+ # b = [1, 3]
78
+ # c = [1, 2, 3]
79
+ # Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
80
+ # # Note that the actual instance returned will be c
81
+ def self.closest_to(a, others)
82
+ others.inject([2.0, nil]) do |memo, other|
83
+ dist = distance(a, other)
84
+ next memo if memo.first < dist
85
+
86
+ [dist, other]
87
+ end.last
88
+ end
89
+
90
+ # Returns the pair of items whose distance is minimized.
91
+ #
92
+ # @param [#each] items A collection of attributes.
93
+ #
94
+ # @return [Array<a, b>] A pair of set of attributes whose Jaccard distance is the minimal, given the input set.
95
+ #
96
+ # @example
97
+ #
98
+ # a = [1, 2, 3]
99
+ # b = [1, 2]
100
+ # c = [1, 3]
101
+ # Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]
102
+ def self.best_match(items)
103
+ seen = Set.new
104
+ matches = []
105
+
106
+ items.each do |row|
107
+ items.each do |col|
108
+ next if row == col
109
+ next if seen.include?([row, col]) || seen.include?([col, row])
110
+ seen << [row, col]
111
+ matches << [distance(row, col), [row, col]]
112
+ end
113
+ end
114
+
115
+ matches.min.last
116
+ end
117
+ end
metadata CHANGED
@@ -1,68 +1,127 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jaccard
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
5
- prerelease:
4
+ version: 1.1.0
6
5
  platform: ruby
7
6
  authors:
8
7
  - François Beausoleil
9
- autorequire:
8
+ autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2012-02-24 00:00:00.000000000Z
11
+ date: 2023-06-20 00:00:00.000000000 Z
13
12
  dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rake
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: 13.0.6
20
+ - - "<"
21
+ - !ruby/object:Gem::Version
22
+ version: '14.0'
23
+ type: :development
24
+ prerelease: false
25
+ version_requirements: !ruby/object:Gem::Requirement
26
+ requirements:
27
+ - - ">="
28
+ - !ruby/object:Gem::Version
29
+ version: 13.0.6
30
+ - - "<"
31
+ - !ruby/object:Gem::Version
32
+ version: '14.0'
14
33
  - !ruby/object:Gem::Dependency
15
34
  name: rspec
16
- requirement: &2153795880 !ruby/object:Gem::Requirement
17
- none: false
35
+ requirement: !ruby/object:Gem::Requirement
18
36
  requirements:
19
- - - ! '>='
37
+ - - ">="
20
38
  - !ruby/object:Gem::Version
21
39
  version: 1.2.9
40
+ - - "<"
41
+ - !ruby/object:Gem::Version
42
+ version: '4.0'
22
43
  type: :development
23
44
  prerelease: false
24
- version_requirements: *2153795880
45
+ version_requirements: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: 1.2.9
50
+ - - "<"
51
+ - !ruby/object:Gem::Version
52
+ version: '4.0'
53
+ - !ruby/object:Gem::Dependency
54
+ name: standardrb
55
+ requirement: !ruby/object:Gem::Requirement
56
+ requirements:
57
+ - - ">="
58
+ - !ruby/object:Gem::Version
59
+ version: 1.0.1
60
+ - - "<"
61
+ - !ruby/object:Gem::Version
62
+ version: '2.0'
63
+ type: :development
64
+ prerelease: false
65
+ version_requirements: !ruby/object:Gem::Requirement
66
+ requirements:
67
+ - - ">="
68
+ - !ruby/object:Gem::Version
69
+ version: 1.0.1
70
+ - - "<"
71
+ - !ruby/object:Gem::Version
72
+ version: '2.0'
25
73
  - !ruby/object:Gem::Dependency
26
74
  name: yard
27
- requirement: &2153795400 !ruby/object:Gem::Requirement
28
- none: false
75
+ requirement: !ruby/object:Gem::Requirement
29
76
  requirements:
30
- - - ! '>='
77
+ - - ">="
78
+ - !ruby/object:Gem::Version
79
+ version: 0.9.34
80
+ - - "<"
31
81
  - !ruby/object:Gem::Version
32
- version: '0'
82
+ version: '1.0'
33
83
  type: :development
34
84
  prerelease: false
35
- version_requirements: *2153795400
85
+ version_requirements: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: 0.9.34
90
+ - - "<"
91
+ - !ruby/object:Gem::Version
92
+ version: '1.0'
36
93
  description: The Jaccard Coefficient Index is a measure of how similar two sets are.
37
94
  This library makes calculating the coefficient very easy, and provides useful helpers.
38
95
  email: francois@teksol.info
39
96
  executables: []
40
97
  extensions: []
41
98
  extra_rdoc_files: []
42
- files: []
99
+ files:
100
+ - Gemfile
101
+ - LICENSE
102
+ - README.md
103
+ - lib/jaccard.rb
43
104
  homepage: http://github.com/francois/jaccard
44
- licenses: []
45
- post_install_message:
105
+ licenses:
106
+ - MIT
107
+ metadata: {}
108
+ post_install_message:
46
109
  rdoc_options: []
47
110
  require_paths:
48
111
  - lib
49
112
  required_ruby_version: !ruby/object:Gem::Requirement
50
- none: false
51
113
  requirements:
52
- - - ! '>='
114
+ - - ">="
53
115
  - !ruby/object:Gem::Version
54
116
  version: '0'
55
117
  required_rubygems_version: !ruby/object:Gem::Requirement
56
- none: false
57
118
  requirements:
58
- - - ! '>='
119
+ - - ">="
59
120
  - !ruby/object:Gem::Version
60
121
  version: '0'
61
122
  requirements: []
62
- rubyforge_project:
63
- rubygems_version: 1.8.6
64
- signing_key:
65
- specification_version: 3
123
+ rubygems_version: 3.4.10
124
+ signing_key:
125
+ specification_version: 4
66
126
  summary: A library to make calculating the Jaccard Coefficient Index a snap
67
127
  test_files: []
68
- has_rdoc: