jaccard 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (6) hide show
  1. checksums.yaml +7 -0
  2. data/Gemfile +5 -0
  3. data/LICENSE +20 -0
  4. data/README.md +81 -0
  5. data/lib/jaccard.rb +117 -0
  6. metadata +84 -25
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: a7f59e59910e3a93f27753822076ab579651c5b138406d71802779c14996726b
4
+ data.tar.gz: 711d75975eca08a4d6f1ac79d4926bef9d359baa6e80bab2bd54c2f8929259e8
5
+ SHA512:
6
+ metadata.gz: 979b053c4a4ca1fe294d532fc0d53bbdfef7d4d61ad644ed114181526095a85a961ae69e7a65208659ac84e685b6ab8679312e9141671bfbb81a61fb1090ef0a
7
+ data.tar.gz: 41e8b09e279c6afb490ebb30bf834bfd5bccc1dc19d29b6d3bb021c66b50841b1ac51a2e96257d5c31b9de671f60aeeb9c63059d50ba2fdeea13ed068024b070
data/Gemfile ADDED
@@ -0,0 +1,5 @@
1
+ source "https://rubygems.org"
2
+
3
+ ruby ">= 1.9.2"
4
+
5
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2010 François Beausoleil
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,81 @@
1
+ Jaccard
2
+ =======
3
+
4
+ The [Jaccard Coefficient Index][1] is a measure of how similar two sets are. This library makes calculating the coefficient very easy, and provides useful helpers.
5
+
6
+ Examples
7
+ ========
8
+
9
+ Calculate how similar two sets are:
10
+
11
+ ```ruby
12
+ require 'jaccard'
13
+
14
+ a = ["likes:jeans", "likes:blue"]
15
+ b = ["likes:jeans", "likes:women", "likes:red"]
16
+ c = ["likes:women", "likes:red"]
17
+
18
+ # Determines how similar a pair of sets are
19
+ Jaccard.coefficient(a, b)
20
+ #=> 0.25
21
+
22
+ Jaccard.coefficient(a, c)
23
+ #=> 0.0
24
+
25
+ Jaccard.coefficient(b, c)
26
+ #=> 0.6666666666666666
27
+
28
+ # According to the input data, b and c have the most similar likes.
29
+ ```
30
+
31
+ We can also extract the distance quite easily:
32
+
33
+ ```ruby
34
+ Jaccard.distance(a, b)
35
+ #=> 0.75
36
+ ```
37
+
38
+ The Jaccard distance is the inverse relation of the coefficient: `1 - coefficient`.
39
+
40
+ Find out which set is closest to a given set of attributes (return a value where the distance is the minimum):
41
+
42
+ ```ruby
43
+ Jaccard.closest_to(a, [b, c])
44
+ #=> ["likes:jeans", "likes:women", "likes:red"]
45
+
46
+ Jaccard.closest_to(b, [a, c])
47
+ #=> ["likes:women", "likes:red"]
48
+ ```
49
+
50
+ Finally, we can find the best pair in a set:
51
+
52
+ ```ruby
53
+ require "pp"
54
+ pp Jaccard.best_match([a, b, c])
55
+ # [["likes:jeans", "likes:women", "likes:red"],
56
+ # ["likes:women", "likes:red"]]
57
+ #=> nil
58
+ ```
59
+
60
+ Notes on scalability
61
+ ====================
62
+
63
+ This library wasn't designed to handle millions of entries. You'll have to benchmark and see if this library meets your needs.
64
+
65
+ Note on Patches/Pull Requests
66
+ =============================
67
+
68
+ * Fork the project.
69
+ * Make your feature addition or bug fix.
70
+ * Add tests for it. This is important so I don't break it in a
71
+ future version unintentionally.
72
+ * Commit, do not mess with rakefile, version, or history.
73
+ (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
74
+ * Send me a pull request. Bonus points for topic branches.
75
+
76
+ Copyright
77
+ =========
78
+
79
+ Copyright (c) 2010 François Beausoleil. See LICENSE for details.
80
+
81
+ [1]: http://en.wikipedia.org/wiki/Jaccard_index
data/lib/jaccard.rb ADDED
@@ -0,0 +1,117 @@
1
+ # We must keep this due to Ruby 2.7 being supported
2
+ # rubocop:disable Lint/RedundantRequireStatement
3
+ require "set"
4
+ # rubocop:enable Lint/RedundantRequireStatement
5
+
6
+ # Helpers to calculate the Jaccard Coefficient Index and related metrics easily.
7
+ #
8
+ # (from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined
9
+ # as the size of the intersection divided by the size of the union of the sample sets.
10
+ #
11
+ # The closer to 1.0 this number is, the more similar two items are.
12
+ module Jaccard
13
+ # Calculates the Jaccard Coefficient Index.
14
+ #
15
+ # +a+ must implement the set intersection and set union operators: <code>#&</code> and <code>#+</code>. Array and Set
16
+ # both implement these methods natively. It is expected that the results of <code>+</code> will either return a
17
+ # unique set or that it returns an object that responds to +#uniq!+. The results of +#coefficient+ will be
18
+ # wrong if the union contains duplicate elements.
19
+ #
20
+ # Also note that the individual items in +a+ and +b+ must implement a sane #eql? method.
21
+ # ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement
22
+ # a correct notion of equality. Other instances might have to be checked to ensure correct
23
+ # behavior.
24
+ #
25
+ # @param [#&, #+] a A set of items
26
+ # @param [#&, #+] b A second set of items
27
+ #
28
+ # @return [Float] The Jaccard Coefficient Index between +a+ and +b+.
29
+ #
30
+ # @example
31
+ #
32
+ # a = [1, 2, 3, 4]
33
+ # b = [1, 3, 4]
34
+ # Jaccard.coefficient(a, b) #=> 0.75
35
+ #
36
+ # @see http://en.wikipedia.org/wiki/Jaccard_index Jaccard Coefficient Index on Wikipedia.
37
+ def self.coefficient(a, b)
38
+ raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&)
39
+ raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+)
40
+
41
+ intersection = a & b
42
+ union = a + b
43
+
44
+ # Set does not implement #uniq or #uniq! since elements are
45
+ # always guaranteed to be present only once. That's the only
46
+ # reason we need to guard against that here.
47
+ union.uniq! if union.respond_to?(:uniq!)
48
+
49
+ intersection.length.to_f / union.length.to_f
50
+ end
51
+
52
+ # Calculates the inverse of the Jaccard coefficient.
53
+ #
54
+ # The closer to 0.0 the distance is, the more similar two items are.
55
+ #
56
+ # @return [Float] <code>1.0 - #coefficient(a, b)</code>
57
+ #
58
+ # @see Jaccard#coefficient for parameter calling convention and caveats about Array vs Set vs other object types.
59
+ def self.distance(a, b)
60
+ 1.0 - coefficient(a, b)
61
+ end
62
+
63
+ # Determines which member of +others+ has the smallest distance vs +a+.
64
+ #
65
+ # Because of the implementation, if multiple items from +others+ have
66
+ # the same distance, the last one will be returned. If this is undesirable,
67
+ # reverse +others+ before calling #closest_to.
68
+ #
69
+ # @param [#&, #+] a A set of attributes
70
+ # @param [#inject] others A collection of set of attributes
71
+ #
72
+ # @return The item from +others+ with the distance minimized to 0.0.
73
+ #
74
+ # @example
75
+ #
76
+ # a = [1, 2, 3]
77
+ # b = [1, 3]
78
+ # c = [1, 2, 3]
79
+ # Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
80
+ # # Note that the actual instance returned will be c
81
+ def self.closest_to(a, others)
82
+ others.inject([2.0, nil]) do |memo, other|
83
+ dist = distance(a, other)
84
+ next memo if memo.first < dist
85
+
86
+ [dist, other]
87
+ end.last
88
+ end
89
+
90
+ # Returns the pair of items whose distance is minimized.
91
+ #
92
+ # @param [#each] items A collection of attributes.
93
+ #
94
+ # @return [Array<a, b>] A pair of set of attributes whose Jaccard distance is the minimal, given the input set.
95
+ #
96
+ # @example
97
+ #
98
+ # a = [1, 2, 3]
99
+ # b = [1, 2]
100
+ # c = [1, 3]
101
+ # Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]
102
+ def self.best_match(items)
103
+ seen = Set.new
104
+ matches = []
105
+
106
+ items.each do |row|
107
+ items.each do |col|
108
+ next if row == col
109
+ next if seen.include?([row, col]) || seen.include?([col, row])
110
+ seen << [row, col]
111
+ matches << [distance(row, col), [row, col]]
112
+ end
113
+ end
114
+
115
+ matches.min.last
116
+ end
117
+ end
metadata CHANGED
@@ -1,68 +1,127 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jaccard
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
5
- prerelease:
4
+ version: 1.1.0
6
5
  platform: ruby
7
6
  authors:
8
7
  - François Beausoleil
9
- autorequire:
8
+ autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2012-02-24 00:00:00.000000000Z
11
+ date: 2023-06-20 00:00:00.000000000 Z
13
12
  dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: rake
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: 13.0.6
20
+ - - "<"
21
+ - !ruby/object:Gem::Version
22
+ version: '14.0'
23
+ type: :development
24
+ prerelease: false
25
+ version_requirements: !ruby/object:Gem::Requirement
26
+ requirements:
27
+ - - ">="
28
+ - !ruby/object:Gem::Version
29
+ version: 13.0.6
30
+ - - "<"
31
+ - !ruby/object:Gem::Version
32
+ version: '14.0'
14
33
  - !ruby/object:Gem::Dependency
15
34
  name: rspec
16
- requirement: &2153795880 !ruby/object:Gem::Requirement
17
- none: false
35
+ requirement: !ruby/object:Gem::Requirement
18
36
  requirements:
19
- - - ! '>='
37
+ - - ">="
20
38
  - !ruby/object:Gem::Version
21
39
  version: 1.2.9
40
+ - - "<"
41
+ - !ruby/object:Gem::Version
42
+ version: '4.0'
22
43
  type: :development
23
44
  prerelease: false
24
- version_requirements: *2153795880
45
+ version_requirements: !ruby/object:Gem::Requirement
46
+ requirements:
47
+ - - ">="
48
+ - !ruby/object:Gem::Version
49
+ version: 1.2.9
50
+ - - "<"
51
+ - !ruby/object:Gem::Version
52
+ version: '4.0'
53
+ - !ruby/object:Gem::Dependency
54
+ name: standardrb
55
+ requirement: !ruby/object:Gem::Requirement
56
+ requirements:
57
+ - - ">="
58
+ - !ruby/object:Gem::Version
59
+ version: 1.0.1
60
+ - - "<"
61
+ - !ruby/object:Gem::Version
62
+ version: '2.0'
63
+ type: :development
64
+ prerelease: false
65
+ version_requirements: !ruby/object:Gem::Requirement
66
+ requirements:
67
+ - - ">="
68
+ - !ruby/object:Gem::Version
69
+ version: 1.0.1
70
+ - - "<"
71
+ - !ruby/object:Gem::Version
72
+ version: '2.0'
25
73
  - !ruby/object:Gem::Dependency
26
74
  name: yard
27
- requirement: &2153795400 !ruby/object:Gem::Requirement
28
- none: false
75
+ requirement: !ruby/object:Gem::Requirement
29
76
  requirements:
30
- - - ! '>='
77
+ - - ">="
78
+ - !ruby/object:Gem::Version
79
+ version: 0.9.34
80
+ - - "<"
31
81
  - !ruby/object:Gem::Version
32
- version: '0'
82
+ version: '1.0'
33
83
  type: :development
34
84
  prerelease: false
35
- version_requirements: *2153795400
85
+ version_requirements: !ruby/object:Gem::Requirement
86
+ requirements:
87
+ - - ">="
88
+ - !ruby/object:Gem::Version
89
+ version: 0.9.34
90
+ - - "<"
91
+ - !ruby/object:Gem::Version
92
+ version: '1.0'
36
93
  description: The Jaccard Coefficient Index is a measure of how similar two sets are.
37
94
  This library makes calculating the coefficient very easy, and provides useful helpers.
38
95
  email: francois@teksol.info
39
96
  executables: []
40
97
  extensions: []
41
98
  extra_rdoc_files: []
42
- files: []
99
+ files:
100
+ - Gemfile
101
+ - LICENSE
102
+ - README.md
103
+ - lib/jaccard.rb
43
104
  homepage: http://github.com/francois/jaccard
44
- licenses: []
45
- post_install_message:
105
+ licenses:
106
+ - MIT
107
+ metadata: {}
108
+ post_install_message:
46
109
  rdoc_options: []
47
110
  require_paths:
48
111
  - lib
49
112
  required_ruby_version: !ruby/object:Gem::Requirement
50
- none: false
51
113
  requirements:
52
- - - ! '>='
114
+ - - ">="
53
115
  - !ruby/object:Gem::Version
54
116
  version: '0'
55
117
  required_rubygems_version: !ruby/object:Gem::Requirement
56
- none: false
57
118
  requirements:
58
- - - ! '>='
119
+ - - ">="
59
120
  - !ruby/object:Gem::Version
60
121
  version: '0'
61
122
  requirements: []
62
- rubyforge_project:
63
- rubygems_version: 1.8.6
64
- signing_key:
65
- specification_version: 3
123
+ rubygems_version: 3.4.10
124
+ signing_key:
125
+ specification_version: 4
66
126
  summary: A library to make calculating the Jaccard Coefficient Index a snap
67
127
  test_files: []
68
- has_rdoc: