jaccard 1.0.1 → 1.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/Gemfile +9 -0
- data/LICENSE +20 -0
- data/README.md +71 -0
- data/lib/jaccard.rb +114 -0
- metadata +11 -7
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2010 François Beausoleil
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,71 @@
|
|
1
|
+
Jaccard
|
2
|
+
=======
|
3
|
+
|
4
|
+
The [Jaccard Coefficient Index][1] is a measure of how similar two sets are. This library makes calculating the coefficient very easy, and provides useful helpers.
|
5
|
+
|
6
|
+
Examples
|
7
|
+
========
|
8
|
+
|
9
|
+
Calculate how similar two sets are:
|
10
|
+
|
11
|
+
a = ["likes:jeans", "likes:blue"]
|
12
|
+
b = ["likes:jeans", "likes:women", "likes:red"]
|
13
|
+
c = ["likes:women", "likes:red"]
|
14
|
+
|
15
|
+
# Determines how similar a pair of sets are
|
16
|
+
Jaccard.coefficient(a, b)
|
17
|
+
#=> 0.25
|
18
|
+
|
19
|
+
Jaccard.coefficient(a, c)
|
20
|
+
#=> 0.0
|
21
|
+
|
22
|
+
Jaccard.coefficient(b, c)
|
23
|
+
#=> 0.6666666666666666
|
24
|
+
|
25
|
+
# According to the input data, b and c have the most similar likes.
|
26
|
+
|
27
|
+
We can also extract the distance quite easily:
|
28
|
+
|
29
|
+
Jaccard.distance(a, b)
|
30
|
+
#=> 0.75
|
31
|
+
|
32
|
+
The Jaccard distance is the inverse relation of the coefficient: `1 - coefficient`.
|
33
|
+
|
34
|
+
Find out which set is closest to a given set of attributes (return a value where the distance is the minimum):
|
35
|
+
|
36
|
+
Jaccard.closest_to(a, [b, c])
|
37
|
+
#=> ["likes:jeans", "likes:women", "likes:red"]
|
38
|
+
|
39
|
+
Jaccard.closest_to(b, [a, c])
|
40
|
+
#=> ["likes:women", "likes:red"]
|
41
|
+
|
42
|
+
Finally, we can find the best pair in a set:
|
43
|
+
|
44
|
+
require "pp"
|
45
|
+
pp Jaccard.best_match([a, b, c])
|
46
|
+
# [["likes:jeans", "likes:women", "likes:red"],
|
47
|
+
# ["likes:women", "likes:red"]]
|
48
|
+
#=> nil
|
49
|
+
|
50
|
+
Notes on scalability
|
51
|
+
====================
|
52
|
+
|
53
|
+
This library wasn't designed to handle millions of entries. You'll have to benchmark and see if this library meets your needs.
|
54
|
+
|
55
|
+
Note on Patches/Pull Requests
|
56
|
+
=============================
|
57
|
+
|
58
|
+
* Fork the project.
|
59
|
+
* Make your feature addition or bug fix.
|
60
|
+
* Add tests for it. This is important so I don't break it in a
|
61
|
+
future version unintentionally.
|
62
|
+
* Commit, do not mess with rakefile, version, or history.
|
63
|
+
(if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
|
64
|
+
* Send me a pull request. Bonus points for topic branches.
|
65
|
+
|
66
|
+
Copyright
|
67
|
+
=========
|
68
|
+
|
69
|
+
Copyright (c) 2010 François Beausoleil. See LICENSE for details.
|
70
|
+
|
71
|
+
[1]: http://en.wikipedia.org/wiki/Jaccard_index
|
data/lib/jaccard.rb
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
require "set"
|
2
|
+
|
3
|
+
# Helpers to calculate the Jaccard Coefficient Index and related metrics easily.
|
4
|
+
#
|
5
|
+
# (from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined
|
6
|
+
# as the size of the intersection divided by the size of the union of the sample sets.
|
7
|
+
#
|
8
|
+
# The closer to 1.0 this number is, the more similar two items are.
|
9
|
+
module Jaccard
|
10
|
+
# Calculates the Jaccard Coefficient Index.
|
11
|
+
#
|
12
|
+
# +a+ must implement the set intersection and set union operators: <code>#&</code> and <code>#+</code>. Array and Set
|
13
|
+
# both implement these methods natively. It is expected that the results of <code>+</code> will either return a
|
14
|
+
# unique set or that it returns an object that responds to +#uniq!+. The results of +#coefficient+ will be
|
15
|
+
# wrong if the union contains duplicate elements.
|
16
|
+
#
|
17
|
+
# Also note that the individual items in +a+ and +b+ must implement a sane #eql? method.
|
18
|
+
# ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement
|
19
|
+
# a correct notion of equality. Other instances might have to be checked to ensure correct
|
20
|
+
# behavior.
|
21
|
+
#
|
22
|
+
# @param [#&, #+] a A set of items
|
23
|
+
# @param [#&, #+] b A second set of items
|
24
|
+
#
|
25
|
+
# @return [Float] The Jaccard Coefficient Index between +a+ and +b+.
|
26
|
+
#
|
27
|
+
# @example
|
28
|
+
#
|
29
|
+
# a = [1, 2, 3, 4]
|
30
|
+
# b = [1, 3, 4]
|
31
|
+
# Jaccard.coefficient(a, b) #=> 0.75
|
32
|
+
#
|
33
|
+
# @see http://en.wikipedia.org/wiki/Jaccard_index Jaccard Coefficient Index on Wikipedia.
|
34
|
+
def self.coefficient(a, b)
|
35
|
+
raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&)
|
36
|
+
raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+)
|
37
|
+
|
38
|
+
intersection = a & b
|
39
|
+
union = a + b
|
40
|
+
|
41
|
+
# Set does not implement #uniq or #uniq! since elements are
|
42
|
+
# always guaranteed to be present only once. That's the only
|
43
|
+
# reason we need to guard against that here.
|
44
|
+
union.uniq! if union.respond_to?(:uniq!)
|
45
|
+
|
46
|
+
intersection.length.to_f / union.length.to_f
|
47
|
+
end
|
48
|
+
|
49
|
+
# Calculates the inverse of the Jaccard coefficient.
|
50
|
+
#
|
51
|
+
# The closer to 0.0 the distance is, the more similar two items are.
|
52
|
+
#
|
53
|
+
# @return [Float] <code>1.0 - #coefficient(a, b)</code>
|
54
|
+
#
|
55
|
+
# @see Jaccard#coefficient for parameter calling convention and caveats about Array vs Set vs other object types.
|
56
|
+
def self.distance(a, b)
|
57
|
+
1.0 - coefficient(a, b)
|
58
|
+
end
|
59
|
+
|
60
|
+
# Determines which member of +others+ has the smallest distance vs +a+.
|
61
|
+
#
|
62
|
+
# Because of the implementation, if multiple items from +others+ have
|
63
|
+
# the same distance, the last one will be returned. If this is undesirable,
|
64
|
+
# reverse +others+ before calling #closest_to.
|
65
|
+
#
|
66
|
+
# @param [#&, #+] a A set of attributes
|
67
|
+
# @param [#inject] others A collection of set of attributes
|
68
|
+
#
|
69
|
+
# @return The item from +others+ with the distance minimized to 0.0.
|
70
|
+
#
|
71
|
+
# @example
|
72
|
+
#
|
73
|
+
# a = [1, 2, 3]
|
74
|
+
# b = [1, 3]
|
75
|
+
# c = [1, 2, 3]
|
76
|
+
# Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
|
77
|
+
# # Note that the actual instance returned will be c
|
78
|
+
def self.closest_to(a, others)
|
79
|
+
others.inject([2.0, nil]) do |memo, other|
|
80
|
+
dist = distance(a, other)
|
81
|
+
next memo if memo.first < dist
|
82
|
+
|
83
|
+
[dist, other]
|
84
|
+
end.last
|
85
|
+
end
|
86
|
+
|
87
|
+
# Returns the pair of items whose distance is minimized.
|
88
|
+
#
|
89
|
+
# @param [#each] items A collection of attributes.
|
90
|
+
#
|
91
|
+
# @return [Array<a, b>] A pair of set of attributes whose Jaccard distance is the minimal, given the input set.
|
92
|
+
#
|
93
|
+
# @example
|
94
|
+
#
|
95
|
+
# a = [1, 2, 3]
|
96
|
+
# b = [1, 2]
|
97
|
+
# c = [1, 3]
|
98
|
+
# Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]
|
99
|
+
def self.best_match(items)
|
100
|
+
seen = Set.new
|
101
|
+
matches = []
|
102
|
+
|
103
|
+
items.each do |row|
|
104
|
+
items.each do |col|
|
105
|
+
next if row == col
|
106
|
+
next if seen.include?([row, col]) || seen.include?([col, row])
|
107
|
+
seen << [row, col]
|
108
|
+
matches << [distance(row, col), [row, col]]
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
matches.sort.first.last
|
113
|
+
end
|
114
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: jaccard
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.2
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -13,7 +13,7 @@ date: 2012-02-24 00:00:00.000000000Z
|
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rspec
|
16
|
-
requirement: &
|
16
|
+
requirement: &2153317540 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.2.9
|
22
22
|
type: :development
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *2153317540
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: yard
|
27
|
-
requirement: &
|
27
|
+
requirement: &2153317060 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,14 +32,18 @@ dependencies:
|
|
32
32
|
version: '0'
|
33
33
|
type: :development
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *2153317060
|
36
36
|
description: The Jaccard Coefficient Index is a measure of how similar two sets are.
|
37
37
|
This library makes calculating the coefficient very easy, and provides useful helpers.
|
38
38
|
email: francois@teksol.info
|
39
39
|
executables: []
|
40
40
|
extensions: []
|
41
41
|
extra_rdoc_files: []
|
42
|
-
files:
|
42
|
+
files:
|
43
|
+
- lib/jaccard.rb
|
44
|
+
- README.md
|
45
|
+
- LICENSE
|
46
|
+
- Gemfile
|
43
47
|
homepage: http://github.com/francois/jaccard
|
44
48
|
licenses: []
|
45
49
|
post_install_message:
|
@@ -60,7 +64,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
60
64
|
version: '0'
|
61
65
|
requirements: []
|
62
66
|
rubyforge_project:
|
63
|
-
rubygems_version: 1.8.
|
67
|
+
rubygems_version: 1.8.17
|
64
68
|
signing_key:
|
65
69
|
specification_version: 3
|
66
70
|
summary: A library to make calculating the Jaccard Coefficient Index a snap
|