jaccard 1.0.1 → 1.0.2
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +9 -0
- data/LICENSE +20 -0
- data/README.md +71 -0
- data/lib/jaccard.rb +114 -0
- metadata +11 -7
data/Gemfile
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,20 @@
|
|
1
|
+
Copyright (c) 2010 François Beausoleil
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining
|
4
|
+
a copy of this software and associated documentation files (the
|
5
|
+
"Software"), to deal in the Software without restriction, including
|
6
|
+
without limitation the rights to use, copy, modify, merge, publish,
|
7
|
+
distribute, sublicense, and/or sell copies of the Software, and to
|
8
|
+
permit persons to whom the Software is furnished to do so, subject to
|
9
|
+
the following conditions:
|
10
|
+
|
11
|
+
The above copyright notice and this permission notice shall be
|
12
|
+
included in all copies or substantial portions of the Software.
|
13
|
+
|
14
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
15
|
+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
16
|
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
17
|
+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
18
|
+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
19
|
+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
20
|
+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,71 @@
|
|
1
|
+
Jaccard
|
2
|
+
=======
|
3
|
+
|
4
|
+
The [Jaccard Coefficient Index][1] is a measure of how similar two sets are. This library makes calculating the coefficient very easy, and provides useful helpers.
|
5
|
+
|
6
|
+
Examples
|
7
|
+
========
|
8
|
+
|
9
|
+
Calculate how similar two sets are:
|
10
|
+
|
11
|
+
a = ["likes:jeans", "likes:blue"]
|
12
|
+
b = ["likes:jeans", "likes:women", "likes:red"]
|
13
|
+
c = ["likes:women", "likes:red"]
|
14
|
+
|
15
|
+
# Determines how similar a pair of sets are
|
16
|
+
Jaccard.coefficient(a, b)
|
17
|
+
#=> 0.25
|
18
|
+
|
19
|
+
Jaccard.coefficient(a, c)
|
20
|
+
#=> 0.0
|
21
|
+
|
22
|
+
Jaccard.coefficient(b, c)
|
23
|
+
#=> 0.6666666666666666
|
24
|
+
|
25
|
+
# According to the input data, b and c have the most similar likes.
|
26
|
+
|
27
|
+
We can also extract the distance quite easily:
|
28
|
+
|
29
|
+
Jaccard.distance(a, b)
|
30
|
+
#=> 0.75
|
31
|
+
|
32
|
+
The Jaccard distance is the inverse relation of the coefficient: `1 - coefficient`.
|
33
|
+
|
34
|
+
Find out which set is closest to a given set of attributes (return a value where the distance is the minimum):
|
35
|
+
|
36
|
+
Jaccard.closest_to(a, [b, c])
|
37
|
+
#=> ["likes:jeans", "likes:women", "likes:red"]
|
38
|
+
|
39
|
+
Jaccard.closest_to(b, [a, c])
|
40
|
+
#=> ["likes:women", "likes:red"]
|
41
|
+
|
42
|
+
Finally, we can find the best pair in a set:
|
43
|
+
|
44
|
+
require "pp"
|
45
|
+
pp Jaccard.best_match([a, b, c])
|
46
|
+
# [["likes:jeans", "likes:women", "likes:red"],
|
47
|
+
# ["likes:women", "likes:red"]]
|
48
|
+
#=> nil
|
49
|
+
|
50
|
+
Notes on scalability
|
51
|
+
====================
|
52
|
+
|
53
|
+
This library wasn't designed to handle millions of entries. You'll have to benchmark and see if this library meets your needs.
|
54
|
+
|
55
|
+
Note on Patches/Pull Requests
|
56
|
+
=============================
|
57
|
+
|
58
|
+
* Fork the project.
|
59
|
+
* Make your feature addition or bug fix.
|
60
|
+
* Add tests for it. This is important so I don't break it in a
|
61
|
+
future version unintentionally.
|
62
|
+
* Commit, do not mess with rakefile, version, or history.
|
63
|
+
(if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
|
64
|
+
* Send me a pull request. Bonus points for topic branches.
|
65
|
+
|
66
|
+
Copyright
|
67
|
+
=========
|
68
|
+
|
69
|
+
Copyright (c) 2010 François Beausoleil. See LICENSE for details.
|
70
|
+
|
71
|
+
[1]: http://en.wikipedia.org/wiki/Jaccard_index
|
data/lib/jaccard.rb
ADDED
@@ -0,0 +1,114 @@
|
|
1
|
+
require "set"
|
2
|
+
|
3
|
+
# Helpers to calculate the Jaccard Coefficient Index and related metrics easily.
|
4
|
+
#
|
5
|
+
# (from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined
|
6
|
+
# as the size of the intersection divided by the size of the union of the sample sets.
|
7
|
+
#
|
8
|
+
# The closer to 1.0 this number is, the more similar two items are.
|
9
|
+
module Jaccard
|
10
|
+
# Calculates the Jaccard Coefficient Index.
|
11
|
+
#
|
12
|
+
# +a+ must implement the set intersection and set union operators: <code>#&</code> and <code>#+</code>. Array and Set
|
13
|
+
# both implement these methods natively. It is expected that the results of <code>+</code> will either return a
|
14
|
+
# unique set or that it returns an object that responds to +#uniq!+. The results of +#coefficient+ will be
|
15
|
+
# wrong if the union contains duplicate elements.
|
16
|
+
#
|
17
|
+
# Also note that the individual items in +a+ and +b+ must implement a sane #eql? method.
|
18
|
+
# ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement
|
19
|
+
# a correct notion of equality. Other instances might have to be checked to ensure correct
|
20
|
+
# behavior.
|
21
|
+
#
|
22
|
+
# @param [#&, #+] a A set of items
|
23
|
+
# @param [#&, #+] b A second set of items
|
24
|
+
#
|
25
|
+
# @return [Float] The Jaccard Coefficient Index between +a+ and +b+.
|
26
|
+
#
|
27
|
+
# @example
|
28
|
+
#
|
29
|
+
# a = [1, 2, 3, 4]
|
30
|
+
# b = [1, 3, 4]
|
31
|
+
# Jaccard.coefficient(a, b) #=> 0.75
|
32
|
+
#
|
33
|
+
# @see http://en.wikipedia.org/wiki/Jaccard_index Jaccard Coefficient Index on Wikipedia.
|
34
|
+
def self.coefficient(a, b)
|
35
|
+
raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&)
|
36
|
+
raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+)
|
37
|
+
|
38
|
+
intersection = a & b
|
39
|
+
union = a + b
|
40
|
+
|
41
|
+
# Set does not implement #uniq or #uniq! since elements are
|
42
|
+
# always guaranteed to be present only once. That's the only
|
43
|
+
# reason we need to guard against that here.
|
44
|
+
union.uniq! if union.respond_to?(:uniq!)
|
45
|
+
|
46
|
+
intersection.length.to_f / union.length.to_f
|
47
|
+
end
|
48
|
+
|
49
|
+
# Calculates the inverse of the Jaccard coefficient.
|
50
|
+
#
|
51
|
+
# The closer to 0.0 the distance is, the more similar two items are.
|
52
|
+
#
|
53
|
+
# @return [Float] <code>1.0 - #coefficient(a, b)</code>
|
54
|
+
#
|
55
|
+
# @see Jaccard#coefficient for parameter calling convention and caveats about Array vs Set vs other object types.
|
56
|
+
def self.distance(a, b)
|
57
|
+
1.0 - coefficient(a, b)
|
58
|
+
end
|
59
|
+
|
60
|
+
# Determines which member of +others+ has the smallest distance vs +a+.
|
61
|
+
#
|
62
|
+
# Because of the implementation, if multiple items from +others+ have
|
63
|
+
# the same distance, the last one will be returned. If this is undesirable,
|
64
|
+
# reverse +others+ before calling #closest_to.
|
65
|
+
#
|
66
|
+
# @param [#&, #+] a A set of attributes
|
67
|
+
# @param [#inject] others A collection of set of attributes
|
68
|
+
#
|
69
|
+
# @return The item from +others+ with the distance minimized to 0.0.
|
70
|
+
#
|
71
|
+
# @example
|
72
|
+
#
|
73
|
+
# a = [1, 2, 3]
|
74
|
+
# b = [1, 3]
|
75
|
+
# c = [1, 2, 3]
|
76
|
+
# Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
|
77
|
+
# # Note that the actual instance returned will be c
|
78
|
+
def self.closest_to(a, others)
|
79
|
+
others.inject([2.0, nil]) do |memo, other|
|
80
|
+
dist = distance(a, other)
|
81
|
+
next memo if memo.first < dist
|
82
|
+
|
83
|
+
[dist, other]
|
84
|
+
end.last
|
85
|
+
end
|
86
|
+
|
87
|
+
# Returns the pair of items whose distance is minimized.
|
88
|
+
#
|
89
|
+
# @param [#each] items A collection of attributes.
|
90
|
+
#
|
91
|
+
# @return [Array<a, b>] A pair of set of attributes whose Jaccard distance is the minimal, given the input set.
|
92
|
+
#
|
93
|
+
# @example
|
94
|
+
#
|
95
|
+
# a = [1, 2, 3]
|
96
|
+
# b = [1, 2]
|
97
|
+
# c = [1, 3]
|
98
|
+
# Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]
|
99
|
+
def self.best_match(items)
|
100
|
+
seen = Set.new
|
101
|
+
matches = []
|
102
|
+
|
103
|
+
items.each do |row|
|
104
|
+
items.each do |col|
|
105
|
+
next if row == col
|
106
|
+
next if seen.include?([row, col]) || seen.include?([col, row])
|
107
|
+
seen << [row, col]
|
108
|
+
matches << [distance(row, col), [row, col]]
|
109
|
+
end
|
110
|
+
end
|
111
|
+
|
112
|
+
matches.sort.first.last
|
113
|
+
end
|
114
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: jaccard
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.
|
4
|
+
version: 1.0.2
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -13,7 +13,7 @@ date: 2012-02-24 00:00:00.000000000Z
|
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: rspec
|
16
|
-
requirement: &
|
16
|
+
requirement: &2153317540 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: 1.2.9
|
22
22
|
type: :development
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *2153317540
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: yard
|
27
|
-
requirement: &
|
27
|
+
requirement: &2153317060 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,14 +32,18 @@ dependencies:
|
|
32
32
|
version: '0'
|
33
33
|
type: :development
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *2153317060
|
36
36
|
description: The Jaccard Coefficient Index is a measure of how similar two sets are.
|
37
37
|
This library makes calculating the coefficient very easy, and provides useful helpers.
|
38
38
|
email: francois@teksol.info
|
39
39
|
executables: []
|
40
40
|
extensions: []
|
41
41
|
extra_rdoc_files: []
|
42
|
-
files:
|
42
|
+
files:
|
43
|
+
- lib/jaccard.rb
|
44
|
+
- README.md
|
45
|
+
- LICENSE
|
46
|
+
- Gemfile
|
43
47
|
homepage: http://github.com/francois/jaccard
|
44
48
|
licenses: []
|
45
49
|
post_install_message:
|
@@ -60,7 +64,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
60
64
|
version: '0'
|
61
65
|
requirements: []
|
62
66
|
rubyforge_project:
|
63
|
-
rubygems_version: 1.8.
|
67
|
+
rubygems_version: 1.8.17
|
64
68
|
signing_key:
|
65
69
|
specification_version: 3
|
66
70
|
summary: A library to make calculating the Jaccard Coefficient Index a snap
|