jaccard 1.0.1 → 1.0.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (5) hide show
  1. data/Gemfile +9 -0
  2. data/LICENSE +20 -0
  3. data/README.md +71 -0
  4. data/lib/jaccard.rb +114 -0
  5. metadata +11 -7
data/Gemfile ADDED
@@ -0,0 +1,9 @@
1
+ source :rubygems
2
+
3
+ gem "rake"
4
+ gem "yard"
5
+ gem "bluecloth" # yard dependency for Markdown formatting
6
+ gem "rspec", "> 2"
7
+ gem "autotest"
8
+ gem "ruby-debug", :platform => :ruby_18
9
+ gem "ruby-debug19", :platform => :ruby_19
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2010 François Beausoleil
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,71 @@
1
+ Jaccard
2
+ =======
3
+
4
+ The [Jaccard Coefficient Index][1] is a measure of how similar two sets are. This library makes calculating the coefficient very easy, and provides useful helpers.
5
+
6
+ Examples
7
+ ========
8
+
9
+ Calculate how similar two sets are:
10
+
11
+ a = ["likes:jeans", "likes:blue"]
12
+ b = ["likes:jeans", "likes:women", "likes:red"]
13
+ c = ["likes:women", "likes:red"]
14
+
15
+ # Determines how similar a pair of sets are
16
+ Jaccard.coefficient(a, b)
17
+ #=> 0.25
18
+
19
+ Jaccard.coefficient(a, c)
20
+ #=> 0.0
21
+
22
+ Jaccard.coefficient(b, c)
23
+ #=> 0.6666666666666666
24
+
25
+ # According to the input data, b and c have the most similar likes.
26
+
27
+ We can also extract the distance quite easily:
28
+
29
+ Jaccard.distance(a, b)
30
+ #=> 0.75
31
+
32
+ The Jaccard distance is the inverse relation of the coefficient: `1 - coefficient`.
33
+
34
+ Find out which set is closest to a given set of attributes (return a value where the distance is the minimum):
35
+
36
+ Jaccard.closest_to(a, [b, c])
37
+ #=> ["likes:jeans", "likes:women", "likes:red"]
38
+
39
+ Jaccard.closest_to(b, [a, c])
40
+ #=> ["likes:women", "likes:red"]
41
+
42
+ Finally, we can find the best pair in a set:
43
+
44
+ require "pp"
45
+ pp Jaccard.best_match([a, b, c])
46
+ # [["likes:jeans", "likes:women", "likes:red"],
47
+ # ["likes:women", "likes:red"]]
48
+ #=> nil
49
+
50
+ Notes on scalability
51
+ ====================
52
+
53
+ This library wasn't designed to handle millions of entries. You'll have to benchmark and see if this library meets your needs.
54
+
55
+ Note on Patches/Pull Requests
56
+ =============================
57
+
58
+ * Fork the project.
59
+ * Make your feature addition or bug fix.
60
+ * Add tests for it. This is important so I don't break it in a
61
+ future version unintentionally.
62
+ * Commit, do not mess with rakefile, version, or history.
63
+ (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
64
+ * Send me a pull request. Bonus points for topic branches.
65
+
66
+ Copyright
67
+ =========
68
+
69
+ Copyright (c) 2010 François Beausoleil. See LICENSE for details.
70
+
71
+ [1]: http://en.wikipedia.org/wiki/Jaccard_index
@@ -0,0 +1,114 @@
1
+ require "set"
2
+
3
+ # Helpers to calculate the Jaccard Coefficient Index and related metrics easily.
4
+ #
5
+ # (from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined
6
+ # as the size of the intersection divided by the size of the union of the sample sets.
7
+ #
8
+ # The closer to 1.0 this number is, the more similar two items are.
9
+ module Jaccard
10
+ # Calculates the Jaccard Coefficient Index.
11
+ #
12
+ # +a+ must implement the set intersection and set union operators: <code>#&</code> and <code>#+</code>. Array and Set
13
+ # both implement these methods natively. It is expected that the results of <code>+</code> will either return a
14
+ # unique set or that it returns an object that responds to +#uniq!+. The results of +#coefficient+ will be
15
+ # wrong if the union contains duplicate elements.
16
+ #
17
+ # Also note that the individual items in +a+ and +b+ must implement a sane #eql? method.
18
+ # ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement
19
+ # a correct notion of equality. Other instances might have to be checked to ensure correct
20
+ # behavior.
21
+ #
22
+ # @param [#&, #+] a A set of items
23
+ # @param [#&, #+] b A second set of items
24
+ #
25
+ # @return [Float] The Jaccard Coefficient Index between +a+ and +b+.
26
+ #
27
+ # @example
28
+ #
29
+ # a = [1, 2, 3, 4]
30
+ # b = [1, 3, 4]
31
+ # Jaccard.coefficient(a, b) #=> 0.75
32
+ #
33
+ # @see http://en.wikipedia.org/wiki/Jaccard_index Jaccard Coefficient Index on Wikipedia.
34
+ def self.coefficient(a, b)
35
+ raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&)
36
+ raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+)
37
+
38
+ intersection = a & b
39
+ union = a + b
40
+
41
+ # Set does not implement #uniq or #uniq! since elements are
42
+ # always guaranteed to be present only once. That's the only
43
+ # reason we need to guard against that here.
44
+ union.uniq! if union.respond_to?(:uniq!)
45
+
46
+ intersection.length.to_f / union.length.to_f
47
+ end
48
+
49
+ # Calculates the inverse of the Jaccard coefficient.
50
+ #
51
+ # The closer to 0.0 the distance is, the more similar two items are.
52
+ #
53
+ # @return [Float] <code>1.0 - #coefficient(a, b)</code>
54
+ #
55
+ # @see Jaccard#coefficient for parameter calling convention and caveats about Array vs Set vs other object types.
56
+ def self.distance(a, b)
57
+ 1.0 - coefficient(a, b)
58
+ end
59
+
60
+ # Determines which member of +others+ has the smallest distance vs +a+.
61
+ #
62
+ # Because of the implementation, if multiple items from +others+ have
63
+ # the same distance, the last one will be returned. If this is undesirable,
64
+ # reverse +others+ before calling #closest_to.
65
+ #
66
+ # @param [#&, #+] a A set of attributes
67
+ # @param [#inject] others A collection of set of attributes
68
+ #
69
+ # @return The item from +others+ with the distance minimized to 0.0.
70
+ #
71
+ # @example
72
+ #
73
+ # a = [1, 2, 3]
74
+ # b = [1, 3]
75
+ # c = [1, 2, 3]
76
+ # Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
77
+ # # Note that the actual instance returned will be c
78
+ def self.closest_to(a, others)
79
+ others.inject([2.0, nil]) do |memo, other|
80
+ dist = distance(a, other)
81
+ next memo if memo.first < dist
82
+
83
+ [dist, other]
84
+ end.last
85
+ end
86
+
87
+ # Returns the pair of items whose distance is minimized.
88
+ #
89
+ # @param [#each] items A collection of attributes.
90
+ #
91
+ # @return [Array<a, b>] A pair of set of attributes whose Jaccard distance is the minimal, given the input set.
92
+ #
93
+ # @example
94
+ #
95
+ # a = [1, 2, 3]
96
+ # b = [1, 2]
97
+ # c = [1, 3]
98
+ # Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]
99
+ def self.best_match(items)
100
+ seen = Set.new
101
+ matches = []
102
+
103
+ items.each do |row|
104
+ items.each do |col|
105
+ next if row == col
106
+ next if seen.include?([row, col]) || seen.include?([col, row])
107
+ seen << [row, col]
108
+ matches << [distance(row, col), [row, col]]
109
+ end
110
+ end
111
+
112
+ matches.sort.first.last
113
+ end
114
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jaccard
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 1.0.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -13,7 +13,7 @@ date: 2012-02-24 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rspec
16
- requirement: &2153795880 !ruby/object:Gem::Requirement
16
+ requirement: &2153317540 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.2.9
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *2153795880
24
+ version_requirements: *2153317540
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: yard
27
- requirement: &2153795400 !ruby/object:Gem::Requirement
27
+ requirement: &2153317060 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,14 +32,18 @@ dependencies:
32
32
  version: '0'
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *2153795400
35
+ version_requirements: *2153317060
36
36
  description: The Jaccard Coefficient Index is a measure of how similar two sets are.
37
37
  This library makes calculating the coefficient very easy, and provides useful helpers.
38
38
  email: francois@teksol.info
39
39
  executables: []
40
40
  extensions: []
41
41
  extra_rdoc_files: []
42
- files: []
42
+ files:
43
+ - lib/jaccard.rb
44
+ - README.md
45
+ - LICENSE
46
+ - Gemfile
43
47
  homepage: http://github.com/francois/jaccard
44
48
  licenses: []
45
49
  post_install_message:
@@ -60,7 +64,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
60
64
  version: '0'
61
65
  requirements: []
62
66
  rubyforge_project:
63
- rubygems_version: 1.8.6
67
+ rubygems_version: 1.8.17
64
68
  signing_key:
65
69
  specification_version: 3
66
70
  summary: A library to make calculating the Jaccard Coefficient Index a snap