jaccard 1.0.1 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (5) hide show
  1. data/Gemfile +9 -0
  2. data/LICENSE +20 -0
  3. data/README.md +71 -0
  4. data/lib/jaccard.rb +114 -0
  5. metadata +11 -7
data/Gemfile ADDED
@@ -0,0 +1,9 @@
1
+ source :rubygems
2
+
3
+ gem "rake"
4
+ gem "yard"
5
+ gem "bluecloth" # yard dependency for Markdown formatting
6
+ gem "rspec", "> 2"
7
+ gem "autotest"
8
+ gem "ruby-debug", :platform => :ruby_18
9
+ gem "ruby-debug19", :platform => :ruby_19
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2010 François Beausoleil
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,71 @@
1
+ Jaccard
2
+ =======
3
+
4
+ The [Jaccard Coefficient Index][1] is a measure of how similar two sets are. This library makes calculating the coefficient very easy, and provides useful helpers.
5
+
6
+ Examples
7
+ ========
8
+
9
+ Calculate how similar two sets are:
10
+
11
+ a = ["likes:jeans", "likes:blue"]
12
+ b = ["likes:jeans", "likes:women", "likes:red"]
13
+ c = ["likes:women", "likes:red"]
14
+
15
+ # Determines how similar a pair of sets are
16
+ Jaccard.coefficient(a, b)
17
+ #=> 0.25
18
+
19
+ Jaccard.coefficient(a, c)
20
+ #=> 0.0
21
+
22
+ Jaccard.coefficient(b, c)
23
+ #=> 0.6666666666666666
24
+
25
+ # According to the input data, b and c have the most similar likes.
26
+
27
+ We can also extract the distance quite easily:
28
+
29
+ Jaccard.distance(a, b)
30
+ #=> 0.75
31
+
32
+ The Jaccard distance is the inverse relation of the coefficient: `1 - coefficient`.
33
+
34
+ Find out which set is closest to a given set of attributes (return a value where the distance is the minimum):
35
+
36
+ Jaccard.closest_to(a, [b, c])
37
+ #=> ["likes:jeans", "likes:women", "likes:red"]
38
+
39
+ Jaccard.closest_to(b, [a, c])
40
+ #=> ["likes:women", "likes:red"]
41
+
42
+ Finally, we can find the best pair in a set:
43
+
44
+ require "pp"
45
+ pp Jaccard.best_match([a, b, c])
46
+ # [["likes:jeans", "likes:women", "likes:red"],
47
+ # ["likes:women", "likes:red"]]
48
+ #=> nil
49
+
50
+ Notes on scalability
51
+ ====================
52
+
53
+ This library wasn't designed to handle millions of entries. You'll have to benchmark and see if this library meets your needs.
54
+
55
+ Note on Patches/Pull Requests
56
+ =============================
57
+
58
+ * Fork the project.
59
+ * Make your feature addition or bug fix.
60
+ * Add tests for it. This is important so I don't break it in a
61
+ future version unintentionally.
62
+ * Commit, do not mess with rakefile, version, or history.
63
+ (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)
64
+ * Send me a pull request. Bonus points for topic branches.
65
+
66
+ Copyright
67
+ =========
68
+
69
+ Copyright (c) 2010 François Beausoleil. See LICENSE for details.
70
+
71
+ [1]: http://en.wikipedia.org/wiki/Jaccard_index
@@ -0,0 +1,114 @@
1
+ require "set"
2
+
3
+ # Helpers to calculate the Jaccard Coefficient Index and related metrics easily.
4
+ #
5
+ # (from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined
6
+ # as the size of the intersection divided by the size of the union of the sample sets.
7
+ #
8
+ # The closer to 1.0 this number is, the more similar two items are.
9
+ module Jaccard
10
+ # Calculates the Jaccard Coefficient Index.
11
+ #
12
+ # +a+ must implement the set intersection and set union operators: <code>#&</code> and <code>#+</code>. Array and Set
13
+ # both implement these methods natively. It is expected that the results of <code>+</code> will either return a
14
+ # unique set or that it returns an object that responds to +#uniq!+. The results of +#coefficient+ will be
15
+ # wrong if the union contains duplicate elements.
16
+ #
17
+ # Also note that the individual items in +a+ and +b+ must implement a sane #eql? method.
18
+ # ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement
19
+ # a correct notion of equality. Other instances might have to be checked to ensure correct
20
+ # behavior.
21
+ #
22
+ # @param [#&, #+] a A set of items
23
+ # @param [#&, #+] b A second set of items
24
+ #
25
+ # @return [Float] The Jaccard Coefficient Index between +a+ and +b+.
26
+ #
27
+ # @example
28
+ #
29
+ # a = [1, 2, 3, 4]
30
+ # b = [1, 3, 4]
31
+ # Jaccard.coefficient(a, b) #=> 0.75
32
+ #
33
+ # @see http://en.wikipedia.org/wiki/Jaccard_index Jaccard Coefficient Index on Wikipedia.
34
+ def self.coefficient(a, b)
35
+ raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&)
36
+ raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+)
37
+
38
+ intersection = a & b
39
+ union = a + b
40
+
41
+ # Set does not implement #uniq or #uniq! since elements are
42
+ # always guaranteed to be present only once. That's the only
43
+ # reason we need to guard against that here.
44
+ union.uniq! if union.respond_to?(:uniq!)
45
+
46
+ intersection.length.to_f / union.length.to_f
47
+ end
48
+
49
+ # Calculates the inverse of the Jaccard coefficient.
50
+ #
51
+ # The closer to 0.0 the distance is, the more similar two items are.
52
+ #
53
+ # @return [Float] <code>1.0 - #coefficient(a, b)</code>
54
+ #
55
+ # @see Jaccard#coefficient for parameter calling convention and caveats about Array vs Set vs other object types.
56
+ def self.distance(a, b)
57
+ 1.0 - coefficient(a, b)
58
+ end
59
+
60
+ # Determines which member of +others+ has the smallest distance vs +a+.
61
+ #
62
+ # Because of the implementation, if multiple items from +others+ have
63
+ # the same distance, the last one will be returned. If this is undesirable,
64
+ # reverse +others+ before calling #closest_to.
65
+ #
66
+ # @param [#&, #+] a A set of attributes
67
+ # @param [#inject] others A collection of set of attributes
68
+ #
69
+ # @return The item from +others+ with the distance minimized to 0.0.
70
+ #
71
+ # @example
72
+ #
73
+ # a = [1, 2, 3]
74
+ # b = [1, 3]
75
+ # c = [1, 2, 3]
76
+ # Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
77
+ # # Note that the actual instance returned will be c
78
+ def self.closest_to(a, others)
79
+ others.inject([2.0, nil]) do |memo, other|
80
+ dist = distance(a, other)
81
+ next memo if memo.first < dist
82
+
83
+ [dist, other]
84
+ end.last
85
+ end
86
+
87
+ # Returns the pair of items whose distance is minimized.
88
+ #
89
+ # @param [#each] items A collection of attributes.
90
+ #
91
+ # @return [Array<a, b>] A pair of set of attributes whose Jaccard distance is the minimal, given the input set.
92
+ #
93
+ # @example
94
+ #
95
+ # a = [1, 2, 3]
96
+ # b = [1, 2]
97
+ # c = [1, 3]
98
+ # Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]
99
+ def self.best_match(items)
100
+ seen = Set.new
101
+ matches = []
102
+
103
+ items.each do |row|
104
+ items.each do |col|
105
+ next if row == col
106
+ next if seen.include?([row, col]) || seen.include?([col, row])
107
+ seen << [row, col]
108
+ matches << [distance(row, col), [row, col]]
109
+ end
110
+ end
111
+
112
+ matches.sort.first.last
113
+ end
114
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: jaccard
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 1.0.2
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -13,7 +13,7 @@ date: 2012-02-24 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rspec
16
- requirement: &2153795880 !ruby/object:Gem::Requirement
16
+ requirement: &2153317540 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: 1.2.9
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *2153795880
24
+ version_requirements: *2153317540
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: yard
27
- requirement: &2153795400 !ruby/object:Gem::Requirement
27
+ requirement: &2153317060 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,14 +32,18 @@ dependencies:
32
32
  version: '0'
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *2153795400
35
+ version_requirements: *2153317060
36
36
  description: The Jaccard Coefficient Index is a measure of how similar two sets are.
37
37
  This library makes calculating the coefficient very easy, and provides useful helpers.
38
38
  email: francois@teksol.info
39
39
  executables: []
40
40
  extensions: []
41
41
  extra_rdoc_files: []
42
- files: []
42
+ files:
43
+ - lib/jaccard.rb
44
+ - README.md
45
+ - LICENSE
46
+ - Gemfile
43
47
  homepage: http://github.com/francois/jaccard
44
48
  licenses: []
45
49
  post_install_message:
@@ -60,7 +64,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
60
64
  version: '0'
61
65
  requirements: []
62
66
  rubyforge_project:
63
- rubygems_version: 1.8.6
67
+ rubygems_version: 1.8.17
64
68
  signing_key:
65
69
  specification_version: 3
66
70
  summary: A library to make calculating the Jaccard Coefficient Index a snap