measurable 0.0.4 → 0.0.5

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 24f0ca4dbb60cda53bab68a614a171df7e337434
4
- data.tar.gz: 8675c8a2e723203f287ce4dac3a6e6237fe2b675
3
+ metadata.gz: 66337383a6c25685893bb39f7caf5c7f0b40bcff
4
+ data.tar.gz: 117a22bb28b1d36f14780d3a22bbad7211b279a0
5
5
  SHA512:
6
- metadata.gz: ff4de5c4fbbe64592a16e7980182a76fa7e3960931d401004f9cebd8e439ea17acf0e16b8a465131b80e832fd560fd97f5fd6e1054f43678ded44d730a4e90c3
7
- data.tar.gz: 62396f9fb4208745628848a5447872bb86e407d2b8ec34b15acaae6ac193f8a2c1a63aa68ef69046d5d6080a693c4cfc38bc9f419c697445eb251bb861cc9af4
6
+ metadata.gz: 0d7aff51213d2f0ca31472d1d5f43b3ce99d58255b9d3d53485b0983e953866a365927734c1852c7040f7c23149455712af6357679e03ff63bf247709ac7e124
7
+ data.tar.gz: 652066925d87d7d52656f87346c856d34296e4a98662e94a670d72e9612c568dd95bf758314c524501700b21b53484eaaf059e5b5fee315129fb1b0b90a170bd
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- measurable (0.0.4)
4
+ measurable (0.0.5)
5
5
 
6
6
  GEM
7
7
  remote: http://rubygems.org/
data/README.md CHANGED
@@ -1,25 +1,32 @@
1
1
  # Measurable
2
2
 
3
- This gem encompasses various distance measures. Besides the `Array` class, I also want to support [NMatrix](http://github.com/sciruby/nmatrix)'s `NVector`.
3
+ A gem to test what metric is best for certain kinds of datasets in machine learning.
4
4
 
5
- My objective is to be able to compare different metrics just by changing which method is called. Also, to show how to use NMatrix's C API. I'll create most of the things in pure Ruby first, then the most used operations (or the slowest ones) will be rewritten in C.
5
+ Besides the `Array` class, I also want to support `NVector` (from [NMatrix](http://github.com/sciruby/nmatrix)).
6
6
 
7
- This is a fork of the gem [Distance Measure](https://github.com/reddavis/Distance-Measures), which has a similar objective, but isn't actively maintained and doesn't support NMatrix. Thank you, [reddavis](https://github.com/reddavis). :)
7
+ The distance measures will be created in Ruby first. If I see that it's really too slow, I'll write some methods in C (or Java, for JRuby).
8
+
9
+ This is a fork of the gem [Distance Measure](https://github.com/reddavis/Distance-Measures), which has a similar objective, but isn't actively maintained and doesn't support NMatrix. Thank you, [@reddavis][reddavis]. :)
8
10
 
9
11
  ## Install
10
12
 
11
13
  `gem install measurable`
12
14
 
13
- It only works with Ruby MRI 1.9.3 or 2.0.0. I still want to test it on JRuby, but as its still pure Ruby, it should work correctly there.
15
+ I only tested it with 2.0.0 (yes, yes, travis, I'll do it eventually). I want to support JRuby as well.
16
+
17
+ ## Distance measures
18
+
19
+ I'm using the term "distance measure" without much concern for the strict mathematical definition of a metric. If the documentation for one of the methods isn't clear about it being or not a metric, please open an issue.
14
20
 
15
- ## Distance measures that I want to support for the moment
21
+ The following are the similarity measures supported at the moment:
16
22
 
17
23
  - Euclidean distance
18
24
  - Squared euclidean distance
19
25
  - Cosine distance
20
- - Max-min distance (["K-Means clustering using max-min distance measure"][1])
26
+ - Max-min distance (from ["K-Means clustering using max-min distance measure"][maxmin])
21
27
  - Jaccard distance
22
28
  - Tanimoto distance
29
+ - Haversine distance
23
30
 
24
31
  These still need to be implemented:
25
32
 
@@ -36,30 +43,42 @@ These still need to be implemented:
36
43
 
37
44
  ## How to use
38
45
 
39
- This list will be updated as I have time. I'll refactor the existing measures and add some that I'll need in a project.
40
-
41
46
  The API I intend to support is something like this:
42
47
 
43
48
  ```ruby
44
49
  require "measurable"
45
-
50
+
46
51
  u = NVector.ones(2)
47
52
  v = NVector.zeros(2)
48
53
  w = [1, 0]
49
54
  x = [2, 2]
50
55
 
51
- Measurable::euclidean(u, v) # => 1.41421
52
- Measurable::euclidean(w, v) # => 1.00000
53
- Measurable::euclidean(w, w) # => 0.00000
54
- Measurable::
56
+ # Calculate the distance between two points in space.
57
+ Measurable.euclidean(u, v) # => 1.41421
58
+ Measurable.euclidean(w, v) # => 1.00000
59
+ Measurable.cosine([1, 2], [2, 3]) # => 0.00772
60
+
61
+ # Calculate the norm of a vector, i.e. its distance from the origin.
62
+ Measurable.euclidean_squared([3, 4]) # => 25
55
63
  ```
56
64
 
57
- Maybe add support for (some of) NMatrix's dtypes, like `:float32`, `:float64`, `:complex64`, `:complex128`, etc. This will have to way until Measurable supports NMatrix C API.
65
+ ## Documentation
66
+
67
+ `RDoc` syntax is used to document the project. To build it locally, you'll need to install the [Fivefish generator](https://github.com/ged/rdoc-generator-fivefish) (`gem install rdoc-generator-fivefish`) and run the following command:
68
+
69
+ ```bash
70
+ rdoc -f fivefish -m README.md *.md LICENSE lib/
71
+ ```
72
+
73
+ I want to be able to use a Rake task to generate the documentation, thus allowing me to forget the specific command. However, there's a bug in `RDoc::Task` in which [custom generators (like Fivefish) can't be used](https://github.com/rdoc/rdoc/issues/246).
74
+
75
+ If there's something wrong with an explanation or if there's information missing, please open an issue or send a pull request.
58
76
 
59
77
  ## License
60
78
 
61
79
  See LICENSE for details.
62
80
 
63
- The original `Distance Measure` gem is copyrighted by @reddavis.
81
+ The original `distance_measures` gem is copyrighted by [@reddavis][reddavis].
64
82
 
65
- [1]: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05156398
83
+ [maxmin]: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05156398
84
+ [reddavis]: (https://github.com/reddavis)
data/Rakefile CHANGED
@@ -1,6 +1,7 @@
1
1
  require 'rake'
2
2
  require 'bundler/gem_tasks'
3
3
  require "rspec/core/rake_task"
4
+ # require 'rdoc/task' # See below.
4
5
 
5
6
  # Setup the necessary gems, specified in the gemspec.
6
7
  require 'bundler'
@@ -15,10 +16,22 @@ end
15
16
  # Run all the specs.
16
17
  RSpec::Core::RakeTask.new(:spec)
17
18
 
19
+ # RDoc task isn't working with custom generators, as can be seen in:
20
+ # https://github.com/rdoc/rdoc/issues/246
21
+ #
22
+ # Whenever this issue is fixed, I'll resume using this task.
23
+ #
24
+ # RDoc::Task.new do |rdoc|
25
+ # rdoc.main = "README.md"
26
+ # rdoc.rdoc_files.include("README.md", "LICENSE", "lib")
27
+ # rdoc.generator = "fivefish"
28
+ # rdoc.external = true
29
+ # end
30
+
18
31
  # Compile task.
19
32
  # Rake::ExtensionTask.new do |ext|
20
- # ext.name = 'measurable'
21
- # ext.ext_dir = 'ext/measurable'
33
+ # ext.name = 'measurable'
34
+ # ext.ext_dir = 'ext/measurable'
22
35
  # ext.lib_dir = 'lib/'
23
- # ext.source_pattern = "**/*.{c, cpp, h}"
36
+ # ext.source_pattern = "**/*.{c, cpp, h}"
24
37
  # end
@@ -1,47 +1,16 @@
1
- require 'measurable/version.rb'
1
+ require 'measurable/version'
2
2
 
3
- # Distance measures.
3
+ # Distance measures. The require order is important.
4
4
  require 'measurable/euclidean'
5
5
  require 'measurable/cosine'
6
- require 'measurable/tanimoto'
7
6
  require 'measurable/jaccard'
7
+ require 'measurable/tanimoto'
8
8
  require 'measurable/haversine'
9
9
  require 'measurable/maxmin'
10
10
 
11
11
  module Measurable
12
- # PI = 3.1415926535
13
- RAD_PER_DEG = 0.017453293 # PI/180
14
- class << self
15
- def binary_union(u, v)
16
- unions = []
17
- u.each_with_index do |n, index|
18
- if n == 1 || v[index] == 1
19
- unions << 1
20
- else
21
- unions << 0
22
- end
23
- end
24
-
25
- unions
26
- end
27
-
28
- def binary_intersection(u, v)
29
- intersects = []
30
- u.each_with_index do |n, index|
31
- if n == 1 && v[index] == 1
32
- intersects << 1
33
- else
34
- intersects << 0
35
- end
36
- end
37
-
38
- intersects
39
- end
12
+ # PI / 180 degrees.
13
+ RAD_PER_DEG = Math::PI / 180
40
14
 
41
- # Checks if we"re dealing with NaN"s and will return 0.0 unless
42
- # handle NaN"s is set to false
43
- def handle_nan(result)
44
- result.nan? ? 0.0 : result
45
- end
46
- end
15
+ extend self # expose all instance methods as singleton methods.
47
16
  end
@@ -1,10 +1,27 @@
1
1
  module Measurable
2
- class << self
3
- def cosine(u, v)
4
- dot_product = dot(u, v)
5
- normalization = self.euclidean_normalize * other.euclidean_normalize
6
2
 
7
- handle_nan(dot_product / normalization)
8
- end
3
+ # call-seq:
4
+ # cosine(u, v) -> Float
5
+ #
6
+ # Calculate the similarity between the orientation of two vectors.
7
+ #
8
+ # See: http://en.wikipedia.org/wiki/Cosine_similarity
9
+ #
10
+ # * *Arguments* :
11
+ # - +u+ -> An array of Numeric objects.
12
+ # - +v+ -> An array of Numeric objects.
13
+ # * *Returns* :
14
+ # - The normalized dot product of +u+ and +v+, that is, the angle between
15
+ # them in the n-dimensional space.
16
+ # * *Raises* :
17
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
18
+ #
19
+ def cosine(u, v)
20
+ # TODO: Change this to a more specific, custom-made exception.
21
+ raise ArgumentError if u.size != v.size
22
+
23
+ dot_product = u.zip(v).reduce(0.0) { |acc, ary| acc += ary[0] * ary[1] }
24
+
25
+ dot_product / (euclidean(u) * euclidean(v))
9
26
  end
10
27
  end
@@ -1,40 +1,76 @@
1
1
  module Measurable
2
- class << self
3
- # Add documentation here!
4
- def euclidean(u, v = nil)
5
- # If the second argument is nil, the method should return the norm of
6
- # vector u. For this, we need the distance between u and the origin.
7
- if v.nil?
8
- v = Array.new(u.size, 0)
9
- end
10
-
11
- # We could make it work with vector of different sizes because of #zip
12
- # but it's unreliable. It's better to just throw an exception.
13
- # TODO: Change this to a more specific, custom-made exception.
14
- raise ArgumentError if u.size != v.size
15
-
16
- sum = u.zip(v).reduce(0.0) do |acc, ary|
17
- acc += (ary[0] - ary[-1])**2
18
- end
19
-
20
- Math.sqrt(sum)
2
+
3
+ # call-seq:
4
+ # euclidean(u) -> Float
5
+ # euclidean(u, v) -> Float
6
+ #
7
+ # Calculate the ordinary distance between arrays +u+ and +v+.
8
+ #
9
+ # If +v+ isn't given, calculate the Euclidean norm of +u+.
10
+ #
11
+ # See: http://en.wikipedia.org/wiki/Euclidean_distance#N_dimensions
12
+ #
13
+ # * *Arguments* :
14
+ # - +u+ -> An array of Numeric objects.
15
+ # - +v+ -> (Optional) An array of Numeric objects.
16
+ # * *Returns* :
17
+ # - The euclidean norm of +u+ or the euclidean distance between +u+ and
18
+ # +v+.
19
+ # * *Raises* :
20
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
21
+ #
22
+ def euclidean(u, v = nil)
23
+ # If the second argument is nil, the method should return the norm of
24
+ # vector u. For this, we need the distance between u and the origin.
25
+ if v.nil?
26
+ v = Array.new(u.size, 0)
21
27
  end
22
-
23
- def euclidean_squared(u, v = nil)
24
- # If the second argument is nil, the method should return the norm of
25
- # vector u. For this, we need the distance between u and the origin.
26
- if v.nil?
27
- v = Array.new(u.size, 0)
28
- end
29
-
30
- # We could make it work with vector of different sizes because of #zip
31
- # but it's unreliable. It's better to just throw an exception.
32
- # TODO: Change this to a more specific, custom-made exception.
33
- raise ArgumentError if u.size != v.size
34
-
35
- u.zip(v).reduce(0.0) do |acc, ary|
36
- acc += (ary[0] - ary[-1])**2
37
- end
28
+
29
+ # TODO: Change this to a more specific, custom-made exception.
30
+ raise ArgumentError if u.size != v.size
31
+
32
+ sum = u.zip(v).reduce(0.0) do |acc, ary|
33
+ acc += (ary[0] - ary[-1]) ** 2
34
+ end
35
+
36
+ Math.sqrt(sum)
37
+ end
38
+
39
+ # call-seq:
40
+ # euclidean_squared(u) -> Float
41
+ # euclidean_squared(u, v) -> Float
42
+ #
43
+ # Calculate the same value as euclidean(u, v), but don't take the square root
44
+ # of it.
45
+ #
46
+ # This isn't a metric in the strict sense, i.e. it doesn't respect the
47
+ # triangle inequality. However, the squared Euclidean distance is very useful
48
+ # whenever only the relative values of distances are important, for example
49
+ # in optimization problems.
50
+ #
51
+ # See: http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance
52
+ #
53
+ # * *Arguments* :
54
+ # - +u+ -> An array of Numeric objects.
55
+ # - +v+ -> (Optional) An array of Numeric objects.
56
+ # * *Returns* :
57
+ # - The squared value of the euclidean norm of +u+ or of the euclidean
58
+ # distance between +u+ and +v+.
59
+ # * *Raises* :
60
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
61
+ #
62
+ def euclidean_squared(u, v = nil)
63
+ # If the second argument is nil, the method should return the norm of
64
+ # vector u. For this, we need the distance between u and the origin.
65
+ if v.nil?
66
+ v = Array.new(u.size, 0)
67
+ end
68
+
69
+ # TODO: Change this to a more specific, custom-made exception.
70
+ raise ArgumentError if u.size != v.size
71
+
72
+ u.zip(v).reduce(0.0) do |acc, ary|
73
+ acc += (ary[0] - ary[-1]) ** 2
38
74
  end
39
75
  end
40
76
  end
@@ -1,46 +1,71 @@
1
- # Notes:
2
- #
3
- # translated into Ruby based on information contained in:
4
- # http://mathforum.org/library/drmath/view/51879.html
5
- # Dr. Rick and Dr. Peterson - 4/20/99
6
- #
7
- # http://www.movable-type.co.uk/scripts/latlong.html
8
- # http://en.wikipedia.org/wiki/Haversine_formula
9
- #
10
- # This formula can compute accurate distances between two points given latitude
11
- # and longitude, even for short distances.
12
-
13
1
  module Measurable
14
2
 
15
- R_MILES = 3956 # radius of the great circle in miles
16
- R_KM = 6371 # radius in kilometers...some algorithms use 6367
17
-
18
- # the great circle distance d will be in whatever units R is in
19
- R = {
20
- :miles => R_MILES,
21
- :km => R_KM,
22
- :feet => R_MILES * 5282,
23
- :meters => R_KM * 1000
3
+ # Earth radius in miles.
4
+ EARTH_RADIUS_IN_MILES = 3956
5
+
6
+ # Earth radius in kilometers. Some algorithms use 6367.
7
+ EARTH_RADIUS_IN_KILOMETERS = 6371
8
+
9
+ # The great circle distance returned will be in whatever units R is in.
10
+ # Provides
11
+ EARTH_RADIUS = {
12
+ :miles => EARTH_RADIUS_IN_MILES,
13
+ :km => EARTH_RADIUS_IN_KILOMETERS,
14
+ :feet => EARTH_RADIUS_IN_MILES * 5282,
15
+ :meters => EARTH_RADIUS_IN_KILOMETERS * 1000
24
16
  }
25
17
 
26
- class << self
27
- def haversine(u, v, um = :meters)
28
- dlon = u[1] - v[1]
29
- dlat = u[0] - v[0]
18
+ # call-seq:
19
+ # haversine(u, v) -> Float
20
+ #
21
+ # Compute accurate distances between two points given their latitudes and
22
+ # longitudes, even for short distances. This isn't a distance measure in the
23
+ # same sense as the other methods in +Measurable+.
24
+ #
25
+ # The distance returned is the great circle (or orthodromic) distance between
26
+ # +u+ and +v+, which is the shortest distance between them on the surface of
27
+ # a sphere. Thus, this implementation considers the Earth to be a sphere.
28
+ #
29
+ # Reminding that the input vectors are of the form [latitude, longitude] in
30
+ # degrees, so if you have the coordinates [23 32' S, 46 37' W] (from São
31
+ # Paulo), the corresponding vector is [-23.53333, -46.61667].
32
+ #
33
+ # References:
34
+ # - http://www.movable-type.co.uk/scripts/latlong.html
35
+ # - http://en.wikipedia.org/wiki/Haversine_formula
36
+ # - http://en.wikipedia.org/wiki/Great-circle_distance
37
+ #
38
+ # * *Arguments* :
39
+ # - +u+ -> An array of Numeric objects.
40
+ # - +v+ -> An array of Numeric objects.
41
+ # - +unit+ -> (Optional) A Symbol representing the unit of measure. Available
42
+ # options are +:miles+, +:feet+, +:km+ and +:meters+.
43
+ # * *Returns* :
44
+ # - The great circle distance between +u+ and +v+.
45
+ # * *Raises* :
46
+ # - +ArgumentError+ -> The size of +u+ and +v+ must be 2.
47
+ # - +ArgumentError+ -> +unit+ must be a Symbol.
48
+ #
49
+ def haversine(u, v, unit = :meters)
50
+ # TODO: Create better exceptions.
51
+ raise ArgumentError if u.size != 2 || v.size != 2
52
+ raise ArgumentError if unit.class != Symbol
53
+
54
+ dlat = u[0] - v[0]
55
+ dlon = u[1] - v[1]
30
56
 
31
- dlon_rad = dlon * RAD_PER_DEG
32
- dlat_rad = dlat * RAD_PER_DEG
57
+ dlon_rad = dlon * RAD_PER_DEG
58
+ dlat_rad = dlat * RAD_PER_DEG
33
59
 
34
- lat1_rad = v[0] * RAD_PER_DEG
35
- lon1_rad = v[1] * RAD_PER_DEG
60
+ lat1_rad = v[0] * RAD_PER_DEG
61
+ lon1_rad = v[1] * RAD_PER_DEG
36
62
 
37
- lat2_rad = u[0] * RAD_PER_DEG
38
- lon2_rad = u[1] * RAD_PER_DEG
63
+ lat2_rad = u[0] * RAD_PER_DEG
64
+ lon2_rad = u[1] * RAD_PER_DEG
39
65
 
40
- a = (Math.sin(dlat_rad/2))**2 + Math.cos(lat1_rad) * Math.cos(lat2_rad) * (Math.sin(dlon_rad/2))**2
41
- c = 2 * Math.atan2( Math.sqrt(a), Math.sqrt(1-a))
66
+ a = (Math.sin(dlat_rad / 2)) ** 2 + Math.cos(lat1_rad) * Math.cos(lat2_rad) * (Math.sin(dlon_rad / 2)) ** 2
67
+ c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))
42
68
 
43
- R[um] * c
44
- end
69
+ EARTH_RADIUS[unit] * c
45
70
  end
46
71
  end
@@ -1,26 +1,69 @@
1
- # http://en.wikipedia.org/wiki/Jaccard_coefficient
2
1
  module Measurable
3
- class << self
4
- def jaccard(u, v)
5
- 1 - jaccard_index(u, v)
2
+
3
+ # call-seq:
4
+ # jaccard_index(u, v) -> Float
5
+ #
6
+ # Give the similarity between two binary vectors +u+ and +v+. Calculated as:
7
+ # jaccard_index = |intersection| / |union|
8
+ #
9
+ # In which intersection and union refer to +u+ and +v+ and |x| is the
10
+ # cardinality of set x.
11
+ #
12
+ # For example:
13
+ # jaccard_index([1, 0, 1], [1, 1, 1]) == 0.666...
14
+ #
15
+ # Because |intersection| = |(1, 0, 1)| = 2 and |union| = |(1, 1, 1)| = 3.
16
+ #
17
+ # See: http://en.wikipedia.org/wiki/Jaccard_coefficient
18
+ #
19
+ # * *Arguments* :
20
+ # - +u+ -> Array of 1s and 0s.
21
+ # - +v+ -> Array of 1s and 0s.
22
+ # * *Returns* :
23
+ # - Float value representing the Jaccard similarity coefficient between
24
+ # +u+ and +v+.
25
+ # * *Raises* :
26
+ # - +ArgumentError+ -> The size of the input arrays doesn't match.
27
+ #
28
+ def jaccard_index(u, v)
29
+ # TODO: Change this to a more specific, custom-made exception.
30
+ raise ArgumentError if u.size != v.size
31
+
32
+ intersection = u.zip(v).reduce(0) do |acc, elem|
33
+ # Both u and v must have this element.
34
+ elem[0] + elem[1] == 2 ? (acc + 1) : acc
6
35
  end
7
-
8
- def jaccard_index(u, v)
9
- union = (u | v).size.to_f
10
- intersection = (u & v).size.to_f
11
-
12
- intersection / union
13
- end
14
-
15
- def binary_jaccard(u, v)
16
- 1 - binary_jaccard_index(u, v)
17
- end
18
-
19
- def binary_jaccard_index(u, v)
20
- intersection = binary_intersection(u, v).delete_if {|x| x == 0}.size.to_f
21
- union = binary_union(u, v).delete_if {|x| x == 0}.size.to_f
22
-
23
- intersection / union
36
+
37
+ union = u.zip(v).reduce(0) do |acc, elem|
38
+ # One of u and v must have this element.
39
+ elem[0] + elem[1] >= 1 ? (acc + 1) : acc
24
40
  end
41
+
42
+ intersection.to_f / union
43
+ end
44
+
45
+ # call-seq:
46
+ # jaccard(u, v) -> Float
47
+ #
48
+ # The jaccard distance is a measure of dissimilarity between two sets. It is
49
+ # calculated as:
50
+ # jaccard_distance = 1 - jaccard_index
51
+ #
52
+ # This is a proper metric, i.e. the following conditions hold:
53
+ # - Symmetry: jaccard(u, v) == jaccard(v, u)
54
+ # - Non-negative: jaccard(u, v) >= 0
55
+ # - Coincidence axiom: jaccard(u, v) == 0 if u == v
56
+ # - Triangular inequality: jaccard(u, v) <= jaccard(u, w) + jaccard(w, v)
57
+ #
58
+ # * *Arguments* :
59
+ # - +u+ -> Array of 1s and 0s.
60
+ # - +v+ -> Array of 1s and 0s.
61
+ # * *Returns* :
62
+ # - Float value representing the dissimilarity between +u+ and +v+.
63
+ # * *Raises* :
64
+ # - +ArgumentError+ -> The size of the input arrays doesn't match.
65
+ #
66
+ def jaccard(u, v)
67
+ 1 - jaccard_index(u, v)
25
68
  end
26
69
  end
@@ -1,13 +1,34 @@
1
1
  module Measurable
2
- class << self
3
- def maxmin(u, v)
4
- sum_min, sum_max = u.zip(v).reduce([0.0, 0.0]) do |acc, attributes|
5
- acc[0] += attributes.min
6
- acc[-1] += attributes.max
7
- acc
8
- end
9
-
10
- sum_min / sum_max
2
+
3
+ # call-seq:
4
+ # maxmin(u, v) -> Float
5
+ #
6
+ # The "Max-min distance" is used to measure similarity between two vectors.
7
+ #
8
+ # When used in k-means clustering, this similarity measure can give better
9
+ # results in some datasets, as pointed out in the paper "K-means clustering
10
+ # using Max-min distance measure" --- Visalakshi, N. K.; Suguna, J.
11
+ #
12
+ # See: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05156398
13
+ #
14
+ # * *Arguments* :
15
+ # - +u+ -> An array of Numeric objects.
16
+ # - +v+ -> An array of Numeric objects.
17
+ # * *Returns* :
18
+ # - Similarity between +u+ and +v+.
19
+ # * *Raises* :
20
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
21
+ #
22
+ def maxmin(u, v)
23
+ # TODO: Change this to a more specific, custom-made exception.
24
+ raise ArgumentError if u.size != v.size
25
+
26
+ sum_min, sum_max = u.zip(v).reduce([0.0, 0.0]) do |acc, attributes|
27
+ acc[0] += attributes.min
28
+ acc[1] += attributes.max
29
+ acc
11
30
  end
31
+
32
+ sum_min / sum_max
12
33
  end
13
34
  end
@@ -1,11 +1,32 @@
1
- # http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29
2
1
  module Measurable
3
- class << self
4
- def tanimoto(u, v)
5
- dot = dot(u, v).to_f
6
- result = dot / (u.sum_of_squares + v.sum_of_squares - dot).to_f
7
-
8
- handle_nan(result)
9
- end
2
+
3
+ # Tanimoto similarity is the same as Jaccard similarity.
4
+ alias :tanimoto_similarity :jaccard
5
+
6
+ # call-seq:
7
+ # tanimoto(u, v) -> Float
8
+ #
9
+ # Tanimoto distance is a coefficient explicitly chosen such as to allow for
10
+ # two dissimilar specimens to be similar to a third one. This breaks the
11
+ # triangle inequality, thus this isn't a metric.
12
+ #
13
+ # More information and references on this are needed. It's left here mostly
14
+ # as a piece of curiosity.
15
+ #
16
+ # See: # http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto.27s_Definitions_of_Similarity_and_Distance
17
+ #
18
+ # * *Arguments* :
19
+ # - +u+ -> An array of Numeric objects.
20
+ # - +v+ -> An array of Numeric objects.
21
+ # * *Returns* :
22
+ # - A measure of the similarity between +u+ and +v+.
23
+ # * *Raises* :
24
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
25
+ #
26
+ def tanimoto(u, v)
27
+ # TODO: Change this to a more specific, custom-made exception.
28
+ raise ArgumentError if u.size != v.size
29
+
30
+ -Math.log2(jaccard_index(u, v))
10
31
  end
11
32
  end
@@ -1,3 +1,3 @@
1
1
  module Measurable
2
- VERSION = "0.0.4"
2
+ VERSION = "0.0.5" # :nodoc:
3
3
  end
@@ -0,0 +1,29 @@
1
+ describe "Cosine distance" do
2
+
3
+ before :all do
4
+ @u = [1, 2]
5
+ @v = [2, 3]
6
+ @w = [4, 5]
7
+ end
8
+
9
+ it "accepts two arguments" do
10
+ expect { Measurable.cosine(@u, @v) }.to_not raise_error
11
+ expect { Measurable.cosine(@u, @v, @w) }.to raise_error(ArgumentError)
12
+ end
13
+
14
+ it "should be symmetric" do
15
+ x = Measurable.cosine(@u, @v)
16
+ y = Measurable.cosine(@v, @u)
17
+
18
+ x.should be_within(TOLERANCE).of(y)
19
+ end
20
+
21
+ it "should return the correct value" do
22
+ x = Measurable.cosine(@u, @v)
23
+ x.should be_within(TOLERANCE).of(0.992277877)
24
+ end
25
+
26
+ it "shouldn't work with vectors of different length" do
27
+ expect { Measurable.cosine(@u, [1, 3, 5, 7]) }.to raise_error
28
+ end
29
+ end
@@ -0,0 +1,61 @@
1
+ describe "Euclidean" do
2
+
3
+ before :all do
4
+ @u = [1, 3, 16]
5
+ @v = [1, 4, 16]
6
+ @w = [4, 5, 6]
7
+ end
8
+
9
+ context "Distance" do
10
+ it "accepts two arguments" do
11
+ expect { Measurable.euclidean(@u, @v) }.to_not raise_error
12
+ expect { Measurable.euclidean(@u, @v, @w) }.to raise_error(ArgumentError)
13
+ end
14
+
15
+ it "accepts one argument and returns the vector's norm" do
16
+ # Remember that 3^2 + 4^2 = 5^2.
17
+ Measurable.euclidean([3, 4]).should == 5
18
+ end
19
+
20
+ it "should be symmetric" do
21
+ Measurable.euclidean(@u, @v).should == Measurable.euclidean(@v, @u)
22
+ end
23
+
24
+ it "should return the correct value" do
25
+ Measurable.euclidean(@u, @u).should == 0
26
+ Measurable.euclidean(@u, @v).should == 1
27
+ end
28
+
29
+ it "shouldn't work with vectors of different length" do
30
+ expect { Measurable.euclidean(@u, [2, 2, 2, 2]) }.to raise_error
31
+ end
32
+ end
33
+
34
+ context "Squared Distance" do
35
+ it "accepts two arguments" do
36
+ expect { Measurable.euclidean_squared(@u, @v) }.to_not raise_error
37
+ expect { Measurable.euclidean_squared(@u, @v, @w) }.to raise_error(ArgumentError)
38
+ end
39
+
40
+ it "accepts one argument and returns the vector's norm" do
41
+ # Remember that 3^2 + 4^2 = 5^2.
42
+ Measurable.euclidean_squared([3, 4]).should == 25
43
+ end
44
+
45
+ it "should be symmetric" do
46
+ x = Measurable.euclidean_squared(@u, @v)
47
+ y = Measurable.euclidean_squared(@v, @u)
48
+
49
+ x.should == y
50
+ end
51
+
52
+ it "should return the correct value" do
53
+ Measurable.euclidean_squared(@u, @u).should == 0
54
+ Measurable.euclidean_squared(@u, @v).should == 1
55
+ end
56
+
57
+ it "shouldn't work with vectors of different length" do
58
+ expect { Measurable.euclidean_squared(@u, [2, 2, 2, 2]) }.to raise_error
59
+ end
60
+ end
61
+ end
@@ -0,0 +1,37 @@
1
+ describe "Haversine distance" do
2
+
3
+ before :all do
4
+ # We have very big errors in this formula, due to:
5
+ # - The Earth is considered a sphere.
6
+ # - Earth's radius is considered constant (same as above).
7
+ #
8
+ # Given these conditions, I'll just assume the error to be less than 1.
9
+ # TODO: Calculate better error estimates.
10
+ @haversine_tolerance = 1
11
+
12
+ @u = [ 35.66667, 139.75] # Tokyo: 35 40' N, 139 45' E.
13
+ @v = [-23.53333, -46.61667] # São Paulo: 23 32' S, 46 37' W.
14
+ end
15
+
16
+ it "accepts two arguments" do
17
+ expect { Measurable.haversine(@u, @v) }.to_not raise_error
18
+ expect { Measurable.haversine(@u, @v, [-24.5, 40.23]) }.to raise_error(ArgumentError)
19
+ end
20
+
21
+ it "should be symmetric" do
22
+ x = Measurable.haversine(@u, @v)
23
+ y = Measurable.haversine(@v, @u)
24
+
25
+ x.should be_within(TOLERANCE).of(y)
26
+ end
27
+
28
+ it "should return the correct value" do
29
+ x = Measurable.haversine(@u, @v, :km)
30
+
31
+ x.should be_within(@haversine_tolerance).of(18533)
32
+ end
33
+
34
+ it "should only work with [lat, long] vectors" do
35
+ expect { Measurable.haversine([2, 4], [1, 3, 5, 7]) }.to raise_error
36
+ end
37
+ end
@@ -0,0 +1,62 @@
1
+ describe "Jaccard" do
2
+
3
+ context "Index" do
4
+ before :all do
5
+ @u = [1, 0, 1]
6
+ @v = [1, 1, 1]
7
+ @w = [0, 1, 0]
8
+ end
9
+
10
+ it "accepts two arguments" do
11
+ expect { Measurable.jaccard_index(@u, @v) }.to_not raise_error
12
+ expect { Measurable.jaccard_index(@u, @v, @w) }.to raise_error(ArgumentError)
13
+ end
14
+
15
+ it "should be symmetric" do
16
+ x = Measurable.jaccard_index(@u, @v)
17
+ y = Measurable.jaccard_index(@v, @u)
18
+
19
+ x.should be_within(TOLERANCE).of(y)
20
+ end
21
+
22
+ it "should return the correct value" do
23
+ x = Measurable.jaccard_index(@u, @v)
24
+
25
+ x.should be_within(TOLERANCE).of(2.0 / 3.0)
26
+ end
27
+
28
+ it "shouldn't work with vectors of different length" do
29
+ expect { Measurable.jaccard_index(@u, [1, 2, 3, 4]) }.to raise_error
30
+ end
31
+ end
32
+
33
+ context "Distance" do
34
+ before :all do
35
+ @u = [1, 0, 1]
36
+ @v = [1, 1, 1]
37
+ @w = [0, 1, 0]
38
+ end
39
+
40
+ it "accepts two arguments" do
41
+ expect { Measurable.jaccard(@u, @v) }.to_not raise_error
42
+ expect { Measurable.jaccard(@u, @v, @w) }.to raise_error(ArgumentError)
43
+ end
44
+
45
+ it "should be symmetric" do
46
+ x = Measurable.jaccard(@u, @v)
47
+ y = Measurable.jaccard(@v, @u)
48
+
49
+ x.should be_within(TOLERANCE).of(y)
50
+ end
51
+
52
+ it "should return the correct value" do
53
+ x = Measurable.jaccard(@u, @v)
54
+
55
+ x.should be_within(TOLERANCE).of(1.0 / 3.0)
56
+ end
57
+
58
+ it "shouldn't work with vectors of different length" do
59
+ expect { Measurable.jaccard(@u, [1, 2, 3, 4]) }.to raise_error
60
+ end
61
+ end
62
+ end
@@ -0,0 +1,30 @@
1
+ describe "Max-min distance" do
2
+
3
+ before :all do
4
+ @u = [1, 3, 16]
5
+ @v = [1, 4, 16]
6
+ @w = [4, 5, 6]
7
+ end
8
+
9
+ it "accepts two arguments" do
10
+ expect { Measurable.maxmin(@u, @v) }.to_not raise_error
11
+ expect { Measurable.maxmin(@u, @v, @w) }.to raise_error(ArgumentError)
12
+ end
13
+
14
+ it "should be symmetric" do
15
+ x = Measurable.maxmin(@u, @v)
16
+ y = Measurable.maxmin(@v, @u)
17
+
18
+ x.should be_within(TOLERANCE).of(y)
19
+ end
20
+
21
+ it "should return the correct value" do
22
+ x = Measurable.maxmin(@u, @v)
23
+
24
+ x.should be_within(TOLERANCE).of(0.9523809523)
25
+ end
26
+
27
+ it "shouldn't work with vectors of different length" do
28
+ expect { Measurable.maxmin(@u, [1, 3, 5, 7]) }.to raise_error
29
+ end
30
+ end
@@ -2,3 +2,5 @@ $LOAD_PATH.unshift(File.dirname(__FILE__))
2
2
  $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
3
3
 
4
4
  require 'measurable'
5
+
6
+ TOLERANCE = 10e-9
@@ -0,0 +1,30 @@
1
+ describe "Tanimoto distance" do
2
+
3
+ before :all do
4
+ @u = [1, 0, 1]
5
+ @v = [1, 1, 1]
6
+ @w = [0, 1, 0]
7
+ end
8
+
9
+ it "accepts two arguments" do
10
+ expect { Measurable.tanimoto(@u, @v) }.to_not raise_error
11
+ expect { Measurable.tanimoto(@u, @v, @w) }.to raise_error(ArgumentError)
12
+ end
13
+
14
+ it "should be symmetric" do
15
+ x = Measurable.tanimoto(@u, @v)
16
+ y = Measurable.tanimoto(@v, @u)
17
+
18
+ x.should be_within(TOLERANCE).of(y)
19
+ end
20
+
21
+ it "should return the correct value" do
22
+ x = Measurable.tanimoto(@u, @v)
23
+
24
+ x.should be_within(TOLERANCE).of(-Math.log2(2.0 / 3.0))
25
+ end
26
+
27
+ it "shouldn't work with vectors of different length" do
28
+ expect { Measurable.tanimoto(@u, [1, 3, 5, 7]) }.to raise_error
29
+ end
30
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: measurable
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.4
4
+ version: 0.0.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Carlos Agarie
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-03-24 00:00:00.000000000 Z
11
+ date: 2013-07-24 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -74,8 +74,13 @@ files:
74
74
  - lib/measurable/tanimoto.rb
75
75
  - lib/measurable/version.rb
76
76
  - measurable.gemspec
77
- - spec/measurable_spec.rb
77
+ - spec/cosine_spec.rb
78
+ - spec/euclidean_spec.rb
79
+ - spec/haversine_spec.rb
80
+ - spec/jaccard_spec.rb
81
+ - spec/maxmin_spec.rb
78
82
  - spec/spec_helper.rb
83
+ - spec/tanimoto_spec.rb
79
84
  homepage: http://github.com/agarie/measurable
80
85
  licenses: []
81
86
  metadata: {}
@@ -95,10 +100,15 @@ required_rubygems_version: !ruby/object:Gem::Requirement
95
100
  version: '0'
96
101
  requirements: []
97
102
  rubyforge_project:
98
- rubygems_version: 2.0.0
103
+ rubygems_version: 2.0.3
99
104
  signing_key:
100
105
  specification_version: 4
101
106
  summary: A Ruby gem with a lot of distance measures for your projects.
102
107
  test_files:
103
- - spec/measurable_spec.rb
108
+ - spec/cosine_spec.rb
109
+ - spec/euclidean_spec.rb
110
+ - spec/haversine_spec.rb
111
+ - spec/jaccard_spec.rb
112
+ - spec/maxmin_spec.rb
104
113
  - spec/spec_helper.rb
114
+ - spec/tanimoto_spec.rb
@@ -1,159 +0,0 @@
1
- describe Measurable do
2
-
3
- describe "Binary union" do
4
-
5
- end
6
-
7
- describe "Binary intersection" do
8
-
9
- end
10
-
11
- describe "Euclidean" do
12
-
13
- before :all do
14
- @u = [1, 3, 16]
15
- @v = [1, 4, 16]
16
- @w = [4, 5, 6]
17
- end
18
-
19
- context "Distance" do
20
- it "accepts two arguments" do
21
- expect { Measurable.euclidean(@u, @v) }.to_not raise_error
22
- expect { Measurable.euclidean(@u, @v, @w) }.to raise_error(ArgumentError)
23
- end
24
-
25
- it "accepts one argument and returns the vector's norm" do
26
- # Remember that 3^2 + 4^2 = 5^2.
27
- Measurable.euclidean([3, 4]).should == 5
28
- end
29
-
30
- it "should be symmetric" do
31
- Measurable.euclidean(@u, @v).should == Measurable.euclidean(@v, @u)
32
- end
33
-
34
- it "should return the correct value" do
35
- Measurable.euclidean(@u, @u).should == 0
36
- Measurable.euclidean(@u, @v).should == 1
37
- end
38
-
39
- it "shouldn't work with vectors of different length" do
40
- expect { Measurable.euclidean(@u, [2, 2, 2, 2]) }.to raise_error
41
- end
42
- end
43
-
44
- context "Squared Distance" do
45
- it "accepts two arguments" do
46
- expect { Measurable.euclidean_squared(@u, @v) }.to_not raise_error
47
- expect { Measurable.euclidean_squared(@u, @v, @w) }.to raise_error(ArgumentError)
48
- end
49
-
50
- it "accepts one argument and returns the vector's norm" do
51
- # Remember that 3^2 + 4^2 = 5^2.
52
- Measurable.euclidean_squared([3, 4]).should == 25
53
- end
54
-
55
- it "should be symmetric" do
56
- x = Measurable.euclidean_squared(@u, @v)
57
- y = Measurable.euclidean_squared(@v, @u)
58
-
59
- x.should == y
60
- end
61
-
62
- it "should return the correct value" do
63
- Measurable.euclidean_squared(@u, @u).should == 0
64
- Measurable.euclidean_squared(@u, @v).should == 1
65
- end
66
-
67
- it "shouldn't work with vectors of different length" do
68
- expect { Measurable.euclidean_squared(@u, [2, 2, 2, 2]) }.to raise_error
69
- end
70
- end
71
-
72
- end
73
-
74
- describe "Cosine distance" do
75
- it "accepts two arguments"
76
-
77
- it "accepts one argument and returns the vector's norm"
78
-
79
- it "should handle NaN's"
80
-
81
- it "should be symmetric"
82
-
83
- it "should return the correct value"
84
-
85
- it "shouldn't work with vectors of different length"
86
- end
87
-
88
- describe "Chebyshev distance" do
89
- it "accepts two arguments"
90
-
91
- it "accepts one argument and returns the vector's norm"
92
-
93
- it "should be symmetric"
94
-
95
- it "should return the correct value"
96
-
97
- it "shouldn't work with vectors of different length"
98
- end
99
-
100
- describe "Tanimoto distance" do
101
- it "accepts two arguments"
102
-
103
- it "accepts one argument and returns the vector's norm"
104
-
105
- it "should be symmetric"
106
-
107
- it "should return the correct value"
108
-
109
- it "shouldn't work with vectors of different length"
110
- end
111
-
112
- describe "Haversine distance" do
113
- it "accepts two arguments"
114
-
115
- it "accepts one argument and returns the vector's norm"
116
-
117
- it "should be symmetric"
118
-
119
- it "should return the correct value"
120
-
121
- it "shouldn't work with vectors of different length"
122
- end
123
-
124
- describe "Jaccard distance" do
125
- it "accepts two arguments"
126
-
127
- it "accepts one argument and returns the vector's norm"
128
-
129
- it "should be symmetric"
130
-
131
- it "should return the correct value"
132
-
133
- it "shouldn't work with vectors of different length"
134
- end
135
-
136
- describe "Binary Jaccard distance" do
137
- it "accepts two arguments"
138
-
139
- it "accepts one argument and returns the vector's norm"
140
-
141
- it "should be symmetric"
142
-
143
- it "should return the correct value"
144
-
145
- it "shouldn't work with vectors of different length"
146
- end
147
-
148
- describe "Max-min distance" do
149
- it "accepts two arguments"
150
-
151
- it "accepts one argument and returns the vector's norm"
152
-
153
- it "should be symmetric"
154
-
155
- it "should return the correct value"
156
-
157
- it "shouldn't work with vectors of different length"
158
- end
159
- end