measurable 0.0.4 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 24f0ca4dbb60cda53bab68a614a171df7e337434
4
- data.tar.gz: 8675c8a2e723203f287ce4dac3a6e6237fe2b675
3
+ metadata.gz: 66337383a6c25685893bb39f7caf5c7f0b40bcff
4
+ data.tar.gz: 117a22bb28b1d36f14780d3a22bbad7211b279a0
5
5
  SHA512:
6
- metadata.gz: ff4de5c4fbbe64592a16e7980182a76fa7e3960931d401004f9cebd8e439ea17acf0e16b8a465131b80e832fd560fd97f5fd6e1054f43678ded44d730a4e90c3
7
- data.tar.gz: 62396f9fb4208745628848a5447872bb86e407d2b8ec34b15acaae6ac193f8a2c1a63aa68ef69046d5d6080a693c4cfc38bc9f419c697445eb251bb861cc9af4
6
+ metadata.gz: 0d7aff51213d2f0ca31472d1d5f43b3ce99d58255b9d3d53485b0983e953866a365927734c1852c7040f7c23149455712af6357679e03ff63bf247709ac7e124
7
+ data.tar.gz: 652066925d87d7d52656f87346c856d34296e4a98662e94a670d72e9612c568dd95bf758314c524501700b21b53484eaaf059e5b5fee315129fb1b0b90a170bd
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- measurable (0.0.4)
4
+ measurable (0.0.5)
5
5
 
6
6
  GEM
7
7
  remote: http://rubygems.org/
data/README.md CHANGED
@@ -1,25 +1,32 @@
1
1
  # Measurable
2
2
 
3
- This gem encompasses various distance measures. Besides the `Array` class, I also want to support [NMatrix](http://github.com/sciruby/nmatrix)'s `NVector`.
3
+ A gem to test what metric is best for certain kinds of datasets in machine learning.
4
4
 
5
- My objective is to be able to compare different metrics just by changing which method is called. Also, to show how to use NMatrix's C API. I'll create most of the things in pure Ruby first, then the most used operations (or the slowest ones) will be rewritten in C.
5
+ Besides the `Array` class, I also want to support `NVector` (from [NMatrix](http://github.com/sciruby/nmatrix)).
6
6
 
7
- This is a fork of the gem [Distance Measure](https://github.com/reddavis/Distance-Measures), which has a similar objective, but isn't actively maintained and doesn't support NMatrix. Thank you, [reddavis](https://github.com/reddavis). :)
7
+ The distance measures will be created in Ruby first. If I see that it's really too slow, I'll write some methods in C (or Java, for JRuby).
8
+
9
+ This is a fork of the gem [Distance Measure](https://github.com/reddavis/Distance-Measures), which has a similar objective, but isn't actively maintained and doesn't support NMatrix. Thank you, [@reddavis][reddavis]. :)
8
10
 
9
11
  ## Install
10
12
 
11
13
  `gem install measurable`
12
14
 
13
- It only works with Ruby MRI 1.9.3 or 2.0.0. I still want to test it on JRuby, but as its still pure Ruby, it should work correctly there.
15
+ I only tested it with 2.0.0 (yes, yes, travis, I'll do it eventually). I want to support JRuby as well.
16
+
17
+ ## Distance measures
18
+
19
+ I'm using the term "distance measure" without much concern for the strict mathematical definition of a metric. If the documentation for one of the methods isn't clear about it being or not a metric, please open an issue.
14
20
 
15
- ## Distance measures that I want to support for the moment
21
+ The following are the similarity measures supported at the moment:
16
22
 
17
23
  - Euclidean distance
18
24
  - Squared euclidean distance
19
25
  - Cosine distance
20
- - Max-min distance (["K-Means clustering using max-min distance measure"][1])
26
+ - Max-min distance (from ["K-Means clustering using max-min distance measure"][maxmin])
21
27
  - Jaccard distance
22
28
  - Tanimoto distance
29
+ - Haversine distance
23
30
 
24
31
  These still need to be implemented:
25
32
 
@@ -36,30 +43,42 @@ These still need to be implemented:
36
43
 
37
44
  ## How to use
38
45
 
39
- This list will be updated as I have time. I'll refactor the existing measures and add some that I'll need in a project.
40
-
41
46
  The API I intend to support is something like this:
42
47
 
43
48
  ```ruby
44
49
  require "measurable"
45
-
50
+
46
51
  u = NVector.ones(2)
47
52
  v = NVector.zeros(2)
48
53
  w = [1, 0]
49
54
  x = [2, 2]
50
55
 
51
- Measurable::euclidean(u, v) # => 1.41421
52
- Measurable::euclidean(w, v) # => 1.00000
53
- Measurable::euclidean(w, w) # => 0.00000
54
- Measurable::
56
+ # Calculate the distance between two points in space.
57
+ Measurable.euclidean(u, v) # => 1.41421
58
+ Measurable.euclidean(w, v) # => 1.00000
59
+ Measurable.cosine([1, 2], [2, 3]) # => 0.00772
60
+
61
+ # Calculate the norm of a vector, i.e. its distance from the origin.
62
+ Measurable.euclidean_squared([3, 4]) # => 25
55
63
  ```
56
64
 
57
- Maybe add support for (some of) NMatrix's dtypes, like `:float32`, `:float64`, `:complex64`, `:complex128`, etc. This will have to way until Measurable supports NMatrix C API.
65
+ ## Documentation
66
+
67
+ `RDoc` syntax is used to document the project. To build it locally, you'll need to install the [Fivefish generator](https://github.com/ged/rdoc-generator-fivefish) (`gem install rdoc-generator-fivefish`) and run the following command:
68
+
69
+ ```bash
70
+ rdoc -f fivefish -m README.md *.md LICENSE lib/
71
+ ```
72
+
73
+ I want to be able to use a Rake task to generate the documentation, thus allowing me to forget the specific command. However, there's a bug in `RDoc::Task` in which [custom generators (like Fivefish) can't be used](https://github.com/rdoc/rdoc/issues/246).
74
+
75
+ If there's something wrong with an explanation or if there's information missing, please open an issue or send a pull request.
58
76
 
59
77
  ## License
60
78
 
61
79
  See LICENSE for details.
62
80
 
63
- The original `Distance Measure` gem is copyrighted by @reddavis.
81
+ The original `distance_measures` gem is copyrighted by [@reddavis][reddavis].
64
82
 
65
- [1]: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05156398
83
+ [maxmin]: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05156398
84
+ [reddavis]: (https://github.com/reddavis)
data/Rakefile CHANGED
@@ -1,6 +1,7 @@
1
1
  require 'rake'
2
2
  require 'bundler/gem_tasks'
3
3
  require "rspec/core/rake_task"
4
+ # require 'rdoc/task' # See below.
4
5
 
5
6
  # Setup the necessary gems, specified in the gemspec.
6
7
  require 'bundler'
@@ -15,10 +16,22 @@ end
15
16
  # Run all the specs.
16
17
  RSpec::Core::RakeTask.new(:spec)
17
18
 
19
+ # RDoc task isn't working with custom generators, as can be seen in:
20
+ # https://github.com/rdoc/rdoc/issues/246
21
+ #
22
+ # Whenever this issue is fixed, I'll resume using this task.
23
+ #
24
+ # RDoc::Task.new do |rdoc|
25
+ # rdoc.main = "README.md"
26
+ # rdoc.rdoc_files.include("README.md", "LICENSE", "lib")
27
+ # rdoc.generator = "fivefish"
28
+ # rdoc.external = true
29
+ # end
30
+
18
31
  # Compile task.
19
32
  # Rake::ExtensionTask.new do |ext|
20
- # ext.name = 'measurable'
21
- # ext.ext_dir = 'ext/measurable'
33
+ # ext.name = 'measurable'
34
+ # ext.ext_dir = 'ext/measurable'
22
35
  # ext.lib_dir = 'lib/'
23
- # ext.source_pattern = "**/*.{c, cpp, h}"
36
+ # ext.source_pattern = "**/*.{c, cpp, h}"
24
37
  # end
@@ -1,47 +1,16 @@
1
- require 'measurable/version.rb'
1
+ require 'measurable/version'
2
2
 
3
- # Distance measures.
3
+ # Distance measures. The require order is important.
4
4
  require 'measurable/euclidean'
5
5
  require 'measurable/cosine'
6
- require 'measurable/tanimoto'
7
6
  require 'measurable/jaccard'
7
+ require 'measurable/tanimoto'
8
8
  require 'measurable/haversine'
9
9
  require 'measurable/maxmin'
10
10
 
11
11
  module Measurable
12
- # PI = 3.1415926535
13
- RAD_PER_DEG = 0.017453293 # PI/180
14
- class << self
15
- def binary_union(u, v)
16
- unions = []
17
- u.each_with_index do |n, index|
18
- if n == 1 || v[index] == 1
19
- unions << 1
20
- else
21
- unions << 0
22
- end
23
- end
24
-
25
- unions
26
- end
27
-
28
- def binary_intersection(u, v)
29
- intersects = []
30
- u.each_with_index do |n, index|
31
- if n == 1 && v[index] == 1
32
- intersects << 1
33
- else
34
- intersects << 0
35
- end
36
- end
37
-
38
- intersects
39
- end
12
+ # PI / 180 degrees.
13
+ RAD_PER_DEG = Math::PI / 180
40
14
 
41
- # Checks if we"re dealing with NaN"s and will return 0.0 unless
42
- # handle NaN"s is set to false
43
- def handle_nan(result)
44
- result.nan? ? 0.0 : result
45
- end
46
- end
15
+ extend self # expose all instance methods as singleton methods.
47
16
  end
@@ -1,10 +1,27 @@
1
1
  module Measurable
2
- class << self
3
- def cosine(u, v)
4
- dot_product = dot(u, v)
5
- normalization = self.euclidean_normalize * other.euclidean_normalize
6
2
 
7
- handle_nan(dot_product / normalization)
8
- end
3
+ # call-seq:
4
+ # cosine(u, v) -> Float
5
+ #
6
+ # Calculate the similarity between the orientation of two vectors.
7
+ #
8
+ # See: http://en.wikipedia.org/wiki/Cosine_similarity
9
+ #
10
+ # * *Arguments* :
11
+ # - +u+ -> An array of Numeric objects.
12
+ # - +v+ -> An array of Numeric objects.
13
+ # * *Returns* :
14
+ # - The normalized dot product of +u+ and +v+, that is, the angle between
15
+ # them in the n-dimensional space.
16
+ # * *Raises* :
17
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
18
+ #
19
+ def cosine(u, v)
20
+ # TODO: Change this to a more specific, custom-made exception.
21
+ raise ArgumentError if u.size != v.size
22
+
23
+ dot_product = u.zip(v).reduce(0.0) { |acc, ary| acc += ary[0] * ary[1] }
24
+
25
+ dot_product / (euclidean(u) * euclidean(v))
9
26
  end
10
27
  end
@@ -1,40 +1,76 @@
1
1
  module Measurable
2
- class << self
3
- # Add documentation here!
4
- def euclidean(u, v = nil)
5
- # If the second argument is nil, the method should return the norm of
6
- # vector u. For this, we need the distance between u and the origin.
7
- if v.nil?
8
- v = Array.new(u.size, 0)
9
- end
10
-
11
- # We could make it work with vector of different sizes because of #zip
12
- # but it's unreliable. It's better to just throw an exception.
13
- # TODO: Change this to a more specific, custom-made exception.
14
- raise ArgumentError if u.size != v.size
15
-
16
- sum = u.zip(v).reduce(0.0) do |acc, ary|
17
- acc += (ary[0] - ary[-1])**2
18
- end
19
-
20
- Math.sqrt(sum)
2
+
3
+ # call-seq:
4
+ # euclidean(u) -> Float
5
+ # euclidean(u, v) -> Float
6
+ #
7
+ # Calculate the ordinary distance between arrays +u+ and +v+.
8
+ #
9
+ # If +v+ isn't given, calculate the Euclidean norm of +u+.
10
+ #
11
+ # See: http://en.wikipedia.org/wiki/Euclidean_distance#N_dimensions
12
+ #
13
+ # * *Arguments* :
14
+ # - +u+ -> An array of Numeric objects.
15
+ # - +v+ -> (Optional) An array of Numeric objects.
16
+ # * *Returns* :
17
+ # - The euclidean norm of +u+ or the euclidean distance between +u+ and
18
+ # +v+.
19
+ # * *Raises* :
20
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
21
+ #
22
+ def euclidean(u, v = nil)
23
+ # If the second argument is nil, the method should return the norm of
24
+ # vector u. For this, we need the distance between u and the origin.
25
+ if v.nil?
26
+ v = Array.new(u.size, 0)
21
27
  end
22
-
23
- def euclidean_squared(u, v = nil)
24
- # If the second argument is nil, the method should return the norm of
25
- # vector u. For this, we need the distance between u and the origin.
26
- if v.nil?
27
- v = Array.new(u.size, 0)
28
- end
29
-
30
- # We could make it work with vector of different sizes because of #zip
31
- # but it's unreliable. It's better to just throw an exception.
32
- # TODO: Change this to a more specific, custom-made exception.
33
- raise ArgumentError if u.size != v.size
34
-
35
- u.zip(v).reduce(0.0) do |acc, ary|
36
- acc += (ary[0] - ary[-1])**2
37
- end
28
+
29
+ # TODO: Change this to a more specific, custom-made exception.
30
+ raise ArgumentError if u.size != v.size
31
+
32
+ sum = u.zip(v).reduce(0.0) do |acc, ary|
33
+ acc += (ary[0] - ary[-1]) ** 2
34
+ end
35
+
36
+ Math.sqrt(sum)
37
+ end
38
+
39
+ # call-seq:
40
+ # euclidean_squared(u) -> Float
41
+ # euclidean_squared(u, v) -> Float
42
+ #
43
+ # Calculate the same value as euclidean(u, v), but don't take the square root
44
+ # of it.
45
+ #
46
+ # This isn't a metric in the strict sense, i.e. it doesn't respect the
47
+ # triangle inequality. However, the squared Euclidean distance is very useful
48
+ # whenever only the relative values of distances are important, for example
49
+ # in optimization problems.
50
+ #
51
+ # See: http://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance
52
+ #
53
+ # * *Arguments* :
54
+ # - +u+ -> An array of Numeric objects.
55
+ # - +v+ -> (Optional) An array of Numeric objects.
56
+ # * *Returns* :
57
+ # - The squared value of the euclidean norm of +u+ or of the euclidean
58
+ # distance between +u+ and +v+.
59
+ # * *Raises* :
60
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
61
+ #
62
+ def euclidean_squared(u, v = nil)
63
+ # If the second argument is nil, the method should return the norm of
64
+ # vector u. For this, we need the distance between u and the origin.
65
+ if v.nil?
66
+ v = Array.new(u.size, 0)
67
+ end
68
+
69
+ # TODO: Change this to a more specific, custom-made exception.
70
+ raise ArgumentError if u.size != v.size
71
+
72
+ u.zip(v).reduce(0.0) do |acc, ary|
73
+ acc += (ary[0] - ary[-1]) ** 2
38
74
  end
39
75
  end
40
76
  end
@@ -1,46 +1,71 @@
1
- # Notes:
2
- #
3
- # translated into Ruby based on information contained in:
4
- # http://mathforum.org/library/drmath/view/51879.html
5
- # Dr. Rick and Dr. Peterson - 4/20/99
6
- #
7
- # http://www.movable-type.co.uk/scripts/latlong.html
8
- # http://en.wikipedia.org/wiki/Haversine_formula
9
- #
10
- # This formula can compute accurate distances between two points given latitude
11
- # and longitude, even for short distances.
12
-
13
1
  module Measurable
14
2
 
15
- R_MILES = 3956 # radius of the great circle in miles
16
- R_KM = 6371 # radius in kilometers...some algorithms use 6367
17
-
18
- # the great circle distance d will be in whatever units R is in
19
- R = {
20
- :miles => R_MILES,
21
- :km => R_KM,
22
- :feet => R_MILES * 5282,
23
- :meters => R_KM * 1000
3
+ # Earth radius in miles.
4
+ EARTH_RADIUS_IN_MILES = 3956
5
+
6
+ # Earth radius in kilometers. Some algorithms use 6367.
7
+ EARTH_RADIUS_IN_KILOMETERS = 6371
8
+
9
+ # The great circle distance returned will be in whatever units R is in.
10
+ # Provides
11
+ EARTH_RADIUS = {
12
+ :miles => EARTH_RADIUS_IN_MILES,
13
+ :km => EARTH_RADIUS_IN_KILOMETERS,
14
+ :feet => EARTH_RADIUS_IN_MILES * 5282,
15
+ :meters => EARTH_RADIUS_IN_KILOMETERS * 1000
24
16
  }
25
17
 
26
- class << self
27
- def haversine(u, v, um = :meters)
28
- dlon = u[1] - v[1]
29
- dlat = u[0] - v[0]
18
+ # call-seq:
19
+ # haversine(u, v) -> Float
20
+ #
21
+ # Compute accurate distances between two points given their latitudes and
22
+ # longitudes, even for short distances. This isn't a distance measure in the
23
+ # same sense as the other methods in +Measurable+.
24
+ #
25
+ # The distance returned is the great circle (or orthodromic) distance between
26
+ # +u+ and +v+, which is the shortest distance between them on the surface of
27
+ # a sphere. Thus, this implementation considers the Earth to be a sphere.
28
+ #
29
+ # Reminding that the input vectors are of the form [latitude, longitude] in
30
+ # degrees, so if you have the coordinates [23 32' S, 46 37' W] (from São
31
+ # Paulo), the corresponding vector is [-23.53333, -46.61667].
32
+ #
33
+ # References:
34
+ # - http://www.movable-type.co.uk/scripts/latlong.html
35
+ # - http://en.wikipedia.org/wiki/Haversine_formula
36
+ # - http://en.wikipedia.org/wiki/Great-circle_distance
37
+ #
38
+ # * *Arguments* :
39
+ # - +u+ -> An array of Numeric objects.
40
+ # - +v+ -> An array of Numeric objects.
41
+ # - +unit+ -> (Optional) A Symbol representing the unit of measure. Available
42
+ # options are +:miles+, +:feet+, +:km+ and +:meters+.
43
+ # * *Returns* :
44
+ # - The great circle distance between +u+ and +v+.
45
+ # * *Raises* :
46
+ # - +ArgumentError+ -> The size of +u+ and +v+ must be 2.
47
+ # - +ArgumentError+ -> +unit+ must be a Symbol.
48
+ #
49
+ def haversine(u, v, unit = :meters)
50
+ # TODO: Create better exceptions.
51
+ raise ArgumentError if u.size != 2 || v.size != 2
52
+ raise ArgumentError if unit.class != Symbol
53
+
54
+ dlat = u[0] - v[0]
55
+ dlon = u[1] - v[1]
30
56
 
31
- dlon_rad = dlon * RAD_PER_DEG
32
- dlat_rad = dlat * RAD_PER_DEG
57
+ dlon_rad = dlon * RAD_PER_DEG
58
+ dlat_rad = dlat * RAD_PER_DEG
33
59
 
34
- lat1_rad = v[0] * RAD_PER_DEG
35
- lon1_rad = v[1] * RAD_PER_DEG
60
+ lat1_rad = v[0] * RAD_PER_DEG
61
+ lon1_rad = v[1] * RAD_PER_DEG
36
62
 
37
- lat2_rad = u[0] * RAD_PER_DEG
38
- lon2_rad = u[1] * RAD_PER_DEG
63
+ lat2_rad = u[0] * RAD_PER_DEG
64
+ lon2_rad = u[1] * RAD_PER_DEG
39
65
 
40
- a = (Math.sin(dlat_rad/2))**2 + Math.cos(lat1_rad) * Math.cos(lat2_rad) * (Math.sin(dlon_rad/2))**2
41
- c = 2 * Math.atan2( Math.sqrt(a), Math.sqrt(1-a))
66
+ a = (Math.sin(dlat_rad / 2)) ** 2 + Math.cos(lat1_rad) * Math.cos(lat2_rad) * (Math.sin(dlon_rad / 2)) ** 2
67
+ c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1 - a))
42
68
 
43
- R[um] * c
44
- end
69
+ EARTH_RADIUS[unit] * c
45
70
  end
46
71
  end
@@ -1,26 +1,69 @@
1
- # http://en.wikipedia.org/wiki/Jaccard_coefficient
2
1
  module Measurable
3
- class << self
4
- def jaccard(u, v)
5
- 1 - jaccard_index(u, v)
2
+
3
+ # call-seq:
4
+ # jaccard_index(u, v) -> Float
5
+ #
6
+ # Give the similarity between two binary vectors +u+ and +v+. Calculated as:
7
+ # jaccard_index = |intersection| / |union|
8
+ #
9
+ # In which intersection and union refer to +u+ and +v+ and |x| is the
10
+ # cardinality of set x.
11
+ #
12
+ # For example:
13
+ # jaccard_index([1, 0, 1], [1, 1, 1]) == 0.666...
14
+ #
15
+ # Because |intersection| = |(1, 0, 1)| = 2 and |union| = |(1, 1, 1)| = 3.
16
+ #
17
+ # See: http://en.wikipedia.org/wiki/Jaccard_coefficient
18
+ #
19
+ # * *Arguments* :
20
+ # - +u+ -> Array of 1s and 0s.
21
+ # - +v+ -> Array of 1s and 0s.
22
+ # * *Returns* :
23
+ # - Float value representing the Jaccard similarity coefficient between
24
+ # +u+ and +v+.
25
+ # * *Raises* :
26
+ # - +ArgumentError+ -> The size of the input arrays doesn't match.
27
+ #
28
+ def jaccard_index(u, v)
29
+ # TODO: Change this to a more specific, custom-made exception.
30
+ raise ArgumentError if u.size != v.size
31
+
32
+ intersection = u.zip(v).reduce(0) do |acc, elem|
33
+ # Both u and v must have this element.
34
+ elem[0] + elem[1] == 2 ? (acc + 1) : acc
6
35
  end
7
-
8
- def jaccard_index(u, v)
9
- union = (u | v).size.to_f
10
- intersection = (u & v).size.to_f
11
-
12
- intersection / union
13
- end
14
-
15
- def binary_jaccard(u, v)
16
- 1 - binary_jaccard_index(u, v)
17
- end
18
-
19
- def binary_jaccard_index(u, v)
20
- intersection = binary_intersection(u, v).delete_if {|x| x == 0}.size.to_f
21
- union = binary_union(u, v).delete_if {|x| x == 0}.size.to_f
22
-
23
- intersection / union
36
+
37
+ union = u.zip(v).reduce(0) do |acc, elem|
38
+ # One of u and v must have this element.
39
+ elem[0] + elem[1] >= 1 ? (acc + 1) : acc
24
40
  end
41
+
42
+ intersection.to_f / union
43
+ end
44
+
45
+ # call-seq:
46
+ # jaccard(u, v) -> Float
47
+ #
48
+ # The jaccard distance is a measure of dissimilarity between two sets. It is
49
+ # calculated as:
50
+ # jaccard_distance = 1 - jaccard_index
51
+ #
52
+ # This is a proper metric, i.e. the following conditions hold:
53
+ # - Symmetry: jaccard(u, v) == jaccard(v, u)
54
+ # - Non-negative: jaccard(u, v) >= 0
55
+ # - Coincidence axiom: jaccard(u, v) == 0 if u == v
56
+ # - Triangular inequality: jaccard(u, v) <= jaccard(u, w) + jaccard(w, v)
57
+ #
58
+ # * *Arguments* :
59
+ # - +u+ -> Array of 1s and 0s.
60
+ # - +v+ -> Array of 1s and 0s.
61
+ # * *Returns* :
62
+ # - Float value representing the dissimilarity between +u+ and +v+.
63
+ # * *Raises* :
64
+ # - +ArgumentError+ -> The size of the input arrays doesn't match.
65
+ #
66
+ def jaccard(u, v)
67
+ 1 - jaccard_index(u, v)
25
68
  end
26
69
  end
@@ -1,13 +1,34 @@
1
1
  module Measurable
2
- class << self
3
- def maxmin(u, v)
4
- sum_min, sum_max = u.zip(v).reduce([0.0, 0.0]) do |acc, attributes|
5
- acc[0] += attributes.min
6
- acc[-1] += attributes.max
7
- acc
8
- end
9
-
10
- sum_min / sum_max
2
+
3
+ # call-seq:
4
+ # maxmin(u, v) -> Float
5
+ #
6
+ # The "Max-min distance" is used to measure similarity between two vectors.
7
+ #
8
+ # When used in k-means clustering, this similarity measure can give better
9
+ # results in some datasets, as pointed out in the paper "K-means clustering
10
+ # using Max-min distance measure" --- Visalakshi, N. K.; Suguna, J.
11
+ #
12
+ # See: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=05156398
13
+ #
14
+ # * *Arguments* :
15
+ # - +u+ -> An array of Numeric objects.
16
+ # - +v+ -> An array of Numeric objects.
17
+ # * *Returns* :
18
+ # - Similarity between +u+ and +v+.
19
+ # * *Raises* :
20
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
21
+ #
22
+ def maxmin(u, v)
23
+ # TODO: Change this to a more specific, custom-made exception.
24
+ raise ArgumentError if u.size != v.size
25
+
26
+ sum_min, sum_max = u.zip(v).reduce([0.0, 0.0]) do |acc, attributes|
27
+ acc[0] += attributes.min
28
+ acc[1] += attributes.max
29
+ acc
11
30
  end
31
+
32
+ sum_min / sum_max
12
33
  end
13
34
  end
@@ -1,11 +1,32 @@
1
- # http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29
2
1
  module Measurable
3
- class << self
4
- def tanimoto(u, v)
5
- dot = dot(u, v).to_f
6
- result = dot / (u.sum_of_squares + v.sum_of_squares - dot).to_f
7
-
8
- handle_nan(result)
9
- end
2
+
3
+ # Tanimoto similarity is the same as Jaccard similarity.
4
+ alias :tanimoto_similarity :jaccard
5
+
6
+ # call-seq:
7
+ # tanimoto(u, v) -> Float
8
+ #
9
+ # Tanimoto distance is a coefficient explicitly chosen such as to allow for
10
+ # two dissimilar specimens to be similar to a third one. This breaks the
11
+ # triangle inequality, thus this isn't a metric.
12
+ #
13
+ # More information and references on this are needed. It's left here mostly
14
+ # as a piece of curiosity.
15
+ #
16
+ # See: # http://en.wikipedia.org/wiki/Jaccard_index#Tanimoto.27s_Definitions_of_Similarity_and_Distance
17
+ #
18
+ # * *Arguments* :
19
+ # - +u+ -> An array of Numeric objects.
20
+ # - +v+ -> An array of Numeric objects.
21
+ # * *Returns* :
22
+ # - A measure of the similarity between +u+ and +v+.
23
+ # * *Raises* :
24
+ # - +ArgumentError+ -> The sizes of +u+ and +v+ doesn't match.
25
+ #
26
+ def tanimoto(u, v)
27
+ # TODO: Change this to a more specific, custom-made exception.
28
+ raise ArgumentError if u.size != v.size
29
+
30
+ -Math.log2(jaccard_index(u, v))
10
31
  end
11
32
  end
@@ -1,3 +1,3 @@
1
1
  module Measurable
2
- VERSION = "0.0.4"
2
+ VERSION = "0.0.5" # :nodoc:
3
3
  end
@@ -0,0 +1,29 @@
1
+ describe "Cosine distance" do
2
+
3
+ before :all do
4
+ @u = [1, 2]
5
+ @v = [2, 3]
6
+ @w = [4, 5]
7
+ end
8
+
9
+ it "accepts two arguments" do
10
+ expect { Measurable.cosine(@u, @v) }.to_not raise_error
11
+ expect { Measurable.cosine(@u, @v, @w) }.to raise_error(ArgumentError)
12
+ end
13
+
14
+ it "should be symmetric" do
15
+ x = Measurable.cosine(@u, @v)
16
+ y = Measurable.cosine(@v, @u)
17
+
18
+ x.should be_within(TOLERANCE).of(y)
19
+ end
20
+
21
+ it "should return the correct value" do
22
+ x = Measurable.cosine(@u, @v)
23
+ x.should be_within(TOLERANCE).of(0.992277877)
24
+ end
25
+
26
+ it "shouldn't work with vectors of different length" do
27
+ expect { Measurable.cosine(@u, [1, 3, 5, 7]) }.to raise_error
28
+ end
29
+ end
@@ -0,0 +1,61 @@
1
+ describe "Euclidean" do
2
+
3
+ before :all do
4
+ @u = [1, 3, 16]
5
+ @v = [1, 4, 16]
6
+ @w = [4, 5, 6]
7
+ end
8
+
9
+ context "Distance" do
10
+ it "accepts two arguments" do
11
+ expect { Measurable.euclidean(@u, @v) }.to_not raise_error
12
+ expect { Measurable.euclidean(@u, @v, @w) }.to raise_error(ArgumentError)
13
+ end
14
+
15
+ it "accepts one argument and returns the vector's norm" do
16
+ # Remember that 3^2 + 4^2 = 5^2.
17
+ Measurable.euclidean([3, 4]).should == 5
18
+ end
19
+
20
+ it "should be symmetric" do
21
+ Measurable.euclidean(@u, @v).should == Measurable.euclidean(@v, @u)
22
+ end
23
+
24
+ it "should return the correct value" do
25
+ Measurable.euclidean(@u, @u).should == 0
26
+ Measurable.euclidean(@u, @v).should == 1
27
+ end
28
+
29
+ it "shouldn't work with vectors of different length" do
30
+ expect { Measurable.euclidean(@u, [2, 2, 2, 2]) }.to raise_error
31
+ end
32
+ end
33
+
34
+ context "Squared Distance" do
35
+ it "accepts two arguments" do
36
+ expect { Measurable.euclidean_squared(@u, @v) }.to_not raise_error
37
+ expect { Measurable.euclidean_squared(@u, @v, @w) }.to raise_error(ArgumentError)
38
+ end
39
+
40
+ it "accepts one argument and returns the vector's norm" do
41
+ # Remember that 3^2 + 4^2 = 5^2.
42
+ Measurable.euclidean_squared([3, 4]).should == 25
43
+ end
44
+
45
+ it "should be symmetric" do
46
+ x = Measurable.euclidean_squared(@u, @v)
47
+ y = Measurable.euclidean_squared(@v, @u)
48
+
49
+ x.should == y
50
+ end
51
+
52
+ it "should return the correct value" do
53
+ Measurable.euclidean_squared(@u, @u).should == 0
54
+ Measurable.euclidean_squared(@u, @v).should == 1
55
+ end
56
+
57
+ it "shouldn't work with vectors of different length" do
58
+ expect { Measurable.euclidean_squared(@u, [2, 2, 2, 2]) }.to raise_error
59
+ end
60
+ end
61
+ end
@@ -0,0 +1,37 @@
1
+ describe "Haversine distance" do
2
+
3
+ before :all do
4
+ # We have very big errors in this formula, due to:
5
+ # - The Earth is considered a sphere.
6
+ # - Earth's radius is considered constant (same as above).
7
+ #
8
+ # Given these conditions, I'll just assume the error to be less than 1.
9
+ # TODO: Calculate better error estimates.
10
+ @haversine_tolerance = 1
11
+
12
+ @u = [ 35.66667, 139.75] # Tokyo: 35 40' N, 139 45' E.
13
+ @v = [-23.53333, -46.61667] # São Paulo: 23 32' S, 46 37' W.
14
+ end
15
+
16
+ it "accepts two arguments" do
17
+ expect { Measurable.haversine(@u, @v) }.to_not raise_error
18
+ expect { Measurable.haversine(@u, @v, [-24.5, 40.23]) }.to raise_error(ArgumentError)
19
+ end
20
+
21
+ it "should be symmetric" do
22
+ x = Measurable.haversine(@u, @v)
23
+ y = Measurable.haversine(@v, @u)
24
+
25
+ x.should be_within(TOLERANCE).of(y)
26
+ end
27
+
28
+ it "should return the correct value" do
29
+ x = Measurable.haversine(@u, @v, :km)
30
+
31
+ x.should be_within(@haversine_tolerance).of(18533)
32
+ end
33
+
34
+ it "should only work with [lat, long] vectors" do
35
+ expect { Measurable.haversine([2, 4], [1, 3, 5, 7]) }.to raise_error
36
+ end
37
+ end
@@ -0,0 +1,62 @@
1
+ describe "Jaccard" do
2
+
3
+ context "Index" do
4
+ before :all do
5
+ @u = [1, 0, 1]
6
+ @v = [1, 1, 1]
7
+ @w = [0, 1, 0]
8
+ end
9
+
10
+ it "accepts two arguments" do
11
+ expect { Measurable.jaccard_index(@u, @v) }.to_not raise_error
12
+ expect { Measurable.jaccard_index(@u, @v, @w) }.to raise_error(ArgumentError)
13
+ end
14
+
15
+ it "should be symmetric" do
16
+ x = Measurable.jaccard_index(@u, @v)
17
+ y = Measurable.jaccard_index(@v, @u)
18
+
19
+ x.should be_within(TOLERANCE).of(y)
20
+ end
21
+
22
+ it "should return the correct value" do
23
+ x = Measurable.jaccard_index(@u, @v)
24
+
25
+ x.should be_within(TOLERANCE).of(2.0 / 3.0)
26
+ end
27
+
28
+ it "shouldn't work with vectors of different length" do
29
+ expect { Measurable.jaccard_index(@u, [1, 2, 3, 4]) }.to raise_error
30
+ end
31
+ end
32
+
33
+ context "Distance" do
34
+ before :all do
35
+ @u = [1, 0, 1]
36
+ @v = [1, 1, 1]
37
+ @w = [0, 1, 0]
38
+ end
39
+
40
+ it "accepts two arguments" do
41
+ expect { Measurable.jaccard(@u, @v) }.to_not raise_error
42
+ expect { Measurable.jaccard(@u, @v, @w) }.to raise_error(ArgumentError)
43
+ end
44
+
45
+ it "should be symmetric" do
46
+ x = Measurable.jaccard(@u, @v)
47
+ y = Measurable.jaccard(@v, @u)
48
+
49
+ x.should be_within(TOLERANCE).of(y)
50
+ end
51
+
52
+ it "should return the correct value" do
53
+ x = Measurable.jaccard(@u, @v)
54
+
55
+ x.should be_within(TOLERANCE).of(1.0 / 3.0)
56
+ end
57
+
58
+ it "shouldn't work with vectors of different length" do
59
+ expect { Measurable.jaccard(@u, [1, 2, 3, 4]) }.to raise_error
60
+ end
61
+ end
62
+ end
@@ -0,0 +1,30 @@
1
+ describe "Max-min distance" do
2
+
3
+ before :all do
4
+ @u = [1, 3, 16]
5
+ @v = [1, 4, 16]
6
+ @w = [4, 5, 6]
7
+ end
8
+
9
+ it "accepts two arguments" do
10
+ expect { Measurable.maxmin(@u, @v) }.to_not raise_error
11
+ expect { Measurable.maxmin(@u, @v, @w) }.to raise_error(ArgumentError)
12
+ end
13
+
14
+ it "should be symmetric" do
15
+ x = Measurable.maxmin(@u, @v)
16
+ y = Measurable.maxmin(@v, @u)
17
+
18
+ x.should be_within(TOLERANCE).of(y)
19
+ end
20
+
21
+ it "should return the correct value" do
22
+ x = Measurable.maxmin(@u, @v)
23
+
24
+ x.should be_within(TOLERANCE).of(0.9523809523)
25
+ end
26
+
27
+ it "shouldn't work with vectors of different length" do
28
+ expect { Measurable.maxmin(@u, [1, 3, 5, 7]) }.to raise_error
29
+ end
30
+ end
@@ -2,3 +2,5 @@ $LOAD_PATH.unshift(File.dirname(__FILE__))
2
2
  $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
3
3
 
4
4
  require 'measurable'
5
+
6
+ TOLERANCE = 10e-9
@@ -0,0 +1,30 @@
1
+ describe "Tanimoto distance" do
2
+
3
+ before :all do
4
+ @u = [1, 0, 1]
5
+ @v = [1, 1, 1]
6
+ @w = [0, 1, 0]
7
+ end
8
+
9
+ it "accepts two arguments" do
10
+ expect { Measurable.tanimoto(@u, @v) }.to_not raise_error
11
+ expect { Measurable.tanimoto(@u, @v, @w) }.to raise_error(ArgumentError)
12
+ end
13
+
14
+ it "should be symmetric" do
15
+ x = Measurable.tanimoto(@u, @v)
16
+ y = Measurable.tanimoto(@v, @u)
17
+
18
+ x.should be_within(TOLERANCE).of(y)
19
+ end
20
+
21
+ it "should return the correct value" do
22
+ x = Measurable.tanimoto(@u, @v)
23
+
24
+ x.should be_within(TOLERANCE).of(-Math.log2(2.0 / 3.0))
25
+ end
26
+
27
+ it "shouldn't work with vectors of different length" do
28
+ expect { Measurable.tanimoto(@u, [1, 3, 5, 7]) }.to raise_error
29
+ end
30
+ end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: measurable
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.4
4
+ version: 0.0.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Carlos Agarie
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-03-24 00:00:00.000000000 Z
11
+ date: 2013-07-24 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -74,8 +74,13 @@ files:
74
74
  - lib/measurable/tanimoto.rb
75
75
  - lib/measurable/version.rb
76
76
  - measurable.gemspec
77
- - spec/measurable_spec.rb
77
+ - spec/cosine_spec.rb
78
+ - spec/euclidean_spec.rb
79
+ - spec/haversine_spec.rb
80
+ - spec/jaccard_spec.rb
81
+ - spec/maxmin_spec.rb
78
82
  - spec/spec_helper.rb
83
+ - spec/tanimoto_spec.rb
79
84
  homepage: http://github.com/agarie/measurable
80
85
  licenses: []
81
86
  metadata: {}
@@ -95,10 +100,15 @@ required_rubygems_version: !ruby/object:Gem::Requirement
95
100
  version: '0'
96
101
  requirements: []
97
102
  rubyforge_project:
98
- rubygems_version: 2.0.0
103
+ rubygems_version: 2.0.3
99
104
  signing_key:
100
105
  specification_version: 4
101
106
  summary: A Ruby gem with a lot of distance measures for your projects.
102
107
  test_files:
103
- - spec/measurable_spec.rb
108
+ - spec/cosine_spec.rb
109
+ - spec/euclidean_spec.rb
110
+ - spec/haversine_spec.rb
111
+ - spec/jaccard_spec.rb
112
+ - spec/maxmin_spec.rb
104
113
  - spec/spec_helper.rb
114
+ - spec/tanimoto_spec.rb
@@ -1,159 +0,0 @@
1
- describe Measurable do
2
-
3
- describe "Binary union" do
4
-
5
- end
6
-
7
- describe "Binary intersection" do
8
-
9
- end
10
-
11
- describe "Euclidean" do
12
-
13
- before :all do
14
- @u = [1, 3, 16]
15
- @v = [1, 4, 16]
16
- @w = [4, 5, 6]
17
- end
18
-
19
- context "Distance" do
20
- it "accepts two arguments" do
21
- expect { Measurable.euclidean(@u, @v) }.to_not raise_error
22
- expect { Measurable.euclidean(@u, @v, @w) }.to raise_error(ArgumentError)
23
- end
24
-
25
- it "accepts one argument and returns the vector's norm" do
26
- # Remember that 3^2 + 4^2 = 5^2.
27
- Measurable.euclidean([3, 4]).should == 5
28
- end
29
-
30
- it "should be symmetric" do
31
- Measurable.euclidean(@u, @v).should == Measurable.euclidean(@v, @u)
32
- end
33
-
34
- it "should return the correct value" do
35
- Measurable.euclidean(@u, @u).should == 0
36
- Measurable.euclidean(@u, @v).should == 1
37
- end
38
-
39
- it "shouldn't work with vectors of different length" do
40
- expect { Measurable.euclidean(@u, [2, 2, 2, 2]) }.to raise_error
41
- end
42
- end
43
-
44
- context "Squared Distance" do
45
- it "accepts two arguments" do
46
- expect { Measurable.euclidean_squared(@u, @v) }.to_not raise_error
47
- expect { Measurable.euclidean_squared(@u, @v, @w) }.to raise_error(ArgumentError)
48
- end
49
-
50
- it "accepts one argument and returns the vector's norm" do
51
- # Remember that 3^2 + 4^2 = 5^2.
52
- Measurable.euclidean_squared([3, 4]).should == 25
53
- end
54
-
55
- it "should be symmetric" do
56
- x = Measurable.euclidean_squared(@u, @v)
57
- y = Measurable.euclidean_squared(@v, @u)
58
-
59
- x.should == y
60
- end
61
-
62
- it "should return the correct value" do
63
- Measurable.euclidean_squared(@u, @u).should == 0
64
- Measurable.euclidean_squared(@u, @v).should == 1
65
- end
66
-
67
- it "shouldn't work with vectors of different length" do
68
- expect { Measurable.euclidean_squared(@u, [2, 2, 2, 2]) }.to raise_error
69
- end
70
- end
71
-
72
- end
73
-
74
- describe "Cosine distance" do
75
- it "accepts two arguments"
76
-
77
- it "accepts one argument and returns the vector's norm"
78
-
79
- it "should handle NaN's"
80
-
81
- it "should be symmetric"
82
-
83
- it "should return the correct value"
84
-
85
- it "shouldn't work with vectors of different length"
86
- end
87
-
88
- describe "Chebyshev distance" do
89
- it "accepts two arguments"
90
-
91
- it "accepts one argument and returns the vector's norm"
92
-
93
- it "should be symmetric"
94
-
95
- it "should return the correct value"
96
-
97
- it "shouldn't work with vectors of different length"
98
- end
99
-
100
- describe "Tanimoto distance" do
101
- it "accepts two arguments"
102
-
103
- it "accepts one argument and returns the vector's norm"
104
-
105
- it "should be symmetric"
106
-
107
- it "should return the correct value"
108
-
109
- it "shouldn't work with vectors of different length"
110
- end
111
-
112
- describe "Haversine distance" do
113
- it "accepts two arguments"
114
-
115
- it "accepts one argument and returns the vector's norm"
116
-
117
- it "should be symmetric"
118
-
119
- it "should return the correct value"
120
-
121
- it "shouldn't work with vectors of different length"
122
- end
123
-
124
- describe "Jaccard distance" do
125
- it "accepts two arguments"
126
-
127
- it "accepts one argument and returns the vector's norm"
128
-
129
- it "should be symmetric"
130
-
131
- it "should return the correct value"
132
-
133
- it "shouldn't work with vectors of different length"
134
- end
135
-
136
- describe "Binary Jaccard distance" do
137
- it "accepts two arguments"
138
-
139
- it "accepts one argument and returns the vector's norm"
140
-
141
- it "should be symmetric"
142
-
143
- it "should return the correct value"
144
-
145
- it "shouldn't work with vectors of different length"
146
- end
147
-
148
- describe "Max-min distance" do
149
- it "accepts two arguments"
150
-
151
- it "accepts one argument and returns the vector's norm"
152
-
153
- it "should be symmetric"
154
-
155
- it "should return the correct value"
156
-
157
- it "shouldn't work with vectors of different length"
158
- end
159
- end