anomaly 0.0.2 → 0.0.3

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -16,7 +16,7 @@ And then execute:
16
16
  bundle install
17
17
  ```
18
18
 
19
- For max performance (about 3x faster), also install the NArray gem:
19
+ For max performance (~ 2x faster), also install the NArray gem:
20
20
 
21
21
  ```ruby
22
22
  gem "narray"
@@ -26,23 +26,32 @@ Anomaly will automatically detect it and use it.
26
26
 
27
27
  ## How to Use
28
28
 
29
- Train the detector with **only non-anomalies**. Each row is a sample.
29
+ Say we have weather data for sunny days and we're trying to detect days that aren't sunny. The data looks like:
30
30
 
31
31
  ```ruby
32
- train_data = [
33
- [0.1, 100, 1.4],
34
- [0.2, 101, 2.1],
35
- [0.5, 102, 1.6]
32
+ # Each row is a different day.
33
+ # [temperature (°F), humidity (%), pressure (in)]
34
+ weather_data = [
35
+ [85, 68, 10.4],
36
+ [88, 62, 12.1],
37
+ [86, 64, 13.6],
38
+ ...
36
39
  ]
37
- ad = Anomaly::Detector.new(train_data)
40
+ ```
41
+
42
+ Train the detector with **only non-anomalies** (sunny days in our case).
43
+
44
+ ```ruby
45
+ ad = Anomaly::Detector.new(weather_data)
38
46
  ```
39
47
 
40
48
  That's it! Let's test for anomalies.
41
49
 
42
50
  ```ruby
43
- test_sample = [1.0, 100, 1.4]
51
+ # 79°F, 66% humidity, 12.3 in. pressure
52
+ test_sample = [79, 66, 12.3]
44
53
  ad.probability(test_sample)
45
- # => 0.0007328491480297603
54
+ # => 7.537174740907633e-08
46
55
  ```
47
56
 
48
57
  **Super-important:** You must select a threshold for anomalies (which we denote with ε - "epsilon")
@@ -52,15 +61,13 @@ Probabilities less than ε are considered anomalies. If ε is higher, more thing
52
61
  ``` ruby
53
62
  ad.anomaly?(test_sample, 1e-10)
54
63
  # => false
55
- ad.anomaly?(test_sample, 0.5)
64
+ ad.anomaly?(test_sample, 1e-5)
56
65
  # => true
57
66
  ```
58
67
 
59
- Here's sample to code to help you find the best ε for your application.
68
+ The wiki has [sample code](https://github.com/ankane/anomaly/wiki/Home) to help you find the best ε for your application.
60
69
 
61
- ```ruby
62
- # TODO
63
- ```
70
+ ### Persistence
64
71
 
65
72
  You can easily persist the detector to a file or database - it's very tiny.
66
73
 
@@ -76,6 +83,11 @@ File.open("anomaly_detector.dump", "w") {|f| f.write(serialized_ad) }
76
83
  ad2 = Marshal.load(File.open("anomaly_detector.dump", "r").read)
77
84
  ```
78
85
 
86
+ ## TODO
87
+
88
+ - Train in chunks (for very large datasets)
89
+ - Multivariate normal distribution (possibly)
90
+
79
91
  ## Contributing
80
92
 
81
93
  1. Fork it
data/Rakefile CHANGED
@@ -2,3 +2,16 @@
2
2
  require "bundler/gem_tasks"
3
3
  require "rspec/core/rake_task"
4
4
  RSpec::Core::RakeTask.new("spec")
5
+
6
+ require "benchmark"
7
+ require "anomaly"
8
+
9
+ task :benchmark do
10
+ data = 1_000_000.times.map{ [rand, rand, rand, rand] }
11
+
12
+ Benchmark.bm do |x|
13
+ x.report { Anomaly::Detector.new(data) }
14
+ require "narray"
15
+ x.report { Anomaly::Detector.new(data) }
16
+ end
17
+ end
@@ -17,4 +17,5 @@ Gem::Specification.new do |gem|
17
17
 
18
18
  gem.add_development_dependency "rake"
19
19
  gem.add_development_dependency "rspec", ">= 2.0.0"
20
+ gem.add_development_dependency "narray"
20
21
  end
@@ -2,39 +2,46 @@ module Anomaly
2
2
  class Detector
3
3
 
4
4
  def initialize(data = nil)
5
- @trained = false
5
+ @m = 0
6
6
  train(data) if data
7
7
  end
8
8
 
9
9
  def train(data)
10
10
  if defined?(NMatrix)
11
11
  d = NMatrix.to_na(data)
12
+ @n, @m = d.sizes
12
13
  # Convert these to an array for Marshal.dump
13
14
  @mean = d.mean(1).to_a
14
15
  @std = d.stddev(1).to_a
15
16
  else
16
17
  # Default to Array, since built-in Matrix does not give us a big performance advantage.
17
18
  d = data.to_a
18
- cols = d.first.size.times.map{|i| d.map{|r| r[i]}}
19
+ @m = d.size
20
+ @n = d.first ? d.first.size : 0
21
+ cols = @n.times.map{|i| d.map{|r| r[i]}}
19
22
  @mean = cols.map{|c| mean(c)}
20
23
  @std = cols.each_with_index.map{|c,i| std(c, @mean[i])}
21
24
  end
22
-
23
25
  @std.map!{|std| (std == 0 or std.nan?) ? Float::MIN : std}
24
-
25
- # raise "Standard deviation cannot be zero" if @std.find_index{|i| i == 0 or i.nan?}
26
-
27
- @trained = true
28
26
  end
29
27
 
30
28
  def trained?
31
- @trained
29
+ @m > 0
30
+ end
31
+
32
+ def samples
33
+ @m
32
34
  end
33
35
 
36
+ # Limit the probability of features to [0,1]
37
+ # to keep probabilities at same scale.
34
38
  def probability(x)
35
39
  raise "Train me first" unless trained?
36
- raise ArgumentError, "x must have #{@mean.size} elements" if x.size != @mean.size
37
- x.each_with_index.map{|a,i| normal_pdf(a, @mean[i], @std[i]) }.reduce(1, :*)
40
+ raise ArgumentError, "x must have #{@n} elements" if x.size != @n
41
+ @n.times.map do |i|
42
+ p = normal_pdf(x[i], @mean[i], @std[i])
43
+ (p.nan? or p > 1) ? 1 : p
44
+ end.reduce(1, :*)
38
45
  end
39
46
 
40
47
  def anomaly?(x, epsilon)
@@ -45,10 +52,8 @@ module Anomaly
45
52
 
46
53
  SQRT2PI = Math.sqrt(2*Math::PI)
47
54
 
48
- # Return 1 (exclude feature) if std ~ 0
49
55
  def normal_pdf(x, mean = 0, std = 1)
50
- p = 1.0/(SQRT2PI*std)*Math.exp(-((x - mean)**2/(2.0*(std**2))))
51
- p.nan? ? 1 : p
56
+ 1/(SQRT2PI*std)*Math.exp(-((x - mean)**2/(2.0*(std**2))))
52
57
  end
53
58
 
54
59
  # Not used for NArray
@@ -1,3 +1,3 @@
1
1
  module Anomaly
2
- VERSION = "0.0.2"
2
+ VERSION = "0.0.3"
3
3
  end
@@ -25,15 +25,7 @@ describe Anomaly::Detector do
25
25
  end
26
26
  end
27
27
 
28
- context "when one training example" do
29
- let(:data) { [[0]] }
30
-
31
- it "returns infinity" do
32
- ad.probability([0]).should == 1
33
- end
34
- end
35
-
36
- context "when data is a matrix" do
28
+ context "when data is an array" do
37
29
  let(:data) { [[-1,-2],[0,0],[1,2]] }
38
30
  let(:sample) { [rand, rand] }
39
31
 
@@ -43,4 +35,25 @@ describe Anomaly::Detector do
43
35
  prob.should == Anomaly::Detector.new(data).probability(sample)
44
36
  end
45
37
  end
38
+
39
+ context "when lots of samples" do
40
+ let(:data) { m.times.map{[0]} }
41
+ let(:m) { rand(100) + 1 }
42
+
43
+ it { ad.samples.should == m }
44
+ it { ad.trained?.should be_true }
45
+ end
46
+
47
+ context "when no samples" do
48
+ let(:data) { [] }
49
+
50
+ it { ad.samples.should == 0 }
51
+ it { ad.trained?.should be_false }
52
+ end
53
+
54
+ context "when pdf is greater than 1" do
55
+ let(:data) { 100.times.map{[0]}.push([1]) }
56
+
57
+ it { ad.probability([0]).should == 1 }
58
+ end
46
59
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: anomaly
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.2
4
+ version: 0.0.3
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -13,7 +13,7 @@ date: 2011-12-12 00:00:00.000000000Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: rake
16
- requirement: &2160640240 !ruby/object:Gem::Requirement
16
+ requirement: &2160380920 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: '0'
22
22
  type: :development
23
23
  prerelease: false
24
- version_requirements: *2160640240
24
+ version_requirements: *2160380920
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: rspec
27
- requirement: &2160639580 !ruby/object:Gem::Requirement
27
+ requirement: &2160379640 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,7 +32,18 @@ dependencies:
32
32
  version: 2.0.0
33
33
  type: :development
34
34
  prerelease: false
35
- version_requirements: *2160639580
35
+ version_requirements: *2160379640
36
+ - !ruby/object:Gem::Dependency
37
+ name: narray
38
+ requirement: &2160378180 !ruby/object:Gem::Requirement
39
+ none: false
40
+ requirements:
41
+ - - ! '>='
42
+ - !ruby/object:Gem::Version
43
+ version: '0'
44
+ type: :development
45
+ prerelease: false
46
+ version_requirements: *2160378180
36
47
  description: Easy-to-use anomaly detection
37
48
  email:
38
49
  - andrew@getformidable.com