fathom 0.2.2 → 0.2.3
Sign up to get free protection for your applications and to get access to all the features.
- data/Gemfile +1 -0
- data/Gemfile.lock +4 -0
- data/TODO.md +140 -0
- data/VERSION +1 -1
- data/lib/fathom.rb +13 -12
- data/lib/fathom/data_node.rb +3 -32
- data/lib/fathom/distributions.rb +8 -0
- data/lib/fathom/distributions/discrete_gaussian.rb +44 -0
- data/lib/fathom/distributions/discrete_uniform.rb +46 -0
- data/lib/fathom/distributions/gaussian.rb +46 -0
- data/lib/fathom/distributions/uniform.rb +35 -0
- data/lib/fathom/{basic_node.rb → enforced_name.rb} +5 -1
- data/lib/fathom/ext/string.rb +27 -0
- data/lib/fathom/inverter.rb +1 -2
- data/lib/fathom/monte_carlo_set.rb +26 -3
- data/lib/fathom/node.rb +90 -0
- data/lib/fathom/numeric_methods.rb +50 -0
- data/lib/fathom/plausible_range.rb +11 -15
- data/spec/fathom/data_node_spec.rb +45 -4
- data/spec/fathom/distributions/discrete_gaussian_spec.rb +64 -0
- data/spec/fathom/distributions/discrete_uniform_spec.rb +0 -0
- data/spec/fathom/distributions/gaussian_spec.rb +64 -0
- data/spec/fathom/distributions/uniform_spec.rb +0 -0
- data/spec/fathom/enforced_name_spec.rb +15 -0
- data/spec/fathom/monte_carlo_set_spec.rb +12 -1
- data/spec/fathom/node_spec.rb +129 -0
- data/spec/fathom/numeric_methods_spec.rb +82 -0
- data/spec/fathom/plausible_range_spec.rb +1 -2
- data/spec/support/dummy_numeric_node.rb +8 -0
- metadata +30 -7
- data/lib/fathom/combined_plausibilities.rb +0 -12
- data/lib/fathom/node_utilities.rb +0 -8
data/Gemfile
CHANGED
data/Gemfile.lock
CHANGED
@@ -5,6 +5,7 @@ GEM
|
|
5
5
|
diff-lcs (1.1.2)
|
6
6
|
fastercsv (1.5.3)
|
7
7
|
linecache (0.43)
|
8
|
+
macaddr (1.0.0)
|
8
9
|
rspec (2.0.1)
|
9
10
|
rspec-core (~> 2.0.1)
|
10
11
|
rspec-expectations (~> 2.0.1)
|
@@ -20,6 +21,8 @@ GEM
|
|
20
21
|
ruby-debug-base (~> 0.10.3.0)
|
21
22
|
ruby-debug-base (0.10.3)
|
22
23
|
linecache (>= 0.3)
|
24
|
+
uuid (2.3.1)
|
25
|
+
macaddr (~> 1.0)
|
23
26
|
|
24
27
|
PLATFORMS
|
25
28
|
ruby
|
@@ -28,3 +31,4 @@ DEPENDENCIES
|
|
28
31
|
fastercsv
|
29
32
|
rspec
|
30
33
|
ruby-debug
|
34
|
+
uuid
|
data/TODO.md
ADDED
@@ -0,0 +1,140 @@
|
|
1
|
+
TODO
|
2
|
+
====
|
3
|
+
|
4
|
+
Reorganizing
|
5
|
+
------------
|
6
|
+
|
7
|
+
I've just made some big refactoring steps regarding the organization of the system and the distributions. To make sure we're there:
|
8
|
+
|
9
|
+
* Go back and test the 4 distributions I decided on
|
10
|
+
* Finish the discrete ideas, adding size to the node and automatically using that for stats
|
11
|
+
* Create the idea of a labeled, multinomial node
|
12
|
+
* Add SQLite3 for in-memory set operations for a labeled, multinomial node
|
13
|
+
* Add and remove finder methods on nodes for their parents and children
|
14
|
+
|
15
|
+
Also, the general organization of the system could be broken down better:
|
16
|
+
|
17
|
+
* agent
|
18
|
+
* distributions
|
19
|
+
* node
|
20
|
+
* import
|
21
|
+
* causal_graph
|
22
|
+
* belief_network
|
23
|
+
* knowledge_base
|
24
|
+
* apophenia
|
25
|
+
* simulation
|
26
|
+
|
27
|
+
MonteCarlo
|
28
|
+
----------
|
29
|
+
|
30
|
+
This needs to get a few new features:
|
31
|
+
|
32
|
+
* combine with ValueDescription into one node
|
33
|
+
* generate nodes for the return values
|
34
|
+
* consider a more general simulation framework, in case it needs to be extended, or to use some of the tools that will be added to the ABM stuff
|
35
|
+
|
36
|
+
Belief Networks
|
37
|
+
---------------
|
38
|
+
|
39
|
+
To get these delivered, I need to revisit the edge logic, to make sure it's easy to extend each edge with an object.
|
40
|
+
|
41
|
+
Then:
|
42
|
+
|
43
|
+
* CPM brought back from the archive
|
44
|
+
* Network propagation
|
45
|
+
* Network testing (polytree)
|
46
|
+
|
47
|
+
Agent Based Modeling
|
48
|
+
--------------------
|
49
|
+
|
50
|
+
* Add parameter-passing standards for callbacks
|
51
|
+
* Add EventMachine and async capabilities (Inncluding the cluster idea)
|
52
|
+
|
53
|
+
Knowledge Base
|
54
|
+
--------------
|
55
|
+
|
56
|
+
Probably around here I'll be able to start looking at a persistent knowledge base. I am not sure which way I'll go, but things I'm considering:
|
57
|
+
|
58
|
+
* RDF markup
|
59
|
+
* Riak/Redis/Mongo/Couch backend
|
60
|
+
* Backend adapters
|
61
|
+
* tabular data in an RDBMS (think Apophenia)
|
62
|
+
|
63
|
+
One of the key features needs to be search:
|
64
|
+
|
65
|
+
* possibly a Xapian search index for full-text searching
|
66
|
+
* still need a standard query language, depending on what I choose above
|
67
|
+
|
68
|
+
Apophenia
|
69
|
+
---------
|
70
|
+
|
71
|
+
I'd like to get Apophenia integrated so that any data model generated there could be combined with the work done here. That means that most of the "hard" data crunching is using some fairly fast tools: C, SQLite3 in memory, Apophenia and the GSL.
|
72
|
+
|
73
|
+
That would mean that you generally come to Fathom to:
|
74
|
+
|
75
|
+
* coordinate the elements of a decision or information discovery project
|
76
|
+
* run simulations across their collective knowledge
|
77
|
+
* maintain consistent information between information nodes
|
78
|
+
|
79
|
+
You would go to Apophenia to:
|
80
|
+
|
81
|
+
* build data models from statistical methods
|
82
|
+
* generate new sets from grouping, sorting, merging, and dealing with multiple imputation
|
83
|
+
|
84
|
+
Fathom could feed the information to Apophenia data models. Given a fairly robust knowledge base, this makes a lot of sense.
|
85
|
+
|
86
|
+
Import
|
87
|
+
------
|
88
|
+
|
89
|
+
* More robust support for CSV and YAML
|
90
|
+
* OpenERP
|
91
|
+
* RDF
|
92
|
+
* Apophenia
|
93
|
+
* Web Crawlers
|
94
|
+
|
95
|
+
Publication
|
96
|
+
-----------
|
97
|
+
|
98
|
+
Turning Fathom into a better tool for publishing knowledge, there are a few major parts to add:
|
99
|
+
|
100
|
+
Support for Reports:
|
101
|
+
|
102
|
+
* Template-based system (possibly MVC as part of the Web Service)
|
103
|
+
* Latex-enabled
|
104
|
+
* Graphs and PDFs through Prawn
|
105
|
+
* PDFs, CSVs through Ruport
|
106
|
+
|
107
|
+
Web Service:
|
108
|
+
|
109
|
+
* Basic CRUD for every node type
|
110
|
+
* Traversal of the graph with good search semantics
|
111
|
+
* Authentication and Authorization
|
112
|
+
* HTML-based interface with strong input capabilities (mind map js, auto-build forms, dynamic forms, etc.)
|
113
|
+
* Survey system
|
114
|
+
|
115
|
+
Meta Data:
|
116
|
+
|
117
|
+
* ontological support for systems approach to research (7 nodes / article)
|
118
|
+
* ontological support for references (auto generate citations)
|
119
|
+
* cleaner approach to the decision framework
|
120
|
+
|
121
|
+
Causal Graphs
|
122
|
+
-------------
|
123
|
+
|
124
|
+
* Add the "DO" operator to a belief network
|
125
|
+
* Add the tests for causality using the DO operator
|
126
|
+
|
127
|
+
This stuff gets into things I haven't finished reading yet, but it would be very interesting/important to finish that work and bring it into Fathom. This is all Judea Pearl stuff.
|
128
|
+
|
129
|
+
Information Service
|
130
|
+
-------------------
|
131
|
+
|
132
|
+
Using Seaside, Fathom (through the Web Service), and OpenERP, there are several products I could create using this framework:
|
133
|
+
|
134
|
+
* Decision support could become custom integrated to a hosted OpenERP instance, for example.
|
135
|
+
* General domain information could culled and organized and verified so that others could tie in and build only their parts of their decision support framework
|
136
|
+
* Think tank support would be a very interesting place to work
|
137
|
+
|
138
|
+
All of this could also be coupled with consulting and hosting services.
|
139
|
+
|
140
|
+
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
0.2.
|
1
|
+
0.2.3
|
data/lib/fathom.rb
CHANGED
@@ -11,6 +11,7 @@ require 'options_hash'
|
|
11
11
|
|
12
12
|
require 'ext/open_struct'
|
13
13
|
require 'ext/array'
|
14
|
+
require 'ext/string'
|
14
15
|
|
15
16
|
module Fathom
|
16
17
|
lib = File.expand_path(File.dirname(__FILE__))
|
@@ -19,18 +20,12 @@ module Fathom
|
|
19
20
|
# Autoload classes and modules so that we only load as much of the library as we're using.
|
20
21
|
# This allows us to have a fairly large library without taking up a lot of memory unless we need it.
|
21
22
|
autoload :Inverter, "inverter"
|
22
|
-
autoload :
|
23
|
+
autoload :Node, "node"
|
23
24
|
autoload :PlausibleRange, "plausible_range"
|
24
|
-
autoload :R, "plausible_range"
|
25
|
-
# autoload :LowerBound, "lower_bound"
|
26
|
-
# autoload :UpperBound, "upper_bound"
|
27
|
-
# autoload :Distribution, "distribution"
|
28
|
-
# autoload :DependencyGraph, "dependency_graph"
|
29
25
|
autoload :ValueDescription, "value_description"
|
30
26
|
autoload :ValueAggregator, "value_aggregator"
|
31
27
|
autoload :ValueMultiplier, "value_multiplier"
|
32
28
|
autoload :MonteCarloSet, "monte_carlo_set"
|
33
|
-
autoload :CombinedPlausibilities, "combined_plausibilities"
|
34
29
|
autoload :CausalGraph, "causal_graph"
|
35
30
|
autoload :DataNode, "data_node"
|
36
31
|
autoload :KnowledgeBase, "knowledge_base"
|
@@ -41,8 +36,6 @@ module Fathom
|
|
41
36
|
autoload :RDFImport, 'import/rdf_import'
|
42
37
|
autoload :SQLiteImport, 'import/sqlite_import'
|
43
38
|
|
44
|
-
autoload :NodeUtilities, 'node_utilities'
|
45
|
-
|
46
39
|
autoload :Simulation, 'simulation'
|
47
40
|
autoload :TickMethods, 'simulation/tick_methods'
|
48
41
|
autoload :TickSimulation, 'simulation/tick_simulation'
|
@@ -50,6 +43,17 @@ module Fathom
|
|
50
43
|
autoload :Agent, 'agent'
|
51
44
|
autoload :Properties, 'agent/properties'
|
52
45
|
autoload :AgentCluster, 'agent/agent_cluster'
|
46
|
+
|
47
|
+
autoload :NumericMethods, 'numeric_methods'
|
48
|
+
autoload :EnforcedName, 'enforced_name'
|
49
|
+
|
50
|
+
autoload :Distributions, 'distributions'
|
51
|
+
module Distributions
|
52
|
+
autoload :Gaussian, 'distributions/gaussian'
|
53
|
+
autoload :Uniform, 'distributions/uniform'
|
54
|
+
autoload :DiscreteGaussian, 'distributions/discrete_gaussian'
|
55
|
+
autoload :DiscreteUniform, 'distributions/discrete_uniform'
|
56
|
+
end
|
53
57
|
|
54
58
|
def knowledge_base
|
55
59
|
@knowledge_base ||= KnowledgeBase.new
|
@@ -59,6 +63,3 @@ end
|
|
59
63
|
|
60
64
|
# Temporary
|
61
65
|
include Fathom
|
62
|
-
def r
|
63
|
-
@r ||= R.new(:min => 1, :max => 10)
|
64
|
-
end
|
data/lib/fathom/data_node.rb
CHANGED
@@ -4,44 +4,15 @@ require File.expand_path(File.join(File.dirname(__FILE__), '..', 'fathom'))
|
|
4
4
|
A DataNode is a node generated from data itself. It stores the data and reveals some statistical
|
5
5
|
measurements for the data. It expects an array or vector of values and generates a vector on demans.
|
6
6
|
=end
|
7
|
-
class Fathom::DataNode
|
7
|
+
class Fathom::DataNode < Node
|
8
8
|
|
9
|
-
include
|
10
|
-
|
11
|
-
attr_reader :values, :name, :distribution, :confidence_interval
|
9
|
+
include NumericMethods
|
12
10
|
|
13
11
|
def initialize(opts={})
|
14
|
-
|
12
|
+
super(opts)
|
15
13
|
raise ArgumentError, "Must provided values: DataNode.new(:values => [...])" unless self.values
|
16
|
-
@name = opts[:name]
|
17
|
-
@distribution = opts[:distribution]
|
18
|
-
end
|
19
|
-
|
20
|
-
alias :ci :confidence_interval
|
21
|
-
|
22
|
-
def vector
|
23
|
-
@vector ||= GSL::Vector.ary_to_gv(self.values)
|
24
|
-
end
|
25
|
-
|
26
|
-
def standard_deviation
|
27
|
-
@standard_deviation ||= vector.sd
|
28
|
-
end
|
29
|
-
alias :sd :standard_deviation
|
30
|
-
alias :std :standard_deviation
|
31
|
-
|
32
|
-
def mean
|
33
|
-
@mean ||= vector.mean
|
34
|
-
end
|
35
|
-
|
36
|
-
def rand
|
37
|
-
rng.gaussian(std) + mean
|
38
14
|
end
|
39
15
|
|
40
|
-
protected
|
41
|
-
def rng
|
42
|
-
@rng ||= GSL::Rng.alloc(GSL::Rng::MT19937_1999, Kernel.rand(100_000))
|
43
|
-
end
|
44
|
-
|
45
16
|
end
|
46
17
|
|
47
18
|
if __FILE__ == $0
|
@@ -0,0 +1,44 @@
|
|
1
|
+
require File.expand_path(File.join(File.dirname(__FILE__), '..', '..', 'fathom'))
|
2
|
+
class Fathom::Distributions::DiscreteGaussian
|
3
|
+
extend Fathom::Distributions::SharedMethods
|
4
|
+
class << self
|
5
|
+
def rng
|
6
|
+
@rng ||= GSL::Rng.alloc(GSL::Rng::MT19937_1999, Kernel.rand(100_000))
|
7
|
+
end
|
8
|
+
|
9
|
+
def rand(sd)
|
10
|
+
(rng.gaussian(sd) / size).floor + 1
|
11
|
+
end
|
12
|
+
|
13
|
+
def inverse_cdf(opts={})
|
14
|
+
mean = opts[:mean]
|
15
|
+
sd = opts[:sd]
|
16
|
+
sd ||= opts[:std]
|
17
|
+
sd ||= opts[:standard_deviation]
|
18
|
+
lower = opts.fetch(:lower, true)
|
19
|
+
lower = false if opts[:upper]
|
20
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.05)
|
21
|
+
value = lower ? GSL::Cdf.gaussian_Pinv(confidence_interval, sd) : GSL::Cdf.gaussian_Qinv(confidence_interval, sd)
|
22
|
+
value + mean
|
23
|
+
end
|
24
|
+
alias :lower_bound :inverse_cdf
|
25
|
+
|
26
|
+
def upper_bound(opts={})
|
27
|
+
inverse_cdf(opts.merge(:lower => false))
|
28
|
+
end
|
29
|
+
|
30
|
+
def interval_values(opts={})
|
31
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.9)
|
32
|
+
bound = (1 - confidence_interval) / 2.0
|
33
|
+
[lower_bound(opts.merge(:confidence_interval => bound)), upper_bound(opts.merge(:confidence_interval => bound))]
|
34
|
+
end
|
35
|
+
|
36
|
+
# If only I had the background to explain what this is....
|
37
|
+
# I want to know how many standard deviations are expressed by the confidence interval
|
38
|
+
# I can then divide the range by this number to get the standard deviation
|
39
|
+
def standard_deviations_under(confidence_interval)
|
40
|
+
GSL::Cdf.gaussian_Qinv((1 - confidence_interval) / 2) * 2
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
44
|
+
|
@@ -0,0 +1,46 @@
|
|
1
|
+
require File.expand_path(File.join(File.dirname(__FILE__), '..', '..', 'fathom'))
|
2
|
+
class Fathom::Distributions::DiscreteUniform
|
3
|
+
extend Fathom::Distributions::SharedMethods
|
4
|
+
class << self
|
5
|
+
def rng
|
6
|
+
@rng ||= GSL::Rng.alloc(GSL::Rng::MT19937_1999, Kernel.rand(100_000))
|
7
|
+
end
|
8
|
+
|
9
|
+
def rand
|
10
|
+
(rng.ugaussian / size).floor + 1
|
11
|
+
end
|
12
|
+
|
13
|
+
def inverse_cdf(opts={})
|
14
|
+
mean = opts[:mean]
|
15
|
+
sd = opts[:sd]
|
16
|
+
sd ||= opts[:std]
|
17
|
+
sd ||= opts[:standard_deviation]
|
18
|
+
lower = opts.fetch(:lower, true)
|
19
|
+
lower = false if opts[:upper]
|
20
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.05)
|
21
|
+
value = lower ? GSL::Cdf.ugaussian_Pinv(confidence_interval) : GSL::Cdf.ugaussian_Qinv(confidence_interval)
|
22
|
+
value + mean
|
23
|
+
end
|
24
|
+
alias :lower_bound :inverse_cdf
|
25
|
+
|
26
|
+
def upper_bound(opts={})
|
27
|
+
inverse_cdf(opts.merge(:lower => false))
|
28
|
+
end
|
29
|
+
|
30
|
+
def interval_values(opts={})
|
31
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.9)
|
32
|
+
bound = (1 - confidence_interval) / 2.0
|
33
|
+
[lower_bound(opts.merge(:confidence_interval => bound)), upper_bound(opts.merge(:confidence_interval => bound))]
|
34
|
+
end
|
35
|
+
|
36
|
+
# If only I had the background to explain what this is....
|
37
|
+
# I want to know how many standard deviations are expressed by the confidence interval
|
38
|
+
# I can then divide the range by this number to get the standard deviation
|
39
|
+
def standard_deviations_under(confidence_interval)
|
40
|
+
GSL::Cdf.ugaussian_Qinv((1 - confidence_interval) / 2) * 2
|
41
|
+
end
|
42
|
+
|
43
|
+
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
@@ -0,0 +1,46 @@
|
|
1
|
+
require File.expand_path(File.join(File.dirname(__FILE__), '..', '..', 'fathom'))
|
2
|
+
class Fathom::Distributions::Gaussian
|
3
|
+
extend Fathom::Distributions::SharedMethods
|
4
|
+
class << self
|
5
|
+
def rng
|
6
|
+
@rng ||= GSL::Rng.alloc(GSL::Rng::MT19937_1999, Kernel.rand(100_000))
|
7
|
+
end
|
8
|
+
|
9
|
+
def rand(sd)
|
10
|
+
rng.gaussian(sd)
|
11
|
+
end
|
12
|
+
|
13
|
+
def inverse_cdf(opts={})
|
14
|
+
mean = opts[:mean]
|
15
|
+
sd = opts[:sd]
|
16
|
+
sd ||= opts[:std]
|
17
|
+
sd ||= opts[:standard_deviation]
|
18
|
+
lower = opts.fetch(:lower, true)
|
19
|
+
lower = false if opts[:upper]
|
20
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.05)
|
21
|
+
value = lower ? GSL::Cdf.gaussian_Pinv(confidence_interval, sd) : GSL::Cdf.gaussian_Qinv(confidence_interval, sd)
|
22
|
+
value + mean
|
23
|
+
end
|
24
|
+
alias :lower_bound :inverse_cdf
|
25
|
+
|
26
|
+
def upper_bound(opts={})
|
27
|
+
inverse_cdf(opts.merge(:lower => false))
|
28
|
+
end
|
29
|
+
|
30
|
+
def interval_values(opts={})
|
31
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.9)
|
32
|
+
bound = (1 - confidence_interval) / 2.0
|
33
|
+
[lower_bound(opts.merge(:confidence_interval => bound)), upper_bound(opts.merge(:confidence_interval => bound))]
|
34
|
+
end
|
35
|
+
|
36
|
+
# If only I had the background to explain what this is....
|
37
|
+
# I want to know how many standard deviations are expressed by the confidence interval
|
38
|
+
# I can then divide the range by this number to get the standard deviation
|
39
|
+
def standard_deviations_under(confidence_interval)
|
40
|
+
GSL::Cdf.gaussian_Qinv((1 - confidence_interval) / 2) * 2
|
41
|
+
end
|
42
|
+
|
43
|
+
|
44
|
+
end
|
45
|
+
end
|
46
|
+
|
@@ -0,0 +1,35 @@
|
|
1
|
+
require File.expand_path(File.join(File.dirname(__FILE__), '..', '..', 'fathom'))
|
2
|
+
class Fathom::Distributions::Uniform
|
3
|
+
extend Fathom::Distributions::SharedMethods
|
4
|
+
class << self
|
5
|
+
def rng
|
6
|
+
@rng ||= GSL::Rng.alloc(GSL::Rng::MT19937_1999, Kernel.rand(100_000))
|
7
|
+
end
|
8
|
+
|
9
|
+
def rand
|
10
|
+
rng.ugaussian
|
11
|
+
end
|
12
|
+
|
13
|
+
def inverse_cdf(opts={})
|
14
|
+
mean = opts[:mean]
|
15
|
+
lower = opts.fetch(:lower, true)
|
16
|
+
lower = false if opts[:upper]
|
17
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.05)
|
18
|
+
value = lower ? GSL::Cdf.ugaussian_Pinv(confidence_interval) : GSL::Cdf.ugaussian_Qinv(confidence_interval)
|
19
|
+
value + mean
|
20
|
+
end
|
21
|
+
alias :lower_bound :inverse_cdf
|
22
|
+
|
23
|
+
def upper_bound(opts={})
|
24
|
+
inverse_cdf(opts.merge(:lower => false))
|
25
|
+
end
|
26
|
+
|
27
|
+
def interval_values(opts={})
|
28
|
+
confidence_interval = opts.fetch(:confidence_interval, 0.9)
|
29
|
+
bound = (1 - confidence_interval) / 2.0
|
30
|
+
[lower_bound(opts.merge(:confidence_interval => bound)), upper_bound(opts.merge(:confidence_interval => bound))]
|
31
|
+
end
|
32
|
+
|
33
|
+
end
|
34
|
+
end
|
35
|
+
|