fathom 0.2.3 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,292 +1,112 @@
1
1
  Fathom
2
2
  ======
3
3
 
4
- Introduction
5
- ------------
4
+ Welcome to Fathom. Fathom is a library for building decision support tools. Fathom is the kind of tool you'd want to use when you:
6
5
 
7
- This is a library for decision support. It is useful for recording various types of information, and then combining it in useful ways. As of right now, it's not very useful, but I'm actively working on it again.
6
+ * want to build a reliable knowledge base for any kind of information
7
+ * need to simplify a complex problem
8
+ * have more data than a spreadsheet likes to use
8
9
 
9
- The ideas for this gem are coming from a lot of places:
10
+ Stability Note
11
+ ==============
10
12
 
11
- * Judea Pearl's work on causal graphs and belief networks. See [Causality](http://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X/ref=sr_1_1?s=books&ie=UTF8&qid=1288840948&sr=1-1) and [Probabilistic Reasoning in Intelligent Systems](http://www.amazon.com/Probabilistic-Reasoning-Intelligent-Systems-Plausible/dp/1558604790/ref=ntt_at_ep_dpi_2)
12
- * Douglas Hubbard's ideas on decision support. See [How to Measure Anything](http://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/0470539399/ref=sr_1_1?ie=UTF8&qid=1288840870&sr=8-1)
13
- * Ben Klemens' ideas on data analysis. See [Modeling with Data](http://modelingwithdata.org/about_the_book.html)
13
+ Please note that Fathom is not ready for production at this time. I happen to be using it in production for a few modeling projects, but it is undergoing some major architectural changes that won't be stabilized until I have release version 0.4, which will contain the Belief Networks, the Knowledge Base API, and a more complete cleanup of the code. One major change that's already in the system is the simulations and imports are all building a knowledge base as a graph. This makes it natural to use the output of a simulation as the input of a different model.
14
14
 
15
- To build useful decision support environments, there are three things that need to be in place:
16
15
 
17
- * Data needs to be gathered or referenced
18
- * Models need to be developed for the data
19
- * Data and models need to be presented in context
16
+ Inspiration for the Project
17
+ ---------------------------
20
18
 
21
- Setting up the data and models starts with a decoupled Ruby library. I'll give it a web service API so that a server could be setup for simple systems. The decoupled library can also be used as consumers on a message queue system for larger installations.
19
+ The ideas for this gem are coming from a lot of places:
22
20
 
23
- Keeping the data and models in context is more of a user interface question, which I'll build in another library. I'm considering hosting that solution myself and just making it available publicly. We'll see after all the core ideas are gathered.
21
+ * Judea Pearl's work on causal graphs and belief networks. See [Causality](http://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X/ref=sr_1_1?s=books&ie=UTF8&qid=1288840948&sr=1-1) and [Probabilistic Reasoning in Intelligent Systems](http://www.amazon.com/Probabilistic-Reasoning-Intelligent-Systems-Plausible/dp/1558604790/ref=ntt_at_ep_dpi_2)
22
+ * Douglas Hubbard's ideas on decision support. See [How to Measure Anything](http://www.amazon.com/How-Measure-Anything-Intangibles-Business/dp/0470539399/ref=sr_1_1?ie=UTF8&qid=1288840870&sr=8-1)
23
+ * Ben Klemens' ideas on data analysis. See [Modeling with Data](http://modelingwithdata.org/about_the_book.html)
24
24
 
25
- Fathom Basics
26
- -------------
25
+ The goals of this project are:
27
26
 
28
- Enrico Fermi [said](http://www.lucidcafe.com/library/95sep/fermi.html):
29
- There are two possible outcomes: if the result confirms the hypothesis, then you've made a measurement.
30
- If the result is contrary to the hypothesis, then you've made a discovery.
27
+ * Build a decoupled library with Ruby and the GSL
28
+ * Make it easy to gather information of all types
29
+ * Add tools to analyze the integration of knowledge
31
30
 
32
- To put together a hypothesis, we gather what we know about our problem:
31
+ Decoupled Library with Ruby and the GSL
32
+ --------------------------------------------------------
33
33
 
34
- * What is the decision we are making?
35
- * What are the consequences of the decision?
36
- * What do we know now?
37
- * How do we order the data we have?
38
- * How can we express this in ranges?
34
+ I use Ruby because it's very flexible and fast to write. I use the GSL because it's very fast, robust, and well-tested. I decouple the library so that it's easy to use the parts that are worthwhile for your problem and ignore the rest.
39
35
 
40
- If we have a lot of clarity about what we're after, it's easier to gather data and build worthwhile models. It's probably a good idea to start with PlausibleRange:
36
+ Most of the library is about coordinating data nodes. So, an educated guess or the data that came out of a spreadsheet will sit in a particular kind of node. The system then allows us to relate this information. For example, a node that defines the income for a company would relate to the nodes that define the revenue and expenses of the company. All of this is done with some simple Ruby libraries to make things easy to coordinate.
41
37
 
42
- q1_sales = PlausibleRange.new(:min => 10, :max => 20, :hard_lower_bound => 0, :name => "First Quarter Sales")
43
- q1_prices = PlausibleRange.new(:min => 10_000, :max => 12_000, :name => "First Quarter Prices")
44
- q1_sales_commissions = PlausibleRange.new(:min => 0.2, :max => 0.2, :name => "Sales Commission Rate")
45
-
46
- We can combine these ranges in a ValueDescription:
38
+ The statistics that we run are usually run on a GSL::Vector using various random number generators coming from the GSL. This is really the heart of what the GSL does for Fathom. A lot of the other tools may be used by some plugins for Fathom, and anyone using their computer to do open source data analysis probably has a good installation of the GSL in it.
47
39
 
48
- q1_gross_margins = ValueDescription.new(q1_sales, q1_prices, q1_sales_commissions) do |random_sample|
49
- revenue = (random_sample.first_quarter_sales * random_sample.first_quarter_prices)
50
- commissions_paid = random_sample.sales_commission_rate * revenue
51
- gross_margins = revenue - commissions_paid
52
- {:revenue => revenue, :commissions_paid => commissions_paid, :gross_margins => gross_margins}
53
- end
54
-
55
- A ValueDescription can take the ranges and combine them with a block of code. Here, we sample sales, prices and commission rates to get revenues, commissions paid, and gross margins. We can then use Monte Carlo methods to model our system:
40
+ There are two bindings for Ruby and the GSL that I know of: [Ruby/GSL](http://rb-gsl.rubyforge.org/) and [ruby-gsl](https://github.com/codahale/ruby-gsl). Ruby/GSL is still maintained and has the syntax that we're using here.
56
41
 
57
- sales_model = MonteCarloSet.new(q1_gross_margins)
58
- sales_model.process(10_000)
59
- sales_model.revenue.mean
60
- sales_model.revenue.sd
61
- sales_model.gross_margins.mean
62
- sales_model.gross_margins.sd
63
-
64
- Here, we are able to run 10,000 random samples to get an idea of how our system interacts. Notice how the methods get generated in the different objects:
42
+ The decoupling is based on making every file as stand-alone as possible. What that means is we use autoload quite a bit in the library. Each file references the main lib/fathom.rb file, so that it can autoload any dependencies it may have. Also, most files have a test to see if they are being run from the command line at the bottom of the file. The goal is to make it fairly easy to run a single class in a message queue system, say, or as part of another script to get data into our knowledge base. In this sense, we're trying to follow some of the design principles of Unix: simple scripts that do one thing well and can combine with other scripts.
65
43
 
66
- * The ValueDescription converts the name to a lower case, underscore-joined name (E.g. Sales Commission Rate becomes sales_commission_rate).
67
- * The MonteCarloSet uses the keys from the return value in the ValueDescription block to generate method names
44
+ Gathering Information
45
+ -----------------------------
68
46
 
69
- At this point, everything is using a normal Gaussian distribution. Since Fathom uses the GNU Scientific Library, there are many other distributions we will incorporate into our library.
47
+ Decision support is about making rational choices from the information available. This means that we do several things:
70
48
 
71
- If you start with data instead of data ranges, you can use a DataNode instead:
49
+ * Make it fairly easy to load spreadsheet data in a CSV format
50
+ * Add support for listing assumptions in a YAML file
51
+ * Make it possible to link to RDF data and all of the external context that can be useful
52
+ * Create links to richer data, such as ERP installations, databases, and web crawlers
72
53
 
73
- q1_sales = DataNode.new(:name => "First Quarter Sales", :values => [10,11,15,9])
74
-
75
- A DataNode can also be used in a ValueDescription.
54
+ The data we gather is organized as a bunch of nodes in a graph. We also try to create ranges or probability distributions for everything, so that things become comparable and we can add things together fairly well. We also always know explicitly what our uncertainty is, so that we're not misleading ourselves. Importing conflicting data should be manageable if you're careful to document what you're loading.
76
55
 
77
- Sometimes it's easier to load data from other sources, such as a spreadsheet:
56
+ Also, we keep track of meta data. So, specific decisions are described in the knowledge base, as are references to other material, or other ways to describe what we're doing. In this way, we should be able to work on convincing the appropriate people (employees, executives, review boards) that we've made sound decisions and that we should commit resources to execute the plan.
78
57
 
79
- sales_data = CSVImport.new(:content => "path/to/sales_data.csv")
80
- sales_data.import
81
-
82
- This reads the sales_data file and imports a DataNode for each column. The spreadsheet is expected to look something like this:
58
+ Tools for Analysis and Integration
59
+ --------------------------------------------
83
60
 
84
- First Quarter Sales,First Quarter Prices
85
- 10,12000
86
- 11,11500
87
- 15,10000
88
- 9,12000
61
+ The integration tools in Fathom are:
89
62
 
90
- The nodes are then generated and stored in the knowledge base. Right now, this is just an in-memory hash stored in Fathom.knowledge_base
63
+ * Monte Carlo Simulations
64
+ * Agent Based Models
65
+ * Belief Networks
66
+ * Causal Graphs
91
67
 
92
- You can also use YAML files to import data. Given the following YAML data:
68
+ These tools allow us to combine information in our knowledge base and run simulations or update beliefs to be able to see the larger perspective. These tools also offer insight into areas where more information should have the most return on investment. So, given a fairly limited amount of information, we can draw conclusions about what's going on, and pinpoint areas where we can refine our models and get more certain results.
93
69
 
94
- CO2 Emissions:
95
- min: 1_000_000
96
- max: 1_000_000_000
70
+ We are also adding Apophenia to the library. We'll be able to build data analysis tools outside of Fathom and then bring their results into our knowledge base and integrate it with the bigger picture. In this way, we'll be able to do all sorts of statistical analysis, machine learning, data mining, and other information-generating tasks without making Fathom too complicated.
97
71
 
98
- CO2 Readings:
99
- - 10
100
- - 20
101
- - 30
72
+ Apophenia is the pragmatic analysts dream. It's a C-based library that uses the GSL to do its analysis very quickly. It uses SQLite for data management, which makes set operations in memory optimal. It stores the models in consistent ways, so that it won't be hard to use this information inside of Fathom.
102
73
 
103
- You can load the nodes with:
74
+ There are also other kinds of external libraries whose analysis could be brought into Fathom through the import tools.
104
75
 
105
- yaml_nodes = YAMLImport.new('path/to/yaml/file')
106
- yaml_nodes.import
107
-
108
- This will create a PlausibleRange for CO2 Emissions and a DataNode for CO2 Readings.
76
+ The ultimate integration tool for Fathom is the web service. We are exposing access to all of the tools through a RESTful, JSON interface so that Fathom can be part of any sort of application. We also expect to publish basic HTML support for these same functions so that users can input and read their knowledge base without too much trouble.
109
77
 
110
- To use imported data in a ValueDescription, just reference this knowledge base:
78
+ Further Information
79
+ -------------------
111
80
 
112
- ValueDescription.new(Fathom.knowledge_base['First Quarter Sales'], Fathom.knowledge_base['First Quarter Prices']) do
113
- ...
114
- end
81
+ * You can use our [Wiki](https://github.com/davidrichards/fathom/wiki) to get code examples and see how things are coming along.
82
+ * You can go to the [TODO](https://github.com/davidrichards/fathom/blob/master/TODO.md) page to see how current development is mapped out.
83
+ * You can go to the [Fleet Ventures Blog](http://fleetventures.com) to get more in-depth tutorials and commentary about how to use these types of tools in business as well as a broader perspective of other technologies we use to solve these kinds of problems.
115
84
 
116
- Serial Agent Based Modeling
85
+ Dependencies and Extensions
117
86
  ---------------------------
118
87
 
119
- I have added some basic support for Agent Based Modeling (ABM). Right now, this only supports serial simulations. I will be adding an Agent Cluster, which will allow us to run large simulations asynchronously using EventMachine. Until then, here's a really simple example of how to do things.
120
-
121
- First, let's create a couple agents, a Cola and a Consumer:
122
-
123
- class Cola < Agent
124
- property :sweetness
125
- property :number_sold
126
-
127
- def on_purchase(consumer)
128
- self.number_sold += 1
129
- log_purchase
130
- end
131
-
132
- def on_tick(simulation)
133
- self.sweetness = suggest_sweetness
134
- end
135
-
136
- def inspect
137
- "Cola: sweetness: #{self.sweetness}, sales: #{self.number_sold}"
138
- end
139
-
140
- protected
141
-
142
- # This is where the fun is as well. This is an admittedly poor suggestion engine.
143
- def suggest_sweetness
144
- case purchases.length
145
- when *(0..10).to_a
146
- self.node_for_sweetness.rand
147
- when *(10..50).to_a
148
- (self.node_for_sweetness.rand * 0.4) +
149
- (average_purchase_sweetness * 0.6)
150
- when *(50..250).to_a
151
- (self.node_for_sweetness.rand * 0.2) +
152
- (average_purchase_sweetness * 0.8)
153
- else
154
- (self.node_for_sweetness.rand * 0.05) +
155
- (average_purchase_sweetness * 0.95)
156
- end
157
- end
158
-
159
- def average_purchase_sweetness
160
- purchases.inject(0.0) {|s, e| s += e} / purchases.length
161
- end
162
-
163
- def log_purchase
164
- purchases << sweetness
165
- end
166
-
167
- def purchases
168
- @purchases ||= []
169
- end
170
- end
171
-
172
- class Consumer < Agent
173
- property :sweetness_preference
174
-
175
- attr_reader :simulation
176
-
177
- def on_tick(simulation)
178
- @simulation ||= simulation
179
- purchase_cola
180
- end
181
-
182
- def inspect
183
- "Consumer: preferred sweetness: #{self.sweetness_preference}"
184
- end
185
-
186
- protected
187
- def agents_using_purchase
188
- @agents_using_purchase ||= simulation.agents_using_purchase
189
- end
190
-
191
- # This is where all the fun happens.
192
- def purchase_cola
193
- if rand < 0.1
194
- agents_using_purchase.rand.on_purchase(self)
195
- else
196
- distances = agents_using_purchase.map {|agent| [agent, (self.sweetness_preference - agent.sweetness).abs] }
197
- sorted_distances = distances.sort {|a, b| a.last <=> b.last }
198
- purchased = sorted_distances.first.first
199
- purchased.on_purchase(self)
200
- end
201
- end
202
- end
203
-
204
- Agents need to do just a few things:
205
-
206
- * define their properties
207
- * define which events they listen to
208
- * define the behavior we're after for each event
209
-
210
- Properties can be whatever you're after. Usually, these are seeded with some knowledge that we're working on in the knowledge base. Declaring a property gives us a getter and a setter for that property, as well as access to the seed objects we use when setting up the agent.
211
-
212
- Events are setup by defining a method starting with on_. A consumer responds to on_tick, and the cola responds to on_tick and on_purchase. We setup events with this convention so that it's a little easier to coordinate the traffic amongst the agents and between the agents and the simulation. When we start using EventMachine for agent clusters, it will be more important to have this interface explicitly defined like this so that things don't get confused.
213
-
214
- The underlying behavior is where we can have a lot of fun. We can start adopting reinforcement learning techniques, or mimic real-world interactions. For this example, I had the consumer purchase some cola at every tick. Right now, it optimizes for the cola that's nearest its preference for sweetness. You may imagine how fun this would get to introduce different types of consumers, or start mimicking a satisficing algorithm (allow the consumers to make a choice that's good enough, rather than optimal). We could start adding budgets, ages, and proximity to the cola. Once the behaviors and properties are setup, models can be iterated over extensively until the system dynamics are thoroughly explored, or even some prognostic value begins to emerge from the experiments.
215
-
216
- To show the whole example, let me give you some configuration data I stored in a YAML file:
217
-
218
- :american_consumer_sweetness_preference:
219
- hard_lower_bound: 0
220
- hard_upper_bound: 1
221
- min: 0.2
222
- max: 0.3
223
- name: American Consumer Sweetness Preference
224
-
225
- :cola_sweetness_range:
226
- hard_lower_bound: 0
227
- hard_upper_bound: 1
228
-
229
- Also, here is the actual simulation:
230
-
231
- require 'rubygems'
232
- require 'fathom'
233
- require 'cola'
234
- require 'consumer'
235
-
236
- YAMLImport.import(File.expand_path('nodes.yml'))
237
-
238
- @rb_cola = Cola.new(:sweetness => Fathom.kb[:cola_sweetness_range], :number_sold => 0)
239
- @ruby_cola = Cola.new(:sweetness => Fathom.kb[:cola_sweetness_range], :number_sold => 0)
240
- @american_consumer = Consumer.new(
241
- :sweetness_preference => Fathom.kb[:american_consumer_sweetness_preference],
242
- :budget => Fathom.kb[:american_cola_budget]
243
- )
244
-
245
- @simulation = TickSimulation.new(@rb_cola, @ruby_cola, @american_consumer)
246
- @simulation.process(1_000)
247
- puts @american_consumer.inspect, @rb_cola.inspect, @ruby_cola.inspect
248
-
249
- The output from this experiment looks like this:
250
-
251
- demo_abm : ruby sim.rb
252
- Consumer: preferred sweetness: 0.258095065252885
253
- Cola: sweetness: 0.362263199218971, sales: 626
254
- Cola: sweetness: 0.377573124603715, sales: 374
255
-
256
- You can see that our single consumer wanted sweetness rated around 0.25, and ended up purchasing more soda that ended up looking like 0.36. With better goal-seeking behavior, the agents could actually optimize to the consumer's preferences. With some verification of the seed nodes against market data, the simulations could look more and more like the real world.
257
-
258
- I've written up an article on our company blog to give a better background to Agent Based Models, which can be [found here](http://fleetventures.com/2010/11/07/agent-based-modeling/).
259
-
260
- Future Development
261
- ------------------
88
+ This project relies on the [GNU Scientific Library](http://www.gnu.org/software/gsl/) and the [ruby/gsl](http://rb-gsl.rubyforge.org/) bindings for the GSL.
262
89
 
263
- This code is certainly not production ready. There are many things I'll want to add just to have basic Monte Carlo methods up to snuff:
264
-
265
- * More distributions to choose from
266
- * More import methods (RDF, relational databases, no SQL data stores)
267
- * A persisted knowledge base
268
- * Configuration on the knowledge base and databases
269
- * Better visualization with plotutils support and possibly other graphics support
270
- * Project organization: decision descriptions, owners, sharing
271
- * Measurement values: use Shannon's entropy and some value calculations to point out which measurements have the highest potential ROI
272
- * EventMachine to drive agent clusters, as well as possibly other parts of the system
273
-
274
- On a bigger level, I still haven't implemented other major ideas:
275
-
276
- * System dynamics
277
- * Belief updating in Causal Graphs
278
- * Fathom as a Web service
279
-
280
- Dependencies
281
- ------------
282
-
283
- This project relies on the GNU Scientific Library and the ruby/gsl bindings for the GSL. It has only minimal extensions to external libraries:
90
+ Fathom has only minimal extensions on external libraries:
284
91
 
285
92
  * Array responds to rand (so [1,2,3].rand returns a random value from that array)
286
93
  * OpenStruct exposes it's underlying table, keys, and values
287
94
  * FasterCSV has a :strip header converter now
95
+ * String has the constantize method added to it from the ActiveSupport library
96
+
97
+ In the future, more optional dependencies will be introduced for parts of the library:
98
+
99
+ * EventMachine is one that I'm sure will be added.
100
+ * RDF.rb and related gems will be used for some of the KnowledgeBase
101
+ * SQLite will be available for some set operations
102
+ * One of the key/value data stores will be used for the KnowledgeBase (Riak, CouchDB, MongDO, Redis, or similar)
288
103
 
289
- In the future, more dependencies will be introduced for parts of the library: EventMachine is one that I'm sure will be added. The goal of this project is to allow a reasonable number of dependencies to make the project performant and useful, but without making it a headache to setup or use with other projects.
104
+ It should be easy to avoid the parts of the library that use dependencies that you don't want to have. The goal for dependencies is:
105
+
106
+ * To use the best tool available for the job
107
+ * Make it easy to avoid those parts of the library that use those dependencies
108
+
109
+ For example, the in-memory version of the Knowledge Base will remain available for quick and dirty analysis.
290
110
 
291
111
  Note on Patches/Pull Requests
292
112
  -----------------------------
@@ -303,6 +123,11 @@ Note on Patches/Pull Requests
303
123
  Copyright
304
124
  ---------
305
125
 
126
+ * The GSL is released under the [GNU General Public License](http://www.gnu.org/copyleft/gpl.html).
127
+ * Ruby/GSL is released under the [GNU General Public License](http://www.gnu.org/copyleft/gpl.html).
128
+ * FasterCSV is released under the [GPL Version 2](http://www.gnu.org/licenses/old-licenses/gpl-2.0.html) license.
129
+ * Ruby is released under [this license](http://www.ruby-lang.org/en/LICENSE.txt).
130
+
306
131
  Copyright (c) 2010 David Richards
307
132
 
308
133
  Permission is hereby granted, free of charge, to any person obtaining
data/TODO.md CHANGED
@@ -1,7 +1,7 @@
1
1
  TODO
2
2
  ====
3
3
 
4
- Reorganizing
4
+ Reorganizing (0.2.5)
5
5
  ------------
6
6
 
7
7
  I've just made some big refactoring steps regarding the organization of the system and the distributions. To make sure we're there:
@@ -10,7 +10,7 @@ I've just made some big refactoring steps regarding the organization of the syst
10
10
  * Finish the discrete ideas, adding size to the node and automatically using that for stats
11
11
  * Create the idea of a labeled, multinomial node
12
12
  * Add SQLite3 for in-memory set operations for a labeled, multinomial node
13
- * Add and remove finder methods on nodes for their parents and children
13
+ * Make sure we are not defining methods on all objects in a class when they should only be set for a single object.
14
14
 
15
15
  Also, the general organization of the system could be broken down better:
16
16
 
@@ -24,16 +24,7 @@ Also, the general organization of the system could be broken down better:
24
24
  * apophenia
25
25
  * simulation
26
26
 
27
- MonteCarlo
28
- ----------
29
-
30
- This needs to get a few new features:
31
-
32
- * combine with ValueDescription into one node
33
- * generate nodes for the return values
34
- * consider a more general simulation framework, in case it needs to be extended, or to use some of the tools that will be added to the ABM stuff
35
-
36
- Belief Networks
27
+ Belief Networks (0.3)
37
28
  ---------------
38
29
 
39
30
  To get these delivered, I need to revisit the edge logic, to make sure it's easy to extend each edge with an object.
@@ -44,13 +35,13 @@ Then:
44
35
  * Network propagation
45
36
  * Network testing (polytree)
46
37
 
47
- Agent Based Modeling
38
+ Agent Based Modeling (0.3.1)
48
39
  --------------------
49
40
 
50
41
  * Add parameter-passing standards for callbacks
51
- * Add EventMachine and async capabilities (Inncluding the cluster idea)
42
+ * Add EventMachine and async capabilities (Including the cluster idea)
52
43
 
53
- Knowledge Base
44
+ Knowledge Base (0.4)
54
45
  --------------
55
46
 
56
47
  Probably around here I'll be able to start looking at a persistent knowledge base. I am not sure which way I'll go, but things I'm considering:
@@ -65,7 +56,7 @@ One of the key features needs to be search:
65
56
  * possibly a Xapian search index for full-text searching
66
57
  * still need a standard query language, depending on what I choose above
67
58
 
68
- Apophenia
59
+ Apophenia (0.5)
69
60
  ---------
70
61
 
71
62
  I'd like to get Apophenia integrated so that any data model generated there could be combined with the work done here. That means that most of the "hard" data crunching is using some fairly fast tools: C, SQLite3 in memory, Apophenia and the GSL.
@@ -83,7 +74,7 @@ You would go to Apophenia to:
83
74
 
84
75
  Fathom could feed the information to Apophenia data models. Given a fairly robust knowledge base, this makes a lot of sense.
85
76
 
86
- Import
77
+ Import (0.6)
87
78
  ------
88
79
 
89
80
  * More robust support for CSV and YAML
@@ -92,7 +83,16 @@ Import
92
83
  * Apophenia
93
84
  * Web Crawlers
94
85
 
95
- Publication
86
+ Value of Information (0.7)
87
+ --------------------
88
+
89
+ * Shannon's Entropy
90
+ * Integration of nodes with the decisions they support
91
+ * Economic value of the decision
92
+ * Economic value of the measurement
93
+ * Guidance for areas where reduced uncertainty would have the highest ROI
94
+
95
+ Publication (1.0)
96
96
  -----------
97
97
 
98
98
  Turning Fathom into a better tool for publishing knowledge, there are a few major parts to add:
data/VERSION CHANGED
@@ -1 +1 @@
1
- 0.2.3
1
+ 0.3.0
@@ -6,6 +6,8 @@
6
6
  $:.unshift(File.dirname(__FILE__))
7
7
  $:.unshift(File.expand_path(File.join(File.dirname(__FILE__), 'fathom')))
8
8
 
9
+ require 'rubygems'
10
+
9
11
  require "gsl"
10
12
  require 'options_hash'
11
13
 
@@ -26,11 +28,13 @@ module Fathom
26
28
  autoload :ValueAggregator, "value_aggregator"
27
29
  autoload :ValueMultiplier, "value_multiplier"
28
30
  autoload :MonteCarloSet, "monte_carlo_set"
31
+ autoload :MCNode, "mc_node"
29
32
  autoload :CausalGraph, "causal_graph"
30
33
  autoload :DataNode, "data_node"
31
34
  autoload :KnowledgeBase, "knowledge_base"
32
35
 
33
36
  autoload :Import, "import"
37
+ autoload :ImportNode, "import/import_node"
34
38
  autoload :YAMLImport, 'import/yaml_import'
35
39
  autoload :CSVImport, 'import/csv_import'
36
40
  autoload :RDFImport, 'import/rdf_import'
@@ -42,23 +42,22 @@ class Fathom::Import
42
42
  end
43
43
  end
44
44
 
45
- attr_reader :content, :options
45
+ attr_reader :content, :options, :import_node
46
46
 
47
47
  def initialize(opts={})
48
48
  @options = OptionsHash.new(opts)
49
49
  @content = @options[:content]
50
+ @import_node = ImportNode.new(opts)
50
51
  end
51
52
 
52
53
  def import
53
- results = []
54
54
  import_methods.each do |method|
55
55
  klass, initialization_data = self.send(method.to_sym)
56
56
  initialization_data.each do |values|
57
- node = extract_nodes(klass, values)
58
- results << node if node
57
+ extract_nodes(klass, values)
59
58
  end
60
59
  end
61
- results
60
+ self.import_node
62
61
  end
63
62
 
64
63
  protected
@@ -67,8 +66,10 @@ class Fathom::Import
67
66
  def extract_nodes(klass, values)
68
67
  begin
69
68
  node = klass.new(values)
70
- Fathom.knowledge_base[node.name] = node
71
- node
69
+ if node
70
+ self.import_node.add_child(node)
71
+ Fathom.knowledge_base[node.name] = node
72
+ end
72
73
  rescue
73
74
  nil
74
75
  end
@@ -55,7 +55,5 @@ module Fathom
55
55
  end
56
56
 
57
57
  if __FILE__ == $0
58
- include Fathom
59
- # TODO: Is there anything you want to do to run this file on its own?
60
- # CSV.new
58
+ Fathom::CSVImport.import(:content => ARGV.first)
61
59
  end
@@ -0,0 +1,17 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), '..', '..', 'fathom'))
2
+ class Fathom::ImportNode < Node
3
+
4
+ attr_reader :imported_at
5
+
6
+ def initialize(opts={})
7
+ super(opts)
8
+ @imported_at = Time.now
9
+ end
10
+
11
+ end
12
+
13
+ if __FILE__ == $0
14
+ include Fathom
15
+ # TODO: Is there anything you want to do to run this file on its own?
16
+ # ImportNode.new
17
+ end
@@ -49,7 +49,5 @@ class Fathom::YAMLImport < Import
49
49
  end
50
50
 
51
51
  if __FILE__ == $0
52
- include Fathom
53
- # TODO: Is there anything you want to do to run this file on its own?
54
- # YAMLImport.new
52
+ Fathom::YAMLImport.import(:content => ARGV.first)
55
53
  end
@@ -0,0 +1,69 @@
1
+ require File.expand_path(File.join(File.dirname(__FILE__), '..', 'fathom'))
2
+ class Fathom::MCNode < Node
3
+
4
+ attr_reader :value_description, :samples_taken
5
+
6
+ def initialize(opts={}, &block)
7
+ super(opts)
8
+ @value_description = opts[:value_description]
9
+ @value_description ||= block if block_given?
10
+ raise ArgumentError, "Must provide a value_description from either a parameter or by passing in a block" unless
11
+ @value_description
12
+ end
13
+
14
+ def process(n=10_000)
15
+ @samples_taken, @samples = n, {}
16
+ @samples_taken.times do
17
+ result = value_description.call(self)
18
+ store(result)
19
+ end
20
+ assert_nodes
21
+ end
22
+
23
+ def reset!
24
+ @samples_taken, @samples = nil, {}
25
+ @samples_asserted = false
26
+ end
27
+
28
+ def fields
29
+ self.children.map {|c| c.name_sym}.compact
30
+ end
31
+
32
+ protected
33
+
34
+ def store(result)
35
+ result = assert_result_hash(result)
36
+ assert_samples(result)
37
+ result.each do |key, value|
38
+ @samples[key.to_sym] << value
39
+ end
40
+ end
41
+
42
+ def assert_samples(result)
43
+ return true if @samples_asserted
44
+ result.each do |k, v|
45
+ @samples[k.to_sym] ||= []
46
+ end
47
+ @samples_asserted = true
48
+ end
49
+
50
+ def assert_result_hash(result)
51
+ result.is_a?(Hash) ? result : {:result => result}
52
+ end
53
+
54
+ # Assumes the same value description for all samples taken
55
+ def assert_nodes
56
+ @samples.each do |key, values|
57
+ node = DataNode.new(:name => key, :values => values)
58
+ add_child(node)
59
+ # self.class.define_summary_method(key)
60
+ end
61
+ end
62
+
63
+ end
64
+
65
+ if __FILE__ == $0
66
+ include Fathom
67
+ # TODO: Is there anything you want to do to run this file on its own?
68
+ # MCNode.new
69
+ end
@@ -48,21 +48,42 @@ class Fathom::MonteCarloSet
48
48
  end
49
49
  end
50
50
 
51
+ def print_summary
52
+ print_hash(self.summary)
53
+ end
54
+
51
55
  protected
52
56
 
57
+ def print_hash(hash, indent=0)
58
+ hash.each do |key, value|
59
+ if value.is_a?(Hash)
60
+ puts "#{' ' * indent}#{key} => {"
61
+ print_hash(value, indent + 2)
62
+ puts "#{' ' * indent}}"
63
+ else
64
+ puts "#{' ' * indent}#{key} => #{value}"
65
+ end
66
+ end
67
+ end
68
+
53
69
  def summarize_field(field)
54
70
  raise "No fields are defined. Have you processed this model yet?" if fields.empty?
55
71
  raise ArgumentError, "#{field} is not a field in this set." unless fields.include?(field)
56
72
  vector = self.send(field)
57
73
  return vector unless vector.is_a?(GSL::Vector)
74
+ lb = lower_bound(:mean => vector.mean, :sd => vector.sd)
75
+ lb = vector.min if vector.min > lb
76
+ ub = upper_bound(:mean => vector.mean, :sd => vector.sd)
77
+ ub = vector.max if vector.max < ub
78
+
58
79
  {
59
80
  :coefficient_of_variation => (vector.sd / vector.mean),
60
81
  :max => vector.max,
61
82
  :mean => vector.mean,
62
83
  :min => vector.min,
63
84
  :sd => vector.sd,
64
- :upper_bound => upper_bound(:mean => vector.mean, :sd => vector.sd),
65
- :lower_bound => lower_bound(:mean => vector.mean, :sd => vector.sd)
85
+ :upper_bound => ub,
86
+ :lower_bound => lb
66
87
  }
67
88
  end
68
89
 
@@ -1,6 +1,6 @@
1
1
  require File.expand_path(File.join(File.dirname(__FILE__), '..', 'fathom'))
2
2
  class Fathom::Node
3
-
3
+
4
4
  attr_reader :name, :distribution, :description, :values
5
5
 
6
6
  def initialize(opts={})
@@ -22,13 +22,17 @@ class Fathom::Node
22
22
 
23
23
  def add_parent(parent)
24
24
  self.parents << parent
25
+ self.add_accessor_for_node(parent)
25
26
  parent.register_child(self)
26
27
  end
27
28
 
28
29
  def register_child(child)
29
30
  raise "Cannot register a child if this node is not a parent already. Use add_parent to the other node or add_child to this node." unless
30
31
  child.parents.include?(self)
31
- children << child unless children.include?(child)
32
+ unless children.include?(child)
33
+ self.add_accessor_for_node(child)
34
+ children << child
35
+ end
32
36
  true
33
37
  end
34
38
 
@@ -38,18 +42,32 @@ class Fathom::Node
38
42
 
39
43
  def add_child(child)
40
44
  self.children << child
45
+ self.add_accessor_for_node(child)
41
46
  child.register_parent(self)
42
47
  end
43
48
 
44
49
  def register_parent(parent)
45
50
  raise "Cannot register a parent if this node is not a child already. Use add_child to the other node or add_parent to this node." unless
46
51
  parent.children.include?(self)
47
- parents << parent unless parents.include?(parent)
52
+ unless parents.include?(parent)
53
+ self.add_accessor_for_node(parent)
54
+ parents << parent
55
+ end
48
56
  true
49
57
  end
50
58
 
51
59
  protected
52
60
 
61
+ def add_accessor_for_node(node)
62
+ return false unless node.is_a?(Node) and node.name_sym
63
+ return false if self.respond_to?(node.name_sym)
64
+ (class << self; self; end).module_eval do
65
+ define_method node.name_sym do
66
+ node
67
+ end
68
+ end
69
+ end
70
+
53
71
  def assert_links(opts)
54
72
  found = opts[:parents]
55
73
  found ||= opts[:parent]
@@ -22,15 +22,19 @@ describe CSVImport do
22
22
  lambda{CSVImport.new(@opts)}.should_not raise_error
23
23
  end
24
24
 
25
+ it "should return the ImportNode as the result" do
26
+ @result.should be_a(ImportNode)
27
+ end
28
+
25
29
  it "should create as many data nodes as there are columns" do
26
- @result.size.should eql(3)
27
- @result.each {|dn| dn.should be_a(DataNode)}
30
+ @result.children.size.should eql(3)
31
+ @result.children.each {|dn| dn.should be_a(DataNode)}
28
32
  end
29
33
 
30
34
  it "should import the values from each column into each data node" do
31
- @result[0].values.should eql([1,4,7])
32
- @result[1].values.should eql([2,5,8])
33
- @result[2].values.should eql([3,6,9])
35
+ @result.this.values.should eql([1,4,7])
36
+ @result.and.values.should eql([2,5,8])
37
+ @result.that.values.should eql([3,6,9])
34
38
  end
35
39
 
36
40
  it "should store the imported values in the knowledge base" do
@@ -0,0 +1,10 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/../../spec_helper')
2
+
3
+ include Fathom
4
+
5
+ describe ImportNode do
6
+ it "should record the time of the import" do
7
+ i = ImportNode.new
8
+ i.imported_at.should be_close(Time.now, 0.01)
9
+ end
10
+ end
@@ -17,24 +17,26 @@ describe YAMLImport do
17
17
  lambda{YAMLImport.new(@opts)}.should_not raise_error
18
18
  end
19
19
 
20
+ it "should create an ImportNode as a return value" do
21
+ @result.should be_an(ImportNode)
22
+ end
23
+
20
24
  it "should create PlausibleRange nodes for any hashes with at least a min and max key in it" do
21
- @result.find {|r| r.name == "CO2 Emissions"}.should_not be_nil
25
+ @result.co2_emissions.should_not be_nil
22
26
  end
23
27
 
24
28
  it "should not create a PlausibleRange for entries missing min and max" do
25
- @result.find {|r| r.name == "Invalid Hash"}.should be_nil
29
+ @result.should_not respond_to(:invalid_hash)
26
30
  end
27
31
 
28
32
  it "should be able to create a PlausibleRange with more complete information" do
29
- more_complete_range = @result.find {|r| r.name == "More Complete Range"}
30
- more_complete_range.ci.should eql(0.6)
31
- more_complete_range.description.should eql('Some good description')
33
+ @result.more_complete_range.ci.should eql(0.6)
34
+ @result.more_complete_range.description.should eql('Some good description')
32
35
  end
33
36
 
34
37
  it "should create DataNodes for entries that have an array of information" do
35
- data_node = @result.find {|r| r.name == 'CO2 Readings'}
36
- data_node.should be_a(DataNode)
37
- data_node.values.should eql([10,20,30])
38
+ @result.co2_readings.should be_a(DataNode)
39
+ @result.co2_readings.values.should eql([10,20,30])
38
40
  end
39
41
 
40
42
  it "should store the imported values in the knowledge base" do
@@ -23,4 +23,14 @@ describe Import do
23
23
  Import.should be_respond_to(:import)
24
24
  end
25
25
 
26
+ it "should create an import node to attach its imports to in the knowledge base" do
27
+ @i.import_node.should be_a(ImportNode)
28
+ end
29
+
30
+ it "should pass its options to the import node for that node to record the parts it is interested in." do
31
+ i = Import.new(:name => "New Import", :content => @content, :description => "This gets passed along too.")
32
+ i.import_node.name.should eql("New Import")
33
+ i.import_node.description.should eql("This gets passed along too.")
34
+ end
35
+
26
36
  end
@@ -0,0 +1,66 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
2
+
3
+ include Fathom
4
+
5
+ describe MCNode do
6
+
7
+ before(:all) do
8
+ @fields = [:value_result]
9
+ end
10
+
11
+ before do
12
+ @vd = lambda{|n| {:value_result => 1}}
13
+ @mcn = MCNode.new(:value_description => @vd)
14
+ end
15
+
16
+ it "should be a type of Node" do
17
+ MCNode.ancestors.should be_include(Fathom::Node)
18
+ end
19
+
20
+ it "should use a value_description from the command-line arguments" do
21
+ mcn = MCNode.new(:value_description => @vd)
22
+ mcn.value_description.should eql(@vd)
23
+ end
24
+
25
+ it "should be able to take a block instead of a named lamda for the value description" do
26
+ mcn = MCNode.new {|n| {:value_result => 1}}
27
+ mcn.value_description.should be_a(Proc)
28
+ end
29
+
30
+ it "should require a value_description from either a parameter or a block passed in" do
31
+ lambda{MCNode.new}.should raise_error(/value_description/)
32
+ end
33
+
34
+ it "should process with the default number of runs at 10,000", :slow => true do
35
+ lambda{@mcn.process}.should_not raise_error
36
+ @mcn.samples_taken.should eql(10_000)
37
+ end
38
+
39
+ it "should call the value_description block each time it is processed" do
40
+ @vd.should_receive(:call).exactly(3).times.with(@mcn).and_return({:value_result => 1})
41
+ @mcn.process(3)
42
+ end
43
+
44
+ it "should define children nodes for all keys in the result set" do
45
+ @mcn.process(1)
46
+ @mcn.value_result.should be_a(Node)
47
+ @mcn.value_result.values.should eql([1])
48
+ @mcn.value_result.vector.should be_a(GSL::Vector)
49
+ end
50
+
51
+ it "should be resetable" do
52
+ @mcn.process(1)
53
+ @mcn.reset!
54
+ lambda{@mcn.process(1)}.should_not raise_error
55
+ end
56
+
57
+ it "should expose the fields from the samples" do
58
+ @mcn.process(1)
59
+ sort_array_of_symbols(@mcn.fields).should eql(@fields)
60
+ end
61
+
62
+
63
+ end
64
+ def sort_array_of_symbols(array)
65
+ array.map {|e| e.to_s}.sort.map {|e| e.to_sym}
66
+ end
@@ -101,6 +101,47 @@ describe MonteCarloSet do
101
101
  lambda{mcs.process(2)}.should_not raise_error
102
102
  lambda{mcs.summary}.should_not raise_error
103
103
  end
104
+
105
+ it "should set the lower bound in the summary to be no more than the minimum (when hard_lower_bound truncates the curve)" do
106
+ @q1_sales = PlausibleRange.new(:min => 0, :max => 2, :hard_lower_bound => 0, :name => "First Quarter Sales")
107
+ @q1_prices = PlausibleRange.new(:min => 1, :max => 1, :name => "First Quarter Prices")
108
+ @q1_sales_commissions = PlausibleRange.new(:min => 0.2, :max => 0.2, :name => "Sales Commission Rate")
109
+
110
+ @q1_gross_margins = ValueDescription.new(@q1_sales, @q1_prices, @q1_sales_commissions) do |random_sample|
111
+ revenue = (random_sample.first_quarter_sales * random_sample.first_quarter_prices)
112
+ commissions_paid = random_sample.sales_commission_rate * revenue
113
+ gross_margins = revenue - commissions_paid
114
+ {:revenue => revenue, :commissions_paid => commissions_paid, :gross_margins => gross_margins}
115
+ end
116
+ @mcs = MonteCarloSet.new(@q1_gross_margins)
117
+ @mcs.process(5)
118
+ # This is an environment where the lower bound would usually be below the minimum.
119
+ # So, the minimum adheres to the hard_lower_bound constraints in the plausible range (tested elsewhere)
120
+ # and now we're expecting the lower bound here to reflect an actual minimum, or a 5% confidence interval,
121
+ # whichever is higher.
122
+ (@mcs.summary[:revenue][:lower_bound] >= @mcs.revenue.min).should be_true
123
+ end
124
+
125
+ it "should set the upper bound in the summary to be no more than the minimum (when hard_upper_bound truncates the curve)" do
126
+ @q1_sales = PlausibleRange.new(:min => 0, :max => 2, :hard_upper_bound => 2, :name => "First Quarter Sales")
127
+ @q1_prices = PlausibleRange.new(:min => 1, :max => 1, :name => "First Quarter Prices")
128
+ @q1_sales_commissions = PlausibleRange.new(:min => 0.2, :max => 0.2, :name => "Sales Commission Rate")
129
+
130
+ @q1_gross_margins = ValueDescription.new(@q1_sales, @q1_prices, @q1_sales_commissions) do |random_sample|
131
+ revenue = (random_sample.first_quarter_sales * random_sample.first_quarter_prices)
132
+ commissions_paid = random_sample.sales_commission_rate * revenue
133
+ gross_margins = revenue - commissions_paid
134
+ {:revenue => revenue, :commissions_paid => commissions_paid, :gross_margins => gross_margins}
135
+ end
136
+ @mcs = MonteCarloSet.new(@q1_gross_margins)
137
+ @mcs.process(5)
138
+ # This is an environment where the upper bound would usually be above the maximum
139
+ # So, the maximum adheres to the hard_upper_bound constraints in the plausible range (tested elsewhere)
140
+ # and now we're expecting the upper bound here to reflect an actual maximum, or a 95% confidence interval,
141
+ # whichever is higher.
142
+ (@mcs.summary[:revenue][:upper_bound] <= @mcs.revenue.max).should be_true
143
+ end
144
+
104
145
  end
105
146
 
106
147
  def sort_array_of_symbols(array)
@@ -125,5 +125,31 @@ describe Node do
125
125
  n2 = Node.new
126
126
  lambda{n2.register_child(n1)}.should raise_error
127
127
  end
128
+
129
+ it "should define an accessor method for the child node when added" do
130
+ n1 = Node.new(:name => 'n1')
131
+ n2 = Node.new :name => 'n2', :child => n1
132
+ n2.n1.should eql(n1)
133
+ n1.should_not respond_to(:n1)
134
+ end
135
+
136
+ it "should define an accessor method for the parent node when added" do
137
+ n1 = Node.new(:name => 'n1')
138
+ n2 = Node.new :parent => n1
139
+ n2.n1.should eql(n1)
140
+ n1.should_not respond_to(:n1)
141
+ end
142
+
143
+ it "should have defined an accessor method for the added child to the parent" do
144
+ n1 = Node.new(:name => 'n1')
145
+ n2 = Node.new :name => 'n2', :child => n1
146
+ n1.n2.should eql(n2)
147
+ end
148
+
149
+ it "should have defined an accessor method for the added parent to the child" do
150
+ n1 = Node.new(:name => 'n1')
151
+ n2 = Node.new :name => 'n2', :parent => n1
152
+ n1.n2.should eql(n2)
153
+ end
128
154
 
129
155
  end
metadata CHANGED
@@ -1,13 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fathom
3
3
  version: !ruby/object:Gem::Version
4
- hash: 17
4
+ hash: 19
5
5
  prerelease: false
6
6
  segments:
7
7
  - 0
8
- - 2
9
8
  - 3
10
- version: 0.2.3
9
+ - 0
10
+ version: 0.3.0
11
11
  platform: ruby
12
12
  authors:
13
13
  - David
@@ -15,7 +15,7 @@ autorequire:
15
15
  bindir: bin
16
16
  cert_chain: []
17
17
 
18
- date: 2010-11-10 00:00:00 -07:00
18
+ date: 2010-11-15 00:00:00 -07:00
19
19
  default_executable:
20
20
  dependencies:
21
21
  - !ruby/object:Gem::Dependency
@@ -79,9 +79,11 @@ files:
79
79
  - lib/fathom/ext/string.rb
80
80
  - lib/fathom/import.rb
81
81
  - lib/fathom/import/csv_import.rb
82
+ - lib/fathom/import/import_node.rb
82
83
  - lib/fathom/import/yaml_import.rb
83
84
  - lib/fathom/inverter.rb
84
85
  - lib/fathom/knowledge_base.rb
86
+ - lib/fathom/mc_node.rb
85
87
  - lib/fathom/monte_carlo_set.rb
86
88
  - lib/fathom/node.rb
87
89
  - lib/fathom/numeric_methods.rb
@@ -102,9 +104,11 @@ files:
102
104
  - spec/fathom/distributions/uniform_spec.rb
103
105
  - spec/fathom/enforced_name_spec.rb
104
106
  - spec/fathom/import/csv_import_spec.rb
107
+ - spec/fathom/import/import_node_spec.rb
105
108
  - spec/fathom/import/yaml_import_spec.rb
106
109
  - spec/fathom/import_spec.rb
107
110
  - spec/fathom/knowledge_base_spec.rb
111
+ - spec/fathom/mc_node_spec.rb
108
112
  - spec/fathom/monte_carlo_set_spec.rb
109
113
  - spec/fathom/node_spec.rb
110
114
  - spec/fathom/numeric_methods_spec.rb
@@ -161,9 +165,11 @@ test_files:
161
165
  - spec/fathom/distributions/uniform_spec.rb
162
166
  - spec/fathom/enforced_name_spec.rb
163
167
  - spec/fathom/import/csv_import_spec.rb
168
+ - spec/fathom/import/import_node_spec.rb
164
169
  - spec/fathom/import/yaml_import_spec.rb
165
170
  - spec/fathom/import_spec.rb
166
171
  - spec/fathom/knowledge_base_spec.rb
172
+ - spec/fathom/mc_node_spec.rb
167
173
  - spec/fathom/monte_carlo_set_spec.rb
168
174
  - spec/fathom/node_spec.rb
169
175
  - spec/fathom/numeric_methods_spec.rb