cubicle 0.1.1 → 0.1.2

Sign up to get free protection for your applications and to get access to all the features.
@@ -1,173 +1,175 @@
1
- == Overview
2
- Cubicle is a Ruby library and DSL for automating the generation, execution and caching of common aggregations of MongoDB documents. Cubicle was born from the need to easily extract simple, processed statistical views of raw, real time business data being collected from a variety of systems.
3
-
4
- == Motivation
5
- Aggregating data in MongoDB, unlike relational or multidimensional (OLAP) database, requires writing custom reduce functions in JavaScript for the simplest cases and full map reduce functions in the more complex cases, even for common aggregations like sums or averages.
6
-
7
- While writing such map reduce functions isn't particularly difficult it can be tedious and error prone and requires switching from Ruby to JavaScript. Cubicle presents a simplified Ruby DSL for generating the JavaScript required for most common aggregation tasks and also handles processing, caching and presenting the results. JavaScript is still required in some cases, but is limited to constructing simple data transformation expressions.
8
-
9
-
10
- == Approach
11
- Cubicle breaks the task of defining and executing aggregation queries into two pieces. The first is the Cubicle, an analysis friendly 'view' of the underlying collection which defines both the attributes that will be used for grouping (dimensions) , the numerical fields that will be aggregated (measures), and kind of aggregation will be applied to each measure. The second piece of the Cubicle puzzle is a Query which specifies which particular dimensions or measures will be selected from the Cubicle for a given data request, along with how the resulting data will be filtered, ordered, paginated and organized.
12
-
13
- == Install
14
-
15
- Install the gem with:
16
-
17
- gem install cubicle
18
- or
19
- sudo gem install cubicle
20
-
21
-
22
- == An Example
23
- Given a document with the following structure (I'm using MongoMapper here as the ORM, but MongoMapper, or any other ORM, is not required by Cubicle, it works directly with the Mongo-Ruby Driver)
24
-
25
- class PokerHand
26
- include MongoMapper::Document
27
-
28
- key :match_date, String #we use iso8601 strings for dates, but Time works too
29
- key :table, String
30
- key :winner, Person # {:person=>{:name=>'Jim', :address=>{...}...}}
31
- key :winning_hand, Symbol #:two_of_a_kind, :full_house, etc...
32
- key :amount_won, Float
33
- end
34
-
35
- == The Cubicle
36
- here's how a Cubicle designed to analyze these poker hands might look:
37
-
38
- class PokerHandCubicle
39
- extend Cubicle
40
-
41
- date :date, :field_name=>'match_date'
42
- dimension :month, :expression=>'this.match_date.substring(0,7)'
43
- dimension :year, :expression=>'this.match_date.substring(0,4)'
44
-
45
- dimensions :table, :winning_hand
46
- dimension :winner, :field_name=>'winner.name'
47
-
48
- count :total_hands, :expression=>'true'
49
- count :total_draws, :expression=>'this.winning_hand=="draw"'
50
- sum :total_winnings, :field_name=>'amount_won'
51
- avg :avg_winnings, :field_name=>'amount_won'
52
-
53
- ratio :draw_pct, :total_draws, :total_hands
54
- end
55
-
56
- == The Queries
57
- The Queries
58
- And here's how you would use this cubicle to query the underlying data:
59
-
60
- aggregated_data = PokerHandCubicle.query
61
-
62
-
63
- Issuing an empty query to the cubicle like the one above will return a list of measures aggregated according to type for each combination of dimensions. However, once a Cubicle has been defined, you can query it in many different ways. For instance if you wanted to see the total number of hands by type, you could do this:
64
-
65
- hands_by_type = PokerHandCubicle.query { select :winning_hand, :total_hands }
66
-
67
- Or, if you wanted to see the total amount won with a full house, by player, sorted by amount won, you could do this:
68
-
69
- full_houses_by_player = PokerHandCubicle.query do
70
- select :winner
71
- where :winning_hand=>'full_house'
72
- order_by :total_winnings
73
- end
74
-
75
- Cubicle can return your data in a hierarchy (tree) too, if you want. If you wanted to see the percent of hands resulting in a draw by table by day, you could do this:
76
-
77
- draw_pct_by_player_by_day = PokerHandCubicle.query do
78
- select :draw_pct
79
- by :date, :table
80
- end
81
-
82
- In addition to the basic query primitives such as select, where, by and order_by, Cubicle has a basic understanding of time, so as long as you have a dimension in your cubicle defined using 'date', and that dimension is either an iso8601 string or an instance of Time, then you can easily perform some handy date filtering in the DSL:
83
-
84
- winnings_last_30_days_by_player = PokerHandCubicle.query do
85
- select :winner, :total_winnings
86
- for_the_last 30.days
87
- end
88
-
89
- or
90
-
91
- winnings_ytd_by_player = PokerHandCubicle.query do
92
- select :winner, :all_measures
93
- year_to_date
94
- order_by [:total_winnings, :desc]
95
- end
96
-
97
- == The Results
98
- Cubicle data is returned as either an array of hashes, for a two dimensional query, or a hash-based tree the leaves of which are arrays of hashes for hierarchical data (via queries using the 'by' keyword)
99
-
100
- Flat data:
101
- [{:dimension1=>'d1', :dimension2=>'d1', :measure1=>'1.0'},{:dimension1=>'d2'...
102
-
103
- Hierarchical data 2 levels deep:
104
- {'dimension 1'=>{'dimension2'=>[{:measures1=>'1.0'}],'dimension2b'=>[{measure1=>'2.0'}],...
105
-
106
- When you request two dimensional data (i.e. you do not use the 'by' keyword) you can transform your two dimensional data set into hierarchical data at any time using the 'hierarchize' method, specifying the dimensions you want to use in your hierarchy:
107
-
108
- data = MyCubicle.query {select :date, :name, :all_measures}
109
- hierarchized_data = data.hierarchize :name, :date
110
-
111
- This will result in a hash containing each unique value for :name in your source collection, and for each unique :name, a hash containing each unique :date with that :name, and for each :date, an array of hashes keyed by the measures in your Cubicle.
112
-
113
- == Caching & Processing
114
- Map reduce operations, especially over large or very large data sets, can take time to complete. Sometimes a long time. However, very often what you want to do is present a graph or a table of numbers to an interactive user on your website, and you probably don't want to make them wait for all your bazillion rows of raw data to be reduced down to the handful of numbers they are actually interested in seeing. For this reason, Cubicle has two modes of operation, the normal default mode in which aggregations are automatically cached until YourCubicle.expire! Or YourCubicle.process is called, and transient mode, which bypasses the caching mechanisms and executes real time queries against the raw source data.
115
-
116
- == Preprocessed Aggregations
117
- The expected normal mode of operation, however, is cached mode. While far from anything actually resembling an OLAP cube, Cubicle was designed to to process data on some periodic schedule and provide quick access to stored, aggregated data in between each processing, much like a real OLAP cube. Also reminiscent of an OLAP cube, Cubicle will cache aggregations at various levels of resolution, depending on the aggregations that were set up when defining a cubicle and depending on what queries are executed. For example, if a given Cubicle has three dimensions, Name, City and Date, when the Cubicle is processed, it will calculated aggregated measures for each combination of values on those three fields. If a query is executed that only requires Name and Date, then Cubicle will aggregate and cache measures by just Name and Date. If a third query asks for just Name, then Cubicle will create an aggregation based just on Name, but rather than using the original data source with its many rows, it will execute its map reduce against the previously cached Name-Date aggregation, which by definition will have fewer rows and should therefore perform faster. If you are aware ahead of time the aggregations your queries will need, you can specify them in the Cubicle definition, like this
118
- class MyCubicle
119
- extend Cubicle
120
- dimension :name
121
- dimension :date
122
- ...
123
- avg :my_measure
124
- ...
125
- aggregate :name, :date
126
- aggregate :name
127
- aggregate :date
128
- end
129
-
130
- When aggregations are specified in this way, then Cubicle will pre-aggregate your data for each of the specified combinations of dimensions whenever MyCubicle.process is called, eliminating the first-hit penalty that would otherwise be incurred when Cubicle encountered a given aggregation for the first time.
131
-
132
- == Transient (Real Time) Queries
133
- Sometimes you may not want to query cached data. In our application, we are using Cubicle to provide data for our performance management Key Performance Indicators (KPI's) which consist of both a historical trend of a particular metric as well as the current, real time value of the same metric for, say, the current month or a rolling 30 day period. For performance reasons, we fetch our trend, which is usually 12 months, from cached data but want up to the minute freshness for our real time KPI values, so we need to query the living source data. To accomplish this using Cubicle, you simply insert 'transient!' into your query definition, like so
134
-
135
- MyCubicle.query do
136
- transient!
137
- select :this, :that, :the_other
138
- end
139
-
140
- This will bypass cached aggregations and execute a map reduce query directly against the cubicle source collection.
141
-
142
- == Mongo Mapper plugin
143
- If MongoMapper is detected, Cubicle will use its connection to MongoDB. Additionally, Cubicle will install a simple MongoMapper plugin for doing ad-hoc, non-cached aggregations on the fly from a MongoMapper document, like this:
144
- MyMongoMapperModel.aggregate do
145
- dimension :my_dimension
146
- count :measure1
147
- avg :measure2
148
- end.query {order_by [:measure2, :desc]; limit 10;}
149
-
150
- == Limitations
151
- * Cubicle cannot currently cause child documents to be emitted in the map reduce. This is a pretty big limitation, and will be resolved shortly.
152
- * Documentation is non-existent. This is being worked on (head that one before?)
153
- * Test coverage is OK, but the tests could be better organized
154
- * Code needs to be modularized a bit, main classes are a bit hairy at the moment
155
-
156
-
157
- == Credits
158
-
159
- * Alex Wang, Patrick Gannon for features, fixes & testing
160
-
161
- == Bugs/Issues
162
- Please report them {on github}[http://github.com/plasticlizard/cubicle/issues].
163
-
164
- == Links
165
-
166
- == Todo
167
- * Support for emitting child / descendant documents
168
- * Work with native Date type, instead of just iso strings
169
- * Hirb support
170
- * Member format strings
171
- * Auto gen of a cubicle definition based on existing keys/key types in the MongoMapper plugin
172
- * DSL support for topcount and bottomcount queries
173
- * Support for 'duration' aggregation that will calculated durations between timestamps
1
+ == Overview
2
+ Cubicle is a Ruby library and DSL for automating the generation, execution and caching of common aggregations of MongoDB documents. Cubicle was born from the need to easily extract simple, processed statistical views of raw, real time business data being collected from a variety of systems.
3
+
4
+ == Motivation
5
+ Aggregating data in MongoDB, unlike relational or multidimensional (OLAP) database, requires writing custom reduce functions in JavaScript for the simplest cases and full map reduce functions in the more complex cases, even for common aggregations like sums or averages.
6
+
7
+ While writing such map reduce functions isn't particularly difficult it can be tedious and error prone and requires switching from Ruby to JavaScript. Cubicle presents a simplified Ruby DSL for generating the JavaScript required for most common aggregation tasks and also handles processing, caching and presenting the results. JavaScript is still required in some cases, but is limited to constructing simple data transformation expressions.
8
+
9
+
10
+ == Approach
11
+ Cubicle breaks the task of defining and executing aggregation queries into two pieces. The first is the Cubicle, an analysis friendly 'view' of the underlying collection which defines both the attributes that will be used for grouping (dimensions) , the numerical fields that will be aggregated (measures), and kind of aggregation will be applied to each measure. The second piece of the Cubicle puzzle is a Query which specifies which particular dimensions or measures will be selected from the Cubicle for a given data request, along with how the resulting data will be filtered, ordered, paginated and organized.
12
+
13
+ == Install
14
+
15
+ Install the gem with:
16
+
17
+ gem install cubicle
18
+ or
19
+ sudo gem install cubicle
20
+
21
+
22
+ == An Example
23
+ Given a document with the following structure (I'm using MongoMapper here as the ORM, but MongoMapper, or any other ORM, is not required by Cubicle, it works directly with the Mongo-Ruby Driver)
24
+
25
+ class PokerHand
26
+ include MongoMapper::Document
27
+
28
+ key :match_date, String #we use iso8601 strings for dates, but Time works too
29
+ key :table, String
30
+ key :winner, Person # {:person=>{:name=>'Jim', :address=>{...}...}}
31
+ key :winning_hand, Symbol #:two_of_a_kind, :full_house, etc...
32
+ key :amount_won, Float
33
+ end
34
+
35
+ == The Cubicle
36
+ here's how a Cubicle designed to analyze these poker hands might look:
37
+
38
+ class PokerHandCubicle
39
+ extend Cubicle
40
+
41
+ date :date, :field_name=>'match_date'
42
+ dimension :month, :expression=>'this.match_date.substring(0,7)'
43
+ dimension :year, :expression=>'this.match_date.substring(0,4)'
44
+
45
+ dimensions :table,
46
+ :winning_hand
47
+ dimension :winner, :field_name=>'winner.name'
48
+
49
+ count :total_hands, :expression=>'true'
50
+ count :total_draws, :expression=>'this.winning_hand=="draw"'
51
+ sum :total_winnings, :field_name=>'amount_won'
52
+ avg :avg_winnings, :field_name=>'amount_won'
53
+
54
+ ratio :draw_pct, :total_draws, :total_hands
55
+ end
56
+
57
+ == The Queries
58
+ The Queries
59
+ And here's how you would use this cubicle to query the underlying data:
60
+
61
+ aggregated_data = PokerHandCubicle.query
62
+
63
+
64
+ Issuing an empty query to the cubicle like the one above will return a list of measures aggregated according to type for each combination of dimensions. However, once a Cubicle has been defined, you can query it in many different ways. For instance if you wanted to see the total number of hands by type, you could do this:
65
+
66
+ hands_by_type = PokerHandCubicle.query { select :winning_hand, :total_hands }
67
+
68
+ Or, if you wanted to see the total amount won with a full house, by player, sorted by amount won, you could do this:
69
+
70
+ full_houses_by_player = PokerHandCubicle.query do
71
+ select :winner
72
+ where :winning_hand=>'full_house'
73
+ order_by :total_winnings
74
+ end
75
+
76
+ Cubicle can return your data in a hierarchy (tree) too, if you want. If you wanted to see the percent of hands resulting in a draw by table by day, you could do this:
77
+
78
+ draw_pct_by_player_by_day = PokerHandCubicle.query do
79
+ select :draw_pct
80
+ by :date, :table
81
+ end
82
+
83
+ In addition to the basic query primitives such as select, where, by and order_by, Cubicle has a basic understanding of time, so as long as you have a dimension in your cubicle defined using 'date', and that dimension is either an iso8601 string or an instance of Time, then you can easily perform some handy date filtering in the DSL:
84
+
85
+ winnings_last_30_days_by_player = PokerHandCubicle.query do
86
+ select :winner, :total_winnings
87
+ for_the_last 30.days
88
+ end
89
+
90
+ or
91
+
92
+ winnings_ytd_by_player = PokerHandCubicle.query do
93
+ select :winner, :all_measures
94
+ year_to_date
95
+ order_by [:total_winnings, :desc]
96
+ end
97
+
98
+ == The Results
99
+ Cubicle data is returned as either an array of hashes, for a two dimensional query, or a hash-based tree the leaves of which are arrays of hashes for hierarchical data (via queries using the 'by' keyword)
100
+
101
+ Flat data:
102
+ [{:dimension1=>'d1', :dimension2=>'d1', :measure1=>'1.0'},{:dimension1=>'d2'...
103
+
104
+ Hierarchical data 2 levels deep:
105
+ {'dimension 1'=>{'dimension2'=>[{:measures1=>'1.0'}],'dimension2b'=>[{measure1=>'2.0'}],...
106
+
107
+ When you request two dimensional data (i.e. you do not use the 'by' keyword) you can transform your two dimensional data set into hierarchical data at any time using the 'hierarchize' method, specifying the dimensions you want to use in your hierarchy:
108
+
109
+ data = MyCubicle.query {select :date, :name, :all_measures}
110
+ hierarchized_data = data.hierarchize :name, :date
111
+
112
+ This will result in a hash containing each unique value for :name in your source collection, and for each unique :name, a hash containing each unique :date with that :name, and for each :date, an array of hashes keyed by the measures in your Cubicle.
113
+
114
+ == Caching & Processing
115
+ Map reduce operations, especially over large or very large data sets, can take time to complete. Sometimes a long time. However, very often what you want to do is present a graph or a table of numbers to an interactive user on your website, and you probably don't want to make them wait for all your bazillion rows of raw data to be reduced down to the handful of numbers they are actually interested in seeing. For this reason, Cubicle has two modes of operation, the normal default mode in which aggregations are automatically cached until YourCubicle.expire! Or YourCubicle.process is called, and transient mode, which bypasses the caching mechanisms and executes real time queries against the raw source data.
116
+
117
+ == Preprocessed Aggregations
118
+ The expected normal mode of operation, however, is cached mode. While far from anything actually resembling an OLAP cube, Cubicle was designed to to process data on some periodic schedule and provide quick access to stored, aggregated data in between each processing, much like a real OLAP cube. Also reminiscent of an OLAP cube, Cubicle will cache aggregations at various levels of resolution, depending on the aggregations that were set up when defining a cubicle and depending on what queries are executed. For example, if a given Cubicle has three dimensions, Name, City and Date, when the Cubicle is processed, it will calculated aggregated measures for each combination of values on those three fields. If a query is executed that only requires Name and Date, then Cubicle will aggregate and cache measures by just Name and Date. If a third query asks for just Name, then Cubicle will create an aggregation based just on Name, but rather than using the original data source with its many rows, it will execute its map reduce against the previously cached Name-Date aggregation, which by definition will have fewer rows and should therefore perform faster. If you are aware ahead of time the aggregations your queries will need, you can specify them in the Cubicle definition, like this
119
+ class MyCubicle
120
+ extend Cubicle
121
+ dimension :name
122
+ dimension :date
123
+ ...
124
+ avg :my_measure
125
+ ...
126
+ aggregate :name, :date
127
+ aggregate :name
128
+ aggregate :date
129
+ end
130
+
131
+ When aggregations are specified in this way, then Cubicle will pre-aggregate your data for each of the specified combinations of dimensions whenever MyCubicle.process is called, eliminating the first-hit penalty that would otherwise be incurred when Cubicle encountered a given aggregation for the first time.
132
+
133
+ == Transient (Real Time) Queries
134
+ Sometimes you may not want to query cached data. In our application, we are using Cubicle to provide data for our performance management Key Performance Indicators (KPI's) which consist of both a historical trend of a particular metric as well as the current, real time value of the same metric for, say, the current month or a rolling 30 day period. For performance reasons, we fetch our trend, which is usually 12 months, from cached data but want up to the minute freshness for our real time KPI values, so we need to query the living source data. To accomplish this using Cubicle, you simply insert 'transient!' into your query definition, like so
135
+
136
+ MyCubicle.query do
137
+ transient!
138
+ select :this, :that, :the_other
139
+ end
140
+
141
+ This will bypass cached aggregations and execute a map reduce query directly against the cubicle source collection.
142
+
143
+ == Mongo Mapper plugin
144
+ If MongoMapper is detected, Cubicle will use its connection to MongoDB. Additionally, Cubicle will install a simple MongoMapper plugin for doing ad-hoc, non-cached aggregations on the fly from a MongoMapper document, like this:
145
+ MyMongoMapperModel.aggregate do
146
+ dimension :my_dimension
147
+ count :measure1
148
+ avg :measure2
149
+ end.query {order_by [:measure2, :desc]; limit 10;}
150
+
151
+ == Limitations
152
+ * Cubicle cannot currently cause child documents to be emitted in the map reduce. This is a pretty big limitation, and will be resolved shortly.
153
+ * Documentation is non-existent. This is being worked on (head that one before?)
154
+ * Test coverage is OK, but the tests could be better organized
155
+ * Code needs to be modularized a bit, main classes are a bit hairy at the moment
156
+
157
+
158
+ == Credits
159
+
160
+ * Alex Wang, Patrick Gannon for features, fixes & testing
161
+
162
+ == Bugs/Issues
163
+ Please report them {on github}[http://github.com/plasticlizard/cubicle/issues].
164
+
165
+ == Links
166
+
167
+ == Todo
168
+ * Support for emitting child / descendant documents
169
+ * Work with native Date type, instead of just iso strings
170
+ * Hirb support
171
+ * Member format strings
172
+ * Auto gen of a cubicle definition based on existing keys/key types in the MongoMapper plugin
173
+ * DSL support for topcount and bottomcount queries
174
+ * Support for 'duration' aggregation that will calculated durations between timestamps
175
+ * Metadata collection to track when cubicles have been processed, perhaps how big they are, how many aggregations, etc.
@@ -5,11 +5,11 @@
5
5
 
6
6
  Gem::Specification.new do |s|
7
7
  s.name = %q{cubicle}
8
- s.version = "0.1.1"
8
+ s.version = "0.1.2"
9
9
 
10
10
  s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
11
11
  s.authors = ["Nathan Stults"]
12
- s.date = %q{2010-03-14}
12
+ s.date = %q{2010-03-20}
13
13
  s.description = %q{Cubicle provides a dsl and aggregation caching framework for automating the generation, execution and caching of map reduce queries when using MongoDB in Ruby. Cubicle also includes a MongoMapper plugin for quickly performing ad-hoc, multi-level group-by queries against a MongoMapper model.}
14
14
  s.email = %q{hereiam@sonic.net}
15
15
  s.extra_rdoc_files = [
@@ -22,6 +22,7 @@ Gem::Specification.new do |s|
22
22
  "README.rdoc",
23
23
  "Rakefile",
24
24
  "cubicle.gemspec",
25
+ "cubicle.log",
25
26
  "lib/cubicle.rb",
26
27
  "lib/cubicle/aggregation.rb",
27
28
  "lib/cubicle/calculated_measure.rb",
@@ -29,6 +30,7 @@ Gem::Specification.new do |s|
29
30
  "lib/cubicle/data_level.rb",
30
31
  "lib/cubicle/date_time.rb",
31
32
  "lib/cubicle/dimension.rb",
33
+ "lib/cubicle/duration.rb",
32
34
  "lib/cubicle/measure.rb",
33
35
  "lib/cubicle/member.rb",
34
36
  "lib/cubicle/member_list.rb",
@@ -44,6 +46,7 @@ Gem::Specification.new do |s|
44
46
  "test/cubicle/cubicle_data_test.rb",
45
47
  "test/cubicle/cubicle_query_test.rb",
46
48
  "test/cubicle/cubicle_test.rb",
49
+ "test/cubicle/duration_test.rb",
47
50
  "test/cubicle/mongo_mapper/aggregate_plugin_test.rb",
48
51
  "test/cubicles/defect_cubicle.rb",
49
52
  "test/log/test.log",
@@ -61,6 +64,7 @@ Gem::Specification.new do |s|
61
64
  "test/cubicle/cubicle_data_test.rb",
62
65
  "test/cubicle/cubicle_query_test.rb",
63
66
  "test/cubicle/cubicle_test.rb",
67
+ "test/cubicle/duration_test.rb",
64
68
  "test/cubicle/mongo_mapper/aggregate_plugin_test.rb",
65
69
  "test/cubicles/defect_cubicle.rb",
66
70
  "test/models/defect.rb",
@@ -0,0 +1,3 @@
1
+ # Logfile created on Sun Mar 14 00:53:11 -0800 2010 by logger.rb/22285
2
+ Processing HirbyCubicle @ Sun Mar 14 00:53:11 -0800 2010
3
+ HirbyCubicle processed @ Sun Mar 14 00:53:11 -0800 2010in 0.160016 seconds.
@@ -1,390 +1,423 @@
1
- require "rubygems"
2
- require "active_support"
3
- require "mongo"
4
- require "logger"
5
-
6
- dir = File.dirname(__FILE__)
7
- ["mongo_environment",
8
- "member",
9
- "member_list",
10
- "measure",
11
- "calculated_measure",
12
- "dimension",
13
- "ratio",
14
- "query",
15
- "data_level",
16
- "data",
17
- "aggregation",
18
- "date_time",
19
- "support"].each {|lib|require File.join(dir,'cubicle',lib)}
20
-
21
- require File.join(dir,"cubicle","mongo_mapper","aggregate_plugin") if defined?(MongoMapper::Document)
22
-
23
- module Cubicle
24
-
25
- def self.register_cubicle_directory(directory_path, recursive=true)
26
- searcher = "#{recursive ? "*" : "**/*"}.rb"
27
- Dir[File.join(directory_path,searcher)].each {|cubicle| require cubicle}
28
- end
29
-
30
- def self.mongo
31
- @mongo ||= defined?(::MongoMapper::Document) ? ::MongoMapper : MongoEnvironment
32
- end
33
-
34
- def self.logger
35
- Cubicle.mongo.logger || Logger.new("cubicle.log")
36
- end
37
-
38
- def database
39
- Cubicle.mongo.database
40
- end
41
-
42
- def collection
43
- database[target_collection_name]
44
- end
45
-
46
- def transient?
47
- @transient ||= false
48
- end
49
-
50
- def transient!
51
- @transient = true
52
- end
53
-
54
- def expire!
55
- collection.drop
56
- expire_aggregations!
57
- end
58
-
59
- def aggregations
60
- return (@aggregations ||= [])
61
- end
62
-
63
- #DSL
64
- def source_collection_name(collection_name = nil)
65
- return @source_collection = collection_name if collection_name
66
- @source_collection ||= name.chomp("Cubicle").chomp("Cube").underscore.pluralize
67
- end
68
- alias source_collection_name= source_collection_name
69
-
70
- def target_collection_name(collection_name = nil)
71
- return nil if transient?
72
- return @target_name = collection_name if collection_name
73
- @target_name ||= "#{name.blank? ? source_collection_name : name.underscore.pluralize}_cubicle"
74
- end
75
- alias target_collection_name= target_collection_name
76
-
77
- def dimension(*args)
78
- dimensions << Cubicle::Dimension.new(*args)
79
- dimensions[-1]
80
- end
81
-
82
- def dimension_names
83
- return @dimensions.map{|dim|dim.name.to_s}
84
- end
85
-
86
- def dimensions(*args)
87
- return (@dimensions ||= Cubicle::MemberList.new) if args.length < 1
88
- args = args[0] if args.length == 1 && args[0].is_a?(Array)
89
- args.each {|dim| dimension dim }
90
- @dimensions
91
- end
92
-
93
- def measure(*args)
94
- measures << Measure.new(*args)
95
- measures[-1]
96
- end
97
-
98
- def measures(*args)
99
- return (@measures ||= Cubicle::MemberList.new) if args.length < 1
100
- args = args[0] if args.length == 1 && args[0].is_a?(Array)
101
- args.each {|m| measure m}
102
- @measures
103
- end
104
-
105
- def count(*args)
106
- options = args.extract_options!
107
- options[:aggregation_method] = :count
108
- measure(*(args << options))
109
- end
110
-
111
- def average(*args)
112
- options = args.extract_options!
113
- options[:aggregation_method] = :average
114
- measure(*(args << options))
115
- #Averaged fields need a count of non-null values to properly calculate the average
116
- args[0] = "#{args[0]}_count".to_sym
117
- count *args
118
- end
119
- alias avg average
120
-
121
- def sum(*args)
122
- options = args.extract_options!
123
- options[:aggregation_method] = :sum
124
- measure(*(args << options))
125
- end
126
-
127
- def ratio(member_name, numerator, denominator)
128
- measures << Ratio.new(member_name, numerator, denominator)
129
- end
130
-
131
- def aggregation(*member_list)
132
- member_list = member_list[0] if member_list[0].is_a?(Array)
133
- aggregations << member_list
134
- end
135
-
136
- def time_dimension(*args)
137
- return (@time_dimension ||= nil) unless args.length > 0
138
- @time_dimension = dimension(*args)
139
- end
140
- alias time_dimension= time_dimension
141
- alias date time_dimension
142
- alias time time_dimension
143
-
144
- def find_member(member_name)
145
- @dimensions[member_name] ||
146
- @measures[member_name]
147
- end
148
-
149
- def query(*args,&block)
150
- options = args.extract_options!
151
- query = Cubicle::Query.new(self)
152
- query.source_collection_name = options.delete(:source_collection) if options[:source_collection]
153
- query.select(*args) if args.length > 0
154
- if block_given?
155
- block.arity == 1 ? (yield query) : (query.instance_eval(&block))
156
- end
157
- query.select_all unless query.selected?
158
- return query if options[:defer]
159
- results = execute_query(query,options)
160
- #If the 'by' clause was used in the the query,
161
- #we'll hierarchize by the members indicated,
162
- #as the next step would otherwise almost certainly
163
- #need to be a call to hierarchize anyway.
164
- query.respond_to?(:by) && query.by.length > 0 ? results.hierarchize(*query.by) : results
165
- end
166
-
167
- def execute_query(query,options={})
168
- count = 0
169
-
170
- find_options = {
171
- :limit=>query.limit || 0,
172
- :skip=>query.offset || 0
173
- }
174
-
175
- find_options[:sort] = prepare_order_by(query)
176
- filter = {}
177
- if query == self || query.transient?
178
- aggregation = aggregate(query,options)
179
- else
180
- process_if_required
181
- aggregation = aggregation_for(query)
182
- #if the query exactly matches the aggregation in terms of requested members, we can issue a simple find
183
- #otherwise, a second map reduce is required to reduce the data set one last time
184
- if ((aggregation.name.split("_")[-1].split(".")) - query.member_names - [:all_measures]).blank?
185
- filter = prepare_filter(query,options[:where] || {})
186
- else
187
- aggregation = aggregate(query,:source_collection=>collection.name)
188
- end
189
- end
190
- count = aggregation.count
191
- #noinspection RubyArgCount
192
- data = aggregation.find(filter,find_options).to_a
193
- #noinspection RubyArgCount
194
- aggregation.drop if aggregation.name =~ /^tmp.mr.*/
195
- Cubicle::Data.new(query, data, count)
196
- end
197
-
198
- def process(options={})
199
- Cubicle.logger.info "Processing #{self.name} @ #{Time.now}"
200
- start = Time.now
201
- expire!
202
- aggregate(self,options)
203
- #Sort desc by length of array, so that larget
204
- #aggregations are processed first, hopefully increasing efficiency
205
- #of the processing step
206
- aggregations.sort!{|a,b|b.length<=>a.length}
207
- aggregations.each do |member_list|
208
- agg_start = Time.now
209
- aggregation_for(query(:defer=>true){select member_list})
210
- Cubicle.logger.info "#{self.name} aggregation #{member_list.inspect} processed in #{Time.now-agg_start} seconds"
211
- end
212
- duration = Time.now - start
213
- Cubicle.logger.info "#{self.name} processed @ #{Time.now}in #{duration} seconds."
214
- end
215
-
216
- protected
217
-
218
- def aggregation_collection_names
219
- database.collection_names.select {|col_name|col_name=~/#{target_collection_name}_aggregation_(.*)/}
220
- end
221
-
222
- def expire_aggregations!
223
- aggregation_collection_names.each{|agg_col|database[agg_col].drop}
224
- end
225
-
226
- def find_best_source_collection(dimension_names, existing_aggregations=self.aggregation_collection_names)
227
- #format of aggregation collection names is source_cubicle_collection_aggregation_dim1.dim2.dim3.dimn
228
- #this next ugly bit of algebra will create 2d array containing a list of the dimension names in each existing aggregation
229
- existing = existing_aggregations.map do |agg_col_name|
230
- agg_col_name.gsub("#{target_collection_name}_aggregation_","").split(".")
231
- end
232
-
233
- #This will select all the aggregations that contain ALL of the desired dimension names
234
- #we are sorting by length because the aggregation with the least number of members
235
- #is likely to be the most efficient data source as it will likely contain the smallest number of rows.
236
- #this will not always be true, and situations may exist where it is rarely true, however the alternative
237
- #is to actually count rows of candidates, which seems a bit wasteful. Of course only the profiler knows,
238
- #but until there is some reason to believe the aggregation caching process needs be highly performant,
239
- #this should do for now.
240
- candidates = existing.select {|candidate|(dimension_names - candidate).blank?}.sort {|a,b|a.length <=> b.length}
241
-
242
- #If no suitable aggregation exists to base this one off of,
243
- #we'll just use the base cubes aggregation collection
244
- return target_collection_name if candidates.blank?
245
- "#{target_collection_name}_aggregation_#{candidates[0].join('.')}"
246
-
247
- end
248
-
249
- def aggregation_for(query)
250
- return collection if query.all_dimensions?
251
-
252
- aggregation_query = query.clone
253
- #If the query needs to filter on a field, it had better be in the aggregation...if it isn't a $where filter...
254
- filter = (query.where if query.respond_to?(:where))
255
- filter.keys.each {|filter_key|aggregation_query.select(filter_key) unless filter_key=~/^\$.*/} unless filter.blank?
256
-
257
- dimension_names = aggregation_query.dimension_names.sort
258
- agg_col_name = "#{target_collection_name}_aggregation_#{dimension_names.join('.')}"
259
-
260
- unless database.collection_names.include?(agg_col_name)
261
- source_col_name = find_best_source_collection(dimension_names)
262
- exec_query = query(dimension_names + [:all_measures], :source_collection=>source_col_name, :defer=>true)
263
- aggregate(exec_query, :target_collection=>agg_col_name)
264
- end
265
-
266
- database[agg_col_name]
267
- end
268
-
269
- def ensure_indexes(collection_name,dimension_names)
270
- #an index for each dimension
271
- dimension_names.each {|dim|database[collection_name].create_index([dim,Mongo::ASCENDING])}
272
- #and a composite
273
- database[collection_name].create_index(dimension_names)
274
- end
275
-
276
- def aggregate(query,options={})
277
- map, reduce = generate_map_function(query), generate_reduce_function
278
- options[:finalize] = generate_finalize_function(query)
279
- options["query"] = prepare_filter(query,options[:where] || {})
280
-
281
- query.source_collection_name ||= source_collection_name
282
-
283
- target_collection = options.delete(:target_collection)
284
- target_collection ||= query.target_collection_name if query.respond_to?(:target_collection_name)
285
-
286
- options[:out] = target_collection unless target_collection.blank? || query.transient?
287
-
288
- #This is defensive - some tests run without ever initializing any collections
289
- return [] unless database.collection_names.include?(query.source_collection_name)
290
-
291
- result = database[query.source_collection_name].map_reduce(map,reduce,options)
292
-
293
- ensure_indexes(target_collection,query.dimension_names) if target_collection
294
-
295
- result
296
- end
297
-
298
- def prepare_filter(query,filter={})
299
- filter.merge!(query.where) if query.respond_to?(:where) && query.where
300
- filter.stringify_keys!
301
- transient = (query.transient? || query == self)
302
- filter.keys.each do |key|
303
- next if key=~/^\$.*/
304
- prefix = nil
305
- prefix = "_id" if (member = self.dimensions[key])
306
- prefix = "value" if (member = self.measures[key]) unless member
307
-
308
- raise "You supplied a filter that does not appear to be a member of this cubicle:#{key}" unless member
309
-
310
- filter_value = filter.delete(key)
311
- if transient
312
- if (member.expression_type == :javascript)
313
- filter_name = "$where"
314
- filter_value = "'#{filter_value}'" if filter_value.is_a?(String) || filter_value.is_a?(Symbol)
315
- filter_value = "(#{member.expression})==#{filter_value}"
316
- else
317
- filter_name = member.expression
318
- end
319
- else
320
- filter_name = "#{prefix}.#{member.name}"
321
- end
322
- filter[filter_name] = filter_value
323
- end
324
- filter
325
- end
326
-
327
- def prepare_order_by(query)
328
- order_by = []
329
- query.order_by.each do |order|
330
- prefix = "_id" if (member = self.dimensions[order[0]])
331
- prefix = "value" if (member = self.measures[order[0]]) unless member
332
- raise "You supplied a field to order_by that does not appear to be a member of this cubicle:#{key}" unless member
333
- order_by << ["#{prefix}.#{order[0]}",order[1]]
334
- end
335
- order_by
336
- end
337
-
338
- def process_if_required
339
- return if database.collection_names.include?(target_collection_name)
340
- process
341
- end
342
-
343
-
344
- def generate_keys_string(query)
345
- "{#{query.dimensions.map{|dim|dim.to_js_keys}.flatten.join(", ")}}"
346
- end
347
-
348
- def generate_values_string(query = self)
349
- "{#{query.measures.map{|measure|measure.to_js_keys}.flatten.join(", ")}}"
350
- end
351
-
352
- def generate_map_function(query = self)
353
- <<MAP
354
- function(){emit(#{generate_keys_string(query)},#{generate_values_string(query)});}
355
- MAP
356
- end
357
-
358
- def generate_reduce_function()
359
- <<REDUCE
360
- function(key,values){
361
- var output = {};
362
- values.forEach(function(doc){
363
- for(var key in doc){
364
- if (doc[key] != null){
365
- output[key] = output[key] || 0;
366
- output[key] += doc[key];
367
- }
368
- }
369
- });
370
- return output;
371
- }
372
- REDUCE
373
- end
374
-
375
- def generate_finalize_function(query = self)
376
- <<FINALIZE
377
- function(key,value)
378
- {
379
-
380
- #{ (query.measures.select{|m|m.aggregation_method == :average}).map do |m|
381
- "value.#{m.name}=value.#{m.name}/value.#{m.name}_count;"
382
- end.join("\n")}
383
- #{ (query.measures.select{|m|m.aggregation_method == :calculation}).map do|m|
384
- "value.#{m.name}=#{m.expression};";
385
- end.join("\n")}
386
- return value;
387
- }
388
- FINALIZE
389
- end
1
+ require "rubygems"
2
+ require "active_support"
3
+ require "mongo"
4
+ require "logger"
5
+
6
+ dir = File.dirname(__FILE__)
7
+ ["mongo_environment",
8
+ "member",
9
+ "member_list",
10
+ "measure",
11
+ "calculated_measure",
12
+ "dimension",
13
+ "ratio",
14
+ "duration",
15
+ "query",
16
+ "data_level",
17
+ "data",
18
+ "aggregation",
19
+ "date_time",
20
+ "support"].each {|lib|require File.join(dir,'cubicle',lib)}
21
+
22
+ require File.join(dir,"cubicle","mongo_mapper","aggregate_plugin") if defined?(MongoMapper::Document)
23
+
24
+ module Cubicle
25
+
26
+ def self.register_cubicle_directory(directory_path, recursive=true)
27
+ searcher = "#{recursive ? "*" : "**/*"}.rb"
28
+ Dir[File.join(directory_path,searcher)].each {|cubicle| require cubicle}
29
+ end
30
+
31
+ def self.mongo
32
+ @mongo ||= defined?(::MongoMapper::Document) ? ::MongoMapper : MongoEnvironment
33
+ end
34
+
35
+ def self.logger
36
+ Cubicle.mongo.logger || Logger.new("cubicle.log")
37
+ end
38
+
39
+ def database
40
+ Cubicle.mongo.database
41
+ end
42
+
43
+ def collection
44
+ database[target_collection_name]
45
+ end
46
+
47
+ def transient?
48
+ @transient ||= false
49
+ end
50
+
51
+ def transient!
52
+ @transient = true
53
+ end
54
+
55
+ def expire!
56
+ collection.drop
57
+ expire_aggregations!
58
+ end
59
+
60
+ def aggregations
61
+ return (@aggregations ||= [])
62
+ end
63
+
64
+ #DSL
65
+ def source_collection_name(collection_name = nil)
66
+ return @source_collection = collection_name if collection_name
67
+ @source_collection ||= name.chomp("Cubicle").chomp("Cube").underscore.pluralize
68
+ end
69
+ alias source_collection_name= source_collection_name
70
+
71
+ def target_collection_name(collection_name = nil)
72
+ return nil if transient?
73
+ return @target_name = collection_name if collection_name
74
+ @target_name ||= "#{name.blank? ? source_collection_name : name.underscore.pluralize}_cubicle"
75
+ end
76
+ alias target_collection_name= target_collection_name
77
+
78
+ def dimension(*args)
79
+ dimensions << Cubicle::Dimension.new(*args)
80
+ dimensions[-1]
81
+ end
82
+
83
+ def dimension_names
84
+ return @dimensions.map{|dim|dim.name.to_s}
85
+ end
86
+
87
+ def dimensions(*args)
88
+ return (@dimensions ||= Cubicle::MemberList.new) if args.length < 1
89
+ args = args[0] if args.length == 1 && args[0].is_a?(Array)
90
+ args.each {|dim| dimension dim }
91
+ @dimensions
92
+ end
93
+
94
+ def measure(*args)
95
+ measures << Measure.new(*args)
96
+ measures[-1]
97
+ end
98
+
99
+ def measures(*args)
100
+ return (@measures ||= Cubicle::MemberList.new) if args.length < 1
101
+ args = args[0] if args.length == 1 && args[0].is_a?(Array)
102
+ args.each {|m| measure m}
103
+ @measures
104
+ end
105
+
106
+ def count(*args)
107
+ options = args.extract_options!
108
+ options[:aggregation_method] = :count
109
+ measure(*(args << options))
110
+ end
111
+
112
+ def average(*args)
113
+ options = args.extract_options!
114
+ options[:aggregation_method] = :average
115
+ measure(*(args << options))
116
+ #Averaged fields need a count of non-null values to properly calculate the average
117
+ args[0] = "#{args[0]}_count".to_sym
118
+ count *args
119
+ end
120
+ alias avg average
121
+
122
+ def sum(*args)
123
+ options = args.extract_options!
124
+ options[:aggregation_method] = :sum
125
+ measure(*(args << options))
126
+ end
127
+
128
+ def duration(*args)
129
+ options = args.extract_options!
130
+ options[:in] ||= durations_in
131
+ args << options
132
+ measures << (dur = Duration.new(*args))
133
+ count("#{dur.name}_count".to_sym, :expression=>dur.expression) if dur.aggregation_method == :average
134
+ end
135
+
136
+ def average_duration(*args)
137
+ duration(*args)
138
+ end
139
+ alias avg_duration average_duration
140
+
141
+ def total_duration(*args)
142
+ options = args.extract_options!
143
+ options[:aggregation_method] = :sum
144
+ duration(*(args<<options))
145
+ end
146
+
147
+ def durations_in(unit_of_time = nil)
148
+ return (@duration_unit ||= :seconds) unless unit_of_time
149
+ @duration_unit = unit_of_time.to_s.pluralize.to_sym
150
+ end
151
+ alias :duration_unit :durations_in
152
+
153
+
154
+ def ratio(member_name, numerator, denominator)
155
+ measures << Ratio.new(member_name, numerator, denominator)
156
+ end
157
+
158
+ def aggregation(*member_list)
159
+ member_list = member_list[0] if member_list[0].is_a?(Array)
160
+ aggregations << member_list
161
+ end
162
+
163
+ def time_dimension(*args)
164
+ return (@time_dimension ||= nil) unless args.length > 0
165
+ @time_dimension = dimension(*args)
166
+ end
167
+ alias time_dimension= time_dimension
168
+ alias date time_dimension
169
+ alias time time_dimension
170
+
171
+ def find_member(member_name)
172
+ @dimensions[member_name] ||
173
+ @measures[member_name]
174
+ end
175
+
176
+ def query(*args,&block)
177
+ options = args.extract_options!
178
+ query = Cubicle::Query.new(self)
179
+ query.source_collection_name = options.delete(:source_collection) if options[:source_collection]
180
+ query.select(*args) if args.length > 0
181
+ if block_given?
182
+ block.arity == 1 ? (yield query) : (query.instance_eval(&block))
183
+ end
184
+ query.select_all unless query.selected?
185
+ return query if options[:defer]
186
+ results = execute_query(query,options)
187
+ #return results if results.blank?
188
+ #If the 'by' clause was used in the the query,
189
+ #we'll hierarchize by the members indicated,
190
+ #as the next step would otherwise almost certainly
191
+ #need to be a call to hierarchize anyway.
192
+ query.respond_to?(:by) && query.by.length > 0 ? results.hierarchize(*query.by) : results
193
+ end
194
+
195
+ #noinspection RubyArgCount
196
+ def execute_query(query,options={})
197
+ count = 0
198
+
199
+ find_options = {
200
+ :limit=>query.limit || 0,
201
+ :skip=>query.offset || 0
202
+ }
203
+
204
+ find_options[:sort] = prepare_order_by(query)
205
+ filter = {}
206
+ if query == self || query.transient?
207
+ aggregation = aggregate(query,options)
208
+ else
209
+ process_if_required
210
+ aggregation = aggregation_for(query)
211
+ #if the query exactly matches the aggregation in terms of requested members, we can issue a simple find
212
+ #otherwise, a second map reduce is required to reduce the data set one last time
213
+ if ((aggregation.name.split("_")[-1].split(".")) - query.member_names - [:all_measures]).blank?
214
+ filter = prepare_filter(query,options[:where] || {})
215
+ else
216
+ aggregation = aggregate(query,:source_collection=>collection.name)
217
+ end
218
+ end
219
+
220
+ if aggregation.blank?
221
+ Cubicle::Data.new(query,[],0) if aggregation == []
222
+ else
223
+ count = aggregation.count
224
+ results = aggregation.find(filter,find_options).to_a
225
+ aggregation.drop if aggregation.name =~ /^tmp.mr.*/
226
+ Cubicle::Data.new(query, results, count)
227
+ end
228
+
229
+ end
230
+
231
+ def process(options={})
232
+ Cubicle.logger.info "Processing #{self.name} @ #{Time.now}"
233
+ start = Time.now
234
+ expire!
235
+ aggregate(self,options)
236
+ #Sort desc by length of array, so that larget
237
+ #aggregations are processed first, hopefully increasing efficiency
238
+ #of the processing step
239
+ aggregations.sort!{|a,b|b.length<=>a.length}
240
+ aggregations.each do |member_list|
241
+ agg_start = Time.now
242
+ aggregation_for(query(:defer=>true){select member_list})
243
+ Cubicle.logger.info "#{self.name} aggregation #{member_list.inspect} processed in #{Time.now-agg_start} seconds"
244
+ end
245
+ duration = Time.now - start
246
+ Cubicle.logger.info "#{self.name} processed @ #{Time.now}in #{duration} seconds."
247
+ end
248
+
249
+ protected
250
+
251
+ def aggregation_collection_names
252
+ database.collection_names.select {|col_name|col_name=~/#{target_collection_name}_aggregation_(.*)/}
253
+ end
254
+
255
+ def expire_aggregations!
256
+ aggregation_collection_names.each{|agg_col|database[agg_col].drop}
257
+ end
258
+
259
+ def find_best_source_collection(dimension_names, existing_aggregations=self.aggregation_collection_names)
260
+ #format of aggregation collection names is source_cubicle_collection_aggregation_dim1.dim2.dim3.dimn
261
+ #this next ugly bit of algebra will create 2d array containing a list of the dimension names in each existing aggregation
262
+ existing = existing_aggregations.map do |agg_col_name|
263
+ agg_col_name.gsub("#{target_collection_name}_aggregation_","").split(".")
264
+ end
265
+
266
+ #This will select all the aggregations that contain ALL of the desired dimension names
267
+ #we are sorting by length because the aggregation with the least number of members
268
+ #is likely to be the most efficient data source as it will likely contain the smallest number of rows.
269
+ #this will not always be true, and situations may exist where it is rarely true, however the alternative
270
+ #is to actually count rows of candidates, which seems a bit wasteful. Of course only the profiler knows,
271
+ #but until there is some reason to believe the aggregation caching process needs be highly performant,
272
+ #this should do for now.
273
+ candidates = existing.select {|candidate|(dimension_names - candidate).blank?}.sort {|a,b|a.length <=> b.length}
274
+
275
+ #If no suitable aggregation exists to base this one off of,
276
+ #we'll just use the base cubes aggregation collection
277
+ return target_collection_name if candidates.blank?
278
+ "#{target_collection_name}_aggregation_#{candidates[0].join('.')}"
279
+
280
+ end
281
+
282
+ def aggregation_for(query)
283
+ return collection if query.all_dimensions?
284
+
285
+ aggregation_query = query.clone
286
+ #If the query needs to filter on a field, it had better be in the aggregation...if it isn't a $where filter...
287
+ filter = (query.where if query.respond_to?(:where))
288
+ filter.keys.each {|filter_key|aggregation_query.select(filter_key) unless filter_key=~/^\$.*/} unless filter.blank?
289
+
290
+ dimension_names = aggregation_query.dimension_names.sort
291
+ agg_col_name = "#{target_collection_name}_aggregation_#{dimension_names.join('.')}"
292
+
293
+ unless database.collection_names.include?(agg_col_name)
294
+ source_col_name = find_best_source_collection(dimension_names)
295
+ exec_query = query(dimension_names + [:all_measures], :source_collection=>source_col_name, :defer=>true)
296
+ aggregate(exec_query, :target_collection=>agg_col_name)
297
+ end
298
+
299
+ database[agg_col_name]
300
+ end
301
+
302
+ def ensure_indexes(collection_name,dimension_names)
303
+ #an index for each dimension
304
+ dimension_names.each {|dim|database[collection_name].create_index([dim,Mongo::ASCENDING])}
305
+ #and a composite
306
+ database[collection_name].create_index(dimension_names)
307
+ end
308
+
309
+ def aggregate(query,options={})
310
+ map, reduce = generate_map_function(query), generate_reduce_function
311
+ options[:finalize] = generate_finalize_function(query)
312
+ options["query"] = prepare_filter(query,options[:where] || {})
313
+
314
+ query.source_collection_name ||= source_collection_name
315
+
316
+ target_collection = options.delete(:target_collection)
317
+ target_collection ||= query.target_collection_name if query.respond_to?(:target_collection_name)
318
+
319
+ options[:out] = target_collection unless target_collection.blank? || query.transient?
320
+
321
+ #This is defensive - some tests run without ever initializing any collections
322
+ return [] unless database.collection_names.include?(query.source_collection_name)
323
+
324
+ result = database[query.source_collection_name].map_reduce(map,reduce,options)
325
+
326
+ ensure_indexes(target_collection,query.dimension_names) if target_collection
327
+
328
+ result
329
+ end
330
+
331
+ def prepare_filter(query,filter={})
332
+ filter.merge!(query.where) if query.respond_to?(:where) && query.where
333
+ filter.stringify_keys!
334
+ transient = (query.transient? || query == self)
335
+ filter.keys.each do |key|
336
+ next if key=~/^\$.*/
337
+ prefix = nil
338
+ prefix = "_id" if (member = self.dimensions[key])
339
+ prefix = "value" if (member = self.measures[key]) unless member
340
+
341
+ raise "You supplied a filter that does not appear to be a member of this cubicle:#{key}" unless member
342
+
343
+ filter_value = filter.delete(key)
344
+ if transient
345
+ if (member.expression_type == :javascript)
346
+ filter_name = "$where"
347
+ filter_value = "'#{filter_value}'" if filter_value.is_a?(String) || filter_value.is_a?(Symbol)
348
+ filter_value = "(#{member.expression})==#{filter_value}"
349
+ else
350
+ filter_name = member.expression
351
+ end
352
+ else
353
+ filter_name = "#{prefix}.#{member.name}"
354
+ end
355
+ filter[filter_name] = filter_value
356
+ end
357
+ filter
358
+ end
359
+
360
+ def prepare_order_by(query)
361
+ order_by = []
362
+ query.order_by.each do |order|
363
+ prefix = "_id" if (member = self.dimensions[order[0]])
364
+ prefix = "value" if (member = self.measures[order[0]]) unless member
365
+ raise "You supplied a field to order_by that does not appear to be a member of this cubicle:#{key}" unless member
366
+ order_by << ["#{prefix}.#{order[0]}",order[1]]
367
+ end
368
+ order_by
369
+ end
370
+
371
+ def process_if_required
372
+ return if database.collection_names.include?(target_collection_name)
373
+ process
374
+ end
375
+
376
+
377
+ def generate_keys_string(query)
378
+ "{#{query.dimensions.map{|dim|dim.to_js_keys}.flatten.join(", ")}}"
379
+ end
380
+
381
+ def generate_values_string(query = self)
382
+ "{#{query.measures.map{|measure|measure.to_js_keys}.flatten.join(", ")}}"
383
+ end
384
+
385
+ def generate_map_function(query = self)
386
+ <<MAP
387
+ function(){emit(#{generate_keys_string(query)},#{generate_values_string(query)});}
388
+ MAP
389
+ end
390
+
391
+ def generate_reduce_function()
392
+ <<REDUCE
393
+ function(key,values){
394
+ var output = {};
395
+ values.forEach(function(doc){
396
+ for(var key in doc){
397
+ if (doc[key] || doc[key] == 0){
398
+ output[key] = output[key] || 0;
399
+ output[key] += doc[key];
400
+ }
401
+ }
402
+ });
403
+ return output;
404
+ }
405
+ REDUCE
406
+ end
407
+
408
+ def generate_finalize_function(query = self)
409
+ <<FINALIZE
410
+ function(key,value)
411
+ {
412
+
413
+ #{ (query.measures.select{|m|m.aggregation_method == :average}).map do |m|
414
+ "value.#{m.name}=value.#{m.name}/value.#{m.name}_count;"
415
+ end.join("\n")}
416
+ #{ (query.measures.select{|m|m.aggregation_method == :calculation}).map do|m|
417
+ "value.#{m.name}=#{m.expression};";
418
+ end.join("\n")}
419
+ return value;
420
+ }
421
+ FINALIZE
422
+ end
390
423
  end