data_hut 0.0.7 → 0.0.8
Sign up to get free protection for your applications and to get access to all the features.
- data/.gitignore +3 -1
- data/CHANGELOG.md +16 -0
- data/README.md +51 -112
- data/Rakefile +6 -0
- data/lib/data_hut/data_warehouse.rb +56 -17
- data/lib/data_hut/version.rb +1 -1
- data/samples/basic.rb +1 -1
- data/samples/common/D3JS_LICENSE +26 -0
- data/samples/common/d3.v3.min.js +4 -0
- data/samples/common/report.html.haml +16 -0
- data/samples/{sample_helper.rb → common/sample_helper.rb} +1 -1
- data/samples/common/samples.gemfile +8 -0
- data/samples/league_of_legends.rb +21 -7
- data/samples/reddit_science.rb +77 -0
- data/samples/weather_files/screenshot.png +0 -0
- data/samples/weather_files/weather.css +24 -0
- data/samples/weather_files/weather.js +89 -0
- data/samples/weather_station.rb +62 -0
- data/test/spec/basic_test.rb +42 -0
- data/test/unit/data_warehouse_test.rb +18 -1
- metadata +12 -3
data/.gitignore
CHANGED
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,21 @@
|
|
1
1
|
# Changelog
|
2
2
|
|
3
|
+
## 0.0.8
|
4
|
+
|
5
|
+
* handle unsanitized nil values properly - If your input data has occasional nil values during extract or transform, you may have seen:
|
6
|
+
DataHut: Ruby type 'NilClass' not supported by Sequel...
|
7
|
+
DataHut now handles nil values instead of raising this exception so that it is easier to work with unsanitized datasets.
|
8
|
+
|
9
|
+
* added `DataHut::DataWarehouse#non_unique` which allows you to specify any test of uniqueness for early skipping during transform or extract phases. DataHut has duplicate detection built-in, i.e. it doesn't allow identical records to be inserted. However in the past, you had to wait for all the fields to be added or transformed before this detection was done. `non-unique` allows you to define more specific uniqueness paramters for early skipping without going through all that. i.e. you have a feed where you know a dup is some kind of GUID... simply test if the GUID is unique *before* going any further...
|
10
|
+
|
11
|
+
dh.extract(data) do |r, d|
|
12
|
+
next if dh.not_unique(guid: d[:guid])
|
13
|
+
r.guid = d[:guid]
|
14
|
+
r.name = d[:name]
|
15
|
+
r.age = d[:age]
|
16
|
+
...
|
17
|
+
end
|
18
|
+
|
3
19
|
## 0.0.7
|
4
20
|
|
5
21
|
* added capability to store and fetch arbitrary metadata from the DataHut.
|
data/README.md
CHANGED
@@ -6,6 +6,9 @@ DataHut has basic features for small one-off analytics like parsing error logs a
|
|
6
6
|
|
7
7
|
*Extract* your data from anywhere, *transform* it however you like and *analyze* it for insights!
|
8
8
|
|
9
|
+
<img src="https://raw.github.com/coldnebo/data_hut/master/samples/weather_files/screenshot.png" width="70%"/>
|
10
|
+
*from [samples/weather_station.rb](https://github.com/coldnebo/data_hut/blob/master/samples/weather_station.rb)*
|
11
|
+
|
9
12
|
|
10
13
|
## Installation
|
11
14
|
|
@@ -50,9 +53,14 @@ Setting up a datahut is easy...
|
|
50
53
|
|
51
54
|
binding.pry
|
52
55
|
|
53
|
-
|
56
|
+
DataHut provides access to the underlying [Sequel::Dataset](http://sequel.rubyforge.org/rdoc/classes/Sequel/Dataset.html) using
|
57
|
+
a Sequel::Model binding. This allows you to query individual fields and stats from the dataset, but also returns rows as objects that are accessed with the same uniform object syntax you used for extracting and transforming... i.e.:
|
58
|
+
|
59
|
+
[1] pry(main)> person = ds.first
|
60
|
+
[2] pry(main)> [person.name, person.age]
|
61
|
+
=> ["barney", 27]
|
54
62
|
|
55
|
-
And here's the
|
63
|
+
And here's some of the other powerful things you can do with a Sequel::Dataset:
|
56
64
|
|
57
65
|
[2] pry(main)> ds.where(eligible: false).count
|
58
66
|
=> 2
|
@@ -63,7 +71,7 @@ And here's the kinds of powerful things you can do:
|
|
63
71
|
[5] pry(main)> ds.min(:age)
|
64
72
|
=> 27
|
65
73
|
|
66
|
-
But
|
74
|
+
But you can also name subsets of data and work with those instead:
|
67
75
|
|
68
76
|
[6] pry(main)> ineligible = ds.where(eligible: false)
|
69
77
|
=> #<Sequel::SQLite::Dataset: "SELECT * FROM `data_warehouse` WHERE (`eligible` = 'f')">
|
@@ -74,7 +82,7 @@ But wait, you can name these collections:
|
|
74
82
|
=> [#< @values={:dw_id=>3, :name=>"fred", :age=>44, :eligible=>false}>,
|
75
83
|
#< @values={:dw_id=>2, :name=>"phil", :age=>31, :eligible=>false}>]
|
76
84
|
|
77
|
-
|
85
|
+
And results remain Sequel::Model objects, so you can access fields with object notation:
|
78
86
|
|
79
87
|
[32] pry(main)> record = ineligible.order(Sequel.desc(:age)).first
|
80
88
|
=> #< @values={:dw_id=>3, :name=>"fred", :age=>44, :eligible=>false}>
|
@@ -84,113 +92,18 @@ The results are Sequel::Model objects, so you can treat them as such:
|
|
84
92
|
=> 44
|
85
93
|
|
86
94
|
|
87
|
-
Read more about the [Sequel gem](http://sequel.rubyforge.org/
|
95
|
+
Read more about the [Sequel gem](http://sequel.rubyforge.org/) to determine what operations you can perform on a DataHut dataset.
|
88
96
|
|
89
97
|
## A More Ambitious Example...
|
90
98
|
|
91
|
-
Taking a popular game like League of Legends and hand-rolling some simple analysis of the champions
|
92
|
-
|
93
|
-
require 'data_hut'
|
94
|
-
require 'nokogiri'
|
95
|
-
require 'open-uri'
|
96
|
-
require 'pry'
|
97
|
-
|
98
|
-
root = 'http://na.leagueoflegends.com'
|
99
|
-
|
100
|
-
# load the data once... (manually delete it to refresh)
|
101
|
-
unless File.exists?("lolstats.db")
|
102
|
-
dh = DataHut.connect("lolstats")
|
103
|
-
|
104
|
-
champions_page = Nokogiri::HTML(open("#{root}/champions"))
|
105
|
-
|
106
|
-
urls = champions_page.css('table.champion_item td.description span a').collect{|e| e.attribute('href').value}
|
107
|
-
|
108
|
-
# keep the powers for later since they are on different pages.
|
109
|
-
powers = {}
|
110
|
-
champions_page.css('table.champion_item').each do |c|
|
111
|
-
name = c.css('td.description span.highlight a').text
|
112
|
-
attack = c.css('td.graphing td.filled_attack').count
|
113
|
-
health = c.css('td.graphing td.filled_health').count
|
114
|
-
spells = c.css('td.graphing td.filled_spells').count
|
115
|
-
difficulty = c.css('td.graphing td.filled_difficulty').count
|
116
|
-
powers.store(name, {attack_power: attack, defense_power: health, ability_power: spells, difficulty: difficulty})
|
117
|
-
end
|
118
|
-
|
119
|
-
puts "loading champion data"
|
120
|
-
dh.extract(urls) do |r, url|
|
121
|
-
champion_page = Nokogiri::HTML(open("#{root}#{url}"))
|
122
|
-
r.name = champion_page.css('div.page_header_text').text
|
123
|
-
|
124
|
-
st = champion_page.css('table.stats_table')
|
125
|
-
names = st.css('td.stats_name').collect{|e| e.text.strip.downcase.gsub(/ /,'_')}
|
126
|
-
values = st.css('td.stats_value').collect{|e| e.text.strip}
|
127
|
-
modifiers = st.css('td.stats_modifier').collect{|e| e.text.strip}
|
128
|
-
|
129
|
-
dh.store_meta(:stats, names)
|
130
|
-
|
131
|
-
(0..names.count-1).collect do |i|
|
132
|
-
stat = (names[i] + "=").to_sym
|
133
|
-
r.send(stat, values[i].to_f)
|
134
|
-
stat_per_level = (names[i].downcase.gsub(/ /,'_') << "_per_level=").to_sym
|
135
|
-
per_level_value = modifiers[i].match(/\+([\d\.]+)/)[1].to_f rescue 0
|
136
|
-
r.send(stat_per_level, per_level_value)
|
137
|
-
end
|
138
|
-
|
139
|
-
# add the powers for this champion...
|
140
|
-
power = powers[r.name]
|
141
|
-
r.attack_power = power[:attack_power]
|
142
|
-
r.defense_power = power[:defense_power]
|
143
|
-
r.ability_power = power[:ability_power]
|
144
|
-
r.difficulty = power[:difficulty]
|
145
|
-
|
146
|
-
print "."
|
147
|
-
end
|
148
|
-
puts "done."
|
149
|
-
end
|
150
|
-
|
151
|
-
# connect again in case extract was skipped because the core data already exists:
|
152
|
-
dh = DataHut.connect("lolstats")
|
153
|
-
|
154
|
-
# instead of writing out each stat line manually, we can use some metaprogramming along with some metadata to automate this.
|
155
|
-
def total_stat(r,stat)
|
156
|
-
total_stat = ("total_" + stat + "=").to_sym
|
157
|
-
stat_per_level = r.send((stat + "_per_level").to_sym)
|
158
|
-
base = r.send(stat.to_sym)
|
159
|
-
total = base + (stat_per_level * 18.0)
|
160
|
-
r.send(total_stat, total)
|
161
|
-
end
|
162
|
-
# we need to fetch metadata that was written during extract (potentially in a previous process run)
|
163
|
-
stats = dh.fetch_meta(:stats)
|
164
|
-
|
165
|
-
puts "first transform"
|
166
|
-
dh.transform do |r|
|
167
|
-
stats.each do |stat|
|
168
|
-
total_stat(r,stat)
|
169
|
-
end
|
170
|
-
print '.'
|
171
|
-
end
|
172
|
-
|
173
|
-
puts "second transform"
|
174
|
-
# there's no need to do transforms all in one batch either... you can layer them...
|
175
|
-
dh.transform(true) do |r|
|
176
|
-
# this index combines the tank dimensions above for best combination (simple Euclidean metric)
|
177
|
-
r.nuke_index = r.total_damage * r.total_move_speed * r.total_mana * (r.ability_power)
|
178
|
-
r.easy_nuke_index = r.total_damage * r.total_move_speed * r.total_mana * (r.ability_power) * (1.0/r.difficulty)
|
179
|
-
r.tenacious_index = r.total_armor * r.total_health * r.total_spell_block * r.total_health_regen * (r.defense_power)
|
180
|
-
r.support_index = r.total_mana * r.total_armor * r.total_spell_block * r.total_health * r.total_health_regen * r.total_mana_regen * (r.ability_power * r.defense_power)
|
181
|
-
print '.'
|
182
|
-
end
|
183
|
-
|
184
|
-
# use once at the end to mark records processed.
|
185
|
-
dh.transform_complete
|
186
|
-
puts "transforms complete"
|
187
|
-
|
188
|
-
ds = dh.dataset
|
189
|
-
|
190
|
-
binding.pry
|
99
|
+
Taking a popular game like League of Legends and hand-rolling some simple analysis of the champions. Look at the following sample
|
100
|
+
code:
|
191
101
|
|
102
|
+
* [samples/league_of_legends.rb](https://github.com/coldnebo/data_hut/blob/master/samples/league_of_legends.rb)
|
192
103
|
|
193
|
-
|
104
|
+
Running this sample scrapes some game statistics from an official website and then transforms this base data with
|
105
|
+
extra fields containing different totals and indices that we can construct however we like.
|
106
|
+
Now that we have some data extracted and some initial transforms defined, lets play with the results...
|
194
107
|
|
195
108
|
* who has the most base damage?
|
196
109
|
|
@@ -202,7 +115,7 @@ Now that we have some data, lets play...
|
|
202
115
|
{"Poppy"=>56.3}]
|
203
116
|
|
204
117
|
|
205
|
-
* but wait a minute... what about at level 18? Fortunately, we've transformed our data to add some extra fields for
|
118
|
+
* but wait a minute... what about at level 18? Fortunately, we've transformed our data to add some extra "total" fields for each stat...
|
206
119
|
|
207
120
|
[2] pry(main)> ds.order(Sequel.desc(:total_damage)).limit(5).collect{|c| {c.name => c.total_damage}}
|
208
121
|
=> [{"Skarner"=>129.70000000000002},
|
@@ -211,8 +124,7 @@ Now that we have some data, lets play...
|
|
211
124
|
{"Taric"=>121.0},
|
212
125
|
{"Alistar"=>120.19}]
|
213
126
|
|
214
|
-
* how about using some of the
|
215
|
-
nuke are subjective, but that's the fun of it; we can model our assumptions and see how the data changes in response.)
|
127
|
+
* how about using some of the indices we defined?... for instance, if we want to know which champions produce the greatest damage we could try sorting by our 'nuke_index', (notice that the assumptions on what make a good 'nuke' are subjective, but that's the fun of it; we can model our assumptions and see how the data changes in response.)
|
216
128
|
|
217
129
|
[3] pry(main)> ds.order(Sequel.desc(:nuke_index)).limit(5).collect{|c| {c.name => [c.total_damage, c.total_move_speed, c.total_mana, c.ability_power]}}
|
218
130
|
=> [{"Karthus"=>[100.7, 335.0, 1368.0, 10]},
|
@@ -221,14 +133,14 @@ nuke are subjective, but that's the fun of it; we can model our assumptions and
|
|
221
133
|
{"Karma"=>[109.4, 335.0, 1320.0, 9]},
|
222
134
|
{"Lux"=>[109.4, 340.0, 1150.0, 10]}]
|
223
135
|
|
224
|
-
|
136
|
+
From my experience in the game, these champions are certainly heavy hitters. What do you think?
|
225
137
|
|
226
|
-
* and (now I risk becoming addicted to
|
138
|
+
* and (now I risk becoming addicted to DataHut myself), here's some further guesses with an 'easy_nuke' index (champions that have a lot of damage, but are also less difficult to play):
|
227
139
|
|
228
140
|
[4] pry(main)> ds.order(Sequel.desc(:easy_nuke_index)).limit(5).collect{|c| c.name}
|
229
141
|
=> ["Sona", "Ryze", "Nasus", "Soraka", "Heimerdinger"]
|
230
142
|
|
231
|
-
* makes sense, but is still fascinating... what about my crack at a support_index?
|
143
|
+
* makes sense, but is still fascinating... what about my crack at a support_index (champions that have a lot of regen, staying power, etc.)?
|
232
144
|
|
233
145
|
[5] pry(main)> ds.order(Sequel.desc(:support_index)).limit(5).collect{|c| c.name}
|
234
146
|
=> ["Sion", "Diana", "Nunu", "Nautilus", "Amumu"]
|
@@ -240,6 +152,33 @@ You get the idea now! *Extract* your data from anywhere, *transform* it however
|
|
240
152
|
Have fun!
|
241
153
|
|
242
154
|
|
155
|
+
## Metadata Object Store
|
156
|
+
|
157
|
+
DataHut also supports a basic Ruby object store for storing persistent metadata that might be useful during extract and transform passes.
|
158
|
+
|
159
|
+
*Examples:*
|
160
|
+
|
161
|
+
* [samples/league_of_legends.rb](https://github.com/coldnebo/data_hut/blob/master/samples/league_of_legends.rb):
|
162
|
+
|
163
|
+
dh.extract(urls) do |r, url|
|
164
|
+
...
|
165
|
+
# names => ["damage", "health", "mana", "move_speed", "armor", "spell_block", "health_regen", "mana_regen"]
|
166
|
+
|
167
|
+
# DataHut also allows you to store metadata for the data warehouse during any processing phase for later retrieval.
|
168
|
+
# Since we extract the data only once, but may need stats names for subsequent transforms, we can store the
|
169
|
+
# stats names in the metadata for later use:
|
170
|
+
dh.store_meta(:stats, names)
|
171
|
+
...
|
172
|
+
end
|
173
|
+
...
|
174
|
+
# later... we can fetch the metadata that was written during the extract phase and use it...
|
175
|
+
stats = dh.fetch_meta(:stats)
|
176
|
+
# stats => ["damage", "health", "mana", "move_speed", "armor", "spell_block", "health_regen", "mana_regen"]
|
177
|
+
|
178
|
+
**Caveats:** Because the datastore can support any Ruby object (including custom ones) it is up to the caller to make sure that custom classes are in context before storage and fetch. i.e. if you store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error. For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
|
179
|
+
|
180
|
+
See {DataHut::DataWarehouse#store_meta} and {DataHut::DataWarehouse#fetch_meta} for details.
|
181
|
+
|
243
182
|
## TODOS
|
244
183
|
|
245
184
|
* further optimizations
|
data/Rakefile
CHANGED
@@ -12,4 +12,10 @@ task :default => :test
|
|
12
12
|
desc "clean up"
|
13
13
|
task :clean do
|
14
14
|
FileUtils.rm(FileList["samples/**/*.db"], force: true, verbose: true)
|
15
|
+
FileUtils.rm(FileList["samples/*.html"], force: true, verbose: true)
|
16
|
+
end
|
17
|
+
|
18
|
+
desc "install gems for running samples"
|
19
|
+
task :samples do
|
20
|
+
system('bundle install --gemfile=samples/common/samples.gemfile')
|
15
21
|
end
|
@@ -82,6 +82,12 @@ module DataHut
|
|
82
82
|
# more information about supported ruby data types you can use.
|
83
83
|
# @yieldparam element an element from your data.
|
84
84
|
# @raise [ArgumentError] if you don't provide a block
|
85
|
+
# @return [void]
|
86
|
+
# @note Duplicate records (all fields and values must match) are automatically not inserted at the end of an extract iteration. You may
|
87
|
+
# also skip duplicate extracts early in the iteration by using {#not_unique}.
|
88
|
+
# @note Fields with nil values in records are skipped because the underlying database defaults these to
|
89
|
+
# nil already. However you must have at least one non-nil value in order for the field to be automatically created,
|
90
|
+
# otherwise subsequent transform layers may report errors on trying to access the field.
|
85
91
|
def extract(data)
|
86
92
|
raise(ArgumentError, "a block is required for extract.", caller) unless block_given?
|
87
93
|
|
@@ -102,7 +108,7 @@ module DataHut
|
|
102
108
|
#
|
103
109
|
# @param forced if set to 'true', this transform will iterate over records already marked processed. This can be useful for
|
104
110
|
# layers of transforms that deal with analytics where the analytical model may need to rapidly change as you explore the data.
|
105
|
-
# See the second transform in {
|
111
|
+
# See the second transform in {https://github.com/coldnebo/data_hut/blob/master/samples/league_of_legends.rb#L102 samples/league_of_legends.rb:102}.
|
106
112
|
# @yield [record] lets you modify the DataHut record
|
107
113
|
# @yieldparam record an OpenStruct that fronts the DataHut record. You may access existing fields on this record or create new
|
108
114
|
# fields to store synthetic data from a transform pass.
|
@@ -110,6 +116,7 @@ module DataHut
|
|
110
116
|
# See {http://sequel.rubyforge.org/rdoc/files/doc/schema_modification_rdoc.html Sequel Schema Modification Methods} for
|
111
117
|
# more information about supported ruby data types you can use.
|
112
118
|
# @raise [ArgumentError] if you don't provide a block
|
119
|
+
# @return [void]
|
113
120
|
def transform(forced=false)
|
114
121
|
raise(ArgumentError, "a block is required for transform.", caller) unless block_given?
|
115
122
|
|
@@ -126,10 +133,10 @@ module DataHut
|
|
126
133
|
r = OpenStruct.new(h)
|
127
134
|
# and let the transformer modify it...
|
128
135
|
yield r
|
129
|
-
# now add any new transformation fields to the schema...
|
130
|
-
adapt_schema(r)
|
131
136
|
# get the update hash from the openstruct
|
132
|
-
h = r
|
137
|
+
h = ostruct_to_hash(r)
|
138
|
+
# now add any new transformation fields to the schema...
|
139
|
+
adapt_schema(h)
|
133
140
|
# and use it to update the record
|
134
141
|
@db[:data_warehouse].where(dw_id: dw_id).update(h)
|
135
142
|
end
|
@@ -144,6 +151,8 @@ module DataHut
|
|
144
151
|
# transform_complete (marks the update complete)
|
145
152
|
# dh.dataset is used to visualize graphs with d3.js
|
146
153
|
# end
|
154
|
+
#
|
155
|
+
# @return [void]
|
147
156
|
def transform_complete
|
148
157
|
@db[:data_warehouse].update(:dw_processed => true)
|
149
158
|
end
|
@@ -156,17 +165,21 @@ module DataHut
|
|
156
165
|
#
|
157
166
|
# @param logger [Logger] a logger for the underlying Sequel actions.
|
158
167
|
# @raise [ArgumentError] if passed a logger that is not a kind of {http://www.ruby-doc.org/stdlib-1.9.3//libdoc/logger/rdoc/Logger.html Logger}.
|
168
|
+
# @return [void]
|
159
169
|
def logger=(logger)
|
160
170
|
raise(ArgumentError, "logger must be a type of Logger.") unless logger.kind_of?(Logger)
|
161
171
|
@db.logger = logger
|
162
172
|
end
|
163
173
|
|
164
|
-
|
165
|
-
|
166
|
-
# stores metadata
|
174
|
+
# stores any Ruby object as metadata in the datahut.
|
167
175
|
#
|
168
|
-
# @param key [Symbol] to
|
169
|
-
# @param value [Object] ruby object to store
|
176
|
+
# @param key [Symbol] to reference the metadata by
|
177
|
+
# @param value [Object] ruby object to store in metadata
|
178
|
+
# @return [void]
|
179
|
+
# @note Because the datastore can support any Ruby object (including custom ones) it is up to
|
180
|
+
# the caller to make sure that custom classes are in context before storage and fetch. i.e. if you
|
181
|
+
# store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error.
|
182
|
+
# For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
|
170
183
|
def store_meta(key, value)
|
171
184
|
key = key.to_s if key.instance_of?(Symbol)
|
172
185
|
begin
|
@@ -177,14 +190,18 @@ module DataHut
|
|
177
190
|
@db[:data_warehouse_meta].insert(key: key, value: value)
|
178
191
|
end
|
179
192
|
rescue Exception => e
|
180
|
-
raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}.", caller)
|
193
|
+
raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}: #{e.message}", caller)
|
181
194
|
end
|
182
195
|
end
|
183
196
|
|
184
|
-
# retrieves
|
197
|
+
# retrieves any Ruby object stored as metadata.
|
185
198
|
#
|
186
199
|
# @param key [Symbol] to lookup the metadata by
|
187
|
-
# @return [Object] ruby object that was fetched
|
200
|
+
# @return [Object] ruby object that was fetched from metadata
|
201
|
+
# @note Because the datastore can support any Ruby object (including custom ones) it is up to
|
202
|
+
# the caller to make sure that custom classes are in context before storage and fetch. i.e. if you
|
203
|
+
# store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error.
|
204
|
+
# For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
|
188
205
|
def fetch_meta(key)
|
189
206
|
key = key.to_s if key.instance_of?(Symbol)
|
190
207
|
begin
|
@@ -192,11 +209,29 @@ module DataHut
|
|
192
209
|
value = r[:value] unless r.nil?
|
193
210
|
value = Marshal.load(value) unless value.nil?
|
194
211
|
rescue Exception => e
|
195
|
-
raise(
|
212
|
+
raise(RuntimeError, "DataHut: unable to fetch metadata key #{key}: #{e.message}", caller)
|
196
213
|
end
|
197
214
|
value
|
198
215
|
end
|
199
216
|
|
217
|
+
# used to determine if the specified fields and values are unique in the datahut.
|
218
|
+
#
|
219
|
+
# @example
|
220
|
+
# dh.extract(data) do |r, d|
|
221
|
+
# next if dh.not_unique(name: d[:name])
|
222
|
+
# r.name = d[:name]
|
223
|
+
# r.age = d[:age]
|
224
|
+
# ...
|
225
|
+
# end
|
226
|
+
#
|
227
|
+
# @note exactly duplicate records are automatically skipped at the end of an extract iteration (see {#extract}). This
|
228
|
+
# method is useful if an extract iteration takes a long time and you want to skip duplicates early in the iteration.
|
229
|
+
# @param hash [Hash] of the key, value pairs specifying a partial record by which to consider records unique.
|
230
|
+
# @return [Boolean] true if the {field: value} already exists, false otherwise (including if the column doesn't yet exist.)
|
231
|
+
def not_unique(hash)
|
232
|
+
@db[:data_warehouse].where(hash).count > 0 rescue false
|
233
|
+
end
|
234
|
+
|
200
235
|
private
|
201
236
|
|
202
237
|
def initialize(name)
|
@@ -221,16 +256,20 @@ module DataHut
|
|
221
256
|
end
|
222
257
|
|
223
258
|
def store(r)
|
224
|
-
|
225
|
-
h
|
259
|
+
h = ostruct_to_hash(r)
|
260
|
+
adapt_schema(h)
|
226
261
|
# don't insert dups
|
227
|
-
unless
|
262
|
+
unless not_unique(h)
|
228
263
|
@db[:data_warehouse].insert(h)
|
229
264
|
end
|
230
265
|
end
|
231
266
|
|
232
|
-
def
|
267
|
+
def ostruct_to_hash(r)
|
233
268
|
h = r.marshal_dump
|
269
|
+
h.reject{|k,v| v.nil?} # you can't define a column type "NilClass", so strip these before adapting the schema
|
270
|
+
end
|
271
|
+
|
272
|
+
def adapt_schema(h)
|
234
273
|
h.keys.each do |key|
|
235
274
|
type = h[key].class
|
236
275
|
unless Sequel::Schema::CreateTableGenerator::GENERIC_TYPES.include?(type)
|
data/lib/data_hut/version.rb
CHANGED
data/samples/basic.rb
CHANGED
@@ -0,0 +1,26 @@
|
|
1
|
+
Copyright (c) 2013, Michael Bostock
|
2
|
+
All rights reserved.
|
3
|
+
|
4
|
+
Redistribution and use in source and binary forms, with or without
|
5
|
+
modification, are permitted provided that the following conditions are met:
|
6
|
+
|
7
|
+
* Redistributions of source code must retain the above copyright notice, this
|
8
|
+
list of conditions and the following disclaimer.
|
9
|
+
|
10
|
+
* Redistributions in binary form must reproduce the above copyright notice,
|
11
|
+
this list of conditions and the following disclaimer in the documentation
|
12
|
+
and/or other materials provided with the distribution.
|
13
|
+
|
14
|
+
* The name Michael Bostock may not be used to endorse or promote products
|
15
|
+
derived from this software without specific prior written permission.
|
16
|
+
|
17
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
18
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
19
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
20
|
+
DISCLAIMED. IN NO EVENT SHALL MICHAEL BOSTOCK BE LIABLE FOR ANY DIRECT,
|
21
|
+
INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
22
|
+
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
23
|
+
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
|
24
|
+
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
25
|
+
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
|
26
|
+
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|