timescaledb 0.2.1 → 0.2.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (42) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +41 -9
  3. data/bin/console +1 -1
  4. data/bin/tsdb +2 -2
  5. data/docs/command_line.md +178 -0
  6. data/docs/img/lttb_example.png +0 -0
  7. data/docs/img/lttb_sql_vs_ruby.gif +0 -0
  8. data/docs/img/lttb_zoom.gif +0 -0
  9. data/docs/index.md +61 -0
  10. data/docs/migrations.md +69 -0
  11. data/docs/models.md +78 -0
  12. data/docs/toolkit.md +394 -0
  13. data/docs/toolkit_lttb_tutorial.md +557 -0
  14. data/docs/toolkit_lttb_zoom.md +357 -0
  15. data/docs/videos.md +16 -0
  16. data/examples/all_in_one/all_in_one.rb +39 -5
  17. data/examples/all_in_one/benchmark_comparison.rb +108 -0
  18. data/examples/all_in_one/caggs.rb +93 -0
  19. data/examples/all_in_one/query_data.rb +78 -0
  20. data/examples/toolkit-demo/compare_volatility.rb +64 -0
  21. data/examples/toolkit-demo/lttb/README.md +15 -0
  22. data/examples/toolkit-demo/lttb/lttb.rb +92 -0
  23. data/examples/toolkit-demo/lttb/lttb_sinatra.rb +139 -0
  24. data/examples/toolkit-demo/lttb/lttb_test.rb +21 -0
  25. data/examples/toolkit-demo/lttb/views/index.erb +27 -0
  26. data/examples/toolkit-demo/lttb-zoom/README.md +13 -0
  27. data/examples/toolkit-demo/lttb-zoom/lttb_zoomable.rb +90 -0
  28. data/examples/toolkit-demo/lttb-zoom/views/index.erb +33 -0
  29. data/lib/timescaledb/acts_as_time_vector.rb +18 -0
  30. data/lib/timescaledb/dimensions.rb +1 -0
  31. data/lib/timescaledb/hypertable.rb +5 -1
  32. data/lib/timescaledb/migration_helpers.rb +11 -0
  33. data/lib/timescaledb/stats_report.rb +1 -1
  34. data/lib/timescaledb/toolkit/helpers.rb +20 -0
  35. data/lib/timescaledb/toolkit/time_vector.rb +66 -0
  36. data/lib/timescaledb/toolkit.rb +3 -0
  37. data/lib/timescaledb/version.rb +1 -1
  38. data/lib/timescaledb.rb +1 -0
  39. data/mkdocs.yml +33 -0
  40. metadata +30 -4
  41. data/examples/all_in_one/Gemfile +0 -11
  42. data/examples/all_in_one/Gemfile.lock +0 -51
data/docs/toolkit.md ADDED
@@ -0,0 +1,394 @@
1
+ # The TimescaleDB Toolkit
2
+
3
+ The [TimescaleDB Toolkit][1] is an extension brought by [Timescale][2] for more
4
+ hyperfunctions, fully compatible with TimescaleDB and PostgreSQL.
5
+
6
+ They have almost no dependecy of hypertables but they play very well in the
7
+ hypertables ecosystem. The mission of the toolkit team is to ease all things
8
+ analytics when using TimescaleDB, with a particular focus on developer
9
+ ergonomics and performance.
10
+
11
+ Here, we're going to have a small walkthrough in some of the toolkit functions
12
+ and the helpers that can make simplify the generation of some complex queries.
13
+
14
+ !!!warning
15
+
16
+ Note that we're just starting the toolkit integration in the gem and several
17
+ functions are still experimental.
18
+
19
+ ## The `add_toolkit_to_search_path!` helper
20
+
21
+ Several functions on the toolkit are still in experimental phase, and for that
22
+ reason they're not in the public schema, but lives in the `toolkit_experimental`
23
+ schema.
24
+
25
+ To use them without worring about the schema or prefixing it in all the cases,
26
+ you can introduce the schema as part of the [search_path][3].
27
+
28
+ To make it easy in the Ruby side, you can call the method directly from the
29
+ ActiveRecord connection:
30
+
31
+ ```ruby
32
+ ActiveRecord::Base.connection.add_toolkit_to_search_path!
33
+ ```
34
+
35
+ This statement is actually adding the [toolkit_experimental][4] to the search
36
+ path aside of the `public` and the `$user` variable path.
37
+
38
+ The statement can be placed right before your usage of the toolkit. For example,
39
+ if a single controller in your Rails app will be using it, you can create a
40
+ [filter][5] in the controller to set up it before the use of your action.
41
+
42
+ ```ruby
43
+ class StatisticsController < ActionController::Base
44
+ before_action :add_timescale_toolkit, only: [:complex_query]
45
+
46
+ def complex_query
47
+ # some code that uses the toolkit functions
48
+ end
49
+
50
+ protected
51
+ def add_timescale_toolkit
52
+ ActiveRecord::Base.connection.add_toolkit_to_search_path!
53
+ end
54
+ ```
55
+
56
+ ## Example from scratch to use the Toolkit functions
57
+
58
+ Let's start by working on some example about the [volatility][6] algorithm.
59
+ This example is inspired in the [function pipelines][7] blog post, which brings
60
+ an example about how to calculate volatility and then apply the function
61
+ pipelines to make the same with the toolkit.
62
+
63
+ !!!success
64
+
65
+ Reading the [blog post][7] before trying this is highly recommended,
66
+ and will give you more insights on how to apply and use time vectors that
67
+ is our next topic.
68
+
69
+
70
+ Let's start by creating the `measurements` hypertable using a regular migration:
71
+
72
+ ```ruby
73
+ class CreateMeasurements < ActiveRecord::Migration
74
+ def change
75
+ hypertable_options = {
76
+ time_column: 'ts',
77
+ chunk_time_interval: '1 day',
78
+ }
79
+ create_table :measurements, hypertable: hypertable_options, id: false do |t|
80
+ t.integer :device_id
81
+ t.decimal :val
82
+ t.timestamp :ts
83
+ end
84
+ end
85
+ end
86
+ ```
87
+
88
+ In this example, we just have a hypertable with no compression options. Every
89
+ `1 day` a new child table aka [chunk][8] will be generated. No compression
90
+ options for now.
91
+
92
+ Now, let's add the model `app/models/measurement.rb`:
93
+
94
+ ```ruby
95
+ class Measurement < ActiveRecord::Base
96
+ self.primary_key = 'device_id'
97
+
98
+ acts_as_hypertable time_column: "ts"
99
+ end
100
+ ```
101
+
102
+ At this moment, you can jump into the Rails console and start testing the model.
103
+
104
+ ## Seeding some data
105
+
106
+ Before we build a very complex example, let's build something that is easy to
107
+ follow and comprehend. Let's create 3 records for the same device, representing
108
+ a hourly measurement of some sensor.
109
+
110
+ ```ruby
111
+ yesterday = 1.day.ago
112
+ [1,2,3].each_with_index do |v,i|
113
+ Measurement.create(device_id: 1, ts: yesterday + i.hour, val: v)
114
+ end
115
+ ```
116
+
117
+ Every value is a progression from 1 to 3. Now, we can build a query to get the
118
+ values and let's build the example using plain Ruby.
119
+
120
+ ```ruby
121
+ values = Measurement.order(:ts).pluck(:val) # => [1,2,3]
122
+ ```
123
+
124
+ Using plain Ruby, we can build this example with a few lines of code:
125
+
126
+ ```ruby
127
+ previous = nil
128
+ volatilities = values.map do |value|
129
+ if previous
130
+ delta = (value - previous).abs
131
+ volatility = delta
132
+ end
133
+ previous = value
134
+ volatility
135
+ end
136
+ # volatilities => [nil, 1, 1]
137
+ volatility = volatilities.compact.sum # => 2
138
+ ```
139
+ Compact can be skipped and we can also build the sum in the same loop. So, a
140
+ refactored version would be:
141
+
142
+ ```ruby
143
+ previous = nil
144
+ volatility = 0
145
+ values.each do |value|
146
+ if previous
147
+ delta = (value - previous).abs
148
+ volatility += delta
149
+ end
150
+ previous = value
151
+ end
152
+ volatility # => 2
153
+ ```
154
+
155
+ Now, it's time to move it to a database level calculating the volatility using
156
+ plain postgresql. A subquery is required to build the calculated delta, so it
157
+ seems a bit more confusing:
158
+
159
+
160
+ ```ruby
161
+ delta = Measurement.select("device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta")
162
+ Measurement
163
+ .select("device_id, sum(abs_delta) as volatility")
164
+ .from("(#{delta.to_sql}) as calc_delta")
165
+ .group('device_id')
166
+ ```
167
+
168
+ The final query for the example above looks like this:
169
+
170
+ ```sql
171
+ SELECT device_id, sum(abs_delta) as volatility
172
+ FROM (
173
+ SELECT device_id,
174
+ abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
175
+ FROM "measurements"
176
+ ) as calc_delta
177
+ GROUP BY device_id
178
+ ```
179
+
180
+ It's much harder to understand the actual example then go with plain SQL and now
181
+ let's reproduce the same example using the toolkit pipelines:
182
+
183
+ ```ruby
184
+ Measurement
185
+ .select("device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility")
186
+ .group("device_id")
187
+ ```
188
+
189
+ As you can see, it's much easier to read and digest the example. Now, let's take
190
+ a look in how we can generate the queries using the scopes injected by the
191
+ `acts_as_time_vector` macro.
192
+
193
+
194
+ ## Adding the `acts_as_time_vector` macro
195
+
196
+ Let's start changing the model to add the `acts_as_time_vector` that is
197
+ here to allow us to not repeat the parameters of the `timevector(ts, val)` call.
198
+
199
+ ```ruby
200
+ class Measurement < ActiveRecord::Base
201
+ self.primary_key = 'device_id'
202
+
203
+ acts_as_hypertable time_column: "ts"
204
+
205
+ acts_as_time_vector segment_by: "device_id",
206
+ value_column: "val",
207
+ time_column: "ts"
208
+ end
209
+ end
210
+ ```
211
+
212
+ If you skip the `time_column` option in the `acts_as_time_vector` it will
213
+ inherit the same value from the `acts_as_hypertable`. I'm making it explicit
214
+ here for the sake of making the macros independent.
215
+
216
+
217
+ Now, that we have it, let's create a scope for it:
218
+
219
+ ```ruby
220
+ class Measurement < ActiveRecord::Base
221
+ acts_as_hypertable time_column: "ts"
222
+ acts_as_time_vector segment_by: "device_id",
223
+ value_column: "val",
224
+ time_column: "ts"
225
+
226
+ scope :volatility, -> do
227
+ select("device_id, timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility")
228
+ .group("device_id")
229
+ end
230
+ end
231
+ ```
232
+
233
+ Now, we have created the volatility scope, grouping by device_id always.
234
+
235
+ In the Toolkit helpers, we have a similar version which also contains a default
236
+ segmentation based in the `segment_by` configuration done through the `acts_as_time_vector`
237
+ macro. A method `segment_by_column` is added to access this configuration, so we
238
+ can make a small change that makes you completely understand the volatility
239
+ macro.
240
+
241
+ ```ruby
242
+ class Measurement < ActiveRecord::Base
243
+ # ... Skipping previous code to focus in the example
244
+
245
+ acts_as_time_vector segment_by: "device_id",
246
+ value_column: "val",
247
+ time_column: "ts"
248
+
249
+ scope :volatility, -> (columns=segment_by_column) do
250
+ _scope = select([*columns,
251
+ "timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility"
252
+ ].join(", "))
253
+ _scope = _scope.group(columns) if columns
254
+ _scope
255
+ end
256
+ end
257
+ ```
258
+
259
+ Testing the method:
260
+
261
+ ```ruby
262
+ Measurement.volatility.map(&:attributes)
263
+ # DEBUG -- : Measurement Load (1.6ms) SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id"
264
+ # => [{"device_id"=>1, "volatility"=>8.0}]
265
+ ```
266
+
267
+ Let's add a few more records with random values:
268
+
269
+ ```ruby
270
+ yesterday = 1.day.ago
271
+ (2..6).each do |d|
272
+ (1..10).each do |j|
273
+ Measurement.create(device_id: d, ts: yesterday + j.hour, val: rand(10))
274
+ end
275
+ end
276
+ ```
277
+
278
+ Testing all the values:
279
+
280
+ ```ruby
281
+ Measurement.order("device_id").volatility.map(&:attributes)
282
+ # DEBUG -- : Measurement Load (1.3ms) SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id" ORDER BY device_id
283
+ => [{"device_id"=>1, "volatility"=>8.0},
284
+ {"device_id"=>2, "volatility"=>24.0},
285
+ {"device_id"=>3, "volatility"=>30.0},
286
+ {"device_id"=>4, "volatility"=>32.0},
287
+ {"device_id"=>5, "volatility"=>44.0},
288
+ {"device_id"=>6, "volatility"=>23.0}]
289
+ ```
290
+
291
+ If the parameter is explicit `nil` it will not group by:
292
+
293
+ ```ruby
294
+ Measurement.volatility(nil).map(&:attributes)
295
+ # DEBUG -- : Measurement Load (5.4ms) SELECT timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements"
296
+ # => [{"volatility"=>186.0, "device_id"=>nil}]
297
+ ```
298
+
299
+ ## Comparing with Ruby version
300
+
301
+ Now, it's time to benchmark and compare Ruby vs PostgreSQL solutions, verifying
302
+ which is faster:
303
+
304
+ ```ruby
305
+ class Measurement < ActiveRecord::Base
306
+ # code you already know
307
+ scope :volatility_by_device_id, -> {
308
+ volatility = Hash.new(0)
309
+ previous = Hash.new
310
+ find_all do |measurement|
311
+ device_id = measurement.device_id
312
+ if previous[device_id]
313
+ delta = (measurement.val - previous[device_id]).abs
314
+ volatility[device_id] += delta
315
+ end
316
+ previous[device_id] = measurement.val
317
+ end
318
+ volatility
319
+ }
320
+ end
321
+ ```
322
+
323
+ Now, benchmarking the real time to compute it on Ruby in milliseconds.
324
+
325
+ ```ruby
326
+ Benchmark.measure { Measurement.volatility_by_device_id }.real * 1000
327
+ # => 3.021999917924404
328
+ ```
329
+
330
+ ## Seeding massive data
331
+
332
+ Now, let's use `generate_series` to fast insert a lot of records directly into
333
+ the database and make it full of records.
334
+
335
+ Let's just agree on some numbers to have a good start. Let's generate data for
336
+ 5 devices emitting values every 5 minutes, which will generate around 50k
337
+ records.
338
+
339
+ Let's use some plain SQL to insert the records now:
340
+
341
+ ```ruby
342
+ sql = "INSERT INTO measurements (ts, device_id, val)
343
+ SELECT ts, device_id, random()*80
344
+ FROM generate_series(TIMESTAMP '2022-01-01 00:00:00',
345
+ TIMESTAMP '2022-02-01 00:00:00',
346
+ INTERVAL '5 minutes') AS g1(ts),
347
+ generate_series(0, 5) AS g2(device_id);
348
+ "
349
+ ActiveRecord::Base.connection.execute(sql)
350
+ ```
351
+
352
+ In my MacOS M1 processor it took less than a second to insert the 53k records:
353
+
354
+ ```ruby
355
+ # DEBUG (177.5ms) INSERT INTO measurements (ts, device_id, val) ..
356
+ # => #<PG::Result:0x00007f8152034168 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=53574>
357
+ ```
358
+
359
+ Now, let's measure compare the time to process the volatility:
360
+
361
+ ```ruby
362
+ Benchmark.bm do |x|
363
+ x.report("ruby") { pp Measurement.volatility_by_device_id }
364
+ x.report("sql") { pp Measurement.volatility("device_id").map(&:attributes) }
365
+ end
366
+ # user system total real
367
+ # ruby 0.612439 0.061890 0.674329 ( 0.727590)
368
+ # sql 0.001142 0.000301 0.001443 ( 0.060301)
369
+ ```
370
+
371
+ Calculating the performance ratio we can see `0.72 / 0.06` means that SQL is 12
372
+ times faster than Ruby to process volatility 🎉
373
+
374
+ Just considering it was localhost, we don't have the internet to pass all the
375
+ records over the wires. Now, moving to a remote host look the numbers:
376
+
377
+ !!!warning
378
+ Note that the previous numbers where using localhost.
379
+ Now, using a remote connection between different regions,
380
+ it looks even ~500 times slower than SQL.
381
+
382
+ user system total real
383
+ ruby 0.716321 0.041640 0.757961 ( 6.388881)
384
+ sql 0.001156 0.000177 0.001333 ( 0.161270)
385
+
386
+
387
+ [1]: https://github.com/timescale/timescaledb-toolkit
388
+ [2]: https://timescale.com
389
+ [3]: https://www.postgresql.org/docs/14/runtime-config-client.html#GUC-SEARCH-PATH
390
+ [4]: https://github.com/timescale/timescaledb-toolkit/blob/main/docs/README.md#a-note-on-tags-
391
+ [5]: https://guides.rubyonrails.org/action_controller_overview.html#filters
392
+ [6]: https://en.wikipedia.org/wiki/Volatility_(finance)
393
+ [7]: https://www.timescale.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators/
394
+ [8]: https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks