timescaledb 0.2.1 → 0.2.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. checksums.yaml +4 -4
  2. data/README.md +41 -9
  3. data/bin/console +1 -1
  4. data/bin/tsdb +2 -2
  5. data/docs/command_line.md +178 -0
  6. data/docs/img/lttb_example.png +0 -0
  7. data/docs/img/lttb_sql_vs_ruby.gif +0 -0
  8. data/docs/img/lttb_zoom.gif +0 -0
  9. data/docs/index.md +61 -0
  10. data/docs/migrations.md +69 -0
  11. data/docs/models.md +78 -0
  12. data/docs/toolkit.md +394 -0
  13. data/docs/toolkit_lttb_tutorial.md +557 -0
  14. data/docs/toolkit_lttb_zoom.md +357 -0
  15. data/docs/videos.md +16 -0
  16. data/examples/all_in_one/all_in_one.rb +39 -5
  17. data/examples/all_in_one/benchmark_comparison.rb +108 -0
  18. data/examples/all_in_one/caggs.rb +93 -0
  19. data/examples/all_in_one/query_data.rb +78 -0
  20. data/examples/toolkit-demo/compare_volatility.rb +64 -0
  21. data/examples/toolkit-demo/lttb/README.md +15 -0
  22. data/examples/toolkit-demo/lttb/lttb.rb +92 -0
  23. data/examples/toolkit-demo/lttb/lttb_sinatra.rb +139 -0
  24. data/examples/toolkit-demo/lttb/lttb_test.rb +21 -0
  25. data/examples/toolkit-demo/lttb/views/index.erb +27 -0
  26. data/examples/toolkit-demo/lttb-zoom/README.md +13 -0
  27. data/examples/toolkit-demo/lttb-zoom/lttb_zoomable.rb +90 -0
  28. data/examples/toolkit-demo/lttb-zoom/views/index.erb +33 -0
  29. data/lib/timescaledb/acts_as_time_vector.rb +18 -0
  30. data/lib/timescaledb/dimensions.rb +1 -0
  31. data/lib/timescaledb/hypertable.rb +5 -1
  32. data/lib/timescaledb/migration_helpers.rb +11 -0
  33. data/lib/timescaledb/stats_report.rb +1 -1
  34. data/lib/timescaledb/toolkit/helpers.rb +20 -0
  35. data/lib/timescaledb/toolkit/time_vector.rb +66 -0
  36. data/lib/timescaledb/toolkit.rb +3 -0
  37. data/lib/timescaledb/version.rb +1 -1
  38. data/lib/timescaledb.rb +1 -0
  39. data/mkdocs.yml +33 -0
  40. metadata +30 -4
  41. data/examples/all_in_one/Gemfile +0 -11
  42. data/examples/all_in_one/Gemfile.lock +0 -51
data/docs/toolkit.md ADDED
@@ -0,0 +1,394 @@
1
+ # The TimescaleDB Toolkit
2
+
3
+ The [TimescaleDB Toolkit][1] is an extension brought by [Timescale][2] for more
4
+ hyperfunctions, fully compatible with TimescaleDB and PostgreSQL.
5
+
6
+ They have almost no dependecy of hypertables but they play very well in the
7
+ hypertables ecosystem. The mission of the toolkit team is to ease all things
8
+ analytics when using TimescaleDB, with a particular focus on developer
9
+ ergonomics and performance.
10
+
11
+ Here, we're going to have a small walkthrough in some of the toolkit functions
12
+ and the helpers that can make simplify the generation of some complex queries.
13
+
14
+ !!!warning
15
+
16
+ Note that we're just starting the toolkit integration in the gem and several
17
+ functions are still experimental.
18
+
19
+ ## The `add_toolkit_to_search_path!` helper
20
+
21
+ Several functions on the toolkit are still in experimental phase, and for that
22
+ reason they're not in the public schema, but lives in the `toolkit_experimental`
23
+ schema.
24
+
25
+ To use them without worring about the schema or prefixing it in all the cases,
26
+ you can introduce the schema as part of the [search_path][3].
27
+
28
+ To make it easy in the Ruby side, you can call the method directly from the
29
+ ActiveRecord connection:
30
+
31
+ ```ruby
32
+ ActiveRecord::Base.connection.add_toolkit_to_search_path!
33
+ ```
34
+
35
+ This statement is actually adding the [toolkit_experimental][4] to the search
36
+ path aside of the `public` and the `$user` variable path.
37
+
38
+ The statement can be placed right before your usage of the toolkit. For example,
39
+ if a single controller in your Rails app will be using it, you can create a
40
+ [filter][5] in the controller to set up it before the use of your action.
41
+
42
+ ```ruby
43
+ class StatisticsController < ActionController::Base
44
+ before_action :add_timescale_toolkit, only: [:complex_query]
45
+
46
+ def complex_query
47
+ # some code that uses the toolkit functions
48
+ end
49
+
50
+ protected
51
+ def add_timescale_toolkit
52
+ ActiveRecord::Base.connection.add_toolkit_to_search_path!
53
+ end
54
+ ```
55
+
56
+ ## Example from scratch to use the Toolkit functions
57
+
58
+ Let's start by working on some example about the [volatility][6] algorithm.
59
+ This example is inspired in the [function pipelines][7] blog post, which brings
60
+ an example about how to calculate volatility and then apply the function
61
+ pipelines to make the same with the toolkit.
62
+
63
+ !!!success
64
+
65
+ Reading the [blog post][7] before trying this is highly recommended,
66
+ and will give you more insights on how to apply and use time vectors that
67
+ is our next topic.
68
+
69
+
70
+ Let's start by creating the `measurements` hypertable using a regular migration:
71
+
72
+ ```ruby
73
+ class CreateMeasurements < ActiveRecord::Migration
74
+ def change
75
+ hypertable_options = {
76
+ time_column: 'ts',
77
+ chunk_time_interval: '1 day',
78
+ }
79
+ create_table :measurements, hypertable: hypertable_options, id: false do |t|
80
+ t.integer :device_id
81
+ t.decimal :val
82
+ t.timestamp :ts
83
+ end
84
+ end
85
+ end
86
+ ```
87
+
88
+ In this example, we just have a hypertable with no compression options. Every
89
+ `1 day` a new child table aka [chunk][8] will be generated. No compression
90
+ options for now.
91
+
92
+ Now, let's add the model `app/models/measurement.rb`:
93
+
94
+ ```ruby
95
+ class Measurement < ActiveRecord::Base
96
+ self.primary_key = 'device_id'
97
+
98
+ acts_as_hypertable time_column: "ts"
99
+ end
100
+ ```
101
+
102
+ At this moment, you can jump into the Rails console and start testing the model.
103
+
104
+ ## Seeding some data
105
+
106
+ Before we build a very complex example, let's build something that is easy to
107
+ follow and comprehend. Let's create 3 records for the same device, representing
108
+ a hourly measurement of some sensor.
109
+
110
+ ```ruby
111
+ yesterday = 1.day.ago
112
+ [1,2,3].each_with_index do |v,i|
113
+ Measurement.create(device_id: 1, ts: yesterday + i.hour, val: v)
114
+ end
115
+ ```
116
+
117
+ Every value is a progression from 1 to 3. Now, we can build a query to get the
118
+ values and let's build the example using plain Ruby.
119
+
120
+ ```ruby
121
+ values = Measurement.order(:ts).pluck(:val) # => [1,2,3]
122
+ ```
123
+
124
+ Using plain Ruby, we can build this example with a few lines of code:
125
+
126
+ ```ruby
127
+ previous = nil
128
+ volatilities = values.map do |value|
129
+ if previous
130
+ delta = (value - previous).abs
131
+ volatility = delta
132
+ end
133
+ previous = value
134
+ volatility
135
+ end
136
+ # volatilities => [nil, 1, 1]
137
+ volatility = volatilities.compact.sum # => 2
138
+ ```
139
+ Compact can be skipped and we can also build the sum in the same loop. So, a
140
+ refactored version would be:
141
+
142
+ ```ruby
143
+ previous = nil
144
+ volatility = 0
145
+ values.each do |value|
146
+ if previous
147
+ delta = (value - previous).abs
148
+ volatility += delta
149
+ end
150
+ previous = value
151
+ end
152
+ volatility # => 2
153
+ ```
154
+
155
+ Now, it's time to move it to a database level calculating the volatility using
156
+ plain postgresql. A subquery is required to build the calculated delta, so it
157
+ seems a bit more confusing:
158
+
159
+
160
+ ```ruby
161
+ delta = Measurement.select("device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta")
162
+ Measurement
163
+ .select("device_id, sum(abs_delta) as volatility")
164
+ .from("(#{delta.to_sql}) as calc_delta")
165
+ .group('device_id')
166
+ ```
167
+
168
+ The final query for the example above looks like this:
169
+
170
+ ```sql
171
+ SELECT device_id, sum(abs_delta) as volatility
172
+ FROM (
173
+ SELECT device_id,
174
+ abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
175
+ FROM "measurements"
176
+ ) as calc_delta
177
+ GROUP BY device_id
178
+ ```
179
+
180
+ It's much harder to understand the actual example then go with plain SQL and now
181
+ let's reproduce the same example using the toolkit pipelines:
182
+
183
+ ```ruby
184
+ Measurement
185
+ .select("device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility")
186
+ .group("device_id")
187
+ ```
188
+
189
+ As you can see, it's much easier to read and digest the example. Now, let's take
190
+ a look in how we can generate the queries using the scopes injected by the
191
+ `acts_as_time_vector` macro.
192
+
193
+
194
+ ## Adding the `acts_as_time_vector` macro
195
+
196
+ Let's start changing the model to add the `acts_as_time_vector` that is
197
+ here to allow us to not repeat the parameters of the `timevector(ts, val)` call.
198
+
199
+ ```ruby
200
+ class Measurement < ActiveRecord::Base
201
+ self.primary_key = 'device_id'
202
+
203
+ acts_as_hypertable time_column: "ts"
204
+
205
+ acts_as_time_vector segment_by: "device_id",
206
+ value_column: "val",
207
+ time_column: "ts"
208
+ end
209
+ end
210
+ ```
211
+
212
+ If you skip the `time_column` option in the `acts_as_time_vector` it will
213
+ inherit the same value from the `acts_as_hypertable`. I'm making it explicit
214
+ here for the sake of making the macros independent.
215
+
216
+
217
+ Now, that we have it, let's create a scope for it:
218
+
219
+ ```ruby
220
+ class Measurement < ActiveRecord::Base
221
+ acts_as_hypertable time_column: "ts"
222
+ acts_as_time_vector segment_by: "device_id",
223
+ value_column: "val",
224
+ time_column: "ts"
225
+
226
+ scope :volatility, -> do
227
+ select("device_id, timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility")
228
+ .group("device_id")
229
+ end
230
+ end
231
+ ```
232
+
233
+ Now, we have created the volatility scope, grouping by device_id always.
234
+
235
+ In the Toolkit helpers, we have a similar version which also contains a default
236
+ segmentation based in the `segment_by` configuration done through the `acts_as_time_vector`
237
+ macro. A method `segment_by_column` is added to access this configuration, so we
238
+ can make a small change that makes you completely understand the volatility
239
+ macro.
240
+
241
+ ```ruby
242
+ class Measurement < ActiveRecord::Base
243
+ # ... Skipping previous code to focus in the example
244
+
245
+ acts_as_time_vector segment_by: "device_id",
246
+ value_column: "val",
247
+ time_column: "ts"
248
+
249
+ scope :volatility, -> (columns=segment_by_column) do
250
+ _scope = select([*columns,
251
+ "timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility"
252
+ ].join(", "))
253
+ _scope = _scope.group(columns) if columns
254
+ _scope
255
+ end
256
+ end
257
+ ```
258
+
259
+ Testing the method:
260
+
261
+ ```ruby
262
+ Measurement.volatility.map(&:attributes)
263
+ # DEBUG -- : Measurement Load (1.6ms) SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id"
264
+ # => [{"device_id"=>1, "volatility"=>8.0}]
265
+ ```
266
+
267
+ Let's add a few more records with random values:
268
+
269
+ ```ruby
270
+ yesterday = 1.day.ago
271
+ (2..6).each do |d|
272
+ (1..10).each do |j|
273
+ Measurement.create(device_id: d, ts: yesterday + j.hour, val: rand(10))
274
+ end
275
+ end
276
+ ```
277
+
278
+ Testing all the values:
279
+
280
+ ```ruby
281
+ Measurement.order("device_id").volatility.map(&:attributes)
282
+ # DEBUG -- : Measurement Load (1.3ms) SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id" ORDER BY device_id
283
+ => [{"device_id"=>1, "volatility"=>8.0},
284
+ {"device_id"=>2, "volatility"=>24.0},
285
+ {"device_id"=>3, "volatility"=>30.0},
286
+ {"device_id"=>4, "volatility"=>32.0},
287
+ {"device_id"=>5, "volatility"=>44.0},
288
+ {"device_id"=>6, "volatility"=>23.0}]
289
+ ```
290
+
291
+ If the parameter is explicit `nil` it will not group by:
292
+
293
+ ```ruby
294
+ Measurement.volatility(nil).map(&:attributes)
295
+ # DEBUG -- : Measurement Load (5.4ms) SELECT timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements"
296
+ # => [{"volatility"=>186.0, "device_id"=>nil}]
297
+ ```
298
+
299
+ ## Comparing with Ruby version
300
+
301
+ Now, it's time to benchmark and compare Ruby vs PostgreSQL solutions, verifying
302
+ which is faster:
303
+
304
+ ```ruby
305
+ class Measurement < ActiveRecord::Base
306
+ # code you already know
307
+ scope :volatility_by_device_id, -> {
308
+ volatility = Hash.new(0)
309
+ previous = Hash.new
310
+ find_all do |measurement|
311
+ device_id = measurement.device_id
312
+ if previous[device_id]
313
+ delta = (measurement.val - previous[device_id]).abs
314
+ volatility[device_id] += delta
315
+ end
316
+ previous[device_id] = measurement.val
317
+ end
318
+ volatility
319
+ }
320
+ end
321
+ ```
322
+
323
+ Now, benchmarking the real time to compute it on Ruby in milliseconds.
324
+
325
+ ```ruby
326
+ Benchmark.measure { Measurement.volatility_by_device_id }.real * 1000
327
+ # => 3.021999917924404
328
+ ```
329
+
330
+ ## Seeding massive data
331
+
332
+ Now, let's use `generate_series` to fast insert a lot of records directly into
333
+ the database and make it full of records.
334
+
335
+ Let's just agree on some numbers to have a good start. Let's generate data for
336
+ 5 devices emitting values every 5 minutes, which will generate around 50k
337
+ records.
338
+
339
+ Let's use some plain SQL to insert the records now:
340
+
341
+ ```ruby
342
+ sql = "INSERT INTO measurements (ts, device_id, val)
343
+ SELECT ts, device_id, random()*80
344
+ FROM generate_series(TIMESTAMP '2022-01-01 00:00:00',
345
+ TIMESTAMP '2022-02-01 00:00:00',
346
+ INTERVAL '5 minutes') AS g1(ts),
347
+ generate_series(0, 5) AS g2(device_id);
348
+ "
349
+ ActiveRecord::Base.connection.execute(sql)
350
+ ```
351
+
352
+ In my MacOS M1 processor it took less than a second to insert the 53k records:
353
+
354
+ ```ruby
355
+ # DEBUG (177.5ms) INSERT INTO measurements (ts, device_id, val) ..
356
+ # => #<PG::Result:0x00007f8152034168 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=53574>
357
+ ```
358
+
359
+ Now, let's measure compare the time to process the volatility:
360
+
361
+ ```ruby
362
+ Benchmark.bm do |x|
363
+ x.report("ruby") { pp Measurement.volatility_by_device_id }
364
+ x.report("sql") { pp Measurement.volatility("device_id").map(&:attributes) }
365
+ end
366
+ # user system total real
367
+ # ruby 0.612439 0.061890 0.674329 ( 0.727590)
368
+ # sql 0.001142 0.000301 0.001443 ( 0.060301)
369
+ ```
370
+
371
+ Calculating the performance ratio we can see `0.72 / 0.06` means that SQL is 12
372
+ times faster than Ruby to process volatility 🎉
373
+
374
+ Just considering it was localhost, we don't have the internet to pass all the
375
+ records over the wires. Now, moving to a remote host look the numbers:
376
+
377
+ !!!warning
378
+ Note that the previous numbers where using localhost.
379
+ Now, using a remote connection between different regions,
380
+ it looks even ~500 times slower than SQL.
381
+
382
+ user system total real
383
+ ruby 0.716321 0.041640 0.757961 ( 6.388881)
384
+ sql 0.001156 0.000177 0.001333 ( 0.161270)
385
+
386
+
387
+ [1]: https://github.com/timescale/timescaledb-toolkit
388
+ [2]: https://timescale.com
389
+ [3]: https://www.postgresql.org/docs/14/runtime-config-client.html#GUC-SEARCH-PATH
390
+ [4]: https://github.com/timescale/timescaledb-toolkit/blob/main/docs/README.md#a-note-on-tags-
391
+ [5]: https://guides.rubyonrails.org/action_controller_overview.html#filters
392
+ [6]: https://en.wikipedia.org/wiki/Volatility_(finance)
393
+ [7]: https://www.timescale.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators/
394
+ [8]: https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks