timescaledb 0.2.1 → 0.2.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +41 -9
- data/bin/console +1 -1
- data/bin/tsdb +2 -2
- data/docs/command_line.md +178 -0
- data/docs/img/lttb_example.png +0 -0
- data/docs/img/lttb_sql_vs_ruby.gif +0 -0
- data/docs/img/lttb_zoom.gif +0 -0
- data/docs/index.md +61 -0
- data/docs/migrations.md +69 -0
- data/docs/models.md +78 -0
- data/docs/toolkit.md +394 -0
- data/docs/toolkit_lttb_tutorial.md +557 -0
- data/docs/toolkit_lttb_zoom.md +357 -0
- data/docs/videos.md +16 -0
- data/examples/all_in_one/all_in_one.rb +39 -5
- data/examples/all_in_one/benchmark_comparison.rb +108 -0
- data/examples/all_in_one/caggs.rb +93 -0
- data/examples/all_in_one/query_data.rb +78 -0
- data/examples/toolkit-demo/compare_volatility.rb +64 -0
- data/examples/toolkit-demo/lttb/README.md +15 -0
- data/examples/toolkit-demo/lttb/lttb.rb +92 -0
- data/examples/toolkit-demo/lttb/lttb_sinatra.rb +139 -0
- data/examples/toolkit-demo/lttb/lttb_test.rb +21 -0
- data/examples/toolkit-demo/lttb/views/index.erb +27 -0
- data/examples/toolkit-demo/lttb-zoom/README.md +13 -0
- data/examples/toolkit-demo/lttb-zoom/lttb_zoomable.rb +90 -0
- data/examples/toolkit-demo/lttb-zoom/views/index.erb +33 -0
- data/lib/timescaledb/acts_as_time_vector.rb +18 -0
- data/lib/timescaledb/dimensions.rb +1 -0
- data/lib/timescaledb/hypertable.rb +5 -1
- data/lib/timescaledb/migration_helpers.rb +11 -0
- data/lib/timescaledb/stats_report.rb +1 -1
- data/lib/timescaledb/toolkit/helpers.rb +20 -0
- data/lib/timescaledb/toolkit/time_vector.rb +66 -0
- data/lib/timescaledb/toolkit.rb +3 -0
- data/lib/timescaledb/version.rb +1 -1
- data/lib/timescaledb.rb +1 -0
- data/mkdocs.yml +33 -0
- metadata +30 -4
- data/examples/all_in_one/Gemfile +0 -11
- data/examples/all_in_one/Gemfile.lock +0 -51
data/docs/toolkit.md
ADDED
@@ -0,0 +1,394 @@
|
|
1
|
+
# The TimescaleDB Toolkit
|
2
|
+
|
3
|
+
The [TimescaleDB Toolkit][1] is an extension brought by [Timescale][2] for more
|
4
|
+
hyperfunctions, fully compatible with TimescaleDB and PostgreSQL.
|
5
|
+
|
6
|
+
They have almost no dependecy of hypertables but they play very well in the
|
7
|
+
hypertables ecosystem. The mission of the toolkit team is to ease all things
|
8
|
+
analytics when using TimescaleDB, with a particular focus on developer
|
9
|
+
ergonomics and performance.
|
10
|
+
|
11
|
+
Here, we're going to have a small walkthrough in some of the toolkit functions
|
12
|
+
and the helpers that can make simplify the generation of some complex queries.
|
13
|
+
|
14
|
+
!!!warning
|
15
|
+
|
16
|
+
Note that we're just starting the toolkit integration in the gem and several
|
17
|
+
functions are still experimental.
|
18
|
+
|
19
|
+
## The `add_toolkit_to_search_path!` helper
|
20
|
+
|
21
|
+
Several functions on the toolkit are still in experimental phase, and for that
|
22
|
+
reason they're not in the public schema, but lives in the `toolkit_experimental`
|
23
|
+
schema.
|
24
|
+
|
25
|
+
To use them without worring about the schema or prefixing it in all the cases,
|
26
|
+
you can introduce the schema as part of the [search_path][3].
|
27
|
+
|
28
|
+
To make it easy in the Ruby side, you can call the method directly from the
|
29
|
+
ActiveRecord connection:
|
30
|
+
|
31
|
+
```ruby
|
32
|
+
ActiveRecord::Base.connection.add_toolkit_to_search_path!
|
33
|
+
```
|
34
|
+
|
35
|
+
This statement is actually adding the [toolkit_experimental][4] to the search
|
36
|
+
path aside of the `public` and the `$user` variable path.
|
37
|
+
|
38
|
+
The statement can be placed right before your usage of the toolkit. For example,
|
39
|
+
if a single controller in your Rails app will be using it, you can create a
|
40
|
+
[filter][5] in the controller to set up it before the use of your action.
|
41
|
+
|
42
|
+
```ruby
|
43
|
+
class StatisticsController < ActionController::Base
|
44
|
+
before_action :add_timescale_toolkit, only: [:complex_query]
|
45
|
+
|
46
|
+
def complex_query
|
47
|
+
# some code that uses the toolkit functions
|
48
|
+
end
|
49
|
+
|
50
|
+
protected
|
51
|
+
def add_timescale_toolkit
|
52
|
+
ActiveRecord::Base.connection.add_toolkit_to_search_path!
|
53
|
+
end
|
54
|
+
```
|
55
|
+
|
56
|
+
## Example from scratch to use the Toolkit functions
|
57
|
+
|
58
|
+
Let's start by working on some example about the [volatility][6] algorithm.
|
59
|
+
This example is inspired in the [function pipelines][7] blog post, which brings
|
60
|
+
an example about how to calculate volatility and then apply the function
|
61
|
+
pipelines to make the same with the toolkit.
|
62
|
+
|
63
|
+
!!!success
|
64
|
+
|
65
|
+
Reading the [blog post][7] before trying this is highly recommended,
|
66
|
+
and will give you more insights on how to apply and use time vectors that
|
67
|
+
is our next topic.
|
68
|
+
|
69
|
+
|
70
|
+
Let's start by creating the `measurements` hypertable using a regular migration:
|
71
|
+
|
72
|
+
```ruby
|
73
|
+
class CreateMeasurements < ActiveRecord::Migration
|
74
|
+
def change
|
75
|
+
hypertable_options = {
|
76
|
+
time_column: 'ts',
|
77
|
+
chunk_time_interval: '1 day',
|
78
|
+
}
|
79
|
+
create_table :measurements, hypertable: hypertable_options, id: false do |t|
|
80
|
+
t.integer :device_id
|
81
|
+
t.decimal :val
|
82
|
+
t.timestamp :ts
|
83
|
+
end
|
84
|
+
end
|
85
|
+
end
|
86
|
+
```
|
87
|
+
|
88
|
+
In this example, we just have a hypertable with no compression options. Every
|
89
|
+
`1 day` a new child table aka [chunk][8] will be generated. No compression
|
90
|
+
options for now.
|
91
|
+
|
92
|
+
Now, let's add the model `app/models/measurement.rb`:
|
93
|
+
|
94
|
+
```ruby
|
95
|
+
class Measurement < ActiveRecord::Base
|
96
|
+
self.primary_key = 'device_id'
|
97
|
+
|
98
|
+
acts_as_hypertable time_column: "ts"
|
99
|
+
end
|
100
|
+
```
|
101
|
+
|
102
|
+
At this moment, you can jump into the Rails console and start testing the model.
|
103
|
+
|
104
|
+
## Seeding some data
|
105
|
+
|
106
|
+
Before we build a very complex example, let's build something that is easy to
|
107
|
+
follow and comprehend. Let's create 3 records for the same device, representing
|
108
|
+
a hourly measurement of some sensor.
|
109
|
+
|
110
|
+
```ruby
|
111
|
+
yesterday = 1.day.ago
|
112
|
+
[1,2,3].each_with_index do |v,i|
|
113
|
+
Measurement.create(device_id: 1, ts: yesterday + i.hour, val: v)
|
114
|
+
end
|
115
|
+
```
|
116
|
+
|
117
|
+
Every value is a progression from 1 to 3. Now, we can build a query to get the
|
118
|
+
values and let's build the example using plain Ruby.
|
119
|
+
|
120
|
+
```ruby
|
121
|
+
values = Measurement.order(:ts).pluck(:val) # => [1,2,3]
|
122
|
+
```
|
123
|
+
|
124
|
+
Using plain Ruby, we can build this example with a few lines of code:
|
125
|
+
|
126
|
+
```ruby
|
127
|
+
previous = nil
|
128
|
+
volatilities = values.map do |value|
|
129
|
+
if previous
|
130
|
+
delta = (value - previous).abs
|
131
|
+
volatility = delta
|
132
|
+
end
|
133
|
+
previous = value
|
134
|
+
volatility
|
135
|
+
end
|
136
|
+
# volatilities => [nil, 1, 1]
|
137
|
+
volatility = volatilities.compact.sum # => 2
|
138
|
+
```
|
139
|
+
Compact can be skipped and we can also build the sum in the same loop. So, a
|
140
|
+
refactored version would be:
|
141
|
+
|
142
|
+
```ruby
|
143
|
+
previous = nil
|
144
|
+
volatility = 0
|
145
|
+
values.each do |value|
|
146
|
+
if previous
|
147
|
+
delta = (value - previous).abs
|
148
|
+
volatility += delta
|
149
|
+
end
|
150
|
+
previous = value
|
151
|
+
end
|
152
|
+
volatility # => 2
|
153
|
+
```
|
154
|
+
|
155
|
+
Now, it's time to move it to a database level calculating the volatility using
|
156
|
+
plain postgresql. A subquery is required to build the calculated delta, so it
|
157
|
+
seems a bit more confusing:
|
158
|
+
|
159
|
+
|
160
|
+
```ruby
|
161
|
+
delta = Measurement.select("device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta")
|
162
|
+
Measurement
|
163
|
+
.select("device_id, sum(abs_delta) as volatility")
|
164
|
+
.from("(#{delta.to_sql}) as calc_delta")
|
165
|
+
.group('device_id')
|
166
|
+
```
|
167
|
+
|
168
|
+
The final query for the example above looks like this:
|
169
|
+
|
170
|
+
```sql
|
171
|
+
SELECT device_id, sum(abs_delta) as volatility
|
172
|
+
FROM (
|
173
|
+
SELECT device_id,
|
174
|
+
abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
|
175
|
+
FROM "measurements"
|
176
|
+
) as calc_delta
|
177
|
+
GROUP BY device_id
|
178
|
+
```
|
179
|
+
|
180
|
+
It's much harder to understand the actual example then go with plain SQL and now
|
181
|
+
let's reproduce the same example using the toolkit pipelines:
|
182
|
+
|
183
|
+
```ruby
|
184
|
+
Measurement
|
185
|
+
.select("device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility")
|
186
|
+
.group("device_id")
|
187
|
+
```
|
188
|
+
|
189
|
+
As you can see, it's much easier to read and digest the example. Now, let's take
|
190
|
+
a look in how we can generate the queries using the scopes injected by the
|
191
|
+
`acts_as_time_vector` macro.
|
192
|
+
|
193
|
+
|
194
|
+
## Adding the `acts_as_time_vector` macro
|
195
|
+
|
196
|
+
Let's start changing the model to add the `acts_as_time_vector` that is
|
197
|
+
here to allow us to not repeat the parameters of the `timevector(ts, val)` call.
|
198
|
+
|
199
|
+
```ruby
|
200
|
+
class Measurement < ActiveRecord::Base
|
201
|
+
self.primary_key = 'device_id'
|
202
|
+
|
203
|
+
acts_as_hypertable time_column: "ts"
|
204
|
+
|
205
|
+
acts_as_time_vector segment_by: "device_id",
|
206
|
+
value_column: "val",
|
207
|
+
time_column: "ts"
|
208
|
+
end
|
209
|
+
end
|
210
|
+
```
|
211
|
+
|
212
|
+
If you skip the `time_column` option in the `acts_as_time_vector` it will
|
213
|
+
inherit the same value from the `acts_as_hypertable`. I'm making it explicit
|
214
|
+
here for the sake of making the macros independent.
|
215
|
+
|
216
|
+
|
217
|
+
Now, that we have it, let's create a scope for it:
|
218
|
+
|
219
|
+
```ruby
|
220
|
+
class Measurement < ActiveRecord::Base
|
221
|
+
acts_as_hypertable time_column: "ts"
|
222
|
+
acts_as_time_vector segment_by: "device_id",
|
223
|
+
value_column: "val",
|
224
|
+
time_column: "ts"
|
225
|
+
|
226
|
+
scope :volatility, -> do
|
227
|
+
select("device_id, timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility")
|
228
|
+
.group("device_id")
|
229
|
+
end
|
230
|
+
end
|
231
|
+
```
|
232
|
+
|
233
|
+
Now, we have created the volatility scope, grouping by device_id always.
|
234
|
+
|
235
|
+
In the Toolkit helpers, we have a similar version which also contains a default
|
236
|
+
segmentation based in the `segment_by` configuration done through the `acts_as_time_vector`
|
237
|
+
macro. A method `segment_by_column` is added to access this configuration, so we
|
238
|
+
can make a small change that makes you completely understand the volatility
|
239
|
+
macro.
|
240
|
+
|
241
|
+
```ruby
|
242
|
+
class Measurement < ActiveRecord::Base
|
243
|
+
# ... Skipping previous code to focus in the example
|
244
|
+
|
245
|
+
acts_as_time_vector segment_by: "device_id",
|
246
|
+
value_column: "val",
|
247
|
+
time_column: "ts"
|
248
|
+
|
249
|
+
scope :volatility, -> (columns=segment_by_column) do
|
250
|
+
_scope = select([*columns,
|
251
|
+
"timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility"
|
252
|
+
].join(", "))
|
253
|
+
_scope = _scope.group(columns) if columns
|
254
|
+
_scope
|
255
|
+
end
|
256
|
+
end
|
257
|
+
```
|
258
|
+
|
259
|
+
Testing the method:
|
260
|
+
|
261
|
+
```ruby
|
262
|
+
Measurement.volatility.map(&:attributes)
|
263
|
+
# DEBUG -- : Measurement Load (1.6ms) SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id"
|
264
|
+
# => [{"device_id"=>1, "volatility"=>8.0}]
|
265
|
+
```
|
266
|
+
|
267
|
+
Let's add a few more records with random values:
|
268
|
+
|
269
|
+
```ruby
|
270
|
+
yesterday = 1.day.ago
|
271
|
+
(2..6).each do |d|
|
272
|
+
(1..10).each do |j|
|
273
|
+
Measurement.create(device_id: d, ts: yesterday + j.hour, val: rand(10))
|
274
|
+
end
|
275
|
+
end
|
276
|
+
```
|
277
|
+
|
278
|
+
Testing all the values:
|
279
|
+
|
280
|
+
```ruby
|
281
|
+
Measurement.order("device_id").volatility.map(&:attributes)
|
282
|
+
# DEBUG -- : Measurement Load (1.3ms) SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id" ORDER BY device_id
|
283
|
+
=> [{"device_id"=>1, "volatility"=>8.0},
|
284
|
+
{"device_id"=>2, "volatility"=>24.0},
|
285
|
+
{"device_id"=>3, "volatility"=>30.0},
|
286
|
+
{"device_id"=>4, "volatility"=>32.0},
|
287
|
+
{"device_id"=>5, "volatility"=>44.0},
|
288
|
+
{"device_id"=>6, "volatility"=>23.0}]
|
289
|
+
```
|
290
|
+
|
291
|
+
If the parameter is explicit `nil` it will not group by:
|
292
|
+
|
293
|
+
```ruby
|
294
|
+
Measurement.volatility(nil).map(&:attributes)
|
295
|
+
# DEBUG -- : Measurement Load (5.4ms) SELECT timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements"
|
296
|
+
# => [{"volatility"=>186.0, "device_id"=>nil}]
|
297
|
+
```
|
298
|
+
|
299
|
+
## Comparing with Ruby version
|
300
|
+
|
301
|
+
Now, it's time to benchmark and compare Ruby vs PostgreSQL solutions, verifying
|
302
|
+
which is faster:
|
303
|
+
|
304
|
+
```ruby
|
305
|
+
class Measurement < ActiveRecord::Base
|
306
|
+
# code you already know
|
307
|
+
scope :volatility_by_device_id, -> {
|
308
|
+
volatility = Hash.new(0)
|
309
|
+
previous = Hash.new
|
310
|
+
find_all do |measurement|
|
311
|
+
device_id = measurement.device_id
|
312
|
+
if previous[device_id]
|
313
|
+
delta = (measurement.val - previous[device_id]).abs
|
314
|
+
volatility[device_id] += delta
|
315
|
+
end
|
316
|
+
previous[device_id] = measurement.val
|
317
|
+
end
|
318
|
+
volatility
|
319
|
+
}
|
320
|
+
end
|
321
|
+
```
|
322
|
+
|
323
|
+
Now, benchmarking the real time to compute it on Ruby in milliseconds.
|
324
|
+
|
325
|
+
```ruby
|
326
|
+
Benchmark.measure { Measurement.volatility_by_device_id }.real * 1000
|
327
|
+
# => 3.021999917924404
|
328
|
+
```
|
329
|
+
|
330
|
+
## Seeding massive data
|
331
|
+
|
332
|
+
Now, let's use `generate_series` to fast insert a lot of records directly into
|
333
|
+
the database and make it full of records.
|
334
|
+
|
335
|
+
Let's just agree on some numbers to have a good start. Let's generate data for
|
336
|
+
5 devices emitting values every 5 minutes, which will generate around 50k
|
337
|
+
records.
|
338
|
+
|
339
|
+
Let's use some plain SQL to insert the records now:
|
340
|
+
|
341
|
+
```ruby
|
342
|
+
sql = "INSERT INTO measurements (ts, device_id, val)
|
343
|
+
SELECT ts, device_id, random()*80
|
344
|
+
FROM generate_series(TIMESTAMP '2022-01-01 00:00:00',
|
345
|
+
TIMESTAMP '2022-02-01 00:00:00',
|
346
|
+
INTERVAL '5 minutes') AS g1(ts),
|
347
|
+
generate_series(0, 5) AS g2(device_id);
|
348
|
+
"
|
349
|
+
ActiveRecord::Base.connection.execute(sql)
|
350
|
+
```
|
351
|
+
|
352
|
+
In my MacOS M1 processor it took less than a second to insert the 53k records:
|
353
|
+
|
354
|
+
```ruby
|
355
|
+
# DEBUG (177.5ms) INSERT INTO measurements (ts, device_id, val) ..
|
356
|
+
# => #<PG::Result:0x00007f8152034168 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=53574>
|
357
|
+
```
|
358
|
+
|
359
|
+
Now, let's measure compare the time to process the volatility:
|
360
|
+
|
361
|
+
```ruby
|
362
|
+
Benchmark.bm do |x|
|
363
|
+
x.report("ruby") { pp Measurement.volatility_by_device_id }
|
364
|
+
x.report("sql") { pp Measurement.volatility("device_id").map(&:attributes) }
|
365
|
+
end
|
366
|
+
# user system total real
|
367
|
+
# ruby 0.612439 0.061890 0.674329 ( 0.727590)
|
368
|
+
# sql 0.001142 0.000301 0.001443 ( 0.060301)
|
369
|
+
```
|
370
|
+
|
371
|
+
Calculating the performance ratio we can see `0.72 / 0.06` means that SQL is 12
|
372
|
+
times faster than Ruby to process volatility 🎉
|
373
|
+
|
374
|
+
Just considering it was localhost, we don't have the internet to pass all the
|
375
|
+
records over the wires. Now, moving to a remote host look the numbers:
|
376
|
+
|
377
|
+
!!!warning
|
378
|
+
Note that the previous numbers where using localhost.
|
379
|
+
Now, using a remote connection between different regions,
|
380
|
+
it looks even ~500 times slower than SQL.
|
381
|
+
|
382
|
+
user system total real
|
383
|
+
ruby 0.716321 0.041640 0.757961 ( 6.388881)
|
384
|
+
sql 0.001156 0.000177 0.001333 ( 0.161270)
|
385
|
+
|
386
|
+
|
387
|
+
[1]: https://github.com/timescale/timescaledb-toolkit
|
388
|
+
[2]: https://timescale.com
|
389
|
+
[3]: https://www.postgresql.org/docs/14/runtime-config-client.html#GUC-SEARCH-PATH
|
390
|
+
[4]: https://github.com/timescale/timescaledb-toolkit/blob/main/docs/README.md#a-note-on-tags-
|
391
|
+
[5]: https://guides.rubyonrails.org/action_controller_overview.html#filters
|
392
|
+
[6]: https://en.wikipedia.org/wiki/Volatility_(finance)
|
393
|
+
[7]: https://www.timescale.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators/
|
394
|
+
[8]: https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks
|