RubyGems - timescaledb - Versions diffs - 0.2.1 → 0.2.3 - Mend

timescaledb 0.2.1 → 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

checksums.yaml +4 -4
data/Fastfile +17 -0
data/Gemfile.lock +2 -2
data/README.md +41 -9
data/bin/console +1 -1
data/bin/tsdb +2 -2
data/docs/command_line.md +178 -0
data/docs/img/lttb_example.png +0 -0
data/docs/img/lttb_sql_vs_ruby.gif +0 -0
data/docs/img/lttb_zoom.gif +0 -0
data/docs/index.md +61 -0
data/docs/migrations.md +69 -0
data/docs/models.md +78 -0
data/docs/toolkit.md +394 -0
data/docs/toolkit_lttb_tutorial.md +557 -0
data/docs/toolkit_lttb_zoom.md +357 -0
data/docs/videos.md +16 -0
data/examples/all_in_one/all_in_one.rb +39 -5
data/examples/all_in_one/benchmark_comparison.rb +108 -0
data/examples/all_in_one/caggs.rb +93 -0
data/examples/all_in_one/query_data.rb +78 -0
data/examples/toolkit-demo/compare_volatility.rb +64 -0
data/examples/toolkit-demo/lttb/README.md +15 -0
data/examples/toolkit-demo/lttb/lttb.rb +92 -0
data/examples/toolkit-demo/lttb/lttb_sinatra.rb +139 -0
data/examples/toolkit-demo/lttb/lttb_test.rb +21 -0
data/examples/toolkit-demo/lttb/views/index.erb +27 -0
data/examples/toolkit-demo/lttb-zoom/README.md +13 -0
data/examples/toolkit-demo/lttb-zoom/lttb_zoomable.rb +90 -0
data/examples/toolkit-demo/lttb-zoom/views/index.erb +33 -0
data/lib/timescaledb/acts_as_time_vector.rb +18 -0
data/lib/timescaledb/dimensions.rb +1 -0
data/lib/timescaledb/hypertable.rb +5 -1
data/lib/timescaledb/job_stats.rb +1 -0
data/lib/timescaledb/migration_helpers.rb +11 -0
data/lib/timescaledb/schema_dumper.rb +1 -1
data/lib/timescaledb/stats_report.rb +1 -1
data/lib/timescaledb/toolkit/helpers.rb +20 -0
data/lib/timescaledb/toolkit/time_vector.rb +66 -0
data/lib/timescaledb/toolkit.rb +3 -0
data/lib/timescaledb/version.rb +1 -1
data/lib/timescaledb.rb +1 -0
data/mkdocs.yml +33 -0
metadata +31 -4
data/examples/all_in_one/Gemfile +0 -11
data/examples/all_in_one/Gemfile.lock +0 -51

data/docs/toolkit.md ADDED Viewed

@@ -0,0 +1,394 @@
+# The TimescaleDB Toolkit
+The [TimescaleDB Toolkit][1] is an extension brought by [Timescale][2] for more
+hyperfunctions, fully compatible with TimescaleDB and PostgreSQL.
+They have almost no dependecy of hypertables but they play very well in the
+hypertables ecosystem. The mission of the toolkit team is to ease all things
+analytics when using TimescaleDB, with a particular focus on developer
+ergonomics and performance.
+Here, we're going to have a small walkthrough in some of the toolkit functions
+and the helpers that can make simplify the generation of some complex queries.
+!!!warning
+    Note that we're just starting the toolkit integration in the gem and several
+    functions are still experimental.
+## The `add_toolkit_to_search_path!` helper
+Several functions on the toolkit are still in experimental phase, and for that
+reason they're not in the public schema, but lives in the `toolkit_experimental`
+schema.
+To use them without worring about the schema or prefixing it in all the cases,
+you can introduce the schema as part of the [search_path][3].
+To make it easy in the Ruby side, you can call the method directly from the
+ActiveRecord connection:
+```ruby
+ActiveRecord::Base.connection.add_toolkit_to_search_path!
+```
+This statement is actually adding the [toolkit_experimental][4] to the search
+path aside of the `public` and the `$user` variable path.
+The statement can be placed right before your usage of the toolkit. For example,
+if a single controller in your Rails app will be using it, you can create a
+[filter][5] in the controller to set up it before the use of your action.
+```ruby
+class StatisticsController < ActionController::Base
+  before_action :add_timescale_toolkit, only: [:complex_query]
+  def complex_query
+    # some code that uses the toolkit functions
+  end
+  protected
+  def add_timescale_toolkit
+    ActiveRecord::Base.connection.add_toolkit_to_search_path!
+  end
+```
+## Example from scratch to use the Toolkit functions
+Let's start by working on some example about the [volatility][6] algorithm.
+This example is inspired in the [function pipelines][7] blog post, which brings
+an example about how to calculate volatility and then apply the function
+pipelines to make the same with the toolkit.
+!!!success
+    Reading the [blog post][7] before trying this is highly recommended,
+    and will give you more insights on how to apply and use time vectors that
+    is our next topic.
+Let's start by creating the `measurements` hypertable using a regular migration:
+```ruby
+class CreateMeasurements < ActiveRecord::Migration
+  def change
+    hypertable_options = {
+      time_column: 'ts',
+      chunk_time_interval: '1 day',
+    }
+    create_table :measurements, hypertable: hypertable_options, id: false do |t|
+      t.integer :device_id
+      t.decimal :val
+      t.timestamp :ts
+    end
+  end
+end
+```
+In this example, we just have a hypertable with no compression options. Every
+`1 day` a new child table aka [chunk][8] will be generated. No compression
+options for now.
+Now, let's add the model `app/models/measurement.rb`:
+```ruby
+class Measurement < ActiveRecord::Base
+  self.primary_key = 'device_id'
+  acts_as_hypertable time_column: "ts"
+end
+```
+At this moment, you can jump into the Rails console and start testing the model.
+## Seeding some data
+Before we build a very complex example, let's build something that is easy to
+follow and comprehend. Let's create 3 records for the same device, representing
+a hourly measurement of some sensor.
+```ruby
+yesterday = 1.day.ago
+[1,2,3].each_with_index do |v,i|
+  Measurement.create(device_id: 1, ts: yesterday + i.hour, val: v)
+end
+```
+Every value is a progression from 1 to 3. Now, we can build a query to get the
+values and let's build the example using plain Ruby.
+```ruby
+values = Measurement.order(:ts).pluck(:val) # => [1,2,3]
+```
+Using plain Ruby, we can build this example with a few lines of code:
+```ruby
+previous = nil
+volatilities = values.map do |value|
+  if previous
+    delta = (value - previous).abs
+    volatility = delta
+  end
+  previous = value
+  volatility
+end
+# volatilities => [nil, 1, 1]
+volatility = volatilities.compact.sum # => 2
+```
+Compact can be skipped and we can also build the sum in the same loop. So, a
+refactored version would be:
+```ruby
+previous = nil
+volatility = 0
+values.each do |value|
+  if previous
+    delta = (value - previous).abs
+    volatility += delta
+  end
+  previous = value
+end
+volatility # => 2
+```
+Now, it's time to move it to a database level calculating the volatility using
+plain postgresql. A subquery is required to build the calculated delta, so it
+seems a bit more confusing:
+```ruby
+delta = Measurement.select("device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta")
+Measurement
+  .select("device_id, sum(abs_delta) as volatility")
+  .from("(#{delta.to_sql}) as calc_delta")
+  .group('device_id')
+```
+The final query for the example above looks like this:
+```sql
+SELECT device_id, sum(abs_delta) as volatility
+FROM (
+  SELECT device_id,
+    abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
+  FROM "measurements"
+) as calc_delta
+GROUP BY device_id
+```
+It's much harder to understand the actual example then go with plain SQL and now
+let's reproduce the same example using the toolkit pipelines:
+```ruby
+Measurement
+  .select("device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility")
+  .group("device_id")
+```
+As you can see, it's much easier to read and digest the example. Now, let's take
+a look in how we can generate the queries using the scopes injected by the
+`acts_as_time_vector` macro.
+## Adding the `acts_as_time_vector` macro
+Let's start changing the model to add the `acts_as_time_vector` that is
+here to allow us to not repeat the parameters of the `timevector(ts, val)` call.
+```ruby
+class Measurement < ActiveRecord::Base
+  self.primary_key = 'device_id'
+  acts_as_hypertable time_column: "ts"
+  acts_as_time_vector segment_by: "device_id",
+    value_column: "val",
+    time_column: "ts"
+  end
+end
+```
+If you skip the `time_column` option in the `acts_as_time_vector` it will
+inherit the same value from the `acts_as_hypertable`. I'm making it explicit
+here for the sake of making the macros independent.
+Now, that we have it, let's create a scope for it:
+```ruby
+class Measurement < ActiveRecord::Base
+  acts_as_hypertable time_column: "ts"
+  acts_as_time_vector segment_by: "device_id",
+    value_column: "val",
+    time_column: "ts"
+  scope :volatility, -> do
+    select("device_id, timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility")
+     .group("device_id")
+  end
+end
+```
+Now, we have created the volatility scope, grouping by device_id always.
+In the Toolkit helpers, we have a similar version which also contains a default
+segmentation based in the `segment_by` configuration done through the `acts_as_time_vector`
+macro. A method `segment_by_column` is added to access this configuration, so we
+can make a small change that makes you completely understand the volatility
+macro.
+```ruby
+class Measurement < ActiveRecord::Base
+  # ... Skipping previous code to focus in the example
+  acts_as_time_vector segment_by: "device_id",
+    value_column: "val",
+    time_column: "ts"
+  scope :volatility, -> (columns=segment_by_column) do
+    _scope = select([*columns,
+        "timevector(#{time_column}, #{value_column}) -> sort() -> delta() -> abs() -> sum() as volatility"
+    ].join(", "))
+    _scope = _scope.group(columns) if columns
+    _scope
+  end
+end
+```
+Testing the method:
+```ruby
+Measurement.volatility.map(&:attributes)
+# DEBUG -- : Measurement Load (1.6ms)  SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id"
+# => [{"device_id"=>1, "volatility"=>8.0}]
+```
+Let's add a few more records with random values:
+```ruby
+yesterday = 1.day.ago
+(2..6).each do |d|
+  (1..10).each do |j|
+    Measurement.create(device_id: d, ts: yesterday + j.hour, val: rand(10))
+  end
+end
+```
+Testing all the values:
+```ruby
+ Measurement.order("device_id").volatility.map(&:attributes)
+ # DEBUG -- : Measurement Load (1.3ms)  SELECT device_id, timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements" GROUP BY "measurements"."device_id" ORDER BY device_id
+=> [{"device_id"=>1, "volatility"=>8.0},
+ {"device_id"=>2, "volatility"=>24.0},
+ {"device_id"=>3, "volatility"=>30.0},
+ {"device_id"=>4, "volatility"=>32.0},
+ {"device_id"=>5, "volatility"=>44.0},
+ {"device_id"=>6, "volatility"=>23.0}]
+```
+If the parameter is explicit `nil` it will not group by:
+```ruby
+Measurement.volatility(nil).map(&:attributes)
+# DEBUG -- : Measurement Load (5.4ms)  SELECT timevector(ts, val) -> sort() -> delta() -> abs() -> sum() as volatility FROM "measurements"
+# => [{"volatility"=>186.0, "device_id"=>nil}]
+```
+## Comparing with Ruby version
+Now, it's time to benchmark and compare Ruby vs PostgreSQL solutions, verifying
+which is faster:
+```ruby
+class Measurement < ActiveRecord::Base
+  # code you already know
+  scope :volatility_by_device_id, -> {
+    volatility = Hash.new(0)
+    previous = Hash.new
+    find_all do |measurement|
+      device_id = measurement.device_id
+      if previous[device_id]
+        delta = (measurement.val - previous[device_id]).abs
+        volatility[device_id] += delta
+      end
+      previous[device_id] = measurement.val
+    end
+    volatility
+  }
+end
+```
+Now, benchmarking the real time to compute it on Ruby in milliseconds.
+```ruby
+Benchmark.measure { Measurement.volatility_by_device_id }.real * 1000
+# => 3.021999917924404
+```
+## Seeding massive data
+Now, let's use `generate_series` to fast insert a lot of records directly into
+the database and make it full of records.
+Let's just agree on some numbers to have a good start. Let's generate data for
+5 devices emitting values every 5 minutes, which will generate around 50k
+records.
+Let's use some plain SQL to insert the records now:
+```ruby
+sql = "INSERT INTO measurements (ts, device_id, val)
+SELECT ts, device_id, random()*80
+FROM generate_series(TIMESTAMP '2022-01-01 00:00:00',
+                   TIMESTAMP '2022-02-01 00:00:00',
+             INTERVAL '5 minutes') AS g1(ts),
+      generate_series(0, 5) AS g2(device_id);
+"
+ActiveRecord::Base.connection.execute(sql)
+```
+In my MacOS M1 processor it took less than a second to insert the 53k records:
+```ruby
+# DEBUG (177.5ms)  INSERT INTO measurements (ts, device_id, val) ..
+# => #<PG::Result:0x00007f8152034168 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=53574>
+```
+Now, let's measure compare the time to process the volatility:
+```ruby
+Benchmark.bm do |x|
+  x.report("ruby")  { pp Measurement.volatility_by_device_id }
+  x.report("sql") { pp Measurement.volatility("device_id").map(&:attributes)  }
+end
+#           user     system      total        real
+# ruby    0.612439   0.061890   0.674329 (  0.727590)
+# sql     0.001142   0.000301   0.001443 (  0.060301)
+```
+Calculating the performance ratio we can see `0.72 / 0.06` means that SQL is 12
+times faster than Ruby to process volatility 🎉
+Just considering it was localhost, we don't have the internet to pass all the
+records over the wires. Now, moving to a remote host look the numbers:
+!!!warning
+    Note that the previous numbers where using localhost.
+    Now, using a remote connection between different regions,
+    it looks even ~500 times slower than SQL.
+            user     system      total        real
+      ruby 0.716321   0.041640   0.757961 (  6.388881)
+      sql  0.001156   0.000177   0.001333 (  0.161270)
+[1]: https://github.com/timescale/timescaledb-toolkit
+[2]: https://timescale.com
+[3]: https://www.postgresql.org/docs/14/runtime-config-client.html#GUC-SEARCH-PATH
+[4]: https://github.com/timescale/timescaledb-toolkit/blob/main/docs/README.md#a-note-on-tags-
+[5]: https://guides.rubyonrails.org/action_controller_overview.html#filters
+[6]: https://en.wikipedia.org/wiki/Volatility_(finance)
+[7]: https://www.timescale.com/blog/function-pipelines-building-functional-programming-into-postgresql-using-custom-operators/
+[8]: https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/#partitioning-in-hypertables-with-chunks