active_hll 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: a9014f39ae62f6db1066d4933bc28485ecc57e788922d70437c4744edfb0c033
4
+ data.tar.gz: 91320d88909c41e425966993e145b9ff8aa5b57c6c3ac0556dc3661519691968
5
+ SHA512:
6
+ metadata.gz: c55980654230a259a249a24b959c18b054d8e41773653bcc5411dbbbe3200fdd35c65dcfe729ed042943a8f2e81e26843c696d4989840fa43384622e3d56614f
7
+ data.tar.gz: 5285ca0005787997dec664512b0bc6796201a8a4bd719e0ce0b9f6efa50096f3a3e7ade096db6ba943e9f84630b2a92c5d9b3a53a4a0b744cc4162e87d6c2990
data/CHANGELOG.md ADDED
@@ -0,0 +1,3 @@
1
+ ## 0.1.0 (2023-01-24)
2
+
3
+ - First release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2023 Andrew Kane
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,157 @@
1
+ # Active HLL
2
+
3
+ :fire: HyperLogLog for Rails and Postgres
4
+
5
+ For fast, approximate count-distinct queries
6
+
7
+ [![Build Status](https://github.com/ankane/active_hll/workflows/build/badge.svg?branch=master)](https://github.com/ankane/active_hll/actions)
8
+
9
+ ## Installation
10
+
11
+ First, install the [hll extension](https://github.com/citusdata/postgresql-hll) on your database server:
12
+
13
+ ```sh
14
+ cd /tmp
15
+ curl -L https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.17.tar.gz | tar xz
16
+ cd postgresql-hll-2.17
17
+ make
18
+ make install # may need sudo
19
+ ```
20
+
21
+ Then add this line to your application’s Gemfile:
22
+
23
+ ```ruby
24
+ gem "active_hll"
25
+ ```
26
+
27
+ And run:
28
+
29
+ ```sh
30
+ bundle install
31
+ rails generate active_hll:install
32
+ rails db:migrate
33
+ ```
34
+
35
+ ## Getting Started
36
+
37
+ HLLs provide an approximate count of unique values (like unique visitors). By rolling up data by day, you can quickly get an approximate count over any date range.
38
+
39
+ Create a table with an `hll` column
40
+
41
+ ```ruby
42
+ class CreateEventRollups < ActiveRecord::Migration[7.0]
43
+ def change
44
+ create_table :event_rollups do |t|
45
+ t.date :time_bucket, index: {unique: true}
46
+ t.hll :visitor_ids
47
+ end
48
+ end
49
+ end
50
+ ```
51
+
52
+ You can use [batch](#batch) and [stream](#stream) approaches to build HLLs
53
+
54
+ ### Batch
55
+
56
+ Use the `hll_agg` method to generate HLLs from existing data
57
+
58
+ ```ruby
59
+ hlls = Event.group_by_day(:created_at).hll_agg(:visitor_id)
60
+ ```
61
+
62
+ > Install [Groupdate](https://github.com/ankane/groupdate) to use the `group_by_day` method
63
+
64
+ And store the result
65
+
66
+ ```ruby
67
+ EventRollup.upsert_all(
68
+ hlls.map { |k, v| {time_bucket: k, visitor_ids: v} },
69
+ unique_by: [:time_bucket]
70
+ )
71
+ ```
72
+
73
+ For a large number of HLLs, use SQL to generate and upsert in a single statement
74
+
75
+ ### Stream
76
+
77
+ Use the `hll_add` method to add new data to HLLs
78
+
79
+ ```ruby
80
+ EventRollup.where(time_bucket: Date.current).hll_add(visitor_ids: ["visitor1", "visitor2"])
81
+ ```
82
+
83
+ ## Querying
84
+
85
+ Get approximate unique values for a time range
86
+
87
+ ```ruby
88
+ EventRollup.where(time_bucket: 30.days.ago.to_date..Date.current).hll_count(:visitor_ids)
89
+ ```
90
+
91
+ Get approximate unique values by time bucket
92
+
93
+ ```ruby
94
+ EventRollup.group(:time_bucket).hll_count(:visitor_ids)
95
+ ```
96
+
97
+ Get approximate unique values by month
98
+
99
+ ```ruby
100
+ EventRollup.group_by_month(:time_bucket, time_zone: false).hll_count(:visitor_ids)
101
+ ```
102
+
103
+ Get the union of multiple HLLs
104
+
105
+ ```ruby
106
+ EventRollup.hll_union(:visitor_ids)
107
+ ```
108
+
109
+ ## Data Protection
110
+
111
+ Cardinality estimators like HyperLogLog do not [preserve privacy](https://arxiv.org/pdf/1808.05879.pdf), so protect `hll` columns the same as you would the raw data.
112
+
113
+ For instance, you can check membership with a good probability with:
114
+
115
+ ```sql
116
+ SELECT
117
+ time_bucket,
118
+ visitor_ids = visitor_ids || hll_hash_text('visitor1') AS likely_member
119
+ FROM
120
+ event_rollups;
121
+ ```
122
+
123
+ ## Data Retention
124
+
125
+ Data should only be retained for as long as it’s needed. Delete older data with:
126
+
127
+ ```ruby
128
+ EventRollup.where("time_bucket < ?", 2.years.ago).delete_all
129
+ ```
130
+
131
+ There’s not a way to remove data from an HLL, so to delete data for a specific user, delete the underlying data and recalculate the rollup.
132
+
133
+ ## Hosted Postgres
134
+
135
+ The `hll` extension is available on Amazon RDS, Google Cloud SQL, and DigitalOcean Managed Databases.
136
+
137
+ ## History
138
+
139
+ View the [changelog](CHANGELOG.md)
140
+
141
+ ## Contributing
142
+
143
+ Everyone is encouraged to help improve this project. Here are a few ways you can help:
144
+
145
+ - [Report bugs](https://github.com/ankane/active_hll/issues)
146
+ - Fix bugs and [submit pull requests](https://github.com/ankane/active_hll/pulls)
147
+ - Write, clarify, or fix documentation
148
+ - Suggest or add new features
149
+
150
+ To get started with development:
151
+
152
+ ```sh
153
+ git clone https://github.com/ankane/active_hll.git
154
+ cd active_hll
155
+ bundle install
156
+ bundle exec rake test
157
+ ```
@@ -0,0 +1,51 @@
1
+ # format of value
2
+ # https://github.com/aggregateknowledge/hll-storage-spec/blob/v1.0.0/STORAGE.md
3
+ module ActiveHll
4
+ class Hll
5
+ attr_reader :value
6
+
7
+ def initialize(value)
8
+ unless value.is_a?(String) && value.encoding == Encoding::BINARY
9
+ raise ArgumentError, "Expected binary string"
10
+ end
11
+
12
+ @value = value
13
+ end
14
+
15
+ def inspect
16
+ "(hll)"
17
+ end
18
+
19
+ def schema_version
20
+ value[0].unpack1("C") >> 4
21
+ end
22
+
23
+ def type
24
+ value[0].unpack1("C") & 0b00001111
25
+ end
26
+
27
+ def regwidth
28
+ (value[1].unpack1("C") >> 5) + 1
29
+ end
30
+
31
+ def log2m
32
+ value[1].unpack1("C") & 0b00011111
33
+ end
34
+
35
+ def sparseon
36
+ (value[2].unpack1("C") & 0b01000000) >> 6
37
+ end
38
+
39
+ def expthresh
40
+ t = value[2].unpack1("C") & 0b00111111
41
+ t == 63 ? -1 : 2**(t - 1)
42
+ end
43
+
44
+ def data
45
+ case type
46
+ when 2
47
+ value[3..-1].unpack("q>*")
48
+ end
49
+ end
50
+ end
51
+ end
@@ -0,0 +1,66 @@
1
+ require "active_support/concern"
2
+
3
+ module ActiveHll
4
+ module Model
5
+ extend ActiveSupport::Concern
6
+
7
+ class_methods do
8
+ def hll_agg(column)
9
+ Utils.hll_calculate(self, "hll_add_agg(hll_hash_any(%s)) AS hll_agg", column, default_value: nil)
10
+ end
11
+
12
+ def hll_union(column)
13
+ Utils.hll_calculate(self, "hll_union_agg(%s) AS hll_union", column, default_value: nil)
14
+ end
15
+
16
+ def hll_count(column)
17
+ Utils.hll_calculate(self, "hll_cardinality(hll_union_agg(%s)) AS hll_count", column, default_value: 0.0)
18
+ end
19
+
20
+ # experimental
21
+ # doesn't work with non-default parameters
22
+ def hll_generate(values)
23
+ parts = ["hll_empty()"]
24
+
25
+ values.each do |value|
26
+ parts << Utils.hll_hash_sql(self, value)
27
+ end
28
+
29
+ result = connection.select_all("SELECT #{parts.join(" || ")}").rows[0][0]
30
+ ActiveHll::Type.new.deserialize(result)
31
+ end
32
+
33
+ def hll_add(attributes)
34
+ set_clauses =
35
+ attributes.map do |attribute, values|
36
+ values = [values] unless values.is_a?(Array)
37
+ return 0 if values.empty?
38
+
39
+ quoted_column = connection.quote_column_name(attribute)
40
+ # possibly fetch parameters for the column in the future
41
+ # for now, users should set a default value on the column
42
+ parts = ["COALESCE(#{quoted_column}, hll_empty())"]
43
+
44
+ values.each do |value|
45
+ parts << Utils.hll_hash_sql(self, value)
46
+ end
47
+
48
+ "#{quoted_column} = #{parts.join(" || ")}"
49
+ end
50
+
51
+ update_all(set_clauses.join(", "))
52
+ end
53
+ end
54
+
55
+ # doesn't update in-memory record attribute for performance
56
+ def hll_add(attributes)
57
+ self.class.where(id: id).hll_add(attributes)
58
+ nil
59
+ end
60
+
61
+ def hll_count(attribute)
62
+ quoted_column = self.class.connection.quote_column_name(attribute)
63
+ self.class.where(id: id).pluck("hll_cardinality(#{quoted_column})").first || 0.0
64
+ end
65
+ end
66
+ end
@@ -0,0 +1,21 @@
1
+ module ActiveHll
2
+ class Type < ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Bytea
3
+ def type
4
+ :hll
5
+ end
6
+
7
+ def serialize(value)
8
+ if value.is_a?(Hll)
9
+ value = value.value
10
+ elsif !value.nil?
11
+ raise ArgumentError, "can't cast #{value.class.name} to hll"
12
+ end
13
+ super(value)
14
+ end
15
+
16
+ def deserialize(value)
17
+ value = super
18
+ value.nil? ? value : Hll.new(value)
19
+ end
20
+ end
21
+ end
@@ -0,0 +1,70 @@
1
+ module ActiveHll
2
+ module Utils
3
+ class << self
4
+ def hll_hash_sql(klass, value)
5
+ hash_function =
6
+ case value
7
+ when true, false
8
+ "hll_hash_boolean"
9
+ when Integer
10
+ "hll_hash_bigint"
11
+ when String
12
+ "hll_hash_text"
13
+ else
14
+ raise ArgumentError, "Unexpected type: #{value.class.name}"
15
+ end
16
+ quoted_value = klass.connection.quote(value)
17
+ "#{hash_function}(#{quoted_value})"
18
+ end
19
+
20
+ def hll_calculate(relation, operation, column, default_value:)
21
+ sql, relation, group_values = hll_calculate_sql(relation, operation, column)
22
+ result = relation.connection.select_all(sql)
23
+
24
+ # typecast
25
+ rows = []
26
+ columns = result.columns
27
+ result.rows.each do |untyped_row|
28
+ rows << (result.column_types.empty? ? untyped_row : columns.each_with_index.map { |c, i| untyped_row[i] && result.column_types[c] ? result.column_types[c].deserialize(untyped_row[i]) : untyped_row[i] })
29
+ end
30
+
31
+ result =
32
+ if group_values.any?
33
+ Hash[rows.map { |r| [r.size == 2 ? r[0] : r[0..-2], r[-1]] }]
34
+ else
35
+ rows[0] && rows[0][0]
36
+ end
37
+
38
+ result = Groupdate.process_result(relation, result, default_value: default_value) if defined?(Groupdate.process_result)
39
+
40
+ result
41
+ end
42
+
43
+ def hll_calculate_sql(relation, operation, column)
44
+ # basic version of Active Record disallow_raw_sql!
45
+ # symbol = column (safe), Arel node = SQL (safe), other = untrusted
46
+ # matches table.column and column
47
+ unless column.is_a?(Symbol) || column.is_a?(Arel::Nodes::SqlLiteral)
48
+ column = column.to_s
49
+ unless /\A\w+(\.\w+)?\z/i.match(column)
50
+ raise ActiveRecord::UnknownAttributeReference, "Query method called with non-attribute argument(s): #{column.inspect}. Use Arel.sql() for known-safe values."
51
+ end
52
+ end
53
+
54
+ # column resolution
55
+ node = relation.all.send(:arel_columns, [column]).first
56
+ node = Arel::Nodes::SqlLiteral.new(node) if node.is_a?(String)
57
+ column = relation.connection.visitor.accept(node, Arel::Collectors::SQLString.new).value
58
+
59
+ group_values = relation.all.group_values
60
+
61
+ relation = relation.unscope(:select).select(*group_values, operation % [column])
62
+
63
+ # same as average
64
+ relation = relation.unscope(:order).distinct!(false) if group_values.empty?
65
+
66
+ [relation.to_sql, relation, group_values]
67
+ end
68
+ end
69
+ end
70
+ end
@@ -0,0 +1,3 @@
1
+ module ActiveHll
2
+ VERSION = "0.1.0"
3
+ end
data/lib/active_hll.rb ADDED
@@ -0,0 +1,41 @@
1
+ # dependencies
2
+ require "active_support"
3
+
4
+ # modules
5
+ require_relative "active_hll/hll"
6
+ require_relative "active_hll/utils"
7
+ require_relative "active_hll/version"
8
+
9
+ module ActiveHll
10
+ class Error < StandardError; end
11
+
12
+ autoload :Type, "active_hll/type"
13
+
14
+ module RegisterType
15
+ def initialize_type_map(m = type_map)
16
+ super
17
+ m.register_type "hll", ActiveHll::Type.new
18
+ end
19
+ end
20
+ end
21
+
22
+ ActiveSupport.on_load(:active_record) do
23
+ require_relative "active_hll/model"
24
+
25
+ include ActiveHll::Model
26
+
27
+ require "active_record/connection_adapters/postgresql_adapter"
28
+
29
+ # ensure schema can be dumped
30
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:hll] = {name: "hll"}
31
+
32
+ # ensure schema can be loaded
33
+ ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :hll)
34
+
35
+ # prevent unknown OID warning
36
+ if ActiveRecord::VERSION::MAJOR >= 7
37
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.singleton_class.prepend(ActiveHll::RegisterType)
38
+ else
39
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(ActiveHll::RegisterType)
40
+ end
41
+ end
@@ -0,0 +1,18 @@
1
+ require "rails/generators/active_record"
2
+
3
+ module ActiveHll
4
+ module Generators
5
+ class InstallGenerator < Rails::Generators::Base
6
+ include ActiveRecord::Generators::Migration
7
+ source_root File.join(__dir__, "templates")
8
+
9
+ def copy_migration
10
+ migration_template "migration.rb", "db/migrate/install_active_hll.rb", migration_version: migration_version
11
+ end
12
+
13
+ def migration_version
14
+ "[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
15
+ end
16
+ end
17
+ end
18
+ end
@@ -0,0 +1,5 @@
1
+ class <%= migration_class_name %> < ActiveRecord::Migration<%= migration_version %>
2
+ def change
3
+ enable_extension "hll"
4
+ end
5
+ end
metadata ADDED
@@ -0,0 +1,67 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: active_hll
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Andrew Kane
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2023-01-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: activerecord
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '6'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '6'
27
+ description:
28
+ email: andrew@ankane.org
29
+ executables: []
30
+ extensions: []
31
+ extra_rdoc_files: []
32
+ files:
33
+ - CHANGELOG.md
34
+ - LICENSE.txt
35
+ - README.md
36
+ - lib/active_hll.rb
37
+ - lib/active_hll/hll.rb
38
+ - lib/active_hll/model.rb
39
+ - lib/active_hll/type.rb
40
+ - lib/active_hll/utils.rb
41
+ - lib/active_hll/version.rb
42
+ - lib/generators/active_hll/install_generator.rb
43
+ - lib/generators/active_hll/templates/migration.rb.tt
44
+ homepage: https://github.com/ankane/active_hll
45
+ licenses:
46
+ - MIT
47
+ metadata: {}
48
+ post_install_message:
49
+ rdoc_options: []
50
+ require_paths:
51
+ - lib
52
+ required_ruby_version: !ruby/object:Gem::Requirement
53
+ requirements:
54
+ - - ">="
55
+ - !ruby/object:Gem::Version
56
+ version: '2.7'
57
+ required_rubygems_version: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ requirements: []
63
+ rubygems_version: 3.4.1
64
+ signing_key:
65
+ specification_version: 4
66
+ summary: HyperLogLog for Rails and Postgres
67
+ test_files: []