active_hll 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: a9014f39ae62f6db1066d4933bc28485ecc57e788922d70437c4744edfb0c033
4
+ data.tar.gz: 91320d88909c41e425966993e145b9ff8aa5b57c6c3ac0556dc3661519691968
5
+ SHA512:
6
+ metadata.gz: c55980654230a259a249a24b959c18b054d8e41773653bcc5411dbbbe3200fdd35c65dcfe729ed042943a8f2e81e26843c696d4989840fa43384622e3d56614f
7
+ data.tar.gz: 5285ca0005787997dec664512b0bc6796201a8a4bd719e0ce0b9f6efa50096f3a3e7ade096db6ba943e9f84630b2a92c5d9b3a53a4a0b744cc4162e87d6c2990
data/CHANGELOG.md ADDED
@@ -0,0 +1,3 @@
1
+ ## 0.1.0 (2023-01-24)
2
+
3
+ - First release
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+ The MIT License (MIT)
2
+
3
+ Copyright (c) 2023 Andrew Kane
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in
13
+ all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,157 @@
1
+ # Active HLL
2
+
3
+ :fire: HyperLogLog for Rails and Postgres
4
+
5
+ For fast, approximate count-distinct queries
6
+
7
+ [![Build Status](https://github.com/ankane/active_hll/workflows/build/badge.svg?branch=master)](https://github.com/ankane/active_hll/actions)
8
+
9
+ ## Installation
10
+
11
+ First, install the [hll extension](https://github.com/citusdata/postgresql-hll) on your database server:
12
+
13
+ ```sh
14
+ cd /tmp
15
+ curl -L https://github.com/citusdata/postgresql-hll/archive/refs/tags/v2.17.tar.gz | tar xz
16
+ cd postgresql-hll-2.17
17
+ make
18
+ make install # may need sudo
19
+ ```
20
+
21
+ Then add this line to your application’s Gemfile:
22
+
23
+ ```ruby
24
+ gem "active_hll"
25
+ ```
26
+
27
+ And run:
28
+
29
+ ```sh
30
+ bundle install
31
+ rails generate active_hll:install
32
+ rails db:migrate
33
+ ```
34
+
35
+ ## Getting Started
36
+
37
+ HLLs provide an approximate count of unique values (like unique visitors). By rolling up data by day, you can quickly get an approximate count over any date range.
38
+
39
+ Create a table with an `hll` column
40
+
41
+ ```ruby
42
+ class CreateEventRollups < ActiveRecord::Migration[7.0]
43
+ def change
44
+ create_table :event_rollups do |t|
45
+ t.date :time_bucket, index: {unique: true}
46
+ t.hll :visitor_ids
47
+ end
48
+ end
49
+ end
50
+ ```
51
+
52
+ You can use [batch](#batch) and [stream](#stream) approaches to build HLLs
53
+
54
+ ### Batch
55
+
56
+ Use the `hll_agg` method to generate HLLs from existing data
57
+
58
+ ```ruby
59
+ hlls = Event.group_by_day(:created_at).hll_agg(:visitor_id)
60
+ ```
61
+
62
+ > Install [Groupdate](https://github.com/ankane/groupdate) to use the `group_by_day` method
63
+
64
+ And store the result
65
+
66
+ ```ruby
67
+ EventRollup.upsert_all(
68
+ hlls.map { |k, v| {time_bucket: k, visitor_ids: v} },
69
+ unique_by: [:time_bucket]
70
+ )
71
+ ```
72
+
73
+ For a large number of HLLs, use SQL to generate and upsert in a single statement
74
+
75
+ ### Stream
76
+
77
+ Use the `hll_add` method to add new data to HLLs
78
+
79
+ ```ruby
80
+ EventRollup.where(time_bucket: Date.current).hll_add(visitor_ids: ["visitor1", "visitor2"])
81
+ ```
82
+
83
+ ## Querying
84
+
85
+ Get approximate unique values for a time range
86
+
87
+ ```ruby
88
+ EventRollup.where(time_bucket: 30.days.ago.to_date..Date.current).hll_count(:visitor_ids)
89
+ ```
90
+
91
+ Get approximate unique values by time bucket
92
+
93
+ ```ruby
94
+ EventRollup.group(:time_bucket).hll_count(:visitor_ids)
95
+ ```
96
+
97
+ Get approximate unique values by month
98
+
99
+ ```ruby
100
+ EventRollup.group_by_month(:time_bucket, time_zone: false).hll_count(:visitor_ids)
101
+ ```
102
+
103
+ Get the union of multiple HLLs
104
+
105
+ ```ruby
106
+ EventRollup.hll_union(:visitor_ids)
107
+ ```
108
+
109
+ ## Data Protection
110
+
111
+ Cardinality estimators like HyperLogLog do not [preserve privacy](https://arxiv.org/pdf/1808.05879.pdf), so protect `hll` columns the same as you would the raw data.
112
+
113
+ For instance, you can check membership with a good probability with:
114
+
115
+ ```sql
116
+ SELECT
117
+ time_bucket,
118
+ visitor_ids = visitor_ids || hll_hash_text('visitor1') AS likely_member
119
+ FROM
120
+ event_rollups;
121
+ ```
122
+
123
+ ## Data Retention
124
+
125
+ Data should only be retained for as long as it’s needed. Delete older data with:
126
+
127
+ ```ruby
128
+ EventRollup.where("time_bucket < ?", 2.years.ago).delete_all
129
+ ```
130
+
131
+ There’s not a way to remove data from an HLL, so to delete data for a specific user, delete the underlying data and recalculate the rollup.
132
+
133
+ ## Hosted Postgres
134
+
135
+ The `hll` extension is available on Amazon RDS, Google Cloud SQL, and DigitalOcean Managed Databases.
136
+
137
+ ## History
138
+
139
+ View the [changelog](CHANGELOG.md)
140
+
141
+ ## Contributing
142
+
143
+ Everyone is encouraged to help improve this project. Here are a few ways you can help:
144
+
145
+ - [Report bugs](https://github.com/ankane/active_hll/issues)
146
+ - Fix bugs and [submit pull requests](https://github.com/ankane/active_hll/pulls)
147
+ - Write, clarify, or fix documentation
148
+ - Suggest or add new features
149
+
150
+ To get started with development:
151
+
152
+ ```sh
153
+ git clone https://github.com/ankane/active_hll.git
154
+ cd active_hll
155
+ bundle install
156
+ bundle exec rake test
157
+ ```
@@ -0,0 +1,51 @@
1
+ # format of value
2
+ # https://github.com/aggregateknowledge/hll-storage-spec/blob/v1.0.0/STORAGE.md
3
+ module ActiveHll
4
+ class Hll
5
+ attr_reader :value
6
+
7
+ def initialize(value)
8
+ unless value.is_a?(String) && value.encoding == Encoding::BINARY
9
+ raise ArgumentError, "Expected binary string"
10
+ end
11
+
12
+ @value = value
13
+ end
14
+
15
+ def inspect
16
+ "(hll)"
17
+ end
18
+
19
+ def schema_version
20
+ value[0].unpack1("C") >> 4
21
+ end
22
+
23
+ def type
24
+ value[0].unpack1("C") & 0b00001111
25
+ end
26
+
27
+ def regwidth
28
+ (value[1].unpack1("C") >> 5) + 1
29
+ end
30
+
31
+ def log2m
32
+ value[1].unpack1("C") & 0b00011111
33
+ end
34
+
35
+ def sparseon
36
+ (value[2].unpack1("C") & 0b01000000) >> 6
37
+ end
38
+
39
+ def expthresh
40
+ t = value[2].unpack1("C") & 0b00111111
41
+ t == 63 ? -1 : 2**(t - 1)
42
+ end
43
+
44
+ def data
45
+ case type
46
+ when 2
47
+ value[3..-1].unpack("q>*")
48
+ end
49
+ end
50
+ end
51
+ end
@@ -0,0 +1,66 @@
1
+ require "active_support/concern"
2
+
3
+ module ActiveHll
4
+ module Model
5
+ extend ActiveSupport::Concern
6
+
7
+ class_methods do
8
+ def hll_agg(column)
9
+ Utils.hll_calculate(self, "hll_add_agg(hll_hash_any(%s)) AS hll_agg", column, default_value: nil)
10
+ end
11
+
12
+ def hll_union(column)
13
+ Utils.hll_calculate(self, "hll_union_agg(%s) AS hll_union", column, default_value: nil)
14
+ end
15
+
16
+ def hll_count(column)
17
+ Utils.hll_calculate(self, "hll_cardinality(hll_union_agg(%s)) AS hll_count", column, default_value: 0.0)
18
+ end
19
+
20
+ # experimental
21
+ # doesn't work with non-default parameters
22
+ def hll_generate(values)
23
+ parts = ["hll_empty()"]
24
+
25
+ values.each do |value|
26
+ parts << Utils.hll_hash_sql(self, value)
27
+ end
28
+
29
+ result = connection.select_all("SELECT #{parts.join(" || ")}").rows[0][0]
30
+ ActiveHll::Type.new.deserialize(result)
31
+ end
32
+
33
+ def hll_add(attributes)
34
+ set_clauses =
35
+ attributes.map do |attribute, values|
36
+ values = [values] unless values.is_a?(Array)
37
+ return 0 if values.empty?
38
+
39
+ quoted_column = connection.quote_column_name(attribute)
40
+ # possibly fetch parameters for the column in the future
41
+ # for now, users should set a default value on the column
42
+ parts = ["COALESCE(#{quoted_column}, hll_empty())"]
43
+
44
+ values.each do |value|
45
+ parts << Utils.hll_hash_sql(self, value)
46
+ end
47
+
48
+ "#{quoted_column} = #{parts.join(" || ")}"
49
+ end
50
+
51
+ update_all(set_clauses.join(", "))
52
+ end
53
+ end
54
+
55
+ # doesn't update in-memory record attribute for performance
56
+ def hll_add(attributes)
57
+ self.class.where(id: id).hll_add(attributes)
58
+ nil
59
+ end
60
+
61
+ def hll_count(attribute)
62
+ quoted_column = self.class.connection.quote_column_name(attribute)
63
+ self.class.where(id: id).pluck("hll_cardinality(#{quoted_column})").first || 0.0
64
+ end
65
+ end
66
+ end
@@ -0,0 +1,21 @@
1
+ module ActiveHll
2
+ class Type < ActiveRecord::ConnectionAdapters::PostgreSQL::OID::Bytea
3
+ def type
4
+ :hll
5
+ end
6
+
7
+ def serialize(value)
8
+ if value.is_a?(Hll)
9
+ value = value.value
10
+ elsif !value.nil?
11
+ raise ArgumentError, "can't cast #{value.class.name} to hll"
12
+ end
13
+ super(value)
14
+ end
15
+
16
+ def deserialize(value)
17
+ value = super
18
+ value.nil? ? value : Hll.new(value)
19
+ end
20
+ end
21
+ end
@@ -0,0 +1,70 @@
1
+ module ActiveHll
2
+ module Utils
3
+ class << self
4
+ def hll_hash_sql(klass, value)
5
+ hash_function =
6
+ case value
7
+ when true, false
8
+ "hll_hash_boolean"
9
+ when Integer
10
+ "hll_hash_bigint"
11
+ when String
12
+ "hll_hash_text"
13
+ else
14
+ raise ArgumentError, "Unexpected type: #{value.class.name}"
15
+ end
16
+ quoted_value = klass.connection.quote(value)
17
+ "#{hash_function}(#{quoted_value})"
18
+ end
19
+
20
+ def hll_calculate(relation, operation, column, default_value:)
21
+ sql, relation, group_values = hll_calculate_sql(relation, operation, column)
22
+ result = relation.connection.select_all(sql)
23
+
24
+ # typecast
25
+ rows = []
26
+ columns = result.columns
27
+ result.rows.each do |untyped_row|
28
+ rows << (result.column_types.empty? ? untyped_row : columns.each_with_index.map { |c, i| untyped_row[i] && result.column_types[c] ? result.column_types[c].deserialize(untyped_row[i]) : untyped_row[i] })
29
+ end
30
+
31
+ result =
32
+ if group_values.any?
33
+ Hash[rows.map { |r| [r.size == 2 ? r[0] : r[0..-2], r[-1]] }]
34
+ else
35
+ rows[0] && rows[0][0]
36
+ end
37
+
38
+ result = Groupdate.process_result(relation, result, default_value: default_value) if defined?(Groupdate.process_result)
39
+
40
+ result
41
+ end
42
+
43
+ def hll_calculate_sql(relation, operation, column)
44
+ # basic version of Active Record disallow_raw_sql!
45
+ # symbol = column (safe), Arel node = SQL (safe), other = untrusted
46
+ # matches table.column and column
47
+ unless column.is_a?(Symbol) || column.is_a?(Arel::Nodes::SqlLiteral)
48
+ column = column.to_s
49
+ unless /\A\w+(\.\w+)?\z/i.match(column)
50
+ raise ActiveRecord::UnknownAttributeReference, "Query method called with non-attribute argument(s): #{column.inspect}. Use Arel.sql() for known-safe values."
51
+ end
52
+ end
53
+
54
+ # column resolution
55
+ node = relation.all.send(:arel_columns, [column]).first
56
+ node = Arel::Nodes::SqlLiteral.new(node) if node.is_a?(String)
57
+ column = relation.connection.visitor.accept(node, Arel::Collectors::SQLString.new).value
58
+
59
+ group_values = relation.all.group_values
60
+
61
+ relation = relation.unscope(:select).select(*group_values, operation % [column])
62
+
63
+ # same as average
64
+ relation = relation.unscope(:order).distinct!(false) if group_values.empty?
65
+
66
+ [relation.to_sql, relation, group_values]
67
+ end
68
+ end
69
+ end
70
+ end
@@ -0,0 +1,3 @@
1
+ module ActiveHll
2
+ VERSION = "0.1.0"
3
+ end
data/lib/active_hll.rb ADDED
@@ -0,0 +1,41 @@
1
+ # dependencies
2
+ require "active_support"
3
+
4
+ # modules
5
+ require_relative "active_hll/hll"
6
+ require_relative "active_hll/utils"
7
+ require_relative "active_hll/version"
8
+
9
+ module ActiveHll
10
+ class Error < StandardError; end
11
+
12
+ autoload :Type, "active_hll/type"
13
+
14
+ module RegisterType
15
+ def initialize_type_map(m = type_map)
16
+ super
17
+ m.register_type "hll", ActiveHll::Type.new
18
+ end
19
+ end
20
+ end
21
+
22
+ ActiveSupport.on_load(:active_record) do
23
+ require_relative "active_hll/model"
24
+
25
+ include ActiveHll::Model
26
+
27
+ require "active_record/connection_adapters/postgresql_adapter"
28
+
29
+ # ensure schema can be dumped
30
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:hll] = {name: "hll"}
31
+
32
+ # ensure schema can be loaded
33
+ ActiveRecord::ConnectionAdapters::TableDefinition.send(:define_column_methods, :hll)
34
+
35
+ # prevent unknown OID warning
36
+ if ActiveRecord::VERSION::MAJOR >= 7
37
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.singleton_class.prepend(ActiveHll::RegisterType)
38
+ else
39
+ ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend(ActiveHll::RegisterType)
40
+ end
41
+ end
@@ -0,0 +1,18 @@
1
+ require "rails/generators/active_record"
2
+
3
+ module ActiveHll
4
+ module Generators
5
+ class InstallGenerator < Rails::Generators::Base
6
+ include ActiveRecord::Generators::Migration
7
+ source_root File.join(__dir__, "templates")
8
+
9
+ def copy_migration
10
+ migration_template "migration.rb", "db/migrate/install_active_hll.rb", migration_version: migration_version
11
+ end
12
+
13
+ def migration_version
14
+ "[#{ActiveRecord::VERSION::MAJOR}.#{ActiveRecord::VERSION::MINOR}]"
15
+ end
16
+ end
17
+ end
18
+ end
@@ -0,0 +1,5 @@
1
+ class <%= migration_class_name %> < ActiveRecord::Migration<%= migration_version %>
2
+ def change
3
+ enable_extension "hll"
4
+ end
5
+ end
metadata ADDED
@@ -0,0 +1,67 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: active_hll
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ platform: ruby
6
+ authors:
7
+ - Andrew Kane
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+ date: 2023-01-24 00:00:00.000000000 Z
12
+ dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: activerecord
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - ">="
18
+ - !ruby/object:Gem::Version
19
+ version: '6'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - ">="
25
+ - !ruby/object:Gem::Version
26
+ version: '6'
27
+ description:
28
+ email: andrew@ankane.org
29
+ executables: []
30
+ extensions: []
31
+ extra_rdoc_files: []
32
+ files:
33
+ - CHANGELOG.md
34
+ - LICENSE.txt
35
+ - README.md
36
+ - lib/active_hll.rb
37
+ - lib/active_hll/hll.rb
38
+ - lib/active_hll/model.rb
39
+ - lib/active_hll/type.rb
40
+ - lib/active_hll/utils.rb
41
+ - lib/active_hll/version.rb
42
+ - lib/generators/active_hll/install_generator.rb
43
+ - lib/generators/active_hll/templates/migration.rb.tt
44
+ homepage: https://github.com/ankane/active_hll
45
+ licenses:
46
+ - MIT
47
+ metadata: {}
48
+ post_install_message:
49
+ rdoc_options: []
50
+ require_paths:
51
+ - lib
52
+ required_ruby_version: !ruby/object:Gem::Requirement
53
+ requirements:
54
+ - - ">="
55
+ - !ruby/object:Gem::Version
56
+ version: '2.7'
57
+ required_rubygems_version: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - ">="
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ requirements: []
63
+ rubygems_version: 3.4.1
64
+ signing_key:
65
+ specification_version: 4
66
+ summary: HyperLogLog for Rails and Postgres
67
+ test_files: []