fast_count 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3432010a9c6d6f616b7341df1dffbf3ea02b9fb22e846ab59c6562609ff109ff
4
- data.tar.gz: 42b6a79b370de8a2c919bdc9744bee718436b7ecba75adf6c95cb82a3bafbe69
3
+ metadata.gz: 25bb1bbb54bacd1439f8a649e5927fa8968e9cb9c5076d56d3da8a6bdb74b086
4
+ data.tar.gz: d565c1a47ca831b9013b99a6a0feea5dacb569380686721b5c62b6c12f6cbac3
5
5
  SHA512:
6
- metadata.gz: 0b61215c4ce6eb05626baba644ff34ba11e4c2fe08deec601f5313bdc4c6805594a4ee9dc1637345fdc1891d9bb17e12a251bfe57a17c96f807913ecfb64c83b
7
- data.tar.gz: 9172798b85f35b0b82b1e91917feba59fc317697cea6786cf94549e23aa2625b284c482f7c7d7411ba21111f83c59c9af976d2afe10d4c7fdff0fd3cc5966f3c
6
+ metadata.gz: 34189623d034c4a8433ced800bd9c3f83d9f3f2925ec5d2d646c783e1921e68640e23de1636cecabf49d7d5f7ffa9d42d813e500e1ba8899d5cc913e24042087
7
+ data.tar.gz: 8fbd2df556dc2d09f56fa55fdd5df0693861ff9daa2e651aca6ec35591e77af25744e4e4bb0da3cdb5cd33e1a7247875528f2b725d458cd599d3df67cd485a75
data/CHANGELOG.md CHANGED
@@ -1,5 +1,20 @@
1
1
  ## master (unreleased)
2
2
 
3
+ ## 0.2.0 (2023-07-24)
4
+
5
+ - Support for quickly getting an exact number of distinct values in a column
6
+
7
+ It is suited for cases, when there is a small amount of distinct values in a column compared to a total number
8
+ of values (for example, 10M rows total and 200 distinct values).
9
+ Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
10
+
11
+ Example:
12
+ ```ruby
13
+ User.fast_distinct_count(column: :company_id)
14
+ ```
15
+
16
+ - Support PostgreSQL schemas for `#fast_count`
17
+
3
18
  ## 0.1.0 (2023-04-26)
4
19
 
5
20
  - First release
data/README.md CHANGED
@@ -8,12 +8,12 @@ Luckily, there are [some tricks](https://www.citusdata.com/blog/2016/10/12/count
8
8
 
9
9
  | SQL | Result | Accuracy | Time |
10
10
  | --- | --- | --- | --- |
11
- | `SELECT count(*) FROM small_table;` | `2037104` | `100.000%` | `4.900s` |
12
- | `SELECT fast_count('small_table');` | `2036407` | `99.965%` | `0.050s` |
13
- | `SELECT count(*) FROM medium_table;` | `81716243` | `100.000%` | `257.5s` |
14
- | `SELECT fast_count('medium_table');` | `81600513` | `99.858%` | `0.048s` |
15
- | `SELECT count(*) FROM large_table;` | `455270802` | `100.000%` | `310.6s` |
16
- | `SELECT fast_count('large_table');` | `454448393` | `99.819%` | `0.046s` |
11
+ | `SELECT count(*) FROM small_table` | `2037104` | `100.000%` | `4.900s` |
12
+ | `SELECT fast_count('small_table')` | `2036407` | `99.965%` | `0.050s` |
13
+ | `SELECT count(*) FROM medium_table` | `81716243` | `100.000%` | `257.5s` |
14
+ | `SELECT fast_count('medium_table')` | `81600513` | `99.858%` | `0.048s` |
15
+ | `SELECT count(*) FROM large_table` | `455270802` | `100.000%` | `310.6s` |
16
+ | `SELECT fast_count('large_table')` | `454448393` | `99.819%` | `0.046s` |
17
17
 
18
18
  *These metrics were pulled from real PostgreSQL databases being used in a production environment.*
19
19
 
@@ -51,6 +51,12 @@ $ gem install fast_count
51
51
 
52
52
  If you are using PostgreSQL, you need to create a database function, used internally:
53
53
 
54
+ ```sh
55
+ $ rails generate migration install_fast_count
56
+ ```
57
+
58
+ with the content:
59
+
54
60
  ```ruby
55
61
  class InstallFastCount < ActiveRecord::Migration[7.0]
56
62
  def up
@@ -65,12 +71,16 @@ end
65
71
 
66
72
  ## Usage
67
73
 
68
- To get an estimated count of the rows in a table:
74
+ ### Estimated table count
75
+
76
+ To quickly get an estimated count of the rows in a table:
69
77
 
70
78
  ```ruby
71
79
  User.fast_count # => 1_254_312_219
72
80
  ```
73
81
 
82
+ ### Result set size estimation
83
+
74
84
  If you want to quickly get an estimation of how many rows will the query return, without actually executing it, yo can run:
75
85
 
76
86
  ```ruby
@@ -79,6 +89,23 @@ User.where.missing(:avatar).estimated_count # => 324_200
79
89
 
80
90
  **Note**: `estimated_count` relies on the database query planner estimations (basically on the output of `EXPLAIN`) to get its results and can be very imprecise. It is better be used to get an idea of the order of magnitude of the future result.
81
91
 
92
+ ### Exact distinct values count
93
+
94
+ To quickly get an exact number of distinct values in a column, you can run:
95
+
96
+ ```ruby
97
+ User.fast_distinct_count(column: :company_id) # => 243
98
+ ```
99
+
100
+ It is suited for cases when there is a small amount of distinct values in a column compared to a total number
101
+ of values (for example, 10M rows total and 200 distinct values).
102
+
103
+ Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
104
+
105
+ **Note**: You need to have an index starting with the specified column for this to work.
106
+
107
+ Uses a ["Loose Index Scan" technique](https://wiki.postgresql.org/wiki/Loose_indexscan).
108
+
82
109
  ## Configuration
83
110
 
84
111
  You can override the following default options:
@@ -21,6 +21,28 @@ module FastCount
21
21
  query_plan = @connection.select_value("EXPLAIN format=tree #{sql}")
22
22
  query_plan.match(/rows=(\d+)/)[1].to_i
23
23
  end
24
+
25
+ # MySQL already supports "Loose Index Scan" (see https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html),
26
+ # so we can just directly run the query.
27
+ def fast_distinct_count(table_name, column_name)
28
+ unless index_exists?(table_name, column_name)
29
+ raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
30
+ end
31
+
32
+ @connection.select_value(<<~SQL)
33
+ SELECT COUNT(*) FROM (
34
+ SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
35
+ ) AS tmp
36
+ SQL
37
+ end
38
+
39
+ private
40
+ def index_exists?(table_name, column_name)
41
+ indexes = @connection.schema_cache.indexes(table_name)
42
+ indexes.find do |index|
43
+ index.using == :btree && Array(index.columns).first == column_name.to_s
44
+ end
45
+ end
24
46
  end
25
47
  end
26
48
  end
@@ -6,9 +6,23 @@ module FastCount
6
6
  class PostgresqlAdapter < BaseAdapter
7
7
  def install
8
8
  @connection.execute(<<~SQL)
9
- CREATE FUNCTION fast_count(table_name text, threshold bigint) RETURNS bigint AS $$
10
- DECLARE count bigint;
9
+ CREATE FUNCTION fast_count(identifier text, threshold bigint) RETURNS bigint AS $$
10
+ DECLARE
11
+ count bigint;
12
+ table_parts text[];
13
+ schema_name text;
14
+ table_name text;
11
15
  BEGIN
16
+ SELECT PARSE_IDENT(identifier) INTO table_parts;
17
+
18
+ IF ARRAY_LENGTH(table_parts, 1) = 2 THEN
19
+ schema_name := ''''|| table_parts[1] ||'''';
20
+ table_name := ''''|| table_parts[2] ||'''';
21
+ ELSE
22
+ schema_name := 'ANY (current_schemas(false))';
23
+ table_name := ''''|| table_parts[1] ||'''';
24
+ END IF;
25
+
12
26
  EXECUTE '
13
27
  WITH tables_counts AS (
14
28
  -- inherited and partitioned tables counts
@@ -17,22 +31,26 @@ module FastCount
17
31
  (SUM(pg_relation_size(child.oid))::float / (current_setting(''block_size'')::float))::integer AS estimate
18
32
  FROM pg_inherits
19
33
  INNER JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
20
- INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
21
- WHERE parent.relname = ''' || table_name || '''
34
+ LEFT JOIN pg_namespace n ON n.oid = parent.relnamespace
35
+ INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
36
+ WHERE n.nspname = '|| schema_name ||' AND
37
+ parent.relname = '|| table_name ||'
22
38
 
23
39
  UNION ALL
24
40
 
25
41
  -- table count
26
42
  SELECT
27
43
  (reltuples::float / greatest(relpages, 1)) *
28
- (pg_relation_size(pg_class.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
29
- FROM pg_class
30
- WHERE relname = '''|| table_name ||'''
44
+ (pg_relation_size(c.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
45
+ FROM pg_class c
46
+ LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
47
+ WHERE n.nspname = '|| schema_name ||' AND
48
+ c.relname = '|| table_name ||'
31
49
  )
32
50
 
33
51
  SELECT
34
52
  CASE
35
- WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM "'|| table_name ||'")
53
+ WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM '|| identifier ||')
36
54
  ELSE SUM(estimate)
37
55
  END AS count
38
56
  FROM tables_counts' INTO count;
@@ -56,6 +74,43 @@ module FastCount
56
74
  query_plan = @connection.select_value("EXPLAIN #{sql}")
57
75
  query_plan.match(/rows=(\d+)/)[1].to_i
58
76
  end
77
+
78
+ def fast_distinct_count(table_name, column_name)
79
+ unless index_exists?(table_name, column_name)
80
+ raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
81
+ end
82
+
83
+ table = @connection.quote_table_name(table_name)
84
+ column = @connection.quote_column_name(column_name)
85
+
86
+ @connection.select_value(<<~SQL)
87
+ WITH RECURSIVE t AS (
88
+ (SELECT #{column} FROM #{table} ORDER BY #{column} LIMIT 1)
89
+ UNION
90
+ SELECT (SELECT #{column} FROM #{table} WHERE #{column} > t.#{column} ORDER BY #{column} LIMIT 1)
91
+ FROM t
92
+ WHERE t.#{column} IS NOT NULL
93
+ ),
94
+
95
+ distinct_values AS (
96
+ SELECT #{column} FROM t WHERE #{column} IS NOT NULL
97
+ UNION
98
+ SELECT NULL WHERE EXISTS (SELECT 1 FROM #{table} WHERE #{column} IS NULL)
99
+ )
100
+
101
+ SELECT COUNT(*) FROM distinct_values
102
+ SQL
103
+ end
104
+
105
+ private
106
+ def index_exists?(table_name, column_name)
107
+ indexes = @connection.schema_cache.indexes(table_name)
108
+ indexes.find do |index|
109
+ index.using == :btree &&
110
+ index.where.nil? &&
111
+ Array(index.columns).first == column_name.to_s
112
+ end
113
+ end
59
114
  end
60
115
  end
61
116
  end
@@ -15,6 +15,14 @@ module FastCount
15
15
  def estimated_count(sql)
16
16
  @connection.select_value("SELECT COUNT(*) FROM (#{sql})")
17
17
  end
18
+
19
+ def fast_distinct_count(table_name, column_name)
20
+ @connection.select_value(<<~SQL)
21
+ SELECT COUNT(*) FROM (
22
+ SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
23
+ ) AS tmp
24
+ SQL
25
+ end
18
26
  end
19
27
  end
20
28
  end
@@ -3,6 +3,9 @@
3
3
  module FastCount
4
4
  module Extensions
5
5
  module ModelExtension
6
+ # Returns an estimated number of rows in the table.
7
+ # Runs in milliseconds.
8
+ #
6
9
  # @example
7
10
  # User.fast_count
8
11
  # User.fast_count(threshold: 50_000)
@@ -11,9 +14,32 @@ module FastCount
11
14
  adapter = Adapters.for_connection(connection)
12
15
  adapter.fast_count(table_name, threshold)
13
16
  end
17
+
18
+ # Returns an exact number of distinct values in a column.
19
+ # It is suited for cases, when there is a small amount
20
+ # of distinct values in a column compared to a total number
21
+ # of values (for example, 10M rows total and 500 distinct values).
22
+ #
23
+ # Runs orders of magnitude faster than 'SELECT COUNT(DISTINCT column) ...'.
24
+ #
25
+ # Note: You need to have an index starting with the specified column
26
+ # for this to work.
27
+ #
28
+ # Uses an "Loose Index Scan" technique (see https://wiki.postgresql.org/wiki/Loose_indexscan).
29
+ #
30
+ # @example
31
+ # User.fast_distinct_count(column: :company_id)
32
+ #
33
+ def fast_distinct_count(column:)
34
+ adapter = Adapters.for_connection(connection)
35
+ adapter.fast_distinct_count(table_name, column)
36
+ end
14
37
  end
15
38
 
16
39
  module RelationExtension
40
+ # Returns an estimated number of rows that the query will return
41
+ # (without actually executing it).
42
+ #
17
43
  # @example
18
44
  # User.where.missing(:avatar).estimated_count
19
45
  #
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module FastCount
4
- VERSION = "0.1.0"
4
+ VERSION = "0.2.0"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fast_count
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - fatkodima
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2023-04-26 00:00:00.000000000 Z
12
+ date: 2023-07-24 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: activerecord
@@ -66,7 +66,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
66
66
  - !ruby/object:Gem::Version
67
67
  version: '0'
68
68
  requirements: []
69
- rubygems_version: 3.4.12
69
+ rubygems_version: 3.4.6
70
70
  signing_key:
71
71
  specification_version: 4
72
72
  summary: Quickly get a count estimation for large tables.