fast_count 0.1.0 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3432010a9c6d6f616b7341df1dffbf3ea02b9fb22e846ab59c6562609ff109ff
4
- data.tar.gz: 42b6a79b370de8a2c919bdc9744bee718436b7ecba75adf6c95cb82a3bafbe69
3
+ metadata.gz: 25bb1bbb54bacd1439f8a649e5927fa8968e9cb9c5076d56d3da8a6bdb74b086
4
+ data.tar.gz: d565c1a47ca831b9013b99a6a0feea5dacb569380686721b5c62b6c12f6cbac3
5
5
  SHA512:
6
- metadata.gz: 0b61215c4ce6eb05626baba644ff34ba11e4c2fe08deec601f5313bdc4c6805594a4ee9dc1637345fdc1891d9bb17e12a251bfe57a17c96f807913ecfb64c83b
7
- data.tar.gz: 9172798b85f35b0b82b1e91917feba59fc317697cea6786cf94549e23aa2625b284c482f7c7d7411ba21111f83c59c9af976d2afe10d4c7fdff0fd3cc5966f3c
6
+ metadata.gz: 34189623d034c4a8433ced800bd9c3f83d9f3f2925ec5d2d646c783e1921e68640e23de1636cecabf49d7d5f7ffa9d42d813e500e1ba8899d5cc913e24042087
7
+ data.tar.gz: 8fbd2df556dc2d09f56fa55fdd5df0693861ff9daa2e651aca6ec35591e77af25744e4e4bb0da3cdb5cd33e1a7247875528f2b725d458cd599d3df67cd485a75
data/CHANGELOG.md CHANGED
@@ -1,5 +1,20 @@
1
1
  ## master (unreleased)
2
2
 
3
+ ## 0.2.0 (2023-07-24)
4
+
5
+ - Support for quickly getting an exact number of distinct values in a column
6
+
7
+ It is suited for cases, when there is a small amount of distinct values in a column compared to a total number
8
+ of values (for example, 10M rows total and 200 distinct values).
9
+ Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
10
+
11
+ Example:
12
+ ```ruby
13
+ User.fast_distinct_count(column: :company_id)
14
+ ```
15
+
16
+ - Support PostgreSQL schemas for `#fast_count`
17
+
3
18
  ## 0.1.0 (2023-04-26)
4
19
 
5
20
  - First release
data/README.md CHANGED
@@ -8,12 +8,12 @@ Luckily, there are [some tricks](https://www.citusdata.com/blog/2016/10/12/count
8
8
 
9
9
  | SQL | Result | Accuracy | Time |
10
10
  | --- | --- | --- | --- |
11
- | `SELECT count(*) FROM small_table;` | `2037104` | `100.000%` | `4.900s` |
12
- | `SELECT fast_count('small_table');` | `2036407` | `99.965%` | `0.050s` |
13
- | `SELECT count(*) FROM medium_table;` | `81716243` | `100.000%` | `257.5s` |
14
- | `SELECT fast_count('medium_table');` | `81600513` | `99.858%` | `0.048s` |
15
- | `SELECT count(*) FROM large_table;` | `455270802` | `100.000%` | `310.6s` |
16
- | `SELECT fast_count('large_table');` | `454448393` | `99.819%` | `0.046s` |
11
+ | `SELECT count(*) FROM small_table` | `2037104` | `100.000%` | `4.900s` |
12
+ | `SELECT fast_count('small_table')` | `2036407` | `99.965%` | `0.050s` |
13
+ | `SELECT count(*) FROM medium_table` | `81716243` | `100.000%` | `257.5s` |
14
+ | `SELECT fast_count('medium_table')` | `81600513` | `99.858%` | `0.048s` |
15
+ | `SELECT count(*) FROM large_table` | `455270802` | `100.000%` | `310.6s` |
16
+ | `SELECT fast_count('large_table')` | `454448393` | `99.819%` | `0.046s` |
17
17
 
18
18
  *These metrics were pulled from real PostgreSQL databases being used in a production environment.*
19
19
 
@@ -51,6 +51,12 @@ $ gem install fast_count
51
51
 
52
52
  If you are using PostgreSQL, you need to create a database function, used internally:
53
53
 
54
+ ```sh
55
+ $ rails generate migration install_fast_count
56
+ ```
57
+
58
+ with the content:
59
+
54
60
  ```ruby
55
61
  class InstallFastCount < ActiveRecord::Migration[7.0]
56
62
  def up
@@ -65,12 +71,16 @@ end
65
71
 
66
72
  ## Usage
67
73
 
68
- To get an estimated count of the rows in a table:
74
+ ### Estimated table count
75
+
76
+ To quickly get an estimated count of the rows in a table:
69
77
 
70
78
  ```ruby
71
79
  User.fast_count # => 1_254_312_219
72
80
  ```
73
81
 
82
+ ### Result set size estimation
83
+
74
84
  If you want to quickly get an estimation of how many rows will the query return, without actually executing it, yo can run:
75
85
 
76
86
  ```ruby
@@ -79,6 +89,23 @@ User.where.missing(:avatar).estimated_count # => 324_200
79
89
 
80
90
  **Note**: `estimated_count` relies on the database query planner estimations (basically on the output of `EXPLAIN`) to get its results and can be very imprecise. It is better be used to get an idea of the order of magnitude of the future result.
81
91
 
92
+ ### Exact distinct values count
93
+
94
+ To quickly get an exact number of distinct values in a column, you can run:
95
+
96
+ ```ruby
97
+ User.fast_distinct_count(column: :company_id) # => 243
98
+ ```
99
+
100
+ It is suited for cases when there is a small amount of distinct values in a column compared to a total number
101
+ of values (for example, 10M rows total and 200 distinct values).
102
+
103
+ Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
104
+
105
+ **Note**: You need to have an index starting with the specified column for this to work.
106
+
107
+ Uses a ["Loose Index Scan" technique](https://wiki.postgresql.org/wiki/Loose_indexscan).
108
+
82
109
  ## Configuration
83
110
 
84
111
  You can override the following default options:
@@ -21,6 +21,28 @@ module FastCount
21
21
  query_plan = @connection.select_value("EXPLAIN format=tree #{sql}")
22
22
  query_plan.match(/rows=(\d+)/)[1].to_i
23
23
  end
24
+
25
+ # MySQL already supports "Loose Index Scan" (see https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html),
26
+ # so we can just directly run the query.
27
+ def fast_distinct_count(table_name, column_name)
28
+ unless index_exists?(table_name, column_name)
29
+ raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
30
+ end
31
+
32
+ @connection.select_value(<<~SQL)
33
+ SELECT COUNT(*) FROM (
34
+ SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
35
+ ) AS tmp
36
+ SQL
37
+ end
38
+
39
+ private
40
+ def index_exists?(table_name, column_name)
41
+ indexes = @connection.schema_cache.indexes(table_name)
42
+ indexes.find do |index|
43
+ index.using == :btree && Array(index.columns).first == column_name.to_s
44
+ end
45
+ end
24
46
  end
25
47
  end
26
48
  end
@@ -6,9 +6,23 @@ module FastCount
6
6
  class PostgresqlAdapter < BaseAdapter
7
7
  def install
8
8
  @connection.execute(<<~SQL)
9
- CREATE FUNCTION fast_count(table_name text, threshold bigint) RETURNS bigint AS $$
10
- DECLARE count bigint;
9
+ CREATE FUNCTION fast_count(identifier text, threshold bigint) RETURNS bigint AS $$
10
+ DECLARE
11
+ count bigint;
12
+ table_parts text[];
13
+ schema_name text;
14
+ table_name text;
11
15
  BEGIN
16
+ SELECT PARSE_IDENT(identifier) INTO table_parts;
17
+
18
+ IF ARRAY_LENGTH(table_parts, 1) = 2 THEN
19
+ schema_name := ''''|| table_parts[1] ||'''';
20
+ table_name := ''''|| table_parts[2] ||'''';
21
+ ELSE
22
+ schema_name := 'ANY (current_schemas(false))';
23
+ table_name := ''''|| table_parts[1] ||'''';
24
+ END IF;
25
+
12
26
  EXECUTE '
13
27
  WITH tables_counts AS (
14
28
  -- inherited and partitioned tables counts
@@ -17,22 +31,26 @@ module FastCount
17
31
  (SUM(pg_relation_size(child.oid))::float / (current_setting(''block_size'')::float))::integer AS estimate
18
32
  FROM pg_inherits
19
33
  INNER JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
20
- INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
21
- WHERE parent.relname = ''' || table_name || '''
34
+ LEFT JOIN pg_namespace n ON n.oid = parent.relnamespace
35
+ INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
36
+ WHERE n.nspname = '|| schema_name ||' AND
37
+ parent.relname = '|| table_name ||'
22
38
 
23
39
  UNION ALL
24
40
 
25
41
  -- table count
26
42
  SELECT
27
43
  (reltuples::float / greatest(relpages, 1)) *
28
- (pg_relation_size(pg_class.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
29
- FROM pg_class
30
- WHERE relname = '''|| table_name ||'''
44
+ (pg_relation_size(c.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
45
+ FROM pg_class c
46
+ LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
47
+ WHERE n.nspname = '|| schema_name ||' AND
48
+ c.relname = '|| table_name ||'
31
49
  )
32
50
 
33
51
  SELECT
34
52
  CASE
35
- WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM "'|| table_name ||'")
53
+ WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM '|| identifier ||')
36
54
  ELSE SUM(estimate)
37
55
  END AS count
38
56
  FROM tables_counts' INTO count;
@@ -56,6 +74,43 @@ module FastCount
56
74
  query_plan = @connection.select_value("EXPLAIN #{sql}")
57
75
  query_plan.match(/rows=(\d+)/)[1].to_i
58
76
  end
77
+
78
+ def fast_distinct_count(table_name, column_name)
79
+ unless index_exists?(table_name, column_name)
80
+ raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
81
+ end
82
+
83
+ table = @connection.quote_table_name(table_name)
84
+ column = @connection.quote_column_name(column_name)
85
+
86
+ @connection.select_value(<<~SQL)
87
+ WITH RECURSIVE t AS (
88
+ (SELECT #{column} FROM #{table} ORDER BY #{column} LIMIT 1)
89
+ UNION
90
+ SELECT (SELECT #{column} FROM #{table} WHERE #{column} > t.#{column} ORDER BY #{column} LIMIT 1)
91
+ FROM t
92
+ WHERE t.#{column} IS NOT NULL
93
+ ),
94
+
95
+ distinct_values AS (
96
+ SELECT #{column} FROM t WHERE #{column} IS NOT NULL
97
+ UNION
98
+ SELECT NULL WHERE EXISTS (SELECT 1 FROM #{table} WHERE #{column} IS NULL)
99
+ )
100
+
101
+ SELECT COUNT(*) FROM distinct_values
102
+ SQL
103
+ end
104
+
105
+ private
106
+ def index_exists?(table_name, column_name)
107
+ indexes = @connection.schema_cache.indexes(table_name)
108
+ indexes.find do |index|
109
+ index.using == :btree &&
110
+ index.where.nil? &&
111
+ Array(index.columns).first == column_name.to_s
112
+ end
113
+ end
59
114
  end
60
115
  end
61
116
  end
@@ -15,6 +15,14 @@ module FastCount
15
15
  def estimated_count(sql)
16
16
  @connection.select_value("SELECT COUNT(*) FROM (#{sql})")
17
17
  end
18
+
19
+ def fast_distinct_count(table_name, column_name)
20
+ @connection.select_value(<<~SQL)
21
+ SELECT COUNT(*) FROM (
22
+ SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
23
+ ) AS tmp
24
+ SQL
25
+ end
18
26
  end
19
27
  end
20
28
  end
@@ -3,6 +3,9 @@
3
3
  module FastCount
4
4
  module Extensions
5
5
  module ModelExtension
6
+ # Returns an estimated number of rows in the table.
7
+ # Runs in milliseconds.
8
+ #
6
9
  # @example
7
10
  # User.fast_count
8
11
  # User.fast_count(threshold: 50_000)
@@ -11,9 +14,32 @@ module FastCount
11
14
  adapter = Adapters.for_connection(connection)
12
15
  adapter.fast_count(table_name, threshold)
13
16
  end
17
+
18
+ # Returns an exact number of distinct values in a column.
19
+ # It is suited for cases, when there is a small amount
20
+ # of distinct values in a column compared to a total number
21
+ # of values (for example, 10M rows total and 500 distinct values).
22
+ #
23
+ # Runs orders of magnitude faster than 'SELECT COUNT(DISTINCT column) ...'.
24
+ #
25
+ # Note: You need to have an index starting with the specified column
26
+ # for this to work.
27
+ #
28
+ # Uses an "Loose Index Scan" technique (see https://wiki.postgresql.org/wiki/Loose_indexscan).
29
+ #
30
+ # @example
31
+ # User.fast_distinct_count(column: :company_id)
32
+ #
33
+ def fast_distinct_count(column:)
34
+ adapter = Adapters.for_connection(connection)
35
+ adapter.fast_distinct_count(table_name, column)
36
+ end
14
37
  end
15
38
 
16
39
  module RelationExtension
40
+ # Returns an estimated number of rows that the query will return
41
+ # (without actually executing it).
42
+ #
17
43
  # @example
18
44
  # User.where.missing(:avatar).estimated_count
19
45
  #
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module FastCount
4
- VERSION = "0.1.0"
4
+ VERSION = "0.2.0"
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fast_count
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - fatkodima
@@ -9,7 +9,7 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2023-04-26 00:00:00.000000000 Z
12
+ date: 2023-07-24 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: activerecord
@@ -66,7 +66,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
66
66
  - !ruby/object:Gem::Version
67
67
  version: '0'
68
68
  requirements: []
69
- rubygems_version: 3.4.12
69
+ rubygems_version: 3.4.6
70
70
  signing_key:
71
71
  specification_version: 4
72
72
  summary: Quickly get a count estimation for large tables.