fast_count 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -0
- data/README.md +34 -7
- data/lib/fast_count/adapters/mysql_adapter.rb +22 -0
- data/lib/fast_count/adapters/postgresql_adapter.rb +63 -8
- data/lib/fast_count/adapters/sqlite_adapter.rb +8 -0
- data/lib/fast_count/extensions.rb +26 -0
- data/lib/fast_count/version.rb +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 25bb1bbb54bacd1439f8a649e5927fa8968e9cb9c5076d56d3da8a6bdb74b086
|
4
|
+
data.tar.gz: d565c1a47ca831b9013b99a6a0feea5dacb569380686721b5c62b6c12f6cbac3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 34189623d034c4a8433ced800bd9c3f83d9f3f2925ec5d2d646c783e1921e68640e23de1636cecabf49d7d5f7ffa9d42d813e500e1ba8899d5cc913e24042087
|
7
|
+
data.tar.gz: 8fbd2df556dc2d09f56fa55fdd5df0693861ff9daa2e651aca6ec35591e77af25744e4e4bb0da3cdb5cd33e1a7247875528f2b725d458cd599d3df67cd485a75
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,20 @@
|
|
1
1
|
## master (unreleased)
|
2
2
|
|
3
|
+
## 0.2.0 (2023-07-24)
|
4
|
+
|
5
|
+
- Support for quickly getting an exact number of distinct values in a column
|
6
|
+
|
7
|
+
It is suited for cases, when there is a small amount of distinct values in a column compared to a total number
|
8
|
+
of values (for example, 10M rows total and 200 distinct values).
|
9
|
+
Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
|
10
|
+
|
11
|
+
Example:
|
12
|
+
```ruby
|
13
|
+
User.fast_distinct_count(column: :company_id)
|
14
|
+
```
|
15
|
+
|
16
|
+
- Support PostgreSQL schemas for `#fast_count`
|
17
|
+
|
3
18
|
## 0.1.0 (2023-04-26)
|
4
19
|
|
5
20
|
- First release
|
data/README.md
CHANGED
@@ -8,12 +8,12 @@ Luckily, there are [some tricks](https://www.citusdata.com/blog/2016/10/12/count
|
|
8
8
|
|
9
9
|
| SQL | Result | Accuracy | Time |
|
10
10
|
| --- | --- | --- | --- |
|
11
|
-
| `SELECT count(*) FROM small_table
|
12
|
-
| `SELECT fast_count('small_table')
|
13
|
-
| `SELECT count(*) FROM medium_table
|
14
|
-
| `SELECT fast_count('medium_table')
|
15
|
-
| `SELECT count(*) FROM large_table
|
16
|
-
| `SELECT fast_count('large_table')
|
11
|
+
| `SELECT count(*) FROM small_table` | `2037104` | `100.000%` | `4.900s` |
|
12
|
+
| `SELECT fast_count('small_table')` | `2036407` | `99.965%` | `0.050s` |
|
13
|
+
| `SELECT count(*) FROM medium_table` | `81716243` | `100.000%` | `257.5s` |
|
14
|
+
| `SELECT fast_count('medium_table')` | `81600513` | `99.858%` | `0.048s` |
|
15
|
+
| `SELECT count(*) FROM large_table` | `455270802` | `100.000%` | `310.6s` |
|
16
|
+
| `SELECT fast_count('large_table')` | `454448393` | `99.819%` | `0.046s` |
|
17
17
|
|
18
18
|
*These metrics were pulled from real PostgreSQL databases being used in a production environment.*
|
19
19
|
|
@@ -51,6 +51,12 @@ $ gem install fast_count
|
|
51
51
|
|
52
52
|
If you are using PostgreSQL, you need to create a database function, used internally:
|
53
53
|
|
54
|
+
```sh
|
55
|
+
$ rails generate migration install_fast_count
|
56
|
+
```
|
57
|
+
|
58
|
+
with the content:
|
59
|
+
|
54
60
|
```ruby
|
55
61
|
class InstallFastCount < ActiveRecord::Migration[7.0]
|
56
62
|
def up
|
@@ -65,12 +71,16 @@ end
|
|
65
71
|
|
66
72
|
## Usage
|
67
73
|
|
68
|
-
|
74
|
+
### Estimated table count
|
75
|
+
|
76
|
+
To quickly get an estimated count of the rows in a table:
|
69
77
|
|
70
78
|
```ruby
|
71
79
|
User.fast_count # => 1_254_312_219
|
72
80
|
```
|
73
81
|
|
82
|
+
### Result set size estimation
|
83
|
+
|
74
84
|
If you want to quickly get an estimation of how many rows will the query return, without actually executing it, yo can run:
|
75
85
|
|
76
86
|
```ruby
|
@@ -79,6 +89,23 @@ User.where.missing(:avatar).estimated_count # => 324_200
|
|
79
89
|
|
80
90
|
**Note**: `estimated_count` relies on the database query planner estimations (basically on the output of `EXPLAIN`) to get its results and can be very imprecise. It is better be used to get an idea of the order of magnitude of the future result.
|
81
91
|
|
92
|
+
### Exact distinct values count
|
93
|
+
|
94
|
+
To quickly get an exact number of distinct values in a column, you can run:
|
95
|
+
|
96
|
+
```ruby
|
97
|
+
User.fast_distinct_count(column: :company_id) # => 243
|
98
|
+
```
|
99
|
+
|
100
|
+
It is suited for cases when there is a small amount of distinct values in a column compared to a total number
|
101
|
+
of values (for example, 10M rows total and 200 distinct values).
|
102
|
+
|
103
|
+
Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
|
104
|
+
|
105
|
+
**Note**: You need to have an index starting with the specified column for this to work.
|
106
|
+
|
107
|
+
Uses a ["Loose Index Scan" technique](https://wiki.postgresql.org/wiki/Loose_indexscan).
|
108
|
+
|
82
109
|
## Configuration
|
83
110
|
|
84
111
|
You can override the following default options:
|
@@ -21,6 +21,28 @@ module FastCount
|
|
21
21
|
query_plan = @connection.select_value("EXPLAIN format=tree #{sql}")
|
22
22
|
query_plan.match(/rows=(\d+)/)[1].to_i
|
23
23
|
end
|
24
|
+
|
25
|
+
# MySQL already supports "Loose Index Scan" (see https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html),
|
26
|
+
# so we can just directly run the query.
|
27
|
+
def fast_distinct_count(table_name, column_name)
|
28
|
+
unless index_exists?(table_name, column_name)
|
29
|
+
raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
|
30
|
+
end
|
31
|
+
|
32
|
+
@connection.select_value(<<~SQL)
|
33
|
+
SELECT COUNT(*) FROM (
|
34
|
+
SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
|
35
|
+
) AS tmp
|
36
|
+
SQL
|
37
|
+
end
|
38
|
+
|
39
|
+
private
|
40
|
+
def index_exists?(table_name, column_name)
|
41
|
+
indexes = @connection.schema_cache.indexes(table_name)
|
42
|
+
indexes.find do |index|
|
43
|
+
index.using == :btree && Array(index.columns).first == column_name.to_s
|
44
|
+
end
|
45
|
+
end
|
24
46
|
end
|
25
47
|
end
|
26
48
|
end
|
@@ -6,9 +6,23 @@ module FastCount
|
|
6
6
|
class PostgresqlAdapter < BaseAdapter
|
7
7
|
def install
|
8
8
|
@connection.execute(<<~SQL)
|
9
|
-
CREATE FUNCTION fast_count(
|
10
|
-
DECLARE
|
9
|
+
CREATE FUNCTION fast_count(identifier text, threshold bigint) RETURNS bigint AS $$
|
10
|
+
DECLARE
|
11
|
+
count bigint;
|
12
|
+
table_parts text[];
|
13
|
+
schema_name text;
|
14
|
+
table_name text;
|
11
15
|
BEGIN
|
16
|
+
SELECT PARSE_IDENT(identifier) INTO table_parts;
|
17
|
+
|
18
|
+
IF ARRAY_LENGTH(table_parts, 1) = 2 THEN
|
19
|
+
schema_name := ''''|| table_parts[1] ||'''';
|
20
|
+
table_name := ''''|| table_parts[2] ||'''';
|
21
|
+
ELSE
|
22
|
+
schema_name := 'ANY (current_schemas(false))';
|
23
|
+
table_name := ''''|| table_parts[1] ||'''';
|
24
|
+
END IF;
|
25
|
+
|
12
26
|
EXECUTE '
|
13
27
|
WITH tables_counts AS (
|
14
28
|
-- inherited and partitioned tables counts
|
@@ -17,22 +31,26 @@ module FastCount
|
|
17
31
|
(SUM(pg_relation_size(child.oid))::float / (current_setting(''block_size'')::float))::integer AS estimate
|
18
32
|
FROM pg_inherits
|
19
33
|
INNER JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
|
20
|
-
|
21
|
-
|
34
|
+
LEFT JOIN pg_namespace n ON n.oid = parent.relnamespace
|
35
|
+
INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
|
36
|
+
WHERE n.nspname = '|| schema_name ||' AND
|
37
|
+
parent.relname = '|| table_name ||'
|
22
38
|
|
23
39
|
UNION ALL
|
24
40
|
|
25
41
|
-- table count
|
26
42
|
SELECT
|
27
43
|
(reltuples::float / greatest(relpages, 1)) *
|
28
|
-
(pg_relation_size(
|
29
|
-
FROM pg_class
|
30
|
-
|
44
|
+
(pg_relation_size(c.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
|
45
|
+
FROM pg_class c
|
46
|
+
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
|
47
|
+
WHERE n.nspname = '|| schema_name ||' AND
|
48
|
+
c.relname = '|| table_name ||'
|
31
49
|
)
|
32
50
|
|
33
51
|
SELECT
|
34
52
|
CASE
|
35
|
-
WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM
|
53
|
+
WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM '|| identifier ||')
|
36
54
|
ELSE SUM(estimate)
|
37
55
|
END AS count
|
38
56
|
FROM tables_counts' INTO count;
|
@@ -56,6 +74,43 @@ module FastCount
|
|
56
74
|
query_plan = @connection.select_value("EXPLAIN #{sql}")
|
57
75
|
query_plan.match(/rows=(\d+)/)[1].to_i
|
58
76
|
end
|
77
|
+
|
78
|
+
def fast_distinct_count(table_name, column_name)
|
79
|
+
unless index_exists?(table_name, column_name)
|
80
|
+
raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
|
81
|
+
end
|
82
|
+
|
83
|
+
table = @connection.quote_table_name(table_name)
|
84
|
+
column = @connection.quote_column_name(column_name)
|
85
|
+
|
86
|
+
@connection.select_value(<<~SQL)
|
87
|
+
WITH RECURSIVE t AS (
|
88
|
+
(SELECT #{column} FROM #{table} ORDER BY #{column} LIMIT 1)
|
89
|
+
UNION
|
90
|
+
SELECT (SELECT #{column} FROM #{table} WHERE #{column} > t.#{column} ORDER BY #{column} LIMIT 1)
|
91
|
+
FROM t
|
92
|
+
WHERE t.#{column} IS NOT NULL
|
93
|
+
),
|
94
|
+
|
95
|
+
distinct_values AS (
|
96
|
+
SELECT #{column} FROM t WHERE #{column} IS NOT NULL
|
97
|
+
UNION
|
98
|
+
SELECT NULL WHERE EXISTS (SELECT 1 FROM #{table} WHERE #{column} IS NULL)
|
99
|
+
)
|
100
|
+
|
101
|
+
SELECT COUNT(*) FROM distinct_values
|
102
|
+
SQL
|
103
|
+
end
|
104
|
+
|
105
|
+
private
|
106
|
+
def index_exists?(table_name, column_name)
|
107
|
+
indexes = @connection.schema_cache.indexes(table_name)
|
108
|
+
indexes.find do |index|
|
109
|
+
index.using == :btree &&
|
110
|
+
index.where.nil? &&
|
111
|
+
Array(index.columns).first == column_name.to_s
|
112
|
+
end
|
113
|
+
end
|
59
114
|
end
|
60
115
|
end
|
61
116
|
end
|
@@ -15,6 +15,14 @@ module FastCount
|
|
15
15
|
def estimated_count(sql)
|
16
16
|
@connection.select_value("SELECT COUNT(*) FROM (#{sql})")
|
17
17
|
end
|
18
|
+
|
19
|
+
def fast_distinct_count(table_name, column_name)
|
20
|
+
@connection.select_value(<<~SQL)
|
21
|
+
SELECT COUNT(*) FROM (
|
22
|
+
SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
|
23
|
+
) AS tmp
|
24
|
+
SQL
|
25
|
+
end
|
18
26
|
end
|
19
27
|
end
|
20
28
|
end
|
@@ -3,6 +3,9 @@
|
|
3
3
|
module FastCount
|
4
4
|
module Extensions
|
5
5
|
module ModelExtension
|
6
|
+
# Returns an estimated number of rows in the table.
|
7
|
+
# Runs in milliseconds.
|
8
|
+
#
|
6
9
|
# @example
|
7
10
|
# User.fast_count
|
8
11
|
# User.fast_count(threshold: 50_000)
|
@@ -11,9 +14,32 @@ module FastCount
|
|
11
14
|
adapter = Adapters.for_connection(connection)
|
12
15
|
adapter.fast_count(table_name, threshold)
|
13
16
|
end
|
17
|
+
|
18
|
+
# Returns an exact number of distinct values in a column.
|
19
|
+
# It is suited for cases, when there is a small amount
|
20
|
+
# of distinct values in a column compared to a total number
|
21
|
+
# of values (for example, 10M rows total and 500 distinct values).
|
22
|
+
#
|
23
|
+
# Runs orders of magnitude faster than 'SELECT COUNT(DISTINCT column) ...'.
|
24
|
+
#
|
25
|
+
# Note: You need to have an index starting with the specified column
|
26
|
+
# for this to work.
|
27
|
+
#
|
28
|
+
# Uses an "Loose Index Scan" technique (see https://wiki.postgresql.org/wiki/Loose_indexscan).
|
29
|
+
#
|
30
|
+
# @example
|
31
|
+
# User.fast_distinct_count(column: :company_id)
|
32
|
+
#
|
33
|
+
def fast_distinct_count(column:)
|
34
|
+
adapter = Adapters.for_connection(connection)
|
35
|
+
adapter.fast_distinct_count(table_name, column)
|
36
|
+
end
|
14
37
|
end
|
15
38
|
|
16
39
|
module RelationExtension
|
40
|
+
# Returns an estimated number of rows that the query will return
|
41
|
+
# (without actually executing it).
|
42
|
+
#
|
17
43
|
# @example
|
18
44
|
# User.where.missing(:avatar).estimated_count
|
19
45
|
#
|
data/lib/fast_count/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fast_count
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- fatkodima
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2023-
|
12
|
+
date: 2023-07-24 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: activerecord
|
@@ -66,7 +66,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
66
66
|
- !ruby/object:Gem::Version
|
67
67
|
version: '0'
|
68
68
|
requirements: []
|
69
|
-
rubygems_version: 3.4.
|
69
|
+
rubygems_version: 3.4.6
|
70
70
|
signing_key:
|
71
71
|
specification_version: 4
|
72
72
|
summary: Quickly get a count estimation for large tables.
|