fast_count 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -0
- data/README.md +34 -7
- data/lib/fast_count/adapters/mysql_adapter.rb +22 -0
- data/lib/fast_count/adapters/postgresql_adapter.rb +63 -8
- data/lib/fast_count/adapters/sqlite_adapter.rb +8 -0
- data/lib/fast_count/extensions.rb +26 -0
- data/lib/fast_count/version.rb +1 -1
- metadata +3 -3
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 25bb1bbb54bacd1439f8a649e5927fa8968e9cb9c5076d56d3da8a6bdb74b086
|
4
|
+
data.tar.gz: d565c1a47ca831b9013b99a6a0feea5dacb569380686721b5c62b6c12f6cbac3
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 34189623d034c4a8433ced800bd9c3f83d9f3f2925ec5d2d646c783e1921e68640e23de1636cecabf49d7d5f7ffa9d42d813e500e1ba8899d5cc913e24042087
|
7
|
+
data.tar.gz: 8fbd2df556dc2d09f56fa55fdd5df0693861ff9daa2e651aca6ec35591e77af25744e4e4bb0da3cdb5cd33e1a7247875528f2b725d458cd599d3df67cd485a75
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,20 @@
|
|
1
1
|
## master (unreleased)
|
2
2
|
|
3
|
+
## 0.2.0 (2023-07-24)
|
4
|
+
|
5
|
+
- Support for quickly getting an exact number of distinct values in a column
|
6
|
+
|
7
|
+
It is suited for cases, when there is a small amount of distinct values in a column compared to a total number
|
8
|
+
of values (for example, 10M rows total and 200 distinct values).
|
9
|
+
Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
|
10
|
+
|
11
|
+
Example:
|
12
|
+
```ruby
|
13
|
+
User.fast_distinct_count(column: :company_id)
|
14
|
+
```
|
15
|
+
|
16
|
+
- Support PostgreSQL schemas for `#fast_count`
|
17
|
+
|
3
18
|
## 0.1.0 (2023-04-26)
|
4
19
|
|
5
20
|
- First release
|
data/README.md
CHANGED
@@ -8,12 +8,12 @@ Luckily, there are [some tricks](https://www.citusdata.com/blog/2016/10/12/count
|
|
8
8
|
|
9
9
|
| SQL | Result | Accuracy | Time |
|
10
10
|
| --- | --- | --- | --- |
|
11
|
-
| `SELECT count(*) FROM small_table
|
12
|
-
| `SELECT fast_count('small_table')
|
13
|
-
| `SELECT count(*) FROM medium_table
|
14
|
-
| `SELECT fast_count('medium_table')
|
15
|
-
| `SELECT count(*) FROM large_table
|
16
|
-
| `SELECT fast_count('large_table')
|
11
|
+
| `SELECT count(*) FROM small_table` | `2037104` | `100.000%` | `4.900s` |
|
12
|
+
| `SELECT fast_count('small_table')` | `2036407` | `99.965%` | `0.050s` |
|
13
|
+
| `SELECT count(*) FROM medium_table` | `81716243` | `100.000%` | `257.5s` |
|
14
|
+
| `SELECT fast_count('medium_table')` | `81600513` | `99.858%` | `0.048s` |
|
15
|
+
| `SELECT count(*) FROM large_table` | `455270802` | `100.000%` | `310.6s` |
|
16
|
+
| `SELECT fast_count('large_table')` | `454448393` | `99.819%` | `0.046s` |
|
17
17
|
|
18
18
|
*These metrics were pulled from real PostgreSQL databases being used in a production environment.*
|
19
19
|
|
@@ -51,6 +51,12 @@ $ gem install fast_count
|
|
51
51
|
|
52
52
|
If you are using PostgreSQL, you need to create a database function, used internally:
|
53
53
|
|
54
|
+
```sh
|
55
|
+
$ rails generate migration install_fast_count
|
56
|
+
```
|
57
|
+
|
58
|
+
with the content:
|
59
|
+
|
54
60
|
```ruby
|
55
61
|
class InstallFastCount < ActiveRecord::Migration[7.0]
|
56
62
|
def up
|
@@ -65,12 +71,16 @@ end
|
|
65
71
|
|
66
72
|
## Usage
|
67
73
|
|
68
|
-
|
74
|
+
### Estimated table count
|
75
|
+
|
76
|
+
To quickly get an estimated count of the rows in a table:
|
69
77
|
|
70
78
|
```ruby
|
71
79
|
User.fast_count # => 1_254_312_219
|
72
80
|
```
|
73
81
|
|
82
|
+
### Result set size estimation
|
83
|
+
|
74
84
|
If you want to quickly get an estimation of how many rows will the query return, without actually executing it, yo can run:
|
75
85
|
|
76
86
|
```ruby
|
@@ -79,6 +89,23 @@ User.where.missing(:avatar).estimated_count # => 324_200
|
|
79
89
|
|
80
90
|
**Note**: `estimated_count` relies on the database query planner estimations (basically on the output of `EXPLAIN`) to get its results and can be very imprecise. It is better be used to get an idea of the order of magnitude of the future result.
|
81
91
|
|
92
|
+
### Exact distinct values count
|
93
|
+
|
94
|
+
To quickly get an exact number of distinct values in a column, you can run:
|
95
|
+
|
96
|
+
```ruby
|
97
|
+
User.fast_distinct_count(column: :company_id) # => 243
|
98
|
+
```
|
99
|
+
|
100
|
+
It is suited for cases when there is a small amount of distinct values in a column compared to a total number
|
101
|
+
of values (for example, 10M rows total and 200 distinct values).
|
102
|
+
|
103
|
+
Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
|
104
|
+
|
105
|
+
**Note**: You need to have an index starting with the specified column for this to work.
|
106
|
+
|
107
|
+
Uses a ["Loose Index Scan" technique](https://wiki.postgresql.org/wiki/Loose_indexscan).
|
108
|
+
|
82
109
|
## Configuration
|
83
110
|
|
84
111
|
You can override the following default options:
|
@@ -21,6 +21,28 @@ module FastCount
|
|
21
21
|
query_plan = @connection.select_value("EXPLAIN format=tree #{sql}")
|
22
22
|
query_plan.match(/rows=(\d+)/)[1].to_i
|
23
23
|
end
|
24
|
+
|
25
|
+
# MySQL already supports "Loose Index Scan" (see https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html),
|
26
|
+
# so we can just directly run the query.
|
27
|
+
def fast_distinct_count(table_name, column_name)
|
28
|
+
unless index_exists?(table_name, column_name)
|
29
|
+
raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
|
30
|
+
end
|
31
|
+
|
32
|
+
@connection.select_value(<<~SQL)
|
33
|
+
SELECT COUNT(*) FROM (
|
34
|
+
SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
|
35
|
+
) AS tmp
|
36
|
+
SQL
|
37
|
+
end
|
38
|
+
|
39
|
+
private
|
40
|
+
def index_exists?(table_name, column_name)
|
41
|
+
indexes = @connection.schema_cache.indexes(table_name)
|
42
|
+
indexes.find do |index|
|
43
|
+
index.using == :btree && Array(index.columns).first == column_name.to_s
|
44
|
+
end
|
45
|
+
end
|
24
46
|
end
|
25
47
|
end
|
26
48
|
end
|
@@ -6,9 +6,23 @@ module FastCount
|
|
6
6
|
class PostgresqlAdapter < BaseAdapter
|
7
7
|
def install
|
8
8
|
@connection.execute(<<~SQL)
|
9
|
-
CREATE FUNCTION fast_count(
|
10
|
-
DECLARE
|
9
|
+
CREATE FUNCTION fast_count(identifier text, threshold bigint) RETURNS bigint AS $$
|
10
|
+
DECLARE
|
11
|
+
count bigint;
|
12
|
+
table_parts text[];
|
13
|
+
schema_name text;
|
14
|
+
table_name text;
|
11
15
|
BEGIN
|
16
|
+
SELECT PARSE_IDENT(identifier) INTO table_parts;
|
17
|
+
|
18
|
+
IF ARRAY_LENGTH(table_parts, 1) = 2 THEN
|
19
|
+
schema_name := ''''|| table_parts[1] ||'''';
|
20
|
+
table_name := ''''|| table_parts[2] ||'''';
|
21
|
+
ELSE
|
22
|
+
schema_name := 'ANY (current_schemas(false))';
|
23
|
+
table_name := ''''|| table_parts[1] ||'''';
|
24
|
+
END IF;
|
25
|
+
|
12
26
|
EXECUTE '
|
13
27
|
WITH tables_counts AS (
|
14
28
|
-- inherited and partitioned tables counts
|
@@ -17,22 +31,26 @@ module FastCount
|
|
17
31
|
(SUM(pg_relation_size(child.oid))::float / (current_setting(''block_size'')::float))::integer AS estimate
|
18
32
|
FROM pg_inherits
|
19
33
|
INNER JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
|
20
|
-
|
21
|
-
|
34
|
+
LEFT JOIN pg_namespace n ON n.oid = parent.relnamespace
|
35
|
+
INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
|
36
|
+
WHERE n.nspname = '|| schema_name ||' AND
|
37
|
+
parent.relname = '|| table_name ||'
|
22
38
|
|
23
39
|
UNION ALL
|
24
40
|
|
25
41
|
-- table count
|
26
42
|
SELECT
|
27
43
|
(reltuples::float / greatest(relpages, 1)) *
|
28
|
-
(pg_relation_size(
|
29
|
-
FROM pg_class
|
30
|
-
|
44
|
+
(pg_relation_size(c.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
|
45
|
+
FROM pg_class c
|
46
|
+
LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
|
47
|
+
WHERE n.nspname = '|| schema_name ||' AND
|
48
|
+
c.relname = '|| table_name ||'
|
31
49
|
)
|
32
50
|
|
33
51
|
SELECT
|
34
52
|
CASE
|
35
|
-
WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM
|
53
|
+
WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM '|| identifier ||')
|
36
54
|
ELSE SUM(estimate)
|
37
55
|
END AS count
|
38
56
|
FROM tables_counts' INTO count;
|
@@ -56,6 +74,43 @@ module FastCount
|
|
56
74
|
query_plan = @connection.select_value("EXPLAIN #{sql}")
|
57
75
|
query_plan.match(/rows=(\d+)/)[1].to_i
|
58
76
|
end
|
77
|
+
|
78
|
+
def fast_distinct_count(table_name, column_name)
|
79
|
+
unless index_exists?(table_name, column_name)
|
80
|
+
raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
|
81
|
+
end
|
82
|
+
|
83
|
+
table = @connection.quote_table_name(table_name)
|
84
|
+
column = @connection.quote_column_name(column_name)
|
85
|
+
|
86
|
+
@connection.select_value(<<~SQL)
|
87
|
+
WITH RECURSIVE t AS (
|
88
|
+
(SELECT #{column} FROM #{table} ORDER BY #{column} LIMIT 1)
|
89
|
+
UNION
|
90
|
+
SELECT (SELECT #{column} FROM #{table} WHERE #{column} > t.#{column} ORDER BY #{column} LIMIT 1)
|
91
|
+
FROM t
|
92
|
+
WHERE t.#{column} IS NOT NULL
|
93
|
+
),
|
94
|
+
|
95
|
+
distinct_values AS (
|
96
|
+
SELECT #{column} FROM t WHERE #{column} IS NOT NULL
|
97
|
+
UNION
|
98
|
+
SELECT NULL WHERE EXISTS (SELECT 1 FROM #{table} WHERE #{column} IS NULL)
|
99
|
+
)
|
100
|
+
|
101
|
+
SELECT COUNT(*) FROM distinct_values
|
102
|
+
SQL
|
103
|
+
end
|
104
|
+
|
105
|
+
private
|
106
|
+
def index_exists?(table_name, column_name)
|
107
|
+
indexes = @connection.schema_cache.indexes(table_name)
|
108
|
+
indexes.find do |index|
|
109
|
+
index.using == :btree &&
|
110
|
+
index.where.nil? &&
|
111
|
+
Array(index.columns).first == column_name.to_s
|
112
|
+
end
|
113
|
+
end
|
59
114
|
end
|
60
115
|
end
|
61
116
|
end
|
@@ -15,6 +15,14 @@ module FastCount
|
|
15
15
|
def estimated_count(sql)
|
16
16
|
@connection.select_value("SELECT COUNT(*) FROM (#{sql})")
|
17
17
|
end
|
18
|
+
|
19
|
+
def fast_distinct_count(table_name, column_name)
|
20
|
+
@connection.select_value(<<~SQL)
|
21
|
+
SELECT COUNT(*) FROM (
|
22
|
+
SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
|
23
|
+
) AS tmp
|
24
|
+
SQL
|
25
|
+
end
|
18
26
|
end
|
19
27
|
end
|
20
28
|
end
|
@@ -3,6 +3,9 @@
|
|
3
3
|
module FastCount
|
4
4
|
module Extensions
|
5
5
|
module ModelExtension
|
6
|
+
# Returns an estimated number of rows in the table.
|
7
|
+
# Runs in milliseconds.
|
8
|
+
#
|
6
9
|
# @example
|
7
10
|
# User.fast_count
|
8
11
|
# User.fast_count(threshold: 50_000)
|
@@ -11,9 +14,32 @@ module FastCount
|
|
11
14
|
adapter = Adapters.for_connection(connection)
|
12
15
|
adapter.fast_count(table_name, threshold)
|
13
16
|
end
|
17
|
+
|
18
|
+
# Returns an exact number of distinct values in a column.
|
19
|
+
# It is suited for cases, when there is a small amount
|
20
|
+
# of distinct values in a column compared to a total number
|
21
|
+
# of values (for example, 10M rows total and 500 distinct values).
|
22
|
+
#
|
23
|
+
# Runs orders of magnitude faster than 'SELECT COUNT(DISTINCT column) ...'.
|
24
|
+
#
|
25
|
+
# Note: You need to have an index starting with the specified column
|
26
|
+
# for this to work.
|
27
|
+
#
|
28
|
+
# Uses an "Loose Index Scan" technique (see https://wiki.postgresql.org/wiki/Loose_indexscan).
|
29
|
+
#
|
30
|
+
# @example
|
31
|
+
# User.fast_distinct_count(column: :company_id)
|
32
|
+
#
|
33
|
+
def fast_distinct_count(column:)
|
34
|
+
adapter = Adapters.for_connection(connection)
|
35
|
+
adapter.fast_distinct_count(table_name, column)
|
36
|
+
end
|
14
37
|
end
|
15
38
|
|
16
39
|
module RelationExtension
|
40
|
+
# Returns an estimated number of rows that the query will return
|
41
|
+
# (without actually executing it).
|
42
|
+
#
|
17
43
|
# @example
|
18
44
|
# User.where.missing(:avatar).estimated_count
|
19
45
|
#
|
data/lib/fast_count/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: fast_count
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- fatkodima
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2023-
|
12
|
+
date: 2023-07-24 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: activerecord
|
@@ -66,7 +66,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
66
66
|
- !ruby/object:Gem::Version
|
67
67
|
version: '0'
|
68
68
|
requirements: []
|
69
|
-
rubygems_version: 3.4.
|
69
|
+
rubygems_version: 3.4.6
|
70
70
|
signing_key:
|
71
71
|
specification_version: 4
|
72
72
|
summary: Quickly get a count estimation for large tables.
|