RubyGems - fast_count - Versions diffs - 0.1.0 → 0.2.0 - Mend

fast_count 0.1.0 → 0.2.0

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +15 -0
data/README.md +34 -7
data/lib/fast_count/adapters/mysql_adapter.rb +22 -0
data/lib/fast_count/adapters/postgresql_adapter.rb +63 -8
data/lib/fast_count/adapters/sqlite_adapter.rb +8 -0
data/lib/fast_count/extensions.rb +26 -0
data/lib/fast_count/version.rb +1 -1
metadata +3 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3432010a9c6d6f616b7341df1dffbf3ea02b9fb22e846ab59c6562609ff109ff
-  data.tar.gz: 42b6a79b370de8a2c919bdc9744bee718436b7ecba75adf6c95cb82a3bafbe69
+  metadata.gz: 25bb1bbb54bacd1439f8a649e5927fa8968e9cb9c5076d56d3da8a6bdb74b086
+  data.tar.gz: d565c1a47ca831b9013b99a6a0feea5dacb569380686721b5c62b6c12f6cbac3
 SHA512:
-  metadata.gz: 0b61215c4ce6eb05626baba644ff34ba11e4c2fe08deec601f5313bdc4c6805594a4ee9dc1637345fdc1891d9bb17e12a251bfe57a17c96f807913ecfb64c83b
-  data.tar.gz: 9172798b85f35b0b82b1e91917feba59fc317697cea6786cf94549e23aa2625b284c482f7c7d7411ba21111f83c59c9af976d2afe10d4c7fdff0fd3cc5966f3c
+  metadata.gz: 34189623d034c4a8433ced800bd9c3f83d9f3f2925ec5d2d646c783e1921e68640e23de1636cecabf49d7d5f7ffa9d42d813e500e1ba8899d5cc913e24042087
+  data.tar.gz: 8fbd2df556dc2d09f56fa55fdd5df0693861ff9daa2e651aca6ec35591e77af25744e4e4bb0da3cdb5cd33e1a7247875528f2b725d458cd599d3df67cd485a75

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,20 @@
 ## master (unreleased)
+## 0.2.0 (2023-07-24)
+- Support for quickly getting an exact number of distinct values in a column
+    It is suited for cases, when there is a small amount of distinct values in a column compared to a total number
+    of values (for example, 10M rows total and 200 distinct values).
+    Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
+    Example:
+    ```ruby
+    User.fast_distinct_count(column: :company_id)
+    ```
+- Support PostgreSQL schemas for `#fast_count`
 ## 0.1.0 (2023-04-26)
 - First release

data/README.md CHANGED Viewed

@@ -8,12 +8,12 @@ Luckily, there are [some tricks](https://www.citusdata.com/blog/2016/10/12/count
 | SQL | Result | Accuracy | Time |
 | --- | --- | --- | --- |
-| `SELECT count(*) FROM small_table;` | `2037104` | `100.000%` | `4.900s` |
-| `SELECT fast_count('small_table');` | `2036407` | `99.965%` | `0.050s` |
-| `SELECT count(*) FROM medium_table;` | `81716243` | `100.000%` | `257.5s` |
-| `SELECT fast_count('medium_table');` | `81600513` | `99.858%` | `0.048s` |
-| `SELECT count(*) FROM large_table;` | `455270802` | `100.000%` | `310.6s` |
-| `SELECT fast_count('large_table');` | `454448393` | `99.819%` | `0.046s` |
+| `SELECT count(*) FROM small_table` | `2037104` | `100.000%` | `4.900s` |
+| `SELECT fast_count('small_table')` | `2036407` | `99.965%` | `0.050s` |
+| `SELECT count(*) FROM medium_table` | `81716243` | `100.000%` | `257.5s` |
+| `SELECT fast_count('medium_table')` | `81600513` | `99.858%` | `0.048s` |
+| `SELECT count(*) FROM large_table` | `455270802` | `100.000%` | `310.6s` |
+| `SELECT fast_count('large_table')` | `454448393` | `99.819%` | `0.046s` |
 *These metrics were pulled from real PostgreSQL databases being used in a production environment.*
@@ -51,6 +51,12 @@ $ gem install fast_count
 If you are using PostgreSQL, you need to create a database function, used internally:
+```sh
+$ rails generate migration install_fast_count
+```
+with the content:
 ```ruby
 class InstallFastCount < ActiveRecord::Migration[7.0]
   def up
@@ -65,12 +71,16 @@ end
 ## Usage
-To get an estimated count of the rows in a table:
+### Estimated table count
+To quickly get an estimated count of the rows in a table:
 ```ruby
 User.fast_count # => 1_254_312_219
 ```
+### Result set size estimation
 If you want to quickly get an estimation of how many rows will the query return, without actually executing it, yo can run:
 ```ruby
@@ -79,6 +89,23 @@ User.where.missing(:avatar).estimated_count # => 324_200
 **Note**: `estimated_count` relies on the database query planner estimations (basically on the output of `EXPLAIN`) to get its results and can be very imprecise. It is better be used to get an idea of the order of magnitude of the future result.
+### Exact distinct values count
+To quickly get an exact number of distinct values in a column, you can run:
+```ruby
+User.fast_distinct_count(column: :company_id) # => 243
+```
+It is suited for cases when there is a small amount of distinct values in a column compared to a total number
+of values (for example, 10M rows total and 200 distinct values).
+Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
+**Note**: You need to have an index starting with the specified column for this to work.
+Uses a ["Loose Index Scan" technique](https://wiki.postgresql.org/wiki/Loose_indexscan).
 ## Configuration
 You can override the following default options:

data/lib/fast_count/adapters/mysql_adapter.rb CHANGED Viewed

@@ -21,6 +21,28 @@ module FastCount
         query_plan = @connection.select_value("EXPLAIN format=tree #{sql}")
         query_plan.match(/rows=(\d+)/)[1].to_i
       end
+      # MySQL already supports "Loose Index Scan" (see https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html),
+      # so we can just directly run the query.
+      def fast_distinct_count(table_name, column_name)
+        unless index_exists?(table_name, column_name)
+          raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
+        end
+        @connection.select_value(<<~SQL)
+          SELECT COUNT(*) FROM (
+            SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
+          ) AS tmp
+        SQL
+      end
+      private
+        def index_exists?(table_name, column_name)
+          indexes = @connection.schema_cache.indexes(table_name)
+          indexes.find do |index|
+            index.using == :btree && Array(index.columns).first == column_name.to_s
+          end
+        end
     end
   end
 end

data/lib/fast_count/adapters/postgresql_adapter.rb CHANGED Viewed

@@ -6,9 +6,23 @@ module FastCount
     class PostgresqlAdapter < BaseAdapter
       def install
         @connection.execute(<<~SQL)
-          CREATE FUNCTION fast_count(table_name text, threshold bigint) RETURNS bigint AS $$
-          DECLARE count bigint;
+          CREATE FUNCTION fast_count(identifier text, threshold bigint) RETURNS bigint AS $$
+          DECLARE
+            count bigint;
+            table_parts text[];
+            schema_name text;
+            table_name text;
             BEGIN
+              SELECT PARSE_IDENT(identifier) INTO table_parts;
+              IF ARRAY_LENGTH(table_parts, 1) = 2 THEN
+                schema_name := ''''|| table_parts[1] ||'''';
+                table_name := ''''|| table_parts[2] ||'''';
+              ELSE
+                schema_name := 'ANY (current_schemas(false))';
+                table_name := ''''|| table_parts[1] ||'''';
+              END IF;
               EXECUTE '
                 WITH tables_counts AS (
                   -- inherited and partitioned tables counts
@@ -17,22 +31,26 @@ module FastCount
                       (SUM(pg_relation_size(child.oid))::float / (current_setting(''block_size'')::float))::integer AS estimate
                   FROM pg_inherits
                     INNER JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
-                    INNER JOIN pg_class child  ON pg_inherits.inhrelid  = child.oid
-                  WHERE parent.relname = ''' || table_name || '''
+                    LEFT JOIN pg_namespace n ON n.oid = parent.relnamespace
+                    INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
+                  WHERE n.nspname = '|| schema_name ||' AND
+                    parent.relname = '|| table_name ||'
                   UNION ALL
                   -- table count
                   SELECT
                     (reltuples::float / greatest(relpages, 1)) *
-                      (pg_relation_size(pg_class.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
-                  FROM pg_class
-                  WHERE relname = '''|| table_name ||'''
+                      (pg_relation_size(c.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
+                  FROM pg_class c
+                    LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
+                  WHERE n.nspname = '|| schema_name ||' AND
+                    c.relname = '|| table_name ||'
                 )
                 SELECT
                   CASE
-                  WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM "'|| table_name ||'")
+                  WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM '|| identifier ||')
                   ELSE SUM(estimate)
                   END AS count
                 FROM tables_counts' INTO count;
@@ -56,6 +74,43 @@ module FastCount
         query_plan = @connection.select_value("EXPLAIN #{sql}")
         query_plan.match(/rows=(\d+)/)[1].to_i
       end
+      def fast_distinct_count(table_name, column_name)
+        unless index_exists?(table_name, column_name)
+          raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
+        end
+        table = @connection.quote_table_name(table_name)
+        column = @connection.quote_column_name(column_name)
+        @connection.select_value(<<~SQL)
+          WITH RECURSIVE t AS (
+            (SELECT #{column} FROM #{table} ORDER BY #{column} LIMIT 1)
+            UNION
+            SELECT (SELECT #{column} FROM #{table} WHERE #{column} > t.#{column} ORDER BY #{column} LIMIT 1)
+            FROM t
+            WHERE t.#{column} IS NOT NULL
+          ),
+          distinct_values AS (
+            SELECT #{column} FROM t WHERE #{column} IS NOT NULL
+            UNION
+            SELECT NULL WHERE EXISTS (SELECT 1 FROM #{table} WHERE #{column} IS NULL)
+          )
+          SELECT COUNT(*) FROM distinct_values
+        SQL
+      end
+      private
+        def index_exists?(table_name, column_name)
+          indexes = @connection.schema_cache.indexes(table_name)
+          indexes.find do |index|
+            index.using == :btree &&
+              index.where.nil? &&
+              Array(index.columns).first == column_name.to_s
+          end
+        end
     end
   end
 end

data/lib/fast_count/adapters/sqlite_adapter.rb CHANGED Viewed

@@ -15,6 +15,14 @@ module FastCount
       def estimated_count(sql)
         @connection.select_value("SELECT COUNT(*) FROM (#{sql})")
       end
+      def fast_distinct_count(table_name, column_name)
+        @connection.select_value(<<~SQL)
+          SELECT COUNT(*) FROM (
+            SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
+          ) AS tmp
+        SQL
+      end
     end
   end
 end

data/lib/fast_count/extensions.rb CHANGED Viewed

@@ -3,6 +3,9 @@
 module FastCount
   module Extensions
     module ModelExtension
+      # Returns an estimated number of rows in the table.
+      # Runs in milliseconds.
+      #
       # @example
       #   User.fast_count
       #   User.fast_count(threshold: 50_000)
@@ -11,9 +14,32 @@ module FastCount
         adapter = Adapters.for_connection(connection)
         adapter.fast_count(table_name, threshold)
       end
+      # Returns an exact number of distinct values in a column.
+      # It is suited for cases, when there is a small amount
+      # of distinct values in a column compared to a total number
+      # of values (for example, 10M rows total and 500 distinct values).
+      #
+      # Runs orders of magnitude faster than 'SELECT COUNT(DISTINCT column) ...'.
+      #
+      # Note: You need to have an index starting with the specified column
+      # for this to work.
+      #
+      # Uses an "Loose Index Scan" technique (see https://wiki.postgresql.org/wiki/Loose_indexscan).
+      #
+      # @example
+      #   User.fast_distinct_count(column: :company_id)
+      #
+      def fast_distinct_count(column:)
+        adapter = Adapters.for_connection(connection)
+        adapter.fast_distinct_count(table_name, column)
+      end
     end
     module RelationExtension
+      # Returns an estimated number of rows that the query will return
+      # (without actually executing it).
+      #
       # @example
       #   User.where.missing(:avatar).estimated_count
       #

data/lib/fast_count/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module FastCount
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: fast_count
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - fatkodima
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-04-26 00:00:00.000000000 Z
+date: 2023-07-24 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activerecord
@@ -66,7 +66,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.4.12
+rubygems_version: 3.4.6
 signing_key:
 specification_version: 4
 summary: Quickly get a count estimation for large tables.