RubyGems - fast_count - Versions diffs - 0.1.0 → 0.2.0 - Mend

fast_count 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +15 -0
data/README.md +34 -7
data/lib/fast_count/adapters/mysql_adapter.rb +22 -0
data/lib/fast_count/adapters/postgresql_adapter.rb +63 -8
data/lib/fast_count/adapters/sqlite_adapter.rb +8 -0
data/lib/fast_count/extensions.rb +26 -0
data/lib/fast_count/version.rb +1 -1
metadata +3 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3432010a9c6d6f616b7341df1dffbf3ea02b9fb22e846ab59c6562609ff109ff
-  data.tar.gz: 42b6a79b370de8a2c919bdc9744bee718436b7ecba75adf6c95cb82a3bafbe69
+  metadata.gz: 25bb1bbb54bacd1439f8a649e5927fa8968e9cb9c5076d56d3da8a6bdb74b086
+  data.tar.gz: d565c1a47ca831b9013b99a6a0feea5dacb569380686721b5c62b6c12f6cbac3
 SHA512:
-  metadata.gz: 0b61215c4ce6eb05626baba644ff34ba11e4c2fe08deec601f5313bdc4c6805594a4ee9dc1637345fdc1891d9bb17e12a251bfe57a17c96f807913ecfb64c83b
-  data.tar.gz: 9172798b85f35b0b82b1e91917feba59fc317697cea6786cf94549e23aa2625b284c482f7c7d7411ba21111f83c59c9af976d2afe10d4c7fdff0fd3cc5966f3c
+  metadata.gz: 34189623d034c4a8433ced800bd9c3f83d9f3f2925ec5d2d646c783e1921e68640e23de1636cecabf49d7d5f7ffa9d42d813e500e1ba8899d5cc913e24042087
+  data.tar.gz: 8fbd2df556dc2d09f56fa55fdd5df0693861ff9daa2e651aca6ec35591e77af25744e4e4bb0da3cdb5cd33e1a7247875528f2b725d458cd599d3df67cd485a75

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,20 @@
 ## master (unreleased)
+## 0.2.0 (2023-07-24)
+- Support for quickly getting an exact number of distinct values in a column
+    It is suited for cases, when there is a small amount of distinct values in a column compared to a total number
+    of values (for example, 10M rows total and 200 distinct values).
+    Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
+    Example:
+    ```ruby
+    User.fast_distinct_count(column: :company_id)
+    ```
+- Support PostgreSQL schemas for `#fast_count`
 ## 0.1.0 (2023-04-26)
 - First release

data/README.md CHANGED Viewed

@@ -8,12 +8,12 @@ Luckily, there are [some tricks](https://www.citusdata.com/blog/2016/10/12/count
 | SQL | Result | Accuracy | Time |
 | --- | --- | --- | --- |
-| `SELECT count(*) FROM small_table;` | `2037104` | `100.000%` | `4.900s` |
-| `SELECT fast_count('small_table');` | `2036407` | `99.965%` | `0.050s` |
-| `SELECT count(*) FROM medium_table;` | `81716243` | `100.000%` | `257.5s` |
-| `SELECT fast_count('medium_table');` | `81600513` | `99.858%` | `0.048s` |
-| `SELECT count(*) FROM large_table;` | `455270802` | `100.000%` | `310.6s` |
-| `SELECT fast_count('large_table');` | `454448393` | `99.819%` | `0.046s` |
+| `SELECT count(*) FROM small_table` | `2037104` | `100.000%` | `4.900s` |
+| `SELECT fast_count('small_table')` | `2036407` | `99.965%` | `0.050s` |
+| `SELECT count(*) FROM medium_table` | `81716243` | `100.000%` | `257.5s` |
+| `SELECT fast_count('medium_table')` | `81600513` | `99.858%` | `0.048s` |
+| `SELECT count(*) FROM large_table` | `455270802` | `100.000%` | `310.6s` |
+| `SELECT fast_count('large_table')` | `454448393` | `99.819%` | `0.046s` |
 *These metrics were pulled from real PostgreSQL databases being used in a production environment.*
@@ -51,6 +51,12 @@ $ gem install fast_count
 If you are using PostgreSQL, you need to create a database function, used internally:
+```sh
+$ rails generate migration install_fast_count
+```
+with the content:
 ```ruby
 class InstallFastCount < ActiveRecord::Migration[7.0]
   def up
@@ -65,12 +71,16 @@ end
 ## Usage
-To get an estimated count of the rows in a table:
+### Estimated table count
+To quickly get an estimated count of the rows in a table:
 ```ruby
 User.fast_count # => 1_254_312_219
 ```
+### Result set size estimation
 If you want to quickly get an estimation of how many rows will the query return, without actually executing it, yo can run:
 ```ruby
@@ -79,6 +89,23 @@ User.where.missing(:avatar).estimated_count # => 324_200
 **Note**: `estimated_count` relies on the database query planner estimations (basically on the output of `EXPLAIN`) to get its results and can be very imprecise. It is better be used to get an idea of the order of magnitude of the future result.
+### Exact distinct values count
+To quickly get an exact number of distinct values in a column, you can run:
+```ruby
+User.fast_distinct_count(column: :company_id) # => 243
+```
+It is suited for cases when there is a small amount of distinct values in a column compared to a total number
+of values (for example, 10M rows total and 200 distinct values).
+Runs orders of magnitude faster than `SELECT COUNT(DISTINCT column) FROM table`.
+**Note**: You need to have an index starting with the specified column for this to work.
+Uses a ["Loose Index Scan" technique](https://wiki.postgresql.org/wiki/Loose_indexscan).
 ## Configuration
 You can override the following default options:

data/lib/fast_count/adapters/mysql_adapter.rb CHANGED Viewed

@@ -21,6 +21,28 @@ module FastCount
         query_plan = @connection.select_value("EXPLAIN format=tree #{sql}")
         query_plan.match(/rows=(\d+)/)[1].to_i
       end
+      # MySQL already supports "Loose Index Scan" (see https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html),
+      # so we can just directly run the query.
+      def fast_distinct_count(table_name, column_name)
+        unless index_exists?(table_name, column_name)
+          raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
+        end
+        @connection.select_value(<<~SQL)
+          SELECT COUNT(*) FROM (
+            SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
+          ) AS tmp
+        SQL
+      end
+      private
+        def index_exists?(table_name, column_name)
+          indexes = @connection.schema_cache.indexes(table_name)
+          indexes.find do |index|
+            index.using == :btree && Array(index.columns).first == column_name.to_s
+          end
+        end
     end
   end
 end

data/lib/fast_count/adapters/postgresql_adapter.rb CHANGED Viewed

@@ -6,9 +6,23 @@ module FastCount
     class PostgresqlAdapter < BaseAdapter
       def install
         @connection.execute(<<~SQL)
-          CREATE FUNCTION fast_count(table_name text, threshold bigint) RETURNS bigint AS $$
-          DECLARE count bigint;
+          CREATE FUNCTION fast_count(identifier text, threshold bigint) RETURNS bigint AS $$
+          DECLARE
+            count bigint;
+            table_parts text[];
+            schema_name text;
+            table_name text;
             BEGIN
+              SELECT PARSE_IDENT(identifier) INTO table_parts;
+              IF ARRAY_LENGTH(table_parts, 1) = 2 THEN
+                schema_name := ''''|| table_parts[1] ||'''';
+                table_name := ''''|| table_parts[2] ||'''';
+              ELSE
+                schema_name := 'ANY (current_schemas(false))';
+                table_name := ''''|| table_parts[1] ||'''';
+              END IF;
               EXECUTE '
                 WITH tables_counts AS (
                   -- inherited and partitioned tables counts
@@ -17,22 +31,26 @@ module FastCount
                       (SUM(pg_relation_size(child.oid))::float / (current_setting(''block_size'')::float))::integer AS estimate
                   FROM pg_inherits
                     INNER JOIN pg_class parent ON pg_inherits.inhparent = parent.oid
-                    INNER JOIN pg_class child  ON pg_inherits.inhrelid  = child.oid
-                  WHERE parent.relname = ''' || table_name || '''
+                    LEFT JOIN pg_namespace n ON n.oid = parent.relnamespace
+                    INNER JOIN pg_class child ON pg_inherits.inhrelid = child.oid
+                  WHERE n.nspname = '|| schema_name ||' AND
+                    parent.relname = '|| table_name ||'
                   UNION ALL
                   -- table count
                   SELECT
                     (reltuples::float / greatest(relpages, 1)) *
-                      (pg_relation_size(pg_class.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
-                  FROM pg_class
-                  WHERE relname = '''|| table_name ||'''
+                      (pg_relation_size(c.oid)::float / (current_setting(''block_size'')::float))::integer AS estimate
+                  FROM pg_class c
+                    LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
+                  WHERE n.nspname = '|| schema_name ||' AND
+                    c.relname = '|| table_name ||'
                 )
                 SELECT
                   CASE
-                  WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM "'|| table_name ||'")
+                  WHEN SUM(estimate) < '|| threshold ||' THEN (SELECT COUNT(*) FROM '|| identifier ||')
                   ELSE SUM(estimate)
                   END AS count
                 FROM tables_counts' INTO count;
@@ -56,6 +74,43 @@ module FastCount
         query_plan = @connection.select_value("EXPLAIN #{sql}")
         query_plan.match(/rows=(\d+)/)[1].to_i
       end
+      def fast_distinct_count(table_name, column_name)
+        unless index_exists?(table_name, column_name)
+          raise "Index starting with '#{column_name}' must exist on '#{table_name}' table"
+        end
+        table = @connection.quote_table_name(table_name)
+        column = @connection.quote_column_name(column_name)
+        @connection.select_value(<<~SQL)
+          WITH RECURSIVE t AS (
+            (SELECT #{column} FROM #{table} ORDER BY #{column} LIMIT 1)
+            UNION
+            SELECT (SELECT #{column} FROM #{table} WHERE #{column} > t.#{column} ORDER BY #{column} LIMIT 1)
+            FROM t
+            WHERE t.#{column} IS NOT NULL
+          ),
+          distinct_values AS (
+            SELECT #{column} FROM t WHERE #{column} IS NOT NULL
+            UNION
+            SELECT NULL WHERE EXISTS (SELECT 1 FROM #{table} WHERE #{column} IS NULL)
+          )
+          SELECT COUNT(*) FROM distinct_values
+        SQL
+      end
+      private
+        def index_exists?(table_name, column_name)
+          indexes = @connection.schema_cache.indexes(table_name)
+          indexes.find do |index|
+            index.using == :btree &&
+              index.where.nil? &&
+              Array(index.columns).first == column_name.to_s
+          end
+        end
     end
   end
 end

data/lib/fast_count/adapters/sqlite_adapter.rb CHANGED Viewed

@@ -15,6 +15,14 @@ module FastCount
       def estimated_count(sql)
         @connection.select_value("SELECT COUNT(*) FROM (#{sql})")
       end
+      def fast_distinct_count(table_name, column_name)
+        @connection.select_value(<<~SQL)
+          SELECT COUNT(*) FROM (
+            SELECT DISTINCT #{@connection.quote_column_name(column_name)} FROM #{@connection.quote_table_name(table_name)}
+          ) AS tmp
+        SQL
+      end
     end
   end
 end

data/lib/fast_count/extensions.rb CHANGED Viewed

@@ -3,6 +3,9 @@
 module FastCount
   module Extensions
     module ModelExtension
+      # Returns an estimated number of rows in the table.
+      # Runs in milliseconds.
+      #
       # @example
       #   User.fast_count
       #   User.fast_count(threshold: 50_000)
@@ -11,9 +14,32 @@ module FastCount
         adapter = Adapters.for_connection(connection)
         adapter.fast_count(table_name, threshold)
       end
+      # Returns an exact number of distinct values in a column.
+      # It is suited for cases, when there is a small amount
+      # of distinct values in a column compared to a total number
+      # of values (for example, 10M rows total and 500 distinct values).
+      #
+      # Runs orders of magnitude faster than 'SELECT COUNT(DISTINCT column) ...'.
+      #
+      # Note: You need to have an index starting with the specified column
+      # for this to work.
+      #
+      # Uses an "Loose Index Scan" technique (see https://wiki.postgresql.org/wiki/Loose_indexscan).
+      #
+      # @example
+      #   User.fast_distinct_count(column: :company_id)
+      #
+      def fast_distinct_count(column:)
+        adapter = Adapters.for_connection(connection)
+        adapter.fast_distinct_count(table_name, column)
+      end
     end
     module RelationExtension
+      # Returns an estimated number of rows that the query will return
+      # (without actually executing it).
+      #
       # @example
       #   User.where.missing(:avatar).estimated_count
       #

data/lib/fast_count/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module FastCount
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: fast_count
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - fatkodima
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2023-04-26 00:00:00.000000000 Z
+date: 2023-07-24 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activerecord
@@ -66,7 +66,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.4.12
+rubygems_version: 3.4.6
 signing_key:
 specification_version: 4
 summary: Quickly get a count estimation for large tables.