RubyGems - rails-paradedb - Versions diffs - 0.6.0 → 0.7.0 - Mend

rails-paradedb 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +8 -1
data/README.md +16 -306
data/lib/parade_db/arel/builder.rb +3 -4
data/lib/parade_db/index.rb +17 -114
data/lib/parade_db/migration_helpers.rb +15 -8
data/lib/parade_db/tokenizer.rb +95 -0
data/lib/parade_db/version.rb +1 -1
data/lib/parade_db.rb +1 -0
metadata +2 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 1bcba07642cf33fe747ed6f810374adee776ee014533e64f5b38d490335dd672
-  data.tar.gz: a470ac63132444ae93ee893eee869d6dac4641774fe965f6519592a7e5c50d35
+  metadata.gz: e28ebb1ec7734474a1ed321766b2e652f4aa1d56aadcc16d48667ac4ddbfc5bc
+  data.tar.gz: 51c25b12045eed272ad0e71475c22f9a670c4da6ff46fcad6d99ce32e3bc1f6e
 SHA512:
-  metadata.gz: aa7193be8a40ed3714b3d17cf1c1b53980b63245a70c878e6aaa164c272df072217eb9c71a808cc29c0dc2611056f7d323f3d6aa9e507b77648beea83a734b6e
-  data.tar.gz: 6870426a910699fd8b1b12f7a03402d8157138d2b2aa4c210339356a2ff60299b29301a1a5cc28f64656f4e80479719e0324270dd0f5a9f7668e09ee9654a262
+  metadata.gz: 4fced3ee50bef2ef4c59567246e9c5382a7e2e3f60293a785d33fe4e9e857abea685740fcf2c2da38c98dce4371d13010f9eacbbbe1c580df760208a20542f3d
+  data.tar.gz: 2e3efbf965bb99866dffe45de2963528ccf9ebfaa4a947abe5b72211306409d5b006eb1e90621cc6591e72f473d91593348b0d1bcaaa1e96cbc08cbe252e8615

data/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,12 @@ All notable changes to this project will be documented in this file. The format
 ## [Unreleased]
+## [0.7.0] - 2026-04-21
+### Changed
+- **BREAKING**: Use function based approach for specifying tokenizers: `Tokenizer.simple(options: {alias: "description_simple"})`
 ## [0.6.0] - 2026-04-14
 ### Added
@@ -126,7 +132,8 @@ All notable changes to this project will be documented in this file. The format
 - Schema dump/load round-trip for tokenizer configuration and index options
   (including `target_segment_count`)
-[Unreleased]: https://github.com/paradedb/rails-paradedb/compare/v0.6.0...HEAD
+[Unreleased]: https://github.com/paradedb/rails-paradedb/compare/v0.7.0...HEAD
+[0.7.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.7.0
 [0.6.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.6.0
 [0.5.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.5.0
 [0.4.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.4.0

data/README.md CHANGED Viewed

@@ -16,31 +16,21 @@
   <a href="https://docs.paradedb.com/changelog/">Changelog</a>
 </h3>
----
-# rails-paradedb
-[![Gem Version](https://img.shields.io/gem/v/rails-paradedb)](https://rubygems.org/gems/rails-paradedb)
-[![Ruby Requirement](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Frubygems.org%2Fapi%2Fv1%2Fversions%2Frails-paradedb.json&query=%24%5B0%5D.ruby_version&label=ruby&logo=ruby)](https://rubygems.org/gems/rails-paradedb)
-[![Gem Downloads](https://img.shields.io/gem/dt/rails-paradedb)](https://rubygems.org/gems/rails-paradedb)
-[![Codecov](https://codecov.io/gh/paradedb/rails-paradedb/graph/badge.svg)](https://codecov.io/gh/paradedb/rails-paradedb)
-[![License](https://img.shields.io/github/license/paradedb/rails-paradedb?color=blue)](https://github.com/paradedb/rails-paradedb?tab=MIT-1-ov-file#readme)
-[![Slack URL](https://img.shields.io/badge/Join%20Slack-purple?logo=slack&link=https%3A%2F%2Fparadedb.com%2Fslack)](https://paradedb.com/slack)
-[![X URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb&label=Follow%20%40paradedb)](https://x.com/paradedb)
+<p align="center">
+  <a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/gem/v/rails-paradedb" alt="Gem Version"></a>&nbsp;
+  <a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Frubygems.org%2Fapi%2Fv1%2Fversions%2Frails-paradedb.json&query=%24%5B0%5D.ruby_version&label=ruby&logo=ruby" alt="Ruby Requirement"></a>&nbsp;
+  <a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/gem/dt/rails-paradedb" alt="Gem Downloads"></a>&nbsp;
+  <a href="https://codecov.io/gh/paradedb/rails-paradedb"><img src="https://codecov.io/gh/paradedb/rails-paradedb/graph/badge.svg" alt="Codecov"></a>&nbsp;
+  <a href="https://github.com/paradedb/rails-paradedb?tab=MIT-1-ov-file#readme"><img src="https://img.shields.io/github/license/paradedb/rails-paradedb?color=blue" alt="License"></a>&nbsp;
+  <a href="https://paradedb.com/slack"><img src="https://img.shields.io/badge/Join%20Slack-purple?logo=slack" alt="Community"></a>&nbsp;
+  <a href="https://x.com/paradedb"><img src="https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb&label=Follow%20%40paradedb" alt="Follow @paradedb"></a>
+</p>
-The official Ruby client for [ParadeDB](https://paradedb.com), built for ActiveRecord.
-Use Elastic-quality full-text search, scoring, snippets, facets, and aggregations directly from Rails.
+---
-## Features
+## ParadeDB for Rails
-- BM25 index management in Rails migrations (`create_paradedb_index`, `remove_bm25_index`, `reindex_bm25`)
-- Chainable ActiveRecord search API (`matching_all`, `matching_any`, `term`, `phrase`, `regex`, `near`, `parse`, and more)
-- Relevance and highlighting (`with_score`, `with_snippet`, `with_snippets`, `with_snippet_positions`)
-- Facets and aggregations (`with_facets`, `facets`, `with_agg`, `facets_agg`, `aggregate_by`)
-- More Like This similarity search (`more_like_this`)
-- Arel integration for advanced query composition with native ParadeDB operators
-- Diagnostics helpers and rake tasks for index health and verification
-- Optional runtime index validation to detect missing/drifted BM25 indexes
+The official ActiveRecord integration for [ParadeDB](https://paradedb.com), including first-class support for managing BM25 indexes and running queries using the full ParadeDB API. Follow the [getting started guide](https://docs.paradedb.com/documentation/getting-started/environment#rails) to begin.
 ## Requirements & Compatibility
@@ -51,294 +41,14 @@ Use Elastic-quality full-text search, scoring, snippets, facets, and aggregation
 | ParadeDB   | 0.22.0+                                          |
 | PostgreSQL | 15+ (PostgreSQL adapter with ParadeDB extension) |
-Notes:
-- CI runs Ruby `3.2` through `4.0` across Rails `7.2` and `8.1` on PostgreSQL `18`.
-- Schema compatibility is checked against every ParadeDB release.
-- The maintained minimum ParadeDB version is `0.22.0`; update `README.md`, `RELEASE.md`, and CI in the same PR whenever that floor changes.
-## Installation
-```ruby
-gem "rails-paradedb"
-```
-```bash
-bundle install
-```
-## Quick Start
-### Prerequisites
-Make sure your Rails app uses PostgreSQL and that `pg_search` is installed in the target database:
-```sql
-CREATE EXTENSION IF NOT EXISTS pg_search;
-```
-### 1. Define Your Model and Index
-```ruby
-class MockItem < ActiveRecord::Base
-  include ParadeDB::Model
-  self.table_name = "mock_items"
-  self.primary_key = "id"
-end
-class MockItemIndex < ParadeDB::Index
-  self.table_name = :mock_items
-  self.key_field = :id
-  self.index_name = :search_idx
-  self.fields = {
-    id: nil,
-    description: nil,
-    category: nil,
-    rating: nil,
-    in_stock: nil,
-    created_at: nil,
-    metadata: nil,
-    weight_range: nil
-  }
-end
-```
-### 2. Create the BM25 Index in a Migration
-```ruby
-class AddMockItemBm25Index < ActiveRecord::Migration[7.2] # use your app's migration version
-  def up
-    create_paradedb_index(MockItemIndex, if_not_exists: true)
-  end
-  def down
-    remove_bm25_index :mock_items, name: :search_idx, if_exists: true
-  end
-end
-```
-### 3. Search
-```ruby
-MockItem.search(:description).matching_all("running shoes")
-MockItem.search(:description).matching_any("wireless", "bluetooth")
-MockItem.search(:description).term("electronics")
-```
-## Query API
-```ruby
-# Full text
-MockItem.search(:description).matching_all("running shoes")
-MockItem.search(:description).matching_any("wireless bluetooth")
-# Query-time tokenizer override
-MockItem.search(:description).matching_any("running shoes", tokenizer: "whitespace")
-MockItem.search(:description).matching_any("running shoes", tokenizer: "whitespace('lowercase=false')")
-# Fuzzy options on match/term
-# Note: tokenizer overrides are mutually exclusive with fuzzy options.
-MockItem.search(:description).matching_any("runing shose", distance: 1)
-MockItem.search(:description).matching_all("runing", distance: 1, prefix: true)
-MockItem.search(:description).term("shose", distance: 1, transposition_cost_one: true)
-# Other query types
-MockItem.search(:description).phrase("running shoes", slop: 2)
-MockItem.search(:description).phrase("running shoes", tokenizer: "whitespace")
-MockItem.search(:description).phrase(%w[running shoes])
-MockItem.search(:description).regex("run.*")
-MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"))
-MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes", ordered: true))
-MockItem.search(:description).near(ParadeDB.proximity("hiking", "running").within(2, "shoes"))
-MockItem.search(:description).near(ParadeDB.proximity("running").within(2, "shoes", "sneakers", ordered: true))
-MockItem.search(:description).near(ParadeDB.regex_term("run.*").within(3, "shoes"))
-MockItem.search(:description).near(ParadeDB.proximity("trail").within(1, "running").within(1, "shoes"))
-MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"), boost: 2.0)
-MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"), const: 1.0)
-MockItem.search(:description).regex_phrase("run.*", "shoes")
-MockItem.search(:description).phrase_prefix("run", "sh", max_expansion: 100)
-MockItem.search(:description).parse("running AND shoes", lenient: true)
-# Match-all / exists / ranges
-MockItem.search(:id).match_all
-MockItem.search(:id).exists
-MockItem.search(:rating).range(gte: 3, lt: 5)
-MockItem.search(:weight_range).range_term("(10, 12]", relation: "Intersects")
-# Similarity
-MockItem.more_like_this(42, fields: [:description])
-```
-## Scoring and Highlighting
-```ruby
-results = MockItem.search(:description)
-                 .matching_all("shoes")
-                 .with_score
-                 .order(search_score: :desc)
-MockItem.search(:description)
-       .matching_all("shoes")
-       .with_snippet(:description, start_tag: "<b>", end_tag: "</b>", max_chars: 80)
-MockItem.search(:description)
-       .matching_all("running")
-       .with_snippets(:description, max_chars: 15, limit: 2, offset: 0, sort_by: :position)
-MockItem.search(:description)
-       .matching_all("running")
-       .with_snippet_positions(:description)
-```
-## Facets and Aggregations
-```ruby
-# Rows + facets (requires order + limit)
-relation = MockItem.search(:description)
-                  .matching_all("shoes")
-                  .with_facets(:category, size: 10)
-                  .order(:id)
-                  .limit(10)
-rows = relation.to_a
-facets = relation.facets
-# Facets-only aggregate
-MockItem.search(:description).matching_all("shoes").facets(:category)
-# Named aggregations
-MockItem.search(:description).matching_all("shoes").facets_agg(
-  docs: ParadeDB::Aggregations.value_count(:id),
-  avg_rating: ParadeDB::Aggregations.avg(:rating)
-)
-# Window aggregations + rows
-MockItem.search(:description).matching_all("shoes").with_agg(
-  exact: false,
-  docs: ParadeDB::Aggregations.value_count(:id),
-  stats: ParadeDB::Aggregations.stats(:rating)
-).order(:id).limit(10)
-# Grouped aggregations
-MockItem.search(:id).match_all.aggregate_by(
-  :category,
-  docs: ParadeDB::Aggregations.value_count(:id)
-)
-```
-If you group by text/JSON fields, index those fields using `:literal` or `:literal_normalized`.
-## ActiveRecord and Arel Composition
-Use ParadeDB conditions with normal ActiveRecord scopes:
-```ruby
-MockItem.search(:description)
-        .matching_all("shoes")
-        .where(in_stock: true)
-        .where(MockItem.arel_table[:rating].gteq(4))
-        .order(created_at: :desc)
-```
-For advanced SQL composition, ParadeDB operators are also available through Arel predications:
-```ruby
-t = MockItem.arel_table
-MockItem.where(t[:description].pdb_match("running shoes"))
-```
-## Diagnostics Helpers
-Ruby helpers:
-```ruby
-ParadeDB.paradedb_indexes
-ParadeDB.paradedb_index_segments("search_idx")
-ParadeDB.paradedb_verify_index("search_idx", sample_rate: 0.1)
-ParadeDB.paradedb_verify_all_indexes(index_pattern: "search_idx")
-```
-Availability depends on the installed `pg_search` version.
-Repository development tasks (from this repo's `Rakefile`):
-```bash
-rake paradedb:diagnostics:indexes
-rake "paradedb:diagnostics:index_segments[search_idx]"
-rake "paradedb:diagnostics:verify_index[search_idx]" SAMPLE_RATE=0.1
-rake paradedb:diagnostics:verify_all_indexes INDEX_PATTERN=search_idx
-```
-## Index Validation
-By default, index validation is disabled. You can enable runtime checks globally:
-```ruby
-# config/initializers/paradedb.rb
-ParadeDB.index_validation_mode = :warn  # :warn, :raise, or :off
-```
-When enabled, `rails-paradedb` validates that the expected BM25 index exists and can raise
-`ParadeDB::IndexDriftError` or `ParadeDB::IndexClassNotFoundError` depending on mode.
-## Common Errors
-### "No search field set. Call .search(column) first."
-```ruby
-# ❌ Missing .search(...)
-MockItem.matching_all("shoes")
-# ✅ Start with .search(column)
-MockItem.search(:description).matching_all("shoes")
-```
-### "with_facets requires ORDER BY and LIMIT"
-```ruby
-# ❌ Missing order/limit
-MockItem.search(:description).matching_all("shoes").with_facets(:category).to_a
-# ✅ Include both
-relation = MockItem.search(:description)
-                   .matching_all("shoes")
-                   .with_facets(:category)
-                   .order(:id)
-                   .limit(10)
-relation.to_a
-relation.facets
-```
-### "search(:field) is not indexed"
-```ruby
-# ❌ Field not in your ParadeDB::Index fields hash
-MockItem.search(:title).matching_all("shoes")
-# ✅ Add :title to the index definition, then migrate
-```
-## Security
-`rails-paradedb` builds SQL through Arel nodes and quoted literals (`Arel::Nodes.build_quoted`)
-rather than manual string interpolation. Tokenizer expressions are validated and search operators are
-rendered through typed nodes, with unit and integration coverage for quoting and edge cases.
 ## Examples
-- [Quick Start](examples/quickstart/quickstart.rb)
+- [Quickstart](examples/quickstart/quickstart.rb)
 - [Faceted Search](examples/faceted_search/faceted_search.rb)
 - [Autocomplete](examples/autocomplete/autocomplete.rb)
 - [More Like This](examples/more_like_this/more_like_this.rb)
-- [Hybrid RRF](examples/hybrid_rrf/hybrid_rrf.rb)
+- [Hybrid Search (RRF)](examples/hybrid_rrf/hybrid_rrf.rb)
 - [RAG](examples/rag/rag.rb)
-- [Examples README](examples/README.md)
-## Documentation
-- **ParadeDB Official Docs**: <https://docs.paradedb.com>
-- **ParadeDB Website**: <https://paradedb.com>
 ## Contributing
@@ -346,7 +56,7 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, test commands, lin
 ## Support
-If you're missing a feature or found a bug, open a
+If you're missing a feature or have found a bug, open a
 [GitHub Issue](https://github.com/paradedb/rails-paradedb/issues/new/choose).
 For community support:
@@ -366,4 +76,4 @@ We would like to thank the following members of the community for their valuable
 ## License
-rails-paradedb is licensed under the [MIT License](LICENSE).
+ParadeDB for Rails is licensed under the [MIT License](LICENSE).

data/lib/parade_db/arel/builder.rb CHANGED Viewed

@@ -302,12 +302,11 @@ module ParadeDB
       def apply_tokenizer(node, tokenizer)
         return node if tokenizer.nil?
-        unless tokenizer.is_a?(String)
-          raise ArgumentError, "tokenizer must be a string"
+        unless tokenizer.is_a?(Tokenizer)
+          raise ArgumentError, "tokenizer must be a Tokenizer"
         end
-        normalized = normalize_tokenizer(tokenizer)
-        Nodes::TokenizerCast.new(node, normalized)
+        return Nodes::TokenizerCast.new(node, tokenizer.render())
       end
       def apply_slop(node, slop)

data/lib/parade_db/index.rb CHANGED Viewed

@@ -1,5 +1,7 @@
 # frozen_string_literal: true
+require_relative "tokenizer"
 module ParadeDB
   class Index
     class << self
@@ -47,117 +49,29 @@ module ParadeDB
     class TokenizerParser
       TOKENIZER_EXPRESSION = /\A[a-zA-Z_][a-zA-Z0-9_]*(?:(?:::|\.)[a-zA-Z_][a-zA-Z0-9_]*)*(?:\(\s*[a-zA-Z0-9_'".,=\s:-]*\s*\))?\z/.freeze
-      TOKENIZER_SINGLE_KEYS = %i[tokenizer args named_args filters stemmer alias].freeze
+      TOKENIZER_SINGLE_KEYS = %i[tokenizer alias].freeze
       class << self
-        def parse(source_name, tokenizer_spec)
-          case tokenizer_spec
-          when Symbol, String
-            [build_tokenized_entry(source_name, tokenizer_spec.to_s, {})]
-          when Hash
-            tokenizer_spec.map do |tokenizer, opts|
-              case opts
-              when Hash
-                build_tokenized_entry(source_name, tokenizer.to_s, normalize_options(opts))
-              when Symbol, String
-                build_tokenized_entry(source_name, tokenizer.to_s, normalize_positional_option(opts))
-              else
-                raise InvalidIndexDefinition,
-                      "tokenizer options for #{source_name}.#{tokenizer} must be a Hash, Symbol, or String"
-              end
-            end
-          else
-            raise InvalidIndexDefinition,
-                  "invalid tokenizer definition for #{source_name}: #{tokenizer_spec.inspect}"
-          end
-        end
-        private
-        def parse_structured_tokenizer_config(source_name, config, context:)
-          unless config.is_a?(Hash)
-            raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} must be a Hash"
+        def parse(source_name, tokenizer, context:)
+          unless tokenizer.is_a?(Tokenizer)
+            raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} must be a Tokenizer"
           end
-          tokenizer = config[:tokenizer] || config["tokenizer"]
-          if tokenizer.nil?
-            raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} requires :tokenizer"
-          end
-          tokenizer_name = tokenizer.to_s
-          validate_tokenizer_name!(source_name, tokenizer_name)
-          args = config[:args] || config["args"]
-          named_args = config[:named_args] || config["named_args"]
-          filters = config[:filters] || config["filters"]
-          stemmer = config[:stemmer] || config["stemmer"]
-          alias_name = config[:alias] || config["alias"]
           options = {}
-          if args
-            unless args.respond_to?(:to_ary)
-              raise InvalidIndexDefinition, "args for #{source_name.inspect} must be an Array"
-            end
-            options[:__positional] = args.to_ary
-          end
-          if named_args
-            unless named_args.is_a?(Hash)
-              raise InvalidIndexDefinition, "named_args for #{source_name.inspect} must be a Hash"
-            end
-            named_args.each { |key, value| options[key.to_sym] = value }
-          end
-          if filters
-            unless filters.respond_to?(:to_ary)
-              raise InvalidIndexDefinition, "filters for #{source_name.inspect} must be an Array"
-            end
-            filters.to_ary.each do |name|
-              filter_key = name.to_s
-              if filter_key == "stemmer" && stemmer
-                options[:stemmer] = stemmer
-              else
-                key = filter_key.to_sym
-                options[key] = true unless options.key?(key)
-              end
-            end
-          end
-          options[:stemmer] = stemmer if stemmer && !options.key?(:stemmer)
-          options[:alias] = alias_name.to_s if alias_name
-          build_tokenized_entry(source_name, tokenizer_name, options)
-        end
-        def normalize_options(opts)
-          opts.each_with_object({}) do |(key, value), memo|
-            memo[key.to_sym] = value
-          end
-        end
-        def normalize_positional_option(option)
-          { __positional: [option.to_s] }
-        end
+          options[:__positional] = tokenizer.positional_args.dup unless tokenizer.positional_args.nil?
+          tokenizer.options&.each { |key, value| options[key.to_sym] = value }
-        def build_tokenized_entry(source_name, tokenizer, options)
-          validate_tokenizer_name!(source_name, tokenizer) unless tokenizer.nil?
           key = options[:alias]&.to_s || source_name
           DefinitionCompiler::Entry.new(
             source: source_name,
             expression: expression?(source_name),
-            tokenizer: tokenizer,
+            tokenizer: tokenizer.name,
             options: options,
             query_key: key
           )
         end
-        def validate_tokenizer_name!(source_name, tokenizer)
-          return if TOKENIZER_EXPRESSION.match?(tokenizer)
-          raise InvalidIndexDefinition,
-                "invalid tokenizer name #{tokenizer.inspect} for #{source_name}. " \
-                "Expected identifier form like simple, pdb::simple, or pdb::ngram(2, 5, alias=field_alias)."
-        end
+        private
         def expression?(value)
           value.match?(/[^a-zA-Z0-9_]/)
@@ -260,33 +174,22 @@ module ParadeDB
             elsif tokenizers
               if single_tokenizer_keys_present
                 raise InvalidIndexDefinition,
-                      "field #{source_name.inspect} cannot mix :tokenizers with :tokenizer/:args/:named_args/:filters/:stemmer/:alias"
+                      "field #{source_name.inspect} cannot mix :tokenizers with :tokenizer/:alias"
               end
               unless tokenizers.respond_to?(:to_ary) && !tokenizers.to_ary.empty?
                 raise InvalidIndexDefinition, "field #{source_name.inspect} :tokenizers must be a non-empty Array"
               end
               tokenizers.to_ary.each_with_index do |tokenizer_config, idx|
-                entry = TokenizerParser.send(
-                  :parse_structured_tokenizer_config,
-                  source_name,
-                  tokenizer_config,
-                  context: "tokenizers[#{idx}]"
-                )
+                entry = TokenizerParser.parse(source_name, tokenizer_config, context: "tokenizers[#{idx}]")
                 entries << entry
               end
-            elsif single_tokenizer_keys_present
-              unless normalized[:tokenizer]
-                raise InvalidIndexDefinition,
-                      "field #{source_name.inspect} specifies tokenizer configuration but no :tokenizer"
-              end
-              entry = TokenizerParser.send(
-                :parse_structured_tokenizer_config,
-                source_name,
-                select_keys(normalized, TokenizerParser::TOKENIZER_SINGLE_KEYS),
-                context: "tokenizer config"
-              )
+            elsif normalized.key?(:tokenizer)
+              entry = TokenizerParser.parse(source_name, normalized[:tokenizer], context: "tokenizer")
               entries << entry
+            elsif single_tokenizer_keys_present
+              raise InvalidIndexDefinition,
+                    "field #{source_name.inspect} specifies tokenizer configuration but no :tokenizer"
             else
               entries << Entry.new(
                 source: source_name,

data/lib/parade_db/migration_helpers.rb CHANGED Viewed

@@ -336,7 +336,7 @@ module ParadeDB
           elsif entries.length == 1
             "#{source_ruby} #{bm25_tokenizer_config_ruby(entries.first)}"
           else
-            configs = entries.map { |e| bm25_tokenizer_config_ruby(e) }
+            configs = entries.map { |e| bm25_tokenizer_ruby_from_entry(e) }
             "#{source_ruby} { tokenizers: [#{configs.join(', ')}] }"
           end
         end
@@ -678,6 +678,10 @@ module ParadeDB
     end
     def bm25_tokenizer_config_ruby(entry)
+      "{ tokenizer: #{bm25_tokenizer_ruby_from_entry(entry)} }"
+    end
+    def bm25_tokenizer_ruby_from_entry(entry)
       opts = entry[:options].dup
       positional_args = Array(opts.delete(:__positional))
       alias_val = opts.delete(:alias)
@@ -685,16 +689,19 @@ module ParadeDB
       max_val = opts.delete(:max)
       positional_args = [min_val, max_val] + positional_args if min_val && max_val
-      parts = ["tokenizer: #{entry[:tokenizer].to_sym.inspect}"]
-      parts << "args: #{positional_args.inspect}" unless positional_args.empty?
-      parts << "alias: #{alias_val.inspect}" if alias_val
+      opts[:alias] = alias_val if alias_val
+      bm25_tokenizer_ruby(entry[:tokenizer], positional_args, opts)
+    end
-      unless opts.empty?
-        named_pairs = opts.map { |k, v| "#{k.inspect} => #{v.inspect}" }.join(", ")
-        parts << "named_args: { #{named_pairs} }"
+    def bm25_tokenizer_ruby(name, positional_args, options)
+      if name.match?(/\A[a-z_][a-z0-9_]*\z/) && Tokenizer.respond_to?(name)
+        args = positional_args.map { |arg| ruby_literal(arg) }
+        args << "options: #{ruby_hash_literal(options)}" unless options.empty?
+        return "Tokenizer.#{name}(#{args.join(', ')})"
       end
-      "{ #{parts.join(', ')} }"
+      "Tokenizer.new(#{name.inspect}, #{ruby_literal(positional_args.empty? ? nil : positional_args)}, #{ruby_literal(options.empty? ? nil : options)})"
     end
     def split_sql_arguments(args_sql)

data/lib/parade_db/tokenizer.rb ADDED Viewed

@@ -0,0 +1,95 @@
+class Tokenizer
+  attr_reader :name, :positional_args, :options
+  def initialize(name, positional_args, options)
+    @name = name
+    @positional_args = positional_args
+    @options = options
+  end
+  def render()
+    if options.nil? && positional_args.nil?
+      return "pdb.#{name}"
+    end
+    args = []
+    if !positional_args.nil?
+      args.concat(positional_args.map { |x| render_positional_arg(x) })
+    end
+    if !options.nil?
+      args.concat(options.map {|k, v| quote_term("#{k}=#{v}")})
+    end
+    return "pdb.#{name}(#{args.join(",")})"
+  end
+  def self.whitespace(options: nil)
+    new("whitespace", nil, options)
+  end
+  def self.unicode_words(options: nil)
+    new("unicode_words", nil, options)
+  end
+  def self.ngram(min_gram, max_gram, options: nil)
+    new("ngram", [min_gram, max_gram], options)
+  end
+  def self.simple(options: nil)
+    new("simple", nil, options)
+  end
+  def self.literal(options: nil)
+    new("literal", nil, options)
+  end
+  def self.literal_normalized(options: nil)
+    new("literal_normalized", nil, options)
+  end
+  def self.edge_ngram(min_gram, max_gram, options: nil)
+    new("edge_ngram", [min_gram, max_gram], options)
+  end
+  def self.regex_pattern(pattern, options: nil)
+    new("regex_pattern", [pattern], options)
+  end
+  def self.chinese_compatible(options: nil)
+    new("chinese_compatible", nil, options)
+  end
+  def self.lindera(dictionary, options: nil)
+    new("lindera", [dictionary], options)
+  end
+  def self.icu(options: nil)
+    new("icu", nil, options)
+  end
+  def self.jieba(options: nil)
+    new("jieba", nil, options)
+  end
+  def self.source_code(options: nil)
+    new("source_code", nil, options)
+  end
+  private
+  def quote_term(value)
+    escaped = value.gsub("'", "''")
+    "'#{escaped}'"
+  end
+  def render_positional_arg(value)
+    case value
+    when true, false, Numeric
+      value.to_s
+    when String
+      quote_term(value)
+    else
+      raise InvalidArgumentError, "Unsupported tokenizer arg type: #{value.class}"
+    end
+  end
+end

data/lib/parade_db/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module ParadeDB
-  VERSION = "0.6.0"
+  VERSION = "0.7.0"
 end

data/lib/parade_db.rb CHANGED Viewed

@@ -12,6 +12,7 @@ require_relative "parade_db/migration_helpers"
 require_relative "parade_db/model"
 require_relative "parade_db/search_methods"
 require_relative "parade_db/railtie"
+require_relative "parade_db/tokenizer"
 module ParadeDB
   FacetQueryError = Errors::FacetQueryError

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: rails-paradedb
 version: !ruby/object:Gem::Version
-  version: 0.6.0
+  version: 0.7.0
 platform: ruby
 authors:
 - ParadeDB
@@ -113,6 +113,7 @@ files:
 - lib/parade_db/query.rb
 - lib/parade_db/railtie.rb
 - lib/parade_db/search_methods.rb
+- lib/parade_db/tokenizer.rb
 - lib/parade_db/tokenizer_sql.rb
 - lib/parade_db/version.rb
 homepage: https://github.com/paradedb/rails-paradedb