rails-paradedb 0.6.0 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 1bcba07642cf33fe747ed6f810374adee776ee014533e64f5b38d490335dd672
4
- data.tar.gz: a470ac63132444ae93ee893eee869d6dac4641774fe965f6519592a7e5c50d35
3
+ metadata.gz: e28ebb1ec7734474a1ed321766b2e652f4aa1d56aadcc16d48667ac4ddbfc5bc
4
+ data.tar.gz: 51c25b12045eed272ad0e71475c22f9a670c4da6ff46fcad6d99ce32e3bc1f6e
5
5
  SHA512:
6
- metadata.gz: aa7193be8a40ed3714b3d17cf1c1b53980b63245a70c878e6aaa164c272df072217eb9c71a808cc29c0dc2611056f7d323f3d6aa9e507b77648beea83a734b6e
7
- data.tar.gz: 6870426a910699fd8b1b12f7a03402d8157138d2b2aa4c210339356a2ff60299b29301a1a5cc28f64656f4e80479719e0324270dd0f5a9f7668e09ee9654a262
6
+ metadata.gz: 4fced3ee50bef2ef4c59567246e9c5382a7e2e3f60293a785d33fe4e9e857abea685740fcf2c2da38c98dce4371d13010f9eacbbbe1c580df760208a20542f3d
7
+ data.tar.gz: 2e3efbf965bb99866dffe45de2963528ccf9ebfaa4a947abe5b72211306409d5b006eb1e90621cc6591e72f473d91593348b0d1bcaaa1e96cbc08cbe252e8615
data/CHANGELOG.md CHANGED
@@ -4,6 +4,12 @@ All notable changes to this project will be documented in this file. The format
4
4
 
5
5
  ## [Unreleased]
6
6
 
7
+ ## [0.7.0] - 2026-04-21
8
+
9
+ ### Changed
10
+
11
+ - **BREAKING**: Use function based approach for specifying tokenizers: `Tokenizer.simple(options: {alias: "description_simple"})`
12
+
7
13
  ## [0.6.0] - 2026-04-14
8
14
 
9
15
  ### Added
@@ -126,7 +132,8 @@ All notable changes to this project will be documented in this file. The format
126
132
  - Schema dump/load round-trip for tokenizer configuration and index options
127
133
  (including `target_segment_count`)
128
134
 
129
- [Unreleased]: https://github.com/paradedb/rails-paradedb/compare/v0.6.0...HEAD
135
+ [Unreleased]: https://github.com/paradedb/rails-paradedb/compare/v0.7.0...HEAD
136
+ [0.7.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.7.0
130
137
  [0.6.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.6.0
131
138
  [0.5.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.5.0
132
139
  [0.4.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.4.0
data/README.md CHANGED
@@ -16,31 +16,21 @@
16
16
  <a href="https://docs.paradedb.com/changelog/">Changelog</a>
17
17
  </h3>
18
18
 
19
- ---
20
-
21
- # rails-paradedb
22
-
23
- [![Gem Version](https://img.shields.io/gem/v/rails-paradedb)](https://rubygems.org/gems/rails-paradedb)
24
- [![Ruby Requirement](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Frubygems.org%2Fapi%2Fv1%2Fversions%2Frails-paradedb.json&query=%24%5B0%5D.ruby_version&label=ruby&logo=ruby)](https://rubygems.org/gems/rails-paradedb)
25
- [![Gem Downloads](https://img.shields.io/gem/dt/rails-paradedb)](https://rubygems.org/gems/rails-paradedb)
26
- [![Codecov](https://codecov.io/gh/paradedb/rails-paradedb/graph/badge.svg)](https://codecov.io/gh/paradedb/rails-paradedb)
27
- [![License](https://img.shields.io/github/license/paradedb/rails-paradedb?color=blue)](https://github.com/paradedb/rails-paradedb?tab=MIT-1-ov-file#readme)
28
- [![Slack URL](https://img.shields.io/badge/Join%20Slack-purple?logo=slack&link=https%3A%2F%2Fparadedb.com%2Fslack)](https://paradedb.com/slack)
29
- [![X URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb&label=Follow%20%40paradedb)](https://x.com/paradedb)
19
+ <p align="center">
20
+ <a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/gem/v/rails-paradedb" alt="Gem Version"></a>&nbsp;
21
+ <a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Frubygems.org%2Fapi%2Fv1%2Fversions%2Frails-paradedb.json&query=%24%5B0%5D.ruby_version&label=ruby&logo=ruby" alt="Ruby Requirement"></a>&nbsp;
22
+ <a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/gem/dt/rails-paradedb" alt="Gem Downloads"></a>&nbsp;
23
+ <a href="https://codecov.io/gh/paradedb/rails-paradedb"><img src="https://codecov.io/gh/paradedb/rails-paradedb/graph/badge.svg" alt="Codecov"></a>&nbsp;
24
+ <a href="https://github.com/paradedb/rails-paradedb?tab=MIT-1-ov-file#readme"><img src="https://img.shields.io/github/license/paradedb/rails-paradedb?color=blue" alt="License"></a>&nbsp;
25
+ <a href="https://paradedb.com/slack"><img src="https://img.shields.io/badge/Join%20Slack-purple?logo=slack" alt="Community"></a>&nbsp;
26
+ <a href="https://x.com/paradedb"><img src="https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb&label=Follow%20%40paradedb" alt="Follow @paradedb"></a>
27
+ </p>
30
28
 
31
- The official Ruby client for [ParadeDB](https://paradedb.com), built for ActiveRecord.
32
- Use Elastic-quality full-text search, scoring, snippets, facets, and aggregations directly from Rails.
29
+ ---
33
30
 
34
- ## Features
31
+ ## ParadeDB for Rails
35
32
 
36
- - BM25 index management in Rails migrations (`create_paradedb_index`, `remove_bm25_index`, `reindex_bm25`)
37
- - Chainable ActiveRecord search API (`matching_all`, `matching_any`, `term`, `phrase`, `regex`, `near`, `parse`, and more)
38
- - Relevance and highlighting (`with_score`, `with_snippet`, `with_snippets`, `with_snippet_positions`)
39
- - Facets and aggregations (`with_facets`, `facets`, `with_agg`, `facets_agg`, `aggregate_by`)
40
- - More Like This similarity search (`more_like_this`)
41
- - Arel integration for advanced query composition with native ParadeDB operators
42
- - Diagnostics helpers and rake tasks for index health and verification
43
- - Optional runtime index validation to detect missing/drifted BM25 indexes
33
+ The official ActiveRecord integration for [ParadeDB](https://paradedb.com), including first-class support for managing BM25 indexes and running queries using the full ParadeDB API. Follow the [getting started guide](https://docs.paradedb.com/documentation/getting-started/environment#rails) to begin.
44
34
 
45
35
  ## Requirements & Compatibility
46
36
 
@@ -51,294 +41,14 @@ Use Elastic-quality full-text search, scoring, snippets, facets, and aggregation
51
41
  | ParadeDB | 0.22.0+ |
52
42
  | PostgreSQL | 15+ (PostgreSQL adapter with ParadeDB extension) |
53
43
 
54
- Notes:
55
-
56
- - CI runs Ruby `3.2` through `4.0` across Rails `7.2` and `8.1` on PostgreSQL `18`.
57
- - Schema compatibility is checked against every ParadeDB release.
58
- - The maintained minimum ParadeDB version is `0.22.0`; update `README.md`, `RELEASE.md`, and CI in the same PR whenever that floor changes.
59
-
60
- ## Installation
61
-
62
- ```ruby
63
- gem "rails-paradedb"
64
- ```
65
-
66
- ```bash
67
- bundle install
68
- ```
69
-
70
- ## Quick Start
71
-
72
- ### Prerequisites
73
-
74
- Make sure your Rails app uses PostgreSQL and that `pg_search` is installed in the target database:
75
-
76
- ```sql
77
- CREATE EXTENSION IF NOT EXISTS pg_search;
78
- ```
79
-
80
- ### 1. Define Your Model and Index
81
-
82
- ```ruby
83
- class MockItem < ActiveRecord::Base
84
- include ParadeDB::Model
85
-
86
- self.table_name = "mock_items"
87
- self.primary_key = "id"
88
- end
89
-
90
- class MockItemIndex < ParadeDB::Index
91
- self.table_name = :mock_items
92
- self.key_field = :id
93
- self.index_name = :search_idx
94
- self.fields = {
95
- id: nil,
96
- description: nil,
97
- category: nil,
98
- rating: nil,
99
- in_stock: nil,
100
- created_at: nil,
101
- metadata: nil,
102
- weight_range: nil
103
- }
104
- end
105
- ```
106
-
107
- ### 2. Create the BM25 Index in a Migration
108
-
109
- ```ruby
110
- class AddMockItemBm25Index < ActiveRecord::Migration[7.2] # use your app's migration version
111
- def up
112
- create_paradedb_index(MockItemIndex, if_not_exists: true)
113
- end
114
-
115
- def down
116
- remove_bm25_index :mock_items, name: :search_idx, if_exists: true
117
- end
118
- end
119
- ```
120
-
121
- ### 3. Search
122
-
123
- ```ruby
124
- MockItem.search(:description).matching_all("running shoes")
125
- MockItem.search(:description).matching_any("wireless", "bluetooth")
126
- MockItem.search(:description).term("electronics")
127
- ```
128
-
129
- ## Query API
130
-
131
- ```ruby
132
- # Full text
133
- MockItem.search(:description).matching_all("running shoes")
134
- MockItem.search(:description).matching_any("wireless bluetooth")
135
-
136
- # Query-time tokenizer override
137
- MockItem.search(:description).matching_any("running shoes", tokenizer: "whitespace")
138
- MockItem.search(:description).matching_any("running shoes", tokenizer: "whitespace('lowercase=false')")
139
-
140
- # Fuzzy options on match/term
141
- # Note: tokenizer overrides are mutually exclusive with fuzzy options.
142
- MockItem.search(:description).matching_any("runing shose", distance: 1)
143
- MockItem.search(:description).matching_all("runing", distance: 1, prefix: true)
144
- MockItem.search(:description).term("shose", distance: 1, transposition_cost_one: true)
145
-
146
- # Other query types
147
- MockItem.search(:description).phrase("running shoes", slop: 2)
148
- MockItem.search(:description).phrase("running shoes", tokenizer: "whitespace")
149
- MockItem.search(:description).phrase(%w[running shoes])
150
- MockItem.search(:description).regex("run.*")
151
- MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"))
152
- MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes", ordered: true))
153
- MockItem.search(:description).near(ParadeDB.proximity("hiking", "running").within(2, "shoes"))
154
- MockItem.search(:description).near(ParadeDB.proximity("running").within(2, "shoes", "sneakers", ordered: true))
155
- MockItem.search(:description).near(ParadeDB.regex_term("run.*").within(3, "shoes"))
156
- MockItem.search(:description).near(ParadeDB.proximity("trail").within(1, "running").within(1, "shoes"))
157
- MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"), boost: 2.0)
158
- MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"), const: 1.0)
159
- MockItem.search(:description).regex_phrase("run.*", "shoes")
160
- MockItem.search(:description).phrase_prefix("run", "sh", max_expansion: 100)
161
- MockItem.search(:description).parse("running AND shoes", lenient: true)
162
-
163
- # Match-all / exists / ranges
164
- MockItem.search(:id).match_all
165
- MockItem.search(:id).exists
166
- MockItem.search(:rating).range(gte: 3, lt: 5)
167
- MockItem.search(:weight_range).range_term("(10, 12]", relation: "Intersects")
168
-
169
- # Similarity
170
- MockItem.more_like_this(42, fields: [:description])
171
- ```
172
-
173
- ## Scoring and Highlighting
174
-
175
- ```ruby
176
- results = MockItem.search(:description)
177
- .matching_all("shoes")
178
- .with_score
179
- .order(search_score: :desc)
180
-
181
- MockItem.search(:description)
182
- .matching_all("shoes")
183
- .with_snippet(:description, start_tag: "<b>", end_tag: "</b>", max_chars: 80)
184
-
185
- MockItem.search(:description)
186
- .matching_all("running")
187
- .with_snippets(:description, max_chars: 15, limit: 2, offset: 0, sort_by: :position)
188
-
189
- MockItem.search(:description)
190
- .matching_all("running")
191
- .with_snippet_positions(:description)
192
- ```
193
-
194
- ## Facets and Aggregations
195
-
196
- ```ruby
197
- # Rows + facets (requires order + limit)
198
- relation = MockItem.search(:description)
199
- .matching_all("shoes")
200
- .with_facets(:category, size: 10)
201
- .order(:id)
202
- .limit(10)
203
-
204
- rows = relation.to_a
205
- facets = relation.facets
206
-
207
- # Facets-only aggregate
208
- MockItem.search(:description).matching_all("shoes").facets(:category)
209
-
210
- # Named aggregations
211
- MockItem.search(:description).matching_all("shoes").facets_agg(
212
- docs: ParadeDB::Aggregations.value_count(:id),
213
- avg_rating: ParadeDB::Aggregations.avg(:rating)
214
- )
215
-
216
- # Window aggregations + rows
217
- MockItem.search(:description).matching_all("shoes").with_agg(
218
- exact: false,
219
- docs: ParadeDB::Aggregations.value_count(:id),
220
- stats: ParadeDB::Aggregations.stats(:rating)
221
- ).order(:id).limit(10)
222
-
223
- # Grouped aggregations
224
- MockItem.search(:id).match_all.aggregate_by(
225
- :category,
226
- docs: ParadeDB::Aggregations.value_count(:id)
227
- )
228
- ```
229
-
230
- If you group by text/JSON fields, index those fields using `:literal` or `:literal_normalized`.
231
-
232
- ## ActiveRecord and Arel Composition
233
-
234
- Use ParadeDB conditions with normal ActiveRecord scopes:
235
-
236
- ```ruby
237
- MockItem.search(:description)
238
- .matching_all("shoes")
239
- .where(in_stock: true)
240
- .where(MockItem.arel_table[:rating].gteq(4))
241
- .order(created_at: :desc)
242
- ```
243
-
244
- For advanced SQL composition, ParadeDB operators are also available through Arel predications:
245
-
246
- ```ruby
247
- t = MockItem.arel_table
248
- MockItem.where(t[:description].pdb_match("running shoes"))
249
- ```
250
-
251
- ## Diagnostics Helpers
252
-
253
- Ruby helpers:
254
-
255
- ```ruby
256
- ParadeDB.paradedb_indexes
257
- ParadeDB.paradedb_index_segments("search_idx")
258
- ParadeDB.paradedb_verify_index("search_idx", sample_rate: 0.1)
259
- ParadeDB.paradedb_verify_all_indexes(index_pattern: "search_idx")
260
- ```
261
-
262
- Availability depends on the installed `pg_search` version.
263
-
264
- Repository development tasks (from this repo's `Rakefile`):
265
-
266
- ```bash
267
- rake paradedb:diagnostics:indexes
268
- rake "paradedb:diagnostics:index_segments[search_idx]"
269
- rake "paradedb:diagnostics:verify_index[search_idx]" SAMPLE_RATE=0.1
270
- rake paradedb:diagnostics:verify_all_indexes INDEX_PATTERN=search_idx
271
- ```
272
-
273
- ## Index Validation
274
-
275
- By default, index validation is disabled. You can enable runtime checks globally:
276
-
277
- ```ruby
278
- # config/initializers/paradedb.rb
279
- ParadeDB.index_validation_mode = :warn # :warn, :raise, or :off
280
- ```
281
-
282
- When enabled, `rails-paradedb` validates that the expected BM25 index exists and can raise
283
- `ParadeDB::IndexDriftError` or `ParadeDB::IndexClassNotFoundError` depending on mode.
284
-
285
- ## Common Errors
286
-
287
- ### "No search field set. Call .search(column) first."
288
-
289
- ```ruby
290
- # ❌ Missing .search(...)
291
- MockItem.matching_all("shoes")
292
-
293
- # ✅ Start with .search(column)
294
- MockItem.search(:description).matching_all("shoes")
295
- ```
296
-
297
- ### "with_facets requires ORDER BY and LIMIT"
298
-
299
- ```ruby
300
- # ❌ Missing order/limit
301
- MockItem.search(:description).matching_all("shoes").with_facets(:category).to_a
302
-
303
- # ✅ Include both
304
- relation = MockItem.search(:description)
305
- .matching_all("shoes")
306
- .with_facets(:category)
307
- .order(:id)
308
- .limit(10)
309
- relation.to_a
310
- relation.facets
311
- ```
312
-
313
- ### "search(:field) is not indexed"
314
-
315
- ```ruby
316
- # ❌ Field not in your ParadeDB::Index fields hash
317
- MockItem.search(:title).matching_all("shoes")
318
-
319
- # ✅ Add :title to the index definition, then migrate
320
- ```
321
-
322
- ## Security
323
-
324
- `rails-paradedb` builds SQL through Arel nodes and quoted literals (`Arel::Nodes.build_quoted`)
325
- rather than manual string interpolation. Tokenizer expressions are validated and search operators are
326
- rendered through typed nodes, with unit and integration coverage for quoting and edge cases.
327
-
328
44
  ## Examples
329
45
 
330
- - [Quick Start](examples/quickstart/quickstart.rb)
46
+ - [Quickstart](examples/quickstart/quickstart.rb)
331
47
  - [Faceted Search](examples/faceted_search/faceted_search.rb)
332
48
  - [Autocomplete](examples/autocomplete/autocomplete.rb)
333
49
  - [More Like This](examples/more_like_this/more_like_this.rb)
334
- - [Hybrid RRF](examples/hybrid_rrf/hybrid_rrf.rb)
50
+ - [Hybrid Search (RRF)](examples/hybrid_rrf/hybrid_rrf.rb)
335
51
  - [RAG](examples/rag/rag.rb)
336
- - [Examples README](examples/README.md)
337
-
338
- ## Documentation
339
-
340
- - **ParadeDB Official Docs**: <https://docs.paradedb.com>
341
- - **ParadeDB Website**: <https://paradedb.com>
342
52
 
343
53
  ## Contributing
344
54
 
@@ -346,7 +56,7 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, test commands, lin
346
56
 
347
57
  ## Support
348
58
 
349
- If you're missing a feature or found a bug, open a
59
+ If you're missing a feature or have found a bug, open a
350
60
  [GitHub Issue](https://github.com/paradedb/rails-paradedb/issues/new/choose).
351
61
 
352
62
  For community support:
@@ -366,4 +76,4 @@ We would like to thank the following members of the community for their valuable
366
76
 
367
77
  ## License
368
78
 
369
- rails-paradedb is licensed under the [MIT License](LICENSE).
79
+ ParadeDB for Rails is licensed under the [MIT License](LICENSE).
@@ -302,12 +302,11 @@ module ParadeDB
302
302
  def apply_tokenizer(node, tokenizer)
303
303
  return node if tokenizer.nil?
304
304
 
305
- unless tokenizer.is_a?(String)
306
- raise ArgumentError, "tokenizer must be a string"
305
+ unless tokenizer.is_a?(Tokenizer)
306
+ raise ArgumentError, "tokenizer must be a Tokenizer"
307
307
  end
308
308
 
309
- normalized = normalize_tokenizer(tokenizer)
310
- Nodes::TokenizerCast.new(node, normalized)
309
+ return Nodes::TokenizerCast.new(node, tokenizer.render())
311
310
  end
312
311
 
313
312
  def apply_slop(node, slop)
@@ -1,5 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require_relative "tokenizer"
4
+
3
5
  module ParadeDB
4
6
  class Index
5
7
  class << self
@@ -47,117 +49,29 @@ module ParadeDB
47
49
 
48
50
  class TokenizerParser
49
51
  TOKENIZER_EXPRESSION = /\A[a-zA-Z_][a-zA-Z0-9_]*(?:(?:::|\.)[a-zA-Z_][a-zA-Z0-9_]*)*(?:\(\s*[a-zA-Z0-9_'".,=\s:-]*\s*\))?\z/.freeze
50
- TOKENIZER_SINGLE_KEYS = %i[tokenizer args named_args filters stemmer alias].freeze
52
+ TOKENIZER_SINGLE_KEYS = %i[tokenizer alias].freeze
51
53
 
52
54
  class << self
53
- def parse(source_name, tokenizer_spec)
54
- case tokenizer_spec
55
- when Symbol, String
56
- [build_tokenized_entry(source_name, tokenizer_spec.to_s, {})]
57
- when Hash
58
- tokenizer_spec.map do |tokenizer, opts|
59
- case opts
60
- when Hash
61
- build_tokenized_entry(source_name, tokenizer.to_s, normalize_options(opts))
62
- when Symbol, String
63
- build_tokenized_entry(source_name, tokenizer.to_s, normalize_positional_option(opts))
64
- else
65
- raise InvalidIndexDefinition,
66
- "tokenizer options for #{source_name}.#{tokenizer} must be a Hash, Symbol, or String"
67
- end
68
- end
69
- else
70
- raise InvalidIndexDefinition,
71
- "invalid tokenizer definition for #{source_name}: #{tokenizer_spec.inspect}"
72
- end
73
- end
74
-
75
- private
76
-
77
- def parse_structured_tokenizer_config(source_name, config, context:)
78
- unless config.is_a?(Hash)
79
- raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} must be a Hash"
55
+ def parse(source_name, tokenizer, context:)
56
+ unless tokenizer.is_a?(Tokenizer)
57
+ raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} must be a Tokenizer"
80
58
  end
81
59
 
82
- tokenizer = config[:tokenizer] || config["tokenizer"]
83
- if tokenizer.nil?
84
- raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} requires :tokenizer"
85
- end
86
-
87
- tokenizer_name = tokenizer.to_s
88
- validate_tokenizer_name!(source_name, tokenizer_name)
89
-
90
- args = config[:args] || config["args"]
91
- named_args = config[:named_args] || config["named_args"]
92
- filters = config[:filters] || config["filters"]
93
- stemmer = config[:stemmer] || config["stemmer"]
94
- alias_name = config[:alias] || config["alias"]
95
-
96
60
  options = {}
97
- if args
98
- unless args.respond_to?(:to_ary)
99
- raise InvalidIndexDefinition, "args for #{source_name.inspect} must be an Array"
100
- end
101
- options[:__positional] = args.to_ary
102
- end
103
-
104
- if named_args
105
- unless named_args.is_a?(Hash)
106
- raise InvalidIndexDefinition, "named_args for #{source_name.inspect} must be a Hash"
107
- end
108
- named_args.each { |key, value| options[key.to_sym] = value }
109
- end
110
-
111
- if filters
112
- unless filters.respond_to?(:to_ary)
113
- raise InvalidIndexDefinition, "filters for #{source_name.inspect} must be an Array"
114
- end
115
- filters.to_ary.each do |name|
116
- filter_key = name.to_s
117
- if filter_key == "stemmer" && stemmer
118
- options[:stemmer] = stemmer
119
- else
120
- key = filter_key.to_sym
121
- options[key] = true unless options.key?(key)
122
- end
123
- end
124
- end
125
-
126
- options[:stemmer] = stemmer if stemmer && !options.key?(:stemmer)
127
- options[:alias] = alias_name.to_s if alias_name
128
-
129
- build_tokenized_entry(source_name, tokenizer_name, options)
130
- end
131
-
132
- def normalize_options(opts)
133
- opts.each_with_object({}) do |(key, value), memo|
134
- memo[key.to_sym] = value
135
- end
136
- end
137
-
138
- def normalize_positional_option(option)
139
- { __positional: [option.to_s] }
140
- end
61
+ options[:__positional] = tokenizer.positional_args.dup unless tokenizer.positional_args.nil?
62
+ tokenizer.options&.each { |key, value| options[key.to_sym] = value }
141
63
 
142
- def build_tokenized_entry(source_name, tokenizer, options)
143
- validate_tokenizer_name!(source_name, tokenizer) unless tokenizer.nil?
144
64
  key = options[:alias]&.to_s || source_name
145
65
  DefinitionCompiler::Entry.new(
146
66
  source: source_name,
147
67
  expression: expression?(source_name),
148
- tokenizer: tokenizer,
68
+ tokenizer: tokenizer.name,
149
69
  options: options,
150
70
  query_key: key
151
71
  )
152
72
  end
153
73
 
154
- def validate_tokenizer_name!(source_name, tokenizer)
155
- return if TOKENIZER_EXPRESSION.match?(tokenizer)
156
-
157
- raise InvalidIndexDefinition,
158
- "invalid tokenizer name #{tokenizer.inspect} for #{source_name}. " \
159
- "Expected identifier form like simple, pdb::simple, or pdb::ngram(2, 5, alias=field_alias)."
160
- end
74
+ private
161
75
 
162
76
  def expression?(value)
163
77
  value.match?(/[^a-zA-Z0-9_]/)
@@ -260,33 +174,22 @@ module ParadeDB
260
174
  elsif tokenizers
261
175
  if single_tokenizer_keys_present
262
176
  raise InvalidIndexDefinition,
263
- "field #{source_name.inspect} cannot mix :tokenizers with :tokenizer/:args/:named_args/:filters/:stemmer/:alias"
177
+ "field #{source_name.inspect} cannot mix :tokenizers with :tokenizer/:alias"
264
178
  end
265
179
  unless tokenizers.respond_to?(:to_ary) && !tokenizers.to_ary.empty?
266
180
  raise InvalidIndexDefinition, "field #{source_name.inspect} :tokenizers must be a non-empty Array"
267
181
  end
268
182
 
269
183
  tokenizers.to_ary.each_with_index do |tokenizer_config, idx|
270
- entry = TokenizerParser.send(
271
- :parse_structured_tokenizer_config,
272
- source_name,
273
- tokenizer_config,
274
- context: "tokenizers[#{idx}]"
275
- )
184
+ entry = TokenizerParser.parse(source_name, tokenizer_config, context: "tokenizers[#{idx}]")
276
185
  entries << entry
277
186
  end
278
- elsif single_tokenizer_keys_present
279
- unless normalized[:tokenizer]
280
- raise InvalidIndexDefinition,
281
- "field #{source_name.inspect} specifies tokenizer configuration but no :tokenizer"
282
- end
283
- entry = TokenizerParser.send(
284
- :parse_structured_tokenizer_config,
285
- source_name,
286
- select_keys(normalized, TokenizerParser::TOKENIZER_SINGLE_KEYS),
287
- context: "tokenizer config"
288
- )
187
+ elsif normalized.key?(:tokenizer)
188
+ entry = TokenizerParser.parse(source_name, normalized[:tokenizer], context: "tokenizer")
289
189
  entries << entry
190
+ elsif single_tokenizer_keys_present
191
+ raise InvalidIndexDefinition,
192
+ "field #{source_name.inspect} specifies tokenizer configuration but no :tokenizer"
290
193
  else
291
194
  entries << Entry.new(
292
195
  source: source_name,
@@ -336,7 +336,7 @@ module ParadeDB
336
336
  elsif entries.length == 1
337
337
  "#{source_ruby} #{bm25_tokenizer_config_ruby(entries.first)}"
338
338
  else
339
- configs = entries.map { |e| bm25_tokenizer_config_ruby(e) }
339
+ configs = entries.map { |e| bm25_tokenizer_ruby_from_entry(e) }
340
340
  "#{source_ruby} { tokenizers: [#{configs.join(', ')}] }"
341
341
  end
342
342
  end
@@ -678,6 +678,10 @@ module ParadeDB
678
678
  end
679
679
 
680
680
  def bm25_tokenizer_config_ruby(entry)
681
+ "{ tokenizer: #{bm25_tokenizer_ruby_from_entry(entry)} }"
682
+ end
683
+
684
+ def bm25_tokenizer_ruby_from_entry(entry)
681
685
  opts = entry[:options].dup
682
686
  positional_args = Array(opts.delete(:__positional))
683
687
  alias_val = opts.delete(:alias)
@@ -685,16 +689,19 @@ module ParadeDB
685
689
  max_val = opts.delete(:max)
686
690
  positional_args = [min_val, max_val] + positional_args if min_val && max_val
687
691
 
688
- parts = ["tokenizer: #{entry[:tokenizer].to_sym.inspect}"]
689
- parts << "args: #{positional_args.inspect}" unless positional_args.empty?
690
- parts << "alias: #{alias_val.inspect}" if alias_val
692
+ opts[:alias] = alias_val if alias_val
693
+
694
+ bm25_tokenizer_ruby(entry[:tokenizer], positional_args, opts)
695
+ end
691
696
 
692
- unless opts.empty?
693
- named_pairs = opts.map { |k, v| "#{k.inspect} => #{v.inspect}" }.join(", ")
694
- parts << "named_args: { #{named_pairs} }"
697
+ def bm25_tokenizer_ruby(name, positional_args, options)
698
+ if name.match?(/\A[a-z_][a-z0-9_]*\z/) && Tokenizer.respond_to?(name)
699
+ args = positional_args.map { |arg| ruby_literal(arg) }
700
+ args << "options: #{ruby_hash_literal(options)}" unless options.empty?
701
+ return "Tokenizer.#{name}(#{args.join(', ')})"
695
702
  end
696
703
 
697
- "{ #{parts.join(', ')} }"
704
+ "Tokenizer.new(#{name.inspect}, #{ruby_literal(positional_args.empty? ? nil : positional_args)}, #{ruby_literal(options.empty? ? nil : options)})"
698
705
  end
699
706
 
700
707
  def split_sql_arguments(args_sql)
@@ -0,0 +1,95 @@
1
+ class Tokenizer
2
+ attr_reader :name, :positional_args, :options
3
+
4
+ def initialize(name, positional_args, options)
5
+ @name = name
6
+ @positional_args = positional_args
7
+ @options = options
8
+ end
9
+
10
+ def render()
11
+ if options.nil? && positional_args.nil?
12
+ return "pdb.#{name}"
13
+ end
14
+
15
+ args = []
16
+ if !positional_args.nil?
17
+ args.concat(positional_args.map { |x| render_positional_arg(x) })
18
+ end
19
+ if !options.nil?
20
+ args.concat(options.map {|k, v| quote_term("#{k}=#{v}")})
21
+ end
22
+
23
+ return "pdb.#{name}(#{args.join(",")})"
24
+ end
25
+
26
+ def self.whitespace(options: nil)
27
+ new("whitespace", nil, options)
28
+ end
29
+
30
+ def self.unicode_words(options: nil)
31
+ new("unicode_words", nil, options)
32
+ end
33
+
34
+ def self.ngram(min_gram, max_gram, options: nil)
35
+ new("ngram", [min_gram, max_gram], options)
36
+ end
37
+
38
+ def self.simple(options: nil)
39
+ new("simple", nil, options)
40
+ end
41
+
42
+ def self.literal(options: nil)
43
+ new("literal", nil, options)
44
+ end
45
+
46
+ def self.literal_normalized(options: nil)
47
+ new("literal_normalized", nil, options)
48
+ end
49
+
50
+ def self.edge_ngram(min_gram, max_gram, options: nil)
51
+ new("edge_ngram", [min_gram, max_gram], options)
52
+ end
53
+
54
+ def self.regex_pattern(pattern, options: nil)
55
+ new("regex_pattern", [pattern], options)
56
+ end
57
+
58
+ def self.chinese_compatible(options: nil)
59
+ new("chinese_compatible", nil, options)
60
+ end
61
+
62
+ def self.lindera(dictionary, options: nil)
63
+ new("lindera", [dictionary], options)
64
+ end
65
+
66
+ def self.icu(options: nil)
67
+ new("icu", nil, options)
68
+ end
69
+
70
+ def self.jieba(options: nil)
71
+ new("jieba", nil, options)
72
+ end
73
+
74
+ def self.source_code(options: nil)
75
+ new("source_code", nil, options)
76
+ end
77
+
78
+ private
79
+
80
+ def quote_term(value)
81
+ escaped = value.gsub("'", "''")
82
+ "'#{escaped}'"
83
+ end
84
+
85
+ def render_positional_arg(value)
86
+ case value
87
+ when true, false, Numeric
88
+ value.to_s
89
+ when String
90
+ quote_term(value)
91
+ else
92
+ raise InvalidArgumentError, "Unsupported tokenizer arg type: #{value.class}"
93
+ end
94
+ end
95
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module ParadeDB
4
- VERSION = "0.6.0"
4
+ VERSION = "0.7.0"
5
5
  end
data/lib/parade_db.rb CHANGED
@@ -12,6 +12,7 @@ require_relative "parade_db/migration_helpers"
12
12
  require_relative "parade_db/model"
13
13
  require_relative "parade_db/search_methods"
14
14
  require_relative "parade_db/railtie"
15
+ require_relative "parade_db/tokenizer"
15
16
 
16
17
  module ParadeDB
17
18
  FacetQueryError = Errors::FacetQueryError
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rails-paradedb
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.7.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - ParadeDB
@@ -113,6 +113,7 @@ files:
113
113
  - lib/parade_db/query.rb
114
114
  - lib/parade_db/railtie.rb
115
115
  - lib/parade_db/search_methods.rb
116
+ - lib/parade_db/tokenizer.rb
116
117
  - lib/parade_db/tokenizer_sql.rb
117
118
  - lib/parade_db/version.rb
118
119
  homepage: https://github.com/paradedb/rails-paradedb