rails-paradedb 0.6.0 → 0.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +8 -1
- data/README.md +16 -306
- data/lib/parade_db/arel/builder.rb +3 -4
- data/lib/parade_db/index.rb +17 -114
- data/lib/parade_db/migration_helpers.rb +15 -8
- data/lib/parade_db/tokenizer.rb +95 -0
- data/lib/parade_db/version.rb +1 -1
- data/lib/parade_db.rb +1 -0
- metadata +2 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: e28ebb1ec7734474a1ed321766b2e652f4aa1d56aadcc16d48667ac4ddbfc5bc
|
|
4
|
+
data.tar.gz: 51c25b12045eed272ad0e71475c22f9a670c4da6ff46fcad6d99ce32e3bc1f6e
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 4fced3ee50bef2ef4c59567246e9c5382a7e2e3f60293a785d33fe4e9e857abea685740fcf2c2da38c98dce4371d13010f9eacbbbe1c580df760208a20542f3d
|
|
7
|
+
data.tar.gz: 2e3efbf965bb99866dffe45de2963528ccf9ebfaa4a947abe5b72211306409d5b006eb1e90621cc6591e72f473d91593348b0d1bcaaa1e96cbc08cbe252e8615
|
data/CHANGELOG.md
CHANGED
|
@@ -4,6 +4,12 @@ All notable changes to this project will be documented in this file. The format
|
|
|
4
4
|
|
|
5
5
|
## [Unreleased]
|
|
6
6
|
|
|
7
|
+
## [0.7.0] - 2026-04-21
|
|
8
|
+
|
|
9
|
+
### Changed
|
|
10
|
+
|
|
11
|
+
- **BREAKING**: Use function based approach for specifying tokenizers: `Tokenizer.simple(options: {alias: "description_simple"})`
|
|
12
|
+
|
|
7
13
|
## [0.6.0] - 2026-04-14
|
|
8
14
|
|
|
9
15
|
### Added
|
|
@@ -126,7 +132,8 @@ All notable changes to this project will be documented in this file. The format
|
|
|
126
132
|
- Schema dump/load round-trip for tokenizer configuration and index options
|
|
127
133
|
(including `target_segment_count`)
|
|
128
134
|
|
|
129
|
-
[Unreleased]: https://github.com/paradedb/rails-paradedb/compare/v0.
|
|
135
|
+
[Unreleased]: https://github.com/paradedb/rails-paradedb/compare/v0.7.0...HEAD
|
|
136
|
+
[0.7.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.7.0
|
|
130
137
|
[0.6.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.6.0
|
|
131
138
|
[0.5.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.5.0
|
|
132
139
|
[0.4.0]: https://github.com/paradedb/rails-paradedb/releases/tag/v0.4.0
|
data/README.md
CHANGED
|
@@ -16,31 +16,21 @@
|
|
|
16
16
|
<a href="https://docs.paradedb.com/changelog/">Changelog</a>
|
|
17
17
|
</h3>
|
|
18
18
|
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
[](https://paradedb.com/slack)
|
|
29
|
-
[](https://x.com/paradedb)
|
|
19
|
+
<p align="center">
|
|
20
|
+
<a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/gem/v/rails-paradedb" alt="Gem Version"></a>
|
|
21
|
+
<a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Frubygems.org%2Fapi%2Fv1%2Fversions%2Frails-paradedb.json&query=%24%5B0%5D.ruby_version&label=ruby&logo=ruby" alt="Ruby Requirement"></a>
|
|
22
|
+
<a href="https://rubygems.org/gems/rails-paradedb"><img src="https://img.shields.io/gem/dt/rails-paradedb" alt="Gem Downloads"></a>
|
|
23
|
+
<a href="https://codecov.io/gh/paradedb/rails-paradedb"><img src="https://codecov.io/gh/paradedb/rails-paradedb/graph/badge.svg" alt="Codecov"></a>
|
|
24
|
+
<a href="https://github.com/paradedb/rails-paradedb?tab=MIT-1-ov-file#readme"><img src="https://img.shields.io/github/license/paradedb/rails-paradedb?color=blue" alt="License"></a>
|
|
25
|
+
<a href="https://paradedb.com/slack"><img src="https://img.shields.io/badge/Join%20Slack-purple?logo=slack" alt="Community"></a>
|
|
26
|
+
<a href="https://x.com/paradedb"><img src="https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb&label=Follow%20%40paradedb" alt="Follow @paradedb"></a>
|
|
27
|
+
</p>
|
|
30
28
|
|
|
31
|
-
|
|
32
|
-
Use Elastic-quality full-text search, scoring, snippets, facets, and aggregations directly from Rails.
|
|
29
|
+
---
|
|
33
30
|
|
|
34
|
-
##
|
|
31
|
+
## ParadeDB for Rails
|
|
35
32
|
|
|
36
|
-
- BM25
|
|
37
|
-
- Chainable ActiveRecord search API (`matching_all`, `matching_any`, `term`, `phrase`, `regex`, `near`, `parse`, and more)
|
|
38
|
-
- Relevance and highlighting (`with_score`, `with_snippet`, `with_snippets`, `with_snippet_positions`)
|
|
39
|
-
- Facets and aggregations (`with_facets`, `facets`, `with_agg`, `facets_agg`, `aggregate_by`)
|
|
40
|
-
- More Like This similarity search (`more_like_this`)
|
|
41
|
-
- Arel integration for advanced query composition with native ParadeDB operators
|
|
42
|
-
- Diagnostics helpers and rake tasks for index health and verification
|
|
43
|
-
- Optional runtime index validation to detect missing/drifted BM25 indexes
|
|
33
|
+
The official ActiveRecord integration for [ParadeDB](https://paradedb.com), including first-class support for managing BM25 indexes and running queries using the full ParadeDB API. Follow the [getting started guide](https://docs.paradedb.com/documentation/getting-started/environment#rails) to begin.
|
|
44
34
|
|
|
45
35
|
## Requirements & Compatibility
|
|
46
36
|
|
|
@@ -51,294 +41,14 @@ Use Elastic-quality full-text search, scoring, snippets, facets, and aggregation
|
|
|
51
41
|
| ParadeDB | 0.22.0+ |
|
|
52
42
|
| PostgreSQL | 15+ (PostgreSQL adapter with ParadeDB extension) |
|
|
53
43
|
|
|
54
|
-
Notes:
|
|
55
|
-
|
|
56
|
-
- CI runs Ruby `3.2` through `4.0` across Rails `7.2` and `8.1` on PostgreSQL `18`.
|
|
57
|
-
- Schema compatibility is checked against every ParadeDB release.
|
|
58
|
-
- The maintained minimum ParadeDB version is `0.22.0`; update `README.md`, `RELEASE.md`, and CI in the same PR whenever that floor changes.
|
|
59
|
-
|
|
60
|
-
## Installation
|
|
61
|
-
|
|
62
|
-
```ruby
|
|
63
|
-
gem "rails-paradedb"
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
```bash
|
|
67
|
-
bundle install
|
|
68
|
-
```
|
|
69
|
-
|
|
70
|
-
## Quick Start
|
|
71
|
-
|
|
72
|
-
### Prerequisites
|
|
73
|
-
|
|
74
|
-
Make sure your Rails app uses PostgreSQL and that `pg_search` is installed in the target database:
|
|
75
|
-
|
|
76
|
-
```sql
|
|
77
|
-
CREATE EXTENSION IF NOT EXISTS pg_search;
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
### 1. Define Your Model and Index
|
|
81
|
-
|
|
82
|
-
```ruby
|
|
83
|
-
class MockItem < ActiveRecord::Base
|
|
84
|
-
include ParadeDB::Model
|
|
85
|
-
|
|
86
|
-
self.table_name = "mock_items"
|
|
87
|
-
self.primary_key = "id"
|
|
88
|
-
end
|
|
89
|
-
|
|
90
|
-
class MockItemIndex < ParadeDB::Index
|
|
91
|
-
self.table_name = :mock_items
|
|
92
|
-
self.key_field = :id
|
|
93
|
-
self.index_name = :search_idx
|
|
94
|
-
self.fields = {
|
|
95
|
-
id: nil,
|
|
96
|
-
description: nil,
|
|
97
|
-
category: nil,
|
|
98
|
-
rating: nil,
|
|
99
|
-
in_stock: nil,
|
|
100
|
-
created_at: nil,
|
|
101
|
-
metadata: nil,
|
|
102
|
-
weight_range: nil
|
|
103
|
-
}
|
|
104
|
-
end
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
### 2. Create the BM25 Index in a Migration
|
|
108
|
-
|
|
109
|
-
```ruby
|
|
110
|
-
class AddMockItemBm25Index < ActiveRecord::Migration[7.2] # use your app's migration version
|
|
111
|
-
def up
|
|
112
|
-
create_paradedb_index(MockItemIndex, if_not_exists: true)
|
|
113
|
-
end
|
|
114
|
-
|
|
115
|
-
def down
|
|
116
|
-
remove_bm25_index :mock_items, name: :search_idx, if_exists: true
|
|
117
|
-
end
|
|
118
|
-
end
|
|
119
|
-
```
|
|
120
|
-
|
|
121
|
-
### 3. Search
|
|
122
|
-
|
|
123
|
-
```ruby
|
|
124
|
-
MockItem.search(:description).matching_all("running shoes")
|
|
125
|
-
MockItem.search(:description).matching_any("wireless", "bluetooth")
|
|
126
|
-
MockItem.search(:description).term("electronics")
|
|
127
|
-
```
|
|
128
|
-
|
|
129
|
-
## Query API
|
|
130
|
-
|
|
131
|
-
```ruby
|
|
132
|
-
# Full text
|
|
133
|
-
MockItem.search(:description).matching_all("running shoes")
|
|
134
|
-
MockItem.search(:description).matching_any("wireless bluetooth")
|
|
135
|
-
|
|
136
|
-
# Query-time tokenizer override
|
|
137
|
-
MockItem.search(:description).matching_any("running shoes", tokenizer: "whitespace")
|
|
138
|
-
MockItem.search(:description).matching_any("running shoes", tokenizer: "whitespace('lowercase=false')")
|
|
139
|
-
|
|
140
|
-
# Fuzzy options on match/term
|
|
141
|
-
# Note: tokenizer overrides are mutually exclusive with fuzzy options.
|
|
142
|
-
MockItem.search(:description).matching_any("runing shose", distance: 1)
|
|
143
|
-
MockItem.search(:description).matching_all("runing", distance: 1, prefix: true)
|
|
144
|
-
MockItem.search(:description).term("shose", distance: 1, transposition_cost_one: true)
|
|
145
|
-
|
|
146
|
-
# Other query types
|
|
147
|
-
MockItem.search(:description).phrase("running shoes", slop: 2)
|
|
148
|
-
MockItem.search(:description).phrase("running shoes", tokenizer: "whitespace")
|
|
149
|
-
MockItem.search(:description).phrase(%w[running shoes])
|
|
150
|
-
MockItem.search(:description).regex("run.*")
|
|
151
|
-
MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"))
|
|
152
|
-
MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes", ordered: true))
|
|
153
|
-
MockItem.search(:description).near(ParadeDB.proximity("hiking", "running").within(2, "shoes"))
|
|
154
|
-
MockItem.search(:description).near(ParadeDB.proximity("running").within(2, "shoes", "sneakers", ordered: true))
|
|
155
|
-
MockItem.search(:description).near(ParadeDB.regex_term("run.*").within(3, "shoes"))
|
|
156
|
-
MockItem.search(:description).near(ParadeDB.proximity("trail").within(1, "running").within(1, "shoes"))
|
|
157
|
-
MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"), boost: 2.0)
|
|
158
|
-
MockItem.search(:description).near(ParadeDB.proximity("running").within(3, "shoes"), const: 1.0)
|
|
159
|
-
MockItem.search(:description).regex_phrase("run.*", "shoes")
|
|
160
|
-
MockItem.search(:description).phrase_prefix("run", "sh", max_expansion: 100)
|
|
161
|
-
MockItem.search(:description).parse("running AND shoes", lenient: true)
|
|
162
|
-
|
|
163
|
-
# Match-all / exists / ranges
|
|
164
|
-
MockItem.search(:id).match_all
|
|
165
|
-
MockItem.search(:id).exists
|
|
166
|
-
MockItem.search(:rating).range(gte: 3, lt: 5)
|
|
167
|
-
MockItem.search(:weight_range).range_term("(10, 12]", relation: "Intersects")
|
|
168
|
-
|
|
169
|
-
# Similarity
|
|
170
|
-
MockItem.more_like_this(42, fields: [:description])
|
|
171
|
-
```
|
|
172
|
-
|
|
173
|
-
## Scoring and Highlighting
|
|
174
|
-
|
|
175
|
-
```ruby
|
|
176
|
-
results = MockItem.search(:description)
|
|
177
|
-
.matching_all("shoes")
|
|
178
|
-
.with_score
|
|
179
|
-
.order(search_score: :desc)
|
|
180
|
-
|
|
181
|
-
MockItem.search(:description)
|
|
182
|
-
.matching_all("shoes")
|
|
183
|
-
.with_snippet(:description, start_tag: "<b>", end_tag: "</b>", max_chars: 80)
|
|
184
|
-
|
|
185
|
-
MockItem.search(:description)
|
|
186
|
-
.matching_all("running")
|
|
187
|
-
.with_snippets(:description, max_chars: 15, limit: 2, offset: 0, sort_by: :position)
|
|
188
|
-
|
|
189
|
-
MockItem.search(:description)
|
|
190
|
-
.matching_all("running")
|
|
191
|
-
.with_snippet_positions(:description)
|
|
192
|
-
```
|
|
193
|
-
|
|
194
|
-
## Facets and Aggregations
|
|
195
|
-
|
|
196
|
-
```ruby
|
|
197
|
-
# Rows + facets (requires order + limit)
|
|
198
|
-
relation = MockItem.search(:description)
|
|
199
|
-
.matching_all("shoes")
|
|
200
|
-
.with_facets(:category, size: 10)
|
|
201
|
-
.order(:id)
|
|
202
|
-
.limit(10)
|
|
203
|
-
|
|
204
|
-
rows = relation.to_a
|
|
205
|
-
facets = relation.facets
|
|
206
|
-
|
|
207
|
-
# Facets-only aggregate
|
|
208
|
-
MockItem.search(:description).matching_all("shoes").facets(:category)
|
|
209
|
-
|
|
210
|
-
# Named aggregations
|
|
211
|
-
MockItem.search(:description).matching_all("shoes").facets_agg(
|
|
212
|
-
docs: ParadeDB::Aggregations.value_count(:id),
|
|
213
|
-
avg_rating: ParadeDB::Aggregations.avg(:rating)
|
|
214
|
-
)
|
|
215
|
-
|
|
216
|
-
# Window aggregations + rows
|
|
217
|
-
MockItem.search(:description).matching_all("shoes").with_agg(
|
|
218
|
-
exact: false,
|
|
219
|
-
docs: ParadeDB::Aggregations.value_count(:id),
|
|
220
|
-
stats: ParadeDB::Aggregations.stats(:rating)
|
|
221
|
-
).order(:id).limit(10)
|
|
222
|
-
|
|
223
|
-
# Grouped aggregations
|
|
224
|
-
MockItem.search(:id).match_all.aggregate_by(
|
|
225
|
-
:category,
|
|
226
|
-
docs: ParadeDB::Aggregations.value_count(:id)
|
|
227
|
-
)
|
|
228
|
-
```
|
|
229
|
-
|
|
230
|
-
If you group by text/JSON fields, index those fields using `:literal` or `:literal_normalized`.
|
|
231
|
-
|
|
232
|
-
## ActiveRecord and Arel Composition
|
|
233
|
-
|
|
234
|
-
Use ParadeDB conditions with normal ActiveRecord scopes:
|
|
235
|
-
|
|
236
|
-
```ruby
|
|
237
|
-
MockItem.search(:description)
|
|
238
|
-
.matching_all("shoes")
|
|
239
|
-
.where(in_stock: true)
|
|
240
|
-
.where(MockItem.arel_table[:rating].gteq(4))
|
|
241
|
-
.order(created_at: :desc)
|
|
242
|
-
```
|
|
243
|
-
|
|
244
|
-
For advanced SQL composition, ParadeDB operators are also available through Arel predications:
|
|
245
|
-
|
|
246
|
-
```ruby
|
|
247
|
-
t = MockItem.arel_table
|
|
248
|
-
MockItem.where(t[:description].pdb_match("running shoes"))
|
|
249
|
-
```
|
|
250
|
-
|
|
251
|
-
## Diagnostics Helpers
|
|
252
|
-
|
|
253
|
-
Ruby helpers:
|
|
254
|
-
|
|
255
|
-
```ruby
|
|
256
|
-
ParadeDB.paradedb_indexes
|
|
257
|
-
ParadeDB.paradedb_index_segments("search_idx")
|
|
258
|
-
ParadeDB.paradedb_verify_index("search_idx", sample_rate: 0.1)
|
|
259
|
-
ParadeDB.paradedb_verify_all_indexes(index_pattern: "search_idx")
|
|
260
|
-
```
|
|
261
|
-
|
|
262
|
-
Availability depends on the installed `pg_search` version.
|
|
263
|
-
|
|
264
|
-
Repository development tasks (from this repo's `Rakefile`):
|
|
265
|
-
|
|
266
|
-
```bash
|
|
267
|
-
rake paradedb:diagnostics:indexes
|
|
268
|
-
rake "paradedb:diagnostics:index_segments[search_idx]"
|
|
269
|
-
rake "paradedb:diagnostics:verify_index[search_idx]" SAMPLE_RATE=0.1
|
|
270
|
-
rake paradedb:diagnostics:verify_all_indexes INDEX_PATTERN=search_idx
|
|
271
|
-
```
|
|
272
|
-
|
|
273
|
-
## Index Validation
|
|
274
|
-
|
|
275
|
-
By default, index validation is disabled. You can enable runtime checks globally:
|
|
276
|
-
|
|
277
|
-
```ruby
|
|
278
|
-
# config/initializers/paradedb.rb
|
|
279
|
-
ParadeDB.index_validation_mode = :warn # :warn, :raise, or :off
|
|
280
|
-
```
|
|
281
|
-
|
|
282
|
-
When enabled, `rails-paradedb` validates that the expected BM25 index exists and can raise
|
|
283
|
-
`ParadeDB::IndexDriftError` or `ParadeDB::IndexClassNotFoundError` depending on mode.
|
|
284
|
-
|
|
285
|
-
## Common Errors
|
|
286
|
-
|
|
287
|
-
### "No search field set. Call .search(column) first."
|
|
288
|
-
|
|
289
|
-
```ruby
|
|
290
|
-
# ❌ Missing .search(...)
|
|
291
|
-
MockItem.matching_all("shoes")
|
|
292
|
-
|
|
293
|
-
# ✅ Start with .search(column)
|
|
294
|
-
MockItem.search(:description).matching_all("shoes")
|
|
295
|
-
```
|
|
296
|
-
|
|
297
|
-
### "with_facets requires ORDER BY and LIMIT"
|
|
298
|
-
|
|
299
|
-
```ruby
|
|
300
|
-
# ❌ Missing order/limit
|
|
301
|
-
MockItem.search(:description).matching_all("shoes").with_facets(:category).to_a
|
|
302
|
-
|
|
303
|
-
# ✅ Include both
|
|
304
|
-
relation = MockItem.search(:description)
|
|
305
|
-
.matching_all("shoes")
|
|
306
|
-
.with_facets(:category)
|
|
307
|
-
.order(:id)
|
|
308
|
-
.limit(10)
|
|
309
|
-
relation.to_a
|
|
310
|
-
relation.facets
|
|
311
|
-
```
|
|
312
|
-
|
|
313
|
-
### "search(:field) is not indexed"
|
|
314
|
-
|
|
315
|
-
```ruby
|
|
316
|
-
# ❌ Field not in your ParadeDB::Index fields hash
|
|
317
|
-
MockItem.search(:title).matching_all("shoes")
|
|
318
|
-
|
|
319
|
-
# ✅ Add :title to the index definition, then migrate
|
|
320
|
-
```
|
|
321
|
-
|
|
322
|
-
## Security
|
|
323
|
-
|
|
324
|
-
`rails-paradedb` builds SQL through Arel nodes and quoted literals (`Arel::Nodes.build_quoted`)
|
|
325
|
-
rather than manual string interpolation. Tokenizer expressions are validated and search operators are
|
|
326
|
-
rendered through typed nodes, with unit and integration coverage for quoting and edge cases.
|
|
327
|
-
|
|
328
44
|
## Examples
|
|
329
45
|
|
|
330
|
-
- [
|
|
46
|
+
- [Quickstart](examples/quickstart/quickstart.rb)
|
|
331
47
|
- [Faceted Search](examples/faceted_search/faceted_search.rb)
|
|
332
48
|
- [Autocomplete](examples/autocomplete/autocomplete.rb)
|
|
333
49
|
- [More Like This](examples/more_like_this/more_like_this.rb)
|
|
334
|
-
- [Hybrid RRF](examples/hybrid_rrf/hybrid_rrf.rb)
|
|
50
|
+
- [Hybrid Search (RRF)](examples/hybrid_rrf/hybrid_rrf.rb)
|
|
335
51
|
- [RAG](examples/rag/rag.rb)
|
|
336
|
-
- [Examples README](examples/README.md)
|
|
337
|
-
|
|
338
|
-
## Documentation
|
|
339
|
-
|
|
340
|
-
- **ParadeDB Official Docs**: <https://docs.paradedb.com>
|
|
341
|
-
- **ParadeDB Website**: <https://paradedb.com>
|
|
342
52
|
|
|
343
53
|
## Contributing
|
|
344
54
|
|
|
@@ -346,7 +56,7 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, test commands, lin
|
|
|
346
56
|
|
|
347
57
|
## Support
|
|
348
58
|
|
|
349
|
-
If you're missing a feature or found a bug, open a
|
|
59
|
+
If you're missing a feature or have found a bug, open a
|
|
350
60
|
[GitHub Issue](https://github.com/paradedb/rails-paradedb/issues/new/choose).
|
|
351
61
|
|
|
352
62
|
For community support:
|
|
@@ -366,4 +76,4 @@ We would like to thank the following members of the community for their valuable
|
|
|
366
76
|
|
|
367
77
|
## License
|
|
368
78
|
|
|
369
|
-
|
|
79
|
+
ParadeDB for Rails is licensed under the [MIT License](LICENSE).
|
|
@@ -302,12 +302,11 @@ module ParadeDB
|
|
|
302
302
|
def apply_tokenizer(node, tokenizer)
|
|
303
303
|
return node if tokenizer.nil?
|
|
304
304
|
|
|
305
|
-
unless tokenizer.is_a?(
|
|
306
|
-
raise ArgumentError, "tokenizer must be a
|
|
305
|
+
unless tokenizer.is_a?(Tokenizer)
|
|
306
|
+
raise ArgumentError, "tokenizer must be a Tokenizer"
|
|
307
307
|
end
|
|
308
308
|
|
|
309
|
-
|
|
310
|
-
Nodes::TokenizerCast.new(node, normalized)
|
|
309
|
+
return Nodes::TokenizerCast.new(node, tokenizer.render())
|
|
311
310
|
end
|
|
312
311
|
|
|
313
312
|
def apply_slop(node, slop)
|
data/lib/parade_db/index.rb
CHANGED
|
@@ -1,5 +1,7 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
|
+
require_relative "tokenizer"
|
|
4
|
+
|
|
3
5
|
module ParadeDB
|
|
4
6
|
class Index
|
|
5
7
|
class << self
|
|
@@ -47,117 +49,29 @@ module ParadeDB
|
|
|
47
49
|
|
|
48
50
|
class TokenizerParser
|
|
49
51
|
TOKENIZER_EXPRESSION = /\A[a-zA-Z_][a-zA-Z0-9_]*(?:(?:::|\.)[a-zA-Z_][a-zA-Z0-9_]*)*(?:\(\s*[a-zA-Z0-9_'".,=\s:-]*\s*\))?\z/.freeze
|
|
50
|
-
TOKENIZER_SINGLE_KEYS = %i[tokenizer
|
|
52
|
+
TOKENIZER_SINGLE_KEYS = %i[tokenizer alias].freeze
|
|
51
53
|
|
|
52
54
|
class << self
|
|
53
|
-
def parse(source_name,
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
[build_tokenized_entry(source_name, tokenizer_spec.to_s, {})]
|
|
57
|
-
when Hash
|
|
58
|
-
tokenizer_spec.map do |tokenizer, opts|
|
|
59
|
-
case opts
|
|
60
|
-
when Hash
|
|
61
|
-
build_tokenized_entry(source_name, tokenizer.to_s, normalize_options(opts))
|
|
62
|
-
when Symbol, String
|
|
63
|
-
build_tokenized_entry(source_name, tokenizer.to_s, normalize_positional_option(opts))
|
|
64
|
-
else
|
|
65
|
-
raise InvalidIndexDefinition,
|
|
66
|
-
"tokenizer options for #{source_name}.#{tokenizer} must be a Hash, Symbol, or String"
|
|
67
|
-
end
|
|
68
|
-
end
|
|
69
|
-
else
|
|
70
|
-
raise InvalidIndexDefinition,
|
|
71
|
-
"invalid tokenizer definition for #{source_name}: #{tokenizer_spec.inspect}"
|
|
72
|
-
end
|
|
73
|
-
end
|
|
74
|
-
|
|
75
|
-
private
|
|
76
|
-
|
|
77
|
-
def parse_structured_tokenizer_config(source_name, config, context:)
|
|
78
|
-
unless config.is_a?(Hash)
|
|
79
|
-
raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} must be a Hash"
|
|
55
|
+
def parse(source_name, tokenizer, context:)
|
|
56
|
+
unless tokenizer.is_a?(Tokenizer)
|
|
57
|
+
raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} must be a Tokenizer"
|
|
80
58
|
end
|
|
81
59
|
|
|
82
|
-
tokenizer = config[:tokenizer] || config["tokenizer"]
|
|
83
|
-
if tokenizer.nil?
|
|
84
|
-
raise InvalidIndexDefinition, "#{context} for #{source_name.inspect} requires :tokenizer"
|
|
85
|
-
end
|
|
86
|
-
|
|
87
|
-
tokenizer_name = tokenizer.to_s
|
|
88
|
-
validate_tokenizer_name!(source_name, tokenizer_name)
|
|
89
|
-
|
|
90
|
-
args = config[:args] || config["args"]
|
|
91
|
-
named_args = config[:named_args] || config["named_args"]
|
|
92
|
-
filters = config[:filters] || config["filters"]
|
|
93
|
-
stemmer = config[:stemmer] || config["stemmer"]
|
|
94
|
-
alias_name = config[:alias] || config["alias"]
|
|
95
|
-
|
|
96
60
|
options = {}
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
raise InvalidIndexDefinition, "args for #{source_name.inspect} must be an Array"
|
|
100
|
-
end
|
|
101
|
-
options[:__positional] = args.to_ary
|
|
102
|
-
end
|
|
103
|
-
|
|
104
|
-
if named_args
|
|
105
|
-
unless named_args.is_a?(Hash)
|
|
106
|
-
raise InvalidIndexDefinition, "named_args for #{source_name.inspect} must be a Hash"
|
|
107
|
-
end
|
|
108
|
-
named_args.each { |key, value| options[key.to_sym] = value }
|
|
109
|
-
end
|
|
110
|
-
|
|
111
|
-
if filters
|
|
112
|
-
unless filters.respond_to?(:to_ary)
|
|
113
|
-
raise InvalidIndexDefinition, "filters for #{source_name.inspect} must be an Array"
|
|
114
|
-
end
|
|
115
|
-
filters.to_ary.each do |name|
|
|
116
|
-
filter_key = name.to_s
|
|
117
|
-
if filter_key == "stemmer" && stemmer
|
|
118
|
-
options[:stemmer] = stemmer
|
|
119
|
-
else
|
|
120
|
-
key = filter_key.to_sym
|
|
121
|
-
options[key] = true unless options.key?(key)
|
|
122
|
-
end
|
|
123
|
-
end
|
|
124
|
-
end
|
|
125
|
-
|
|
126
|
-
options[:stemmer] = stemmer if stemmer && !options.key?(:stemmer)
|
|
127
|
-
options[:alias] = alias_name.to_s if alias_name
|
|
128
|
-
|
|
129
|
-
build_tokenized_entry(source_name, tokenizer_name, options)
|
|
130
|
-
end
|
|
131
|
-
|
|
132
|
-
def normalize_options(opts)
|
|
133
|
-
opts.each_with_object({}) do |(key, value), memo|
|
|
134
|
-
memo[key.to_sym] = value
|
|
135
|
-
end
|
|
136
|
-
end
|
|
137
|
-
|
|
138
|
-
def normalize_positional_option(option)
|
|
139
|
-
{ __positional: [option.to_s] }
|
|
140
|
-
end
|
|
61
|
+
options[:__positional] = tokenizer.positional_args.dup unless tokenizer.positional_args.nil?
|
|
62
|
+
tokenizer.options&.each { |key, value| options[key.to_sym] = value }
|
|
141
63
|
|
|
142
|
-
def build_tokenized_entry(source_name, tokenizer, options)
|
|
143
|
-
validate_tokenizer_name!(source_name, tokenizer) unless tokenizer.nil?
|
|
144
64
|
key = options[:alias]&.to_s || source_name
|
|
145
65
|
DefinitionCompiler::Entry.new(
|
|
146
66
|
source: source_name,
|
|
147
67
|
expression: expression?(source_name),
|
|
148
|
-
tokenizer: tokenizer,
|
|
68
|
+
tokenizer: tokenizer.name,
|
|
149
69
|
options: options,
|
|
150
70
|
query_key: key
|
|
151
71
|
)
|
|
152
72
|
end
|
|
153
73
|
|
|
154
|
-
|
|
155
|
-
return if TOKENIZER_EXPRESSION.match?(tokenizer)
|
|
156
|
-
|
|
157
|
-
raise InvalidIndexDefinition,
|
|
158
|
-
"invalid tokenizer name #{tokenizer.inspect} for #{source_name}. " \
|
|
159
|
-
"Expected identifier form like simple, pdb::simple, or pdb::ngram(2, 5, alias=field_alias)."
|
|
160
|
-
end
|
|
74
|
+
private
|
|
161
75
|
|
|
162
76
|
def expression?(value)
|
|
163
77
|
value.match?(/[^a-zA-Z0-9_]/)
|
|
@@ -260,33 +174,22 @@ module ParadeDB
|
|
|
260
174
|
elsif tokenizers
|
|
261
175
|
if single_tokenizer_keys_present
|
|
262
176
|
raise InvalidIndexDefinition,
|
|
263
|
-
"field #{source_name.inspect} cannot mix :tokenizers with :tokenizer/:
|
|
177
|
+
"field #{source_name.inspect} cannot mix :tokenizers with :tokenizer/:alias"
|
|
264
178
|
end
|
|
265
179
|
unless tokenizers.respond_to?(:to_ary) && !tokenizers.to_ary.empty?
|
|
266
180
|
raise InvalidIndexDefinition, "field #{source_name.inspect} :tokenizers must be a non-empty Array"
|
|
267
181
|
end
|
|
268
182
|
|
|
269
183
|
tokenizers.to_ary.each_with_index do |tokenizer_config, idx|
|
|
270
|
-
entry = TokenizerParser.
|
|
271
|
-
:parse_structured_tokenizer_config,
|
|
272
|
-
source_name,
|
|
273
|
-
tokenizer_config,
|
|
274
|
-
context: "tokenizers[#{idx}]"
|
|
275
|
-
)
|
|
184
|
+
entry = TokenizerParser.parse(source_name, tokenizer_config, context: "tokenizers[#{idx}]")
|
|
276
185
|
entries << entry
|
|
277
186
|
end
|
|
278
|
-
elsif
|
|
279
|
-
|
|
280
|
-
raise InvalidIndexDefinition,
|
|
281
|
-
"field #{source_name.inspect} specifies tokenizer configuration but no :tokenizer"
|
|
282
|
-
end
|
|
283
|
-
entry = TokenizerParser.send(
|
|
284
|
-
:parse_structured_tokenizer_config,
|
|
285
|
-
source_name,
|
|
286
|
-
select_keys(normalized, TokenizerParser::TOKENIZER_SINGLE_KEYS),
|
|
287
|
-
context: "tokenizer config"
|
|
288
|
-
)
|
|
187
|
+
elsif normalized.key?(:tokenizer)
|
|
188
|
+
entry = TokenizerParser.parse(source_name, normalized[:tokenizer], context: "tokenizer")
|
|
289
189
|
entries << entry
|
|
190
|
+
elsif single_tokenizer_keys_present
|
|
191
|
+
raise InvalidIndexDefinition,
|
|
192
|
+
"field #{source_name.inspect} specifies tokenizer configuration but no :tokenizer"
|
|
290
193
|
else
|
|
291
194
|
entries << Entry.new(
|
|
292
195
|
source: source_name,
|
|
@@ -336,7 +336,7 @@ module ParadeDB
|
|
|
336
336
|
elsif entries.length == 1
|
|
337
337
|
"#{source_ruby} #{bm25_tokenizer_config_ruby(entries.first)}"
|
|
338
338
|
else
|
|
339
|
-
configs = entries.map { |e|
|
|
339
|
+
configs = entries.map { |e| bm25_tokenizer_ruby_from_entry(e) }
|
|
340
340
|
"#{source_ruby} { tokenizers: [#{configs.join(', ')}] }"
|
|
341
341
|
end
|
|
342
342
|
end
|
|
@@ -678,6 +678,10 @@ module ParadeDB
|
|
|
678
678
|
end
|
|
679
679
|
|
|
680
680
|
def bm25_tokenizer_config_ruby(entry)
|
|
681
|
+
"{ tokenizer: #{bm25_tokenizer_ruby_from_entry(entry)} }"
|
|
682
|
+
end
|
|
683
|
+
|
|
684
|
+
def bm25_tokenizer_ruby_from_entry(entry)
|
|
681
685
|
opts = entry[:options].dup
|
|
682
686
|
positional_args = Array(opts.delete(:__positional))
|
|
683
687
|
alias_val = opts.delete(:alias)
|
|
@@ -685,16 +689,19 @@ module ParadeDB
|
|
|
685
689
|
max_val = opts.delete(:max)
|
|
686
690
|
positional_args = [min_val, max_val] + positional_args if min_val && max_val
|
|
687
691
|
|
|
688
|
-
|
|
689
|
-
|
|
690
|
-
|
|
692
|
+
opts[:alias] = alias_val if alias_val
|
|
693
|
+
|
|
694
|
+
bm25_tokenizer_ruby(entry[:tokenizer], positional_args, opts)
|
|
695
|
+
end
|
|
691
696
|
|
|
692
|
-
|
|
693
|
-
|
|
694
|
-
|
|
697
|
+
def bm25_tokenizer_ruby(name, positional_args, options)
|
|
698
|
+
if name.match?(/\A[a-z_][a-z0-9_]*\z/) && Tokenizer.respond_to?(name)
|
|
699
|
+
args = positional_args.map { |arg| ruby_literal(arg) }
|
|
700
|
+
args << "options: #{ruby_hash_literal(options)}" unless options.empty?
|
|
701
|
+
return "Tokenizer.#{name}(#{args.join(', ')})"
|
|
695
702
|
end
|
|
696
703
|
|
|
697
|
-
"{ #{
|
|
704
|
+
"Tokenizer.new(#{name.inspect}, #{ruby_literal(positional_args.empty? ? nil : positional_args)}, #{ruby_literal(options.empty? ? nil : options)})"
|
|
698
705
|
end
|
|
699
706
|
|
|
700
707
|
def split_sql_arguments(args_sql)
|
|
@@ -0,0 +1,95 @@
|
|
|
1
|
+
class Tokenizer
|
|
2
|
+
attr_reader :name, :positional_args, :options
|
|
3
|
+
|
|
4
|
+
def initialize(name, positional_args, options)
|
|
5
|
+
@name = name
|
|
6
|
+
@positional_args = positional_args
|
|
7
|
+
@options = options
|
|
8
|
+
end
|
|
9
|
+
|
|
10
|
+
def render()
|
|
11
|
+
if options.nil? && positional_args.nil?
|
|
12
|
+
return "pdb.#{name}"
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
args = []
|
|
16
|
+
if !positional_args.nil?
|
|
17
|
+
args.concat(positional_args.map { |x| render_positional_arg(x) })
|
|
18
|
+
end
|
|
19
|
+
if !options.nil?
|
|
20
|
+
args.concat(options.map {|k, v| quote_term("#{k}=#{v}")})
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
return "pdb.#{name}(#{args.join(",")})"
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
def self.whitespace(options: nil)
|
|
27
|
+
new("whitespace", nil, options)
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
def self.unicode_words(options: nil)
|
|
31
|
+
new("unicode_words", nil, options)
|
|
32
|
+
end
|
|
33
|
+
|
|
34
|
+
def self.ngram(min_gram, max_gram, options: nil)
|
|
35
|
+
new("ngram", [min_gram, max_gram], options)
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
def self.simple(options: nil)
|
|
39
|
+
new("simple", nil, options)
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
def self.literal(options: nil)
|
|
43
|
+
new("literal", nil, options)
|
|
44
|
+
end
|
|
45
|
+
|
|
46
|
+
def self.literal_normalized(options: nil)
|
|
47
|
+
new("literal_normalized", nil, options)
|
|
48
|
+
end
|
|
49
|
+
|
|
50
|
+
def self.edge_ngram(min_gram, max_gram, options: nil)
|
|
51
|
+
new("edge_ngram", [min_gram, max_gram], options)
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
def self.regex_pattern(pattern, options: nil)
|
|
55
|
+
new("regex_pattern", [pattern], options)
|
|
56
|
+
end
|
|
57
|
+
|
|
58
|
+
def self.chinese_compatible(options: nil)
|
|
59
|
+
new("chinese_compatible", nil, options)
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
def self.lindera(dictionary, options: nil)
|
|
63
|
+
new("lindera", [dictionary], options)
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
def self.icu(options: nil)
|
|
67
|
+
new("icu", nil, options)
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
def self.jieba(options: nil)
|
|
71
|
+
new("jieba", nil, options)
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
def self.source_code(options: nil)
|
|
75
|
+
new("source_code", nil, options)
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
private
|
|
79
|
+
|
|
80
|
+
def quote_term(value)
|
|
81
|
+
escaped = value.gsub("'", "''")
|
|
82
|
+
"'#{escaped}'"
|
|
83
|
+
end
|
|
84
|
+
|
|
85
|
+
def render_positional_arg(value)
|
|
86
|
+
case value
|
|
87
|
+
when true, false, Numeric
|
|
88
|
+
value.to_s
|
|
89
|
+
when String
|
|
90
|
+
quote_term(value)
|
|
91
|
+
else
|
|
92
|
+
raise InvalidArgumentError, "Unsupported tokenizer arg type: #{value.class}"
|
|
93
|
+
end
|
|
94
|
+
end
|
|
95
|
+
end
|
data/lib/parade_db/version.rb
CHANGED
data/lib/parade_db.rb
CHANGED
|
@@ -12,6 +12,7 @@ require_relative "parade_db/migration_helpers"
|
|
|
12
12
|
require_relative "parade_db/model"
|
|
13
13
|
require_relative "parade_db/search_methods"
|
|
14
14
|
require_relative "parade_db/railtie"
|
|
15
|
+
require_relative "parade_db/tokenizer"
|
|
15
16
|
|
|
16
17
|
module ParadeDB
|
|
17
18
|
FacetQueryError = Errors::FacetQueryError
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: rails-paradedb
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.7.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- ParadeDB
|
|
@@ -113,6 +113,7 @@ files:
|
|
|
113
113
|
- lib/parade_db/query.rb
|
|
114
114
|
- lib/parade_db/railtie.rb
|
|
115
115
|
- lib/parade_db/search_methods.rb
|
|
116
|
+
- lib/parade_db/tokenizer.rb
|
|
116
117
|
- lib/parade_db/tokenizer_sql.rb
|
|
117
118
|
- lib/parade_db/version.rb
|
|
118
119
|
homepage: https://github.com/paradedb/rails-paradedb
|