chewy 7.6.0 → 8.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (73) hide show
  1. checksums.yaml +4 -4
  2. data/.github/CODEOWNERS +1 -1
  3. data/.github/dependabot.yml +2 -2
  4. data/.github/workflows/ruby.yml +11 -10
  5. data/.rubocop.yml +1 -1
  6. data/.rubocop_todo.yml +132 -39
  7. data/CHANGELOG.md +18 -1
  8. data/CONTRIBUTING.md +1 -1
  9. data/LICENSE.txt +1 -1
  10. data/README.md +50 -1125
  11. data/chewy.gemspec +3 -2
  12. data/docker-compose.yml +14 -0
  13. data/docs/README.md +16 -0
  14. data/docs/configuration.md +440 -0
  15. data/docs/import.md +122 -0
  16. data/docs/indexing.md +329 -0
  17. data/docs/querying.md +72 -0
  18. data/docs/rake_tasks.md +108 -0
  19. data/docs/testing.md +41 -0
  20. data/docs/troubleshooting.md +101 -0
  21. data/gemfiles/base.gemfile +3 -3
  22. data/gemfiles/{rails.6.1.activerecord.gemfile → rails.7.2.activerecord.gemfile} +3 -3
  23. data/gemfiles/{rails.7.0.activerecord.gemfile → rails.8.0.activerecord.gemfile} +3 -3
  24. data/lib/chewy/config.rb +2 -2
  25. data/lib/chewy/errors.rb +3 -0
  26. data/lib/chewy/fields/root.rb +1 -1
  27. data/lib/chewy/index/actions.rb +5 -5
  28. data/lib/chewy/index/aliases.rb +1 -1
  29. data/lib/chewy/index/syncer.rb +5 -5
  30. data/lib/chewy/minitest/helpers.rb +1 -1
  31. data/lib/chewy/search/request.rb +4 -4
  32. data/lib/chewy/search/response.rb +7 -0
  33. data/lib/chewy/search/scrolling.rb +2 -1
  34. data/lib/chewy/strategy/delayed_sidekiq/worker.rb +1 -1
  35. data/lib/chewy/version.rb +1 -1
  36. data/lib/chewy.rb +4 -0
  37. data/migration_guide.md +1 -1
  38. data/spec/chewy/config_spec.rb +13 -14
  39. data/spec/chewy/elastic_client_spec.rb +1 -1
  40. data/spec/chewy/fields/base_spec.rb +2 -2
  41. data/spec/chewy/fields/time_fields_spec.rb +1 -1
  42. data/spec/chewy/index/actions_spec.rb +9 -70
  43. data/spec/chewy/index/aliases_spec.rb +1 -1
  44. data/spec/chewy/index/import/bulk_builder_spec.rb +2 -2
  45. data/spec/chewy/index/import/bulk_request_spec.rb +1 -1
  46. data/spec/chewy/index/import/routine_spec.rb +1 -1
  47. data/spec/chewy/index/import_spec.rb +15 -15
  48. data/spec/chewy/index/observe/callback_spec.rb +1 -1
  49. data/spec/chewy/index/specification_spec.rb +1 -4
  50. data/spec/chewy/index/syncer_spec.rb +1 -1
  51. data/spec/chewy/index_spec.rb +1 -1
  52. data/spec/chewy/journal_spec.rb +2 -2
  53. data/spec/chewy/minitest/helpers_spec.rb +2 -6
  54. data/spec/chewy/multi_search_spec.rb +1 -1
  55. data/spec/chewy/rake_helper_spec.rb +1 -1
  56. data/spec/chewy/repository_spec.rb +4 -4
  57. data/spec/chewy/rspec/update_index_spec.rb +2 -2
  58. data/spec/chewy/runtime_spec.rb +2 -2
  59. data/spec/chewy/search/loader_spec.rb +1 -1
  60. data/spec/chewy/search/pagination/kaminari_examples.rb +1 -1
  61. data/spec/chewy/search/query_proxy_spec.rb +0 -24
  62. data/spec/chewy/search/request_spec.rb +7 -3
  63. data/spec/chewy/search/response_spec.rb +2 -24
  64. data/spec/chewy/search/scrolling_spec.rb +1 -1
  65. data/spec/chewy/search_spec.rb +1 -1
  66. data/spec/chewy/stash_spec.rb +1 -1
  67. data/spec/chewy/strategy/delayed_sidekiq_spec.rb +27 -10
  68. data/spec/chewy/strategy_spec.rb +1 -1
  69. data/spec/chewy_spec.rb +5 -22
  70. data/spec/spec_helper.rb +26 -0
  71. data/spec/support/active_record.rb +35 -4
  72. metadata +22 -17
  73. data/gemfiles/rails.7.1.activerecord.gemfile +0 -14
data/chewy.gemspec CHANGED
@@ -11,13 +11,14 @@ Gem::Specification.new do |spec|
11
11
  spec.description = 'Chewy provides functionality for Elasticsearch index handling, documents import mappings and chainable query DSL'
12
12
  spec.homepage = 'https://github.com/toptal/chewy'
13
13
  spec.license = 'MIT'
14
+ spec.required_ruby_version = '~> 3.2'
14
15
 
15
16
  spec.files = `git ls-files`.split($RS)
16
17
  spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
17
18
  spec.require_paths = ['lib']
18
19
 
19
- spec.add_dependency 'activesupport', '>= 5.2' # Remove with major version bump, 8.x
20
- spec.add_dependency 'elasticsearch', '>= 7.14.0', '< 8'
20
+ spec.add_dependency 'activesupport', '>= 7.2'
21
+ spec.add_dependency 'elasticsearch', '>= 8.14', '< 9.0'
21
22
  spec.add_dependency 'elasticsearch-dsl'
22
23
  spec.metadata['rubygems_mfa_required'] = 'true'
23
24
  end
@@ -0,0 +1,14 @@
1
+ services:
2
+ elasticsearch_test:
3
+ image: "elasticsearch:8.15.0"
4
+ environment:
5
+ - bootstrap.memory_lock=${ES_MEMORY_LOCK:-false}
6
+ - "ES_JAVA_OPTS=-Xms${TEST_ES_HEAP_SIZE:-500m} -Xmx${TEST_ES_HEAP_SIZE:-500m}"
7
+ - discovery.type=single-node
8
+ - xpack.security.enabled=false
9
+ ports:
10
+ - "127.0.0.1:9250:9200"
11
+ ulimits:
12
+ nofile:
13
+ soft: 65536
14
+ hard: 65536
data/docs/README.md ADDED
@@ -0,0 +1,16 @@
1
+ # Chewy Documentation
2
+
3
+ Getting started — see the main [README](../README.md).
4
+
5
+ ## Reference
6
+
7
+ - [Configuration](configuration.md) — client settings, update strategies, notifications, integrations
8
+ - [Indexing](indexing.md) — index definition, field types, crutches, witchcraft, index manipulation
9
+ - [Import](import.md) — import options, raw import, journaling
10
+ - [Querying](querying.md) — search requests, pagination, scopes, scroll, loading
11
+
12
+ ## Operations
13
+
14
+ - [Rake Tasks](rake_tasks.md) — reindexing, syncing, journal management, parallelization
15
+ - [Testing](testing.md) — RSpec, Minitest, DatabaseCleaner
16
+ - [Troubleshooting](troubleshooting.md) — common errors, debugging imports, ES 8 gotchas
@@ -0,0 +1,440 @@
1
+ # Configuration
2
+
3
+ ## Client settings
4
+
5
+ To configure the Chewy client you need to add `chewy.rb` file with `Chewy.settings` hash:
6
+
7
+ ```ruby
8
+ # config/initializers/chewy.rb
9
+ Chewy.settings = {host: 'localhost:9250'} # do not use environments
10
+ ```
11
+
12
+ And add `chewy.yml` configuration file.
13
+
14
+ You can create `chewy.yml` manually or run `rails g chewy:install` to generate it:
15
+
16
+ ```yaml
17
+ # config/chewy.yml
18
+ # separate environment configs
19
+ test:
20
+ host: 'localhost:9250'
21
+ prefix: 'test'
22
+ development:
23
+ host: 'localhost:9200'
24
+ ```
25
+
26
+ The resulting config merges both hashes. Client options are passed as is to `Elasticsearch::Transport::Client` except for the `:prefix`, which is used internally by Chewy to create prefixed index names:
27
+
28
+ ```ruby
29
+ Chewy.settings = {prefix: 'test'}
30
+ UsersIndex.index_name # => 'test_users'
31
+ ```
32
+
33
+ The logger may be set explicitly:
34
+
35
+ ```ruby
36
+ Chewy.logger = Logger.new(STDOUT)
37
+ ```
38
+
39
+ See [config.rb](../lib/chewy/config.rb) for more details.
40
+
41
+ ### AWS Elasticsearch
42
+
43
+ If you would like to use AWS's Elasticsearch using an IAM user policy, you will need to sign your requests for the `es:*` action by injecting the appropriate headers passing a proc to `transport_options`.
44
+ You'll need an additional gem for Faraday middleware: add `gem 'faraday_middleware-aws-sigv4'` to your Gemfile.
45
+
46
+ ```ruby
47
+ require 'faraday_middleware/aws_sigv4'
48
+
49
+ Chewy.settings = {
50
+ host: 'http://my-es-instance-on-aws.us-east-1.es.amazonaws.com:80',
51
+ port: 80, # 443 for https host
52
+ transport_options: {
53
+ headers: { content_type: 'application/json' },
54
+ proc: -> (f) do
55
+ f.request :aws_sigv4,
56
+ service: 'es',
57
+ region: 'us-east-1',
58
+ access_key_id: ENV['AWS_ACCESS_KEY'],
59
+ secret_access_key: ENV['AWS_SECRET_ACCESS_KEY']
60
+ end
61
+ }
62
+ }
63
+ ```
64
+
65
+ ## Elasticsearch client options
66
+
67
+ All connection options, except the `:prefix`, are passed to the `Elasticseach::Client.new` ([chewy/lib/chewy.rb](https://github.com/toptal/chewy/blob/f5bad9f83c21416ac10590f6f34009c645062e89/lib/chewy.rb#L153-L160)):
68
+
69
+ Here's the relevant Elasticsearch documentation on the subject: https://rubydoc.info/gems/elasticsearch-transport#setting-hosts
70
+
71
+ ## Index update strategies
72
+
73
+ Assume you've got the following code (see [indexing.md](indexing.md#index-definition) for the full `update_index` DSL):
74
+
75
+ ```ruby
76
+ class City < ActiveRecord::Base
77
+ update_index 'cities', :self
78
+ end
79
+
80
+ class CitiesIndex < Chewy::Index
81
+ index_scope City
82
+ field :name
83
+ end
84
+ ```
85
+
86
+ If you do something like `City.first.save!` you'll get an UndefinedUpdateStrategy exception instead of the object saving and index updating. This exception forces you to choose an appropriate update strategy for the current context.
87
+
88
+ If you want to return to the pre-0.7.0 behavior - just set `Chewy.root_strategy = :bypass`.
89
+
90
+ ### `:atomic`
91
+
92
+ The main strategy here is `:atomic`. Assume you have to update a lot of records in the db.
93
+
94
+ ```ruby
95
+ Chewy.strategy(:atomic) do
96
+ City.popular.map(&:do_some_update_action!)
97
+ end
98
+ ```
99
+
100
+ Using this strategy delays the index update request until the end of the block. Updated records are aggregated and the index update happens with the bulk API. So this strategy is highly optimized.
101
+
102
+ ### `:sidekiq`
103
+
104
+ This does the same thing as `:atomic`, but asynchronously using sidekiq. Patch `Chewy::Strategy::Sidekiq::Worker` for index updates improving.
105
+
106
+ ```ruby
107
+ Chewy.strategy(:sidekiq) do
108
+ City.popular.map(&:do_some_update_action!)
109
+ end
110
+ ```
111
+
112
+ The default queue name is `chewy`, you can customize it in settings: `sidekiq.queue_name`
113
+ ```
114
+ Chewy.settings[:sidekiq] = {queue: :low}
115
+ ```
116
+
117
+ ### `:lazy_sidekiq`
118
+
119
+ This does the same thing as `:sidekiq`, but with lazy evaluation. Beware it does not allow you to use any non-persistent record state for indices and conditions because record will be re-fetched from database asynchronously using sidekiq. However for destroying records strategy will fallback to `:sidekiq` because it's not possible to re-fetch deleted records from database.
120
+
121
+ The purpose of this strategy is to improve the response time of the code that should update indexes, as it does not only defer actual ES calls to a background job but `update_index` callbacks evaluation (for created and updated objects) too. Similar to `:sidekiq`, index update is asynchronous so this strategy cannot be used when data and index synchronization is required.
122
+
123
+ ```ruby
124
+ Chewy.strategy(:lazy_sidekiq) do
125
+ City.popular.map(&:do_some_update_action!)
126
+ end
127
+ ```
128
+
129
+ The default queue name is `chewy`, you can customize it in settings: `sidekiq.queue_name`
130
+ ```
131
+ Chewy.settings[:sidekiq] = {queue: :low}
132
+ ```
133
+
134
+ ### `:delayed_sidekiq`
135
+
136
+ It accumulates IDs of records to be reindexed during the latency window in Redis and then performs the reindexing of all accumulated records at once.
137
+ This strategy is very useful in the case of frequently mutated records.
138
+ It supports the `update_fields` option, so it will attempt to select just enough data from the database.
139
+
140
+ Keep in mind, this strategy does not guarantee reindexing in the event of Sidekiq worker termination or an error during the reindexing phase.
141
+ This behavior is intentional to prevent continuous growth of Redis db.
142
+
143
+ There are three options that can be defined in the index:
144
+ ```ruby
145
+ class CitiesIndex...
146
+ strategy_config delayed_sidekiq: {
147
+ latency: 3,
148
+ margin: 2,
149
+ ttl: 60 * 60 * 24,
150
+ reindex_wrapper: ->(&reindex) {
151
+ ActiveRecord::Base.connected_to(role: :reading) { reindex.call }
152
+ }
153
+ # latency - will prevent scheduling identical jobs
154
+ # margin - main purpose is to cover db replication lag by the margin
155
+ # ttl - a chunk expiration time (in seconds)
156
+ # reindex_wrapper - lambda that accepts block to wrap that reindex process AR connection block.
157
+ }
158
+
159
+ ...
160
+ end
161
+ ```
162
+
163
+ Also you can define defaults in the `initializers/chewy.rb`
164
+ ```ruby
165
+ Chewy.settings = {
166
+ strategy_config: {
167
+ delayed_sidekiq: {
168
+ latency: 3,
169
+ margin: 2,
170
+ ttl: 60 * 60 * 24,
171
+ reindex_wrapper: ->(&reindex) {
172
+ ActiveRecord::Base.connected_to(role: :reading) { reindex.call }
173
+ }
174
+ }
175
+ }
176
+ }
177
+
178
+ ```
179
+ or in `config/chewy.yml`
180
+ ```ruby
181
+ strategy_config:
182
+ delayed_sidekiq:
183
+ latency: 3
184
+ margin: 2
185
+ ttl: <%= 60 * 60 * 24 %>
186
+ # reindex_wrapper setting is not possible here!!! use the initializer instead
187
+ ```
188
+
189
+ You can use the strategy identically to other strategies
190
+ ```ruby
191
+ Chewy.strategy(:delayed_sidekiq) do
192
+ City.popular.map(&:do_some_update_action!)
193
+ end
194
+ ```
195
+
196
+ The default queue name is `chewy`, you can customize it in settings: `sidekiq.queue_name`
197
+ ```
198
+ Chewy.settings[:sidekiq] = {queue: :low}
199
+ ```
200
+
201
+ Explicit call of the reindex using `:delayed_sidekiq strategy`
202
+ ```ruby
203
+ CitiesIndex.import([1, 2, 3], strategy: :delayed_sidekiq)
204
+ ```
205
+
206
+ Explicit call of the reindex using `:delayed_sidekiq` strategy with `:update_fields` support
207
+ ```ruby
208
+ CitiesIndex.import([1, 2, 3], update_fields: [:name], strategy: :delayed_sidekiq)
209
+ ```
210
+
211
+ While running tests with delayed_sidekiq strategy and Sidekiq is using a real redis instance that is NOT cleaned up in between tests (via e.g. `Sidekiq.redis(&:flushdb)`), you'll want to cleanup some redis keys in between tests to avoid state leaking and flaky tests. Chewy provides a convenience method for that:
212
+ ```ruby
213
+ # it might be a good idea to also add to your testing setup, e.g.: a rspec `before` hook
214
+ Chewy::Strategy::DelayedSidekiq.clear_timechunks!
215
+ ```
216
+
217
+ ### `:active_job`
218
+
219
+ This does the same thing as `:atomic`, but using ActiveJob. This will inherit the ActiveJob configuration settings including the `active_job.queue_adapter` setting for the environment. Patch `Chewy::Strategy::ActiveJob::Worker` for index updates improving.
220
+
221
+ ```ruby
222
+ Chewy.strategy(:active_job) do
223
+ City.popular.map(&:do_some_update_action!)
224
+ end
225
+ ```
226
+
227
+ The default queue name is `chewy`, you can customize it in settings: `active_job.queue_name`
228
+ ```
229
+ Chewy.settings[:active_job] = {queue: :low}
230
+ ```
231
+
232
+ ### `:urgent`
233
+
234
+ The following strategy is convenient if you are going to update documents in your index one by one.
235
+
236
+ ```ruby
237
+ Chewy.strategy(:urgent) do
238
+ City.popular.map(&:do_some_update_action!)
239
+ end
240
+ ```
241
+
242
+ This code will perform `City.popular.count` requests for ES documents update.
243
+
244
+ It is convenient for use in e.g. the Rails console with non-block notation:
245
+
246
+ ```ruby
247
+ > Chewy.strategy(:urgent)
248
+ > City.popular.map(&:do_some_update_action!)
249
+ ```
250
+
251
+ ### `:bypass`
252
+
253
+ When the bypass strategy is active the index will not be automatically updated on object save.
254
+
255
+ For example, on `City.first.save!` the cities index would not be updated.
256
+
257
+ ### Nesting
258
+
259
+ Strategies are designed to allow nesting, so it is possible to redefine it for nested contexts.
260
+
261
+ ```ruby
262
+ Chewy.strategy(:atomic) do
263
+ city1.do_update!
264
+ Chewy.strategy(:urgent) do
265
+ city2.do_update!
266
+ city3.do_update!
267
+ # there will be 2 update index requests for city2 and city3
268
+ end
269
+ city4.do_update!
270
+ # city1 and city4 will be grouped in one index update request
271
+ end
272
+ ```
273
+
274
+ ### Non-block notation
275
+
276
+ It is possible to nest strategies without blocks:
277
+
278
+ ```ruby
279
+ Chewy.strategy(:urgent)
280
+ city1.do_update! # index updated
281
+ Chewy.strategy(:bypass)
282
+ city2.do_update! # update bypassed
283
+ Chewy.strategy.pop
284
+ city3.do_update! # index updated again
285
+ ```
286
+
287
+ ### Designing your own strategies
288
+
289
+ See [strategy/base.rb](../lib/chewy/strategy/base.rb) for more details. See [strategy/atomic.rb](../lib/chewy/strategy/atomic.rb) for an example.
290
+
291
+ ## Rails application strategies integration
292
+
293
+ There are a couple of predefined strategies for your Rails application. Initially, the Rails console uses the `:urgent` strategy by default, except in the sandbox case. When you are running sandbox it switches to the `:bypass` strategy to avoid polluting the index.
294
+
295
+ Migrations are wrapped with the `:bypass` strategy. Because the main behavior implies that indices are reset after migration, there is no need for extra index updates. Also indexing might be broken during migrations because of the outdated schema.
296
+
297
+ Controller actions are wrapped with the configurable value of `Chewy.request_strategy` and defaults to `:atomic`. This is done at the middleware level to reduce the number of index update requests inside actions.
298
+
299
+ It is also a good idea to set up the `:bypass` strategy inside your test suite and import objects manually only when needed, and use `Chewy.massacre` when needed to flush test ES indices before every example. This will allow you to minimize unnecessary ES requests and reduce overhead.
300
+
301
+ Deprecation note: since version 8 wildcard removing of indices is disabled by default. You can enable it for a cluster with setting `action.destructive_requires_name` to false.
302
+
303
+ ```ruby
304
+ RSpec.configure do |config|
305
+ config.before(:suite) do
306
+ Chewy.strategy(:bypass)
307
+ end
308
+ end
309
+ ```
310
+
311
+ See [testing.md](testing.md) for more on RSpec/Minitest integration and the DatabaseCleaner caveat.
312
+
313
+ ## `ActiveSupport::Notifications` support
314
+
315
+ Chewy has notifying the following events:
316
+
317
+ ### `search_query.chewy` payload
318
+
319
+ * `payload[:index]`: requested index class
320
+ * `payload[:request]`: request hash
321
+
322
+ ### `import_objects.chewy` payload
323
+
324
+ * `payload[:index]`: currently imported index name
325
+ * `payload[:import]`: imports stats, total imported and deleted objects count:
326
+
327
+ ```ruby
328
+ {index: 30, delete: 5}
329
+ ```
330
+
331
+ * `payload[:errors]`: might not exist. Contains grouped errors with objects ids list:
332
+
333
+ ```ruby
334
+ {index: {
335
+ 'error 1 text' => ['1', '2', '3'],
336
+ 'error 2 text' => ['4']
337
+ }, delete: {
338
+ 'delete error text' => ['10', '12']
339
+ }}
340
+ ```
341
+
342
+ ## NewRelic integration
343
+
344
+ **Note:** this example was written for an older version of the NewRelic APM agent and may need adaptation for current versions. The general pattern of subscribing to Chewy's `ActiveSupport::Notifications` events remains valid.
345
+
346
+ To integrate with NewRelic you may use the following example source (config/initializers/chewy.rb):
347
+
348
+ ```ruby
349
+ require 'new_relic/agent/instrumentation/evented_subscriber'
350
+
351
+ class ChewySubscriber < NewRelic::Agent::Instrumentation::EventedSubscriber
352
+ def start(name, id, payload)
353
+ event = ChewyEvent.new(name, Time.current, nil, id, payload)
354
+ push_event(event)
355
+ end
356
+
357
+ def finish(_name, id, _payload)
358
+ pop_event(id).finish
359
+ end
360
+
361
+ class ChewyEvent < NewRelic::Agent::Instrumentation::Event
362
+ OPERATIONS = {
363
+ 'import_objects.chewy' => 'import',
364
+ 'search_query.chewy' => 'search',
365
+ 'delete_query.chewy' => 'delete'
366
+ }.freeze
367
+
368
+ def initialize(*args)
369
+ super
370
+ @segment = start_segment
371
+ end
372
+
373
+ def start_segment
374
+ segment = NewRelic::Agent::Transaction::DatastoreSegment.new product, operation, collection, host, port
375
+ if (txn = state.current_transaction)
376
+ segment.transaction = txn
377
+ end
378
+ segment.notice_sql @payload[:request].to_s
379
+ segment.start
380
+ segment
381
+ end
382
+
383
+ def finish
384
+ if (txn = state.current_transaction)
385
+ txn.add_segment @segment
386
+ end
387
+ @segment.finish
388
+ end
389
+
390
+ private
391
+
392
+ def state
393
+ @state ||= NewRelic::Agent::TransactionState.tl_get
394
+ end
395
+
396
+ def product
397
+ 'Elasticsearch'
398
+ end
399
+
400
+ def operation
401
+ OPERATIONS[name]
402
+ end
403
+
404
+ def collection
405
+ payload.values_at(:type, :index)
406
+ .reject { |value| value.try(:empty?) }
407
+ .first
408
+ .to_s
409
+ end
410
+
411
+ def host
412
+ Chewy.client.transport.hosts.first[:host]
413
+ end
414
+
415
+ def port
416
+ Chewy.client.transport.hosts.first[:port]
417
+ end
418
+ end
419
+ end
420
+
421
+ ActiveSupport::Notifications.subscribe(/.chewy$/, ChewySubscriber.new)
422
+ ```
423
+
424
+ ## Import scope clean-up behavior
425
+
426
+ Whenever you set the `import_scope` for the index, in the case of ActiveRecord,
427
+ options for order, offset and limit will be removed. You can set the behavior of
428
+ chewy, before the clean-up itself.
429
+
430
+ The default behavior is a warning sent to the Chewy logger (`:warn`). Another more
431
+ restrictive option is raising an exception (`:raise`). Both options have a
432
+ negative impact on performance since verifying whether the code uses any of
433
+ these options requires building AREL query.
434
+
435
+ To avoid the loading time impact, you can ignore the check (`:ignore`) before
436
+ the clean-up.
437
+
438
+ ```
439
+ Chewy.import_scope_cleanup_behavior = :ignore
440
+ ```
data/docs/import.md ADDED
@@ -0,0 +1,122 @@
1
+ # Import
2
+
3
+ ## Default import options
4
+
5
+ Every index has `default_import_options` configuration to specify, suddenly, default import options:
6
+
7
+ ```ruby
8
+ class ProductsIndex < Chewy::Index
9
+ index_scope Post.includes(:tags)
10
+ default_import_options batch_size: 100, bulk_size: 10.megabytes, refresh: false
11
+
12
+ field :name
13
+ field :tags, value: -> { tags.map(&:name) }
14
+ end
15
+ ```
16
+
17
+ See [import.rb](../lib/chewy/index/import.rb) for available options. For field definitions (`field`, `index_scope`, etc.), see [indexing.md](indexing.md#index-definition).
18
+
19
+ ## Raw import
20
+
21
+ Another way to speed up import time is Raw Imports. This technology is only available in ActiveRecord adapter. Very often, ActiveRecord model instantiation is what consumes most of the CPU and RAM resources. Precious time is wasted on converting, say, timestamps from strings and then serializing them back to strings. Chewy can operate on raw hashes of data directly obtained from the database. All you need is to provide a way to convert that hash to a lightweight object that mimics the behaviour of the normal ActiveRecord object.
22
+
23
+ ```ruby
24
+ class LightweightProduct
25
+ def initialize(attributes)
26
+ @attributes = attributes
27
+ end
28
+
29
+ # Depending on the database, `created_at` might
30
+ # be in different formats. In PostgreSQL, for example,
31
+ # you might see the following format:
32
+ # "2016-03-22 16:23:22"
33
+ #
34
+ # Taking into account that Elastic expects something different,
35
+ # one might do something like the following, just to avoid
36
+ # unnecessary String -> DateTime -> String conversion.
37
+ #
38
+ # "2016-03-22 16:23:22" -> "2016-03-22T16:23:22Z"
39
+ def created_at
40
+ @attributes['created_at'].tr(' ', 'T') << 'Z'
41
+ end
42
+ end
43
+
44
+ index_scope Product
45
+ default_import_options raw_import: ->(hash) {
46
+ LightweightProduct.new(hash)
47
+ }
48
+
49
+ field :created_at, 'datetime'
50
+ ```
51
+
52
+ Also, you can pass `:raw_import` option to the `import` method explicitly.
53
+
54
+ ## Index creation during import
55
+
56
+ By default, when you perform import Chewy checks whether an index exists and creates it if it's absent.
57
+ You can turn off this feature to decrease Elasticsearch hits count.
58
+ To do so you need to set `skip_index_creation_on_import` parameter to `false` in your `config/chewy.yml`.
59
+
60
+ ## Skip record fields during import
61
+
62
+ You can use `ignore_blank: true` to skip fields that return `true` for the `.blank?` method:
63
+
64
+ ```ruby
65
+ index_scope Country
66
+ field :id
67
+ field :cities, ignore_blank: true do
68
+ field :id
69
+ field :name
70
+ field :surname, ignore_blank: true
71
+ field :description
72
+ end
73
+ ```
74
+
75
+ ### Default values for different types
76
+
77
+ By default `ignore_blank` is false on every type except `geo_point`.
78
+
79
+ ## Journaling
80
+
81
+ You can record all actions that were made to the separate journal index in Elasticsearch.
82
+ When you create/update/destroy your documents, it will be saved in this special index.
83
+ If you make something with a batch of documents (e.g. during index reset) it will be saved as a one record, including primary keys of each document that was affected.
84
+ Common journal record looks like this:
85
+
86
+ ```json
87
+ {
88
+ "action": "index",
89
+ "object_id": [1, 2, 3],
90
+ "index_name": "...",
91
+ "created_at": "<timestamp>"
92
+ }
93
+ ```
94
+
95
+ This feature is turned off by default.
96
+ You can turn it on by setting `journal` option to `true` in `config/chewy.yml`.
97
+
98
+ Also, you can provide this option while you're importing some index:
99
+
100
+ ```ruby
101
+ CityIndex.import journal: true
102
+ ```
103
+
104
+ Or as a default import option for an index:
105
+
106
+ ```ruby
107
+ class CityIndex
108
+ index_scope City
109
+ default_import_options journal: true
110
+ end
111
+ ```
112
+
113
+ You may be wondering why do you need it? The answer is simple: not to lose the data.
114
+
115
+ Imagine that you reset your index in a zero-downtime manner (to separate index),
116
+ and in the meantime somebody keeps updating the data frequently (to old
117
+ index). So all these actions will be written to the journal index and you'll be
118
+ able to apply them after index reset using the `Chewy::Journal` interface. You can subscribe to journal events via `ActiveSupport::Notifications` — see [configuration.md](configuration.md#activesupportnotifications-support) for details.
119
+
120
+ When enabled, journal can grow to enormous size, consider setting up cron job
121
+ that would clean it occasionally using [`chewy:journal:clean` rake
122
+ task](rake_tasks.md#chewyjournal).