hekenga 1.1.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 94824a7ff42ed31e87c54b5093ad9b8a03b27b08ec5cc7c4f45491b71b3dcc79
4
- data.tar.gz: ca0ffaded81560273fa6421de3dfdb9a0efce0fd62f95c410bd31004cb17a816
3
+ metadata.gz: d5a372f23cc2fe1751eb08e07e40705b0361fc609b072e14226b72819cfe9647
4
+ data.tar.gz: 99228c0f076660abe23c92f37c55509b156810fc0a741764b7872d4a0c597263
5
5
  SHA512:
6
- metadata.gz: 8677a4cd8023f511971b54ac4458e7b230ee693246473d77b9e3237895ef95670aa157d0117f13223a0df4dde71430de9c940bfc89315422203e23ec3434218e
7
- data.tar.gz: 69b879e26c4d7771377881d9e83d411fa3b0a2674ce851b5b9b74f323c1c2dec749218257766b27e467eb409708f221d9125a8451bd9d394b103e654e86785da
6
+ metadata.gz: e2dfea01ce0c1c17fc51ab550e1805653a255d7c65a8b8424c7a1d6e646d59b9bc70fc45cfcf558d1fd7a6df81e5a2a8230a240e8c75d578736cdf19f4e2e290
7
+ data.tar.gz: 0af8e27a9f1b4c7ae490788f64457a67fae9a3fb9459821b8066bca61671ae30463971e9ec277d01ba358817992260281e5dceaaab676f9093dc05a941a520ce
data/CHANGELOG.md CHANGED
@@ -1,5 +1,25 @@
1
1
  # Changelog
2
2
 
3
+ ## v2.1.0
4
+
5
+ - `per_document` task `scope` will no longer let you specify `.only` or
6
+ `.without` as it could potentially cause data loss.
7
+ - `per_document` task `scope` now correctly works with `.includes` even
8
+ in parallel execution mode.
9
+
10
+ ## v2.0.0
11
+
12
+ - (breaking) `Hekenga::Iterator` has been replaced by `Hekenga::IdIterator`. If any
13
+ selector or sort is set on a document task migration scope, it no longer forces an
14
+ ascending ID sort. This should help to prevent index misses, though there is a
15
+ tradeoff that documents being concurrently updated may be skipped or
16
+ processed multiple times. Hekenga tries to guard against processing multiple
17
+ times. Manually specifying an `asc(:_id)` on your scope will continue to
18
+ process documents in ID order.
19
+ - Document tasks now support a new option, `cursor_timeout`. This is the maximum
20
+ time a document task's `scope` can be iterated and queue jobs within. The
21
+ default is one day.
22
+
3
23
  ## v1.1.0
4
24
 
5
25
  - `setup` is now passed the current batch of documents so it can be used to
data/CLAUDE.md ADDED
@@ -0,0 +1,60 @@
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ Hekenga is a Ruby gem providing a migration framework for MongoDB (via Mongoid). It supports sequential and parallel document processing via ActiveJob, with error recovery, validation tracking, and a Thor-based CLI.
8
+
9
+ ## Common Commands
10
+
11
+ ```bash
12
+ # Run full test suite (requires MongoDB - see docker-compose.yml)
13
+ rake spec
14
+
15
+ # Run a single spec file
16
+ rake spec SPEC=spec/hekenga/document_task_spec.rb
17
+
18
+ # Install gem locally
19
+ bundle exec rake install
20
+
21
+ # Interactive console with gem loaded
22
+ bin/console
23
+ ```
24
+
25
+ ## Architecture
26
+
27
+ ### Migration Flow
28
+
29
+ ```
30
+ Migration.perform! → MasterProcess.run! → launches tasks in threads
31
+ SimpleTask: executes up/down blocks directly
32
+ DocumentTask: iterates documents → batch → execute → write (sequential)
33
+ ParallelTask: splits into ID batches → enqueues ParallelJob per batch (via ActiveJob)
34
+ ```
35
+
36
+ ### Key Components
37
+
38
+ - **`Hekenga::Migration`** — main migration class, orchestrates tasks
39
+ - **`Hekenga::MasterProcess`** — launches tasks, manages execution/recovery/progress
40
+ - **`Hekenga::DSL::*`** — fluent DSL for defining migrations (`DSL::Migration`, `DSL::SimpleTask`, `DSL::DocumentTask`)
41
+ - **`Hekenga::DocumentTaskExecutor`** — core document processing: filter → up block → validate → write
42
+ - **`Hekenga::ParallelTask`** / **`Hekenga::ParallelJob`** — parallel execution via ActiveJob
43
+ - **`Hekenga::DocumentTaskRecord`** — Mongoid doc tracking parallel task progress
44
+ - **`Hekenga::Log`** — Mongoid doc tracking migration/task status (`:naught`, `:running`, `:complete`, `:failed`, `:skipped`)
45
+ - **`Hekenga::Failure::*`** — error/validation/write/cancelled failure tracking subclasses
46
+ - **`Hekenga::IdIterator`** / **`Hekenga::MongoidIterator`** — efficient document iteration for parallel vs sequential paths
47
+
48
+ ### Task Types
49
+
50
+ - **SimpleTask** — one-off up/down blocks, no document iteration
51
+ - **DocumentTask** — per-document processing with scope, filter, setup, up, down, after blocks; supports `parallel!`, `timeless!`, `always_write!`, `use_transaction!`, configurable write strategies (`:update` vs `:delete_then_insert`)
52
+
53
+ ### Configuration
54
+
55
+ Via `Hekenga.configure` block — sets migration directory and report frequency. Thread-safe registry tracks all migrations.
56
+
57
+ ## Dependencies
58
+
59
+ - **mongoid** (>= 6), **activejob** (>= 5), **thor** (1.2.1)
60
+ - Test: **rspec** (~> 3.0), **database_cleaner-mongoid** (~> 2.0), **pry**
data/README.md CHANGED
@@ -1,10 +1,7 @@
1
1
  # Hekenga
2
2
 
3
- An attempt at a migration framework for MongoDB that supports parallel document
4
- processing via ActiveJob, chained jobs and error recovery.
5
-
6
- **Note that this gem is currently in pre-alpha - assume most things have a high
7
- chance of being broken.**
3
+ A migration framework for MongoDB (via Mongoid) that supports parallel document
4
+ processing via ActiveJob, chained jobs, and error recovery.
8
5
 
9
6
  ## Installation
10
7
 
@@ -22,13 +19,135 @@ Or install it yourself as:
22
19
 
23
20
  $ gem install hekenga
24
21
 
22
+ ## Configuration
23
+
24
+ ```ruby
25
+ Hekenga.configure do |config|
26
+ config.dir = ["db", "hekenga"] # where migration files live (relative to root)
27
+ config.root = Dir.pwd # application root
28
+ end
29
+ ```
30
+
31
+ Migrations are stored as Ruby files in the configured directory (default: `db/hekenga/`).
32
+
25
33
  ## Usage
26
34
 
27
- CLI instructions:
35
+ ### CLI
36
+
37
+ ```
38
+ $ hekenga help # Show all available commands
39
+ $ hekenga generate <description> # Generate a new migration scaffold
40
+ $ hekenga status # Show status of all migrations
41
+ $ hekenga run_all! # Run all pending migrations in date order
42
+ $ hekenga run! <path_or_pkey> # Run a specific migration
43
+ $ hekenga run! <path_or_pkey> --test # Dry run (no writes persisted)
44
+ $ hekenga run! <path_or_pkey> --clear # Clear logs before running
45
+ $ hekenga recover! <path_or_pkey> # Re-process failed/invalid records
46
+ $ hekenga cancel # Cancel all active migrations
47
+ $ hekenga skip <path_or_pkey> # Mark a migration as skipped
48
+ $ hekenga clear! <path_or_pkey> # Remove all logs/failures for a migration
49
+ $ hekenga cleanup # Remove all failure logs
50
+ ```
51
+
52
+ ### Writing Migrations
53
+
54
+ Generate a migration scaffold:
55
+
56
+ $ hekenga generate "Add default role to users"
57
+
58
+ #### Simple Tasks
59
+
60
+ Simple tasks run arbitrary code once. Use `actual?` and `test?` to check execution mode.
61
+
62
+ ```ruby
63
+ Hekenga.migration do
64
+ description "Backfill analytics collection"
65
+ created "2024-01-15 10:00"
66
+
67
+ task "Create indexes" do
68
+ up do
69
+ Analytics.create_indexes if actual?
70
+ end
71
+ end
72
+ end
73
+ ```
74
+
75
+ #### Document Tasks
76
+
77
+ Document tasks iterate over a Mongoid scope and process each document in batches.
78
+
79
+ ```ruby
80
+ Hekenga.migration do
81
+ description "Normalize user emails"
82
+ created "2024-01-15 10:00"
83
+ batch_size 100 # default batch size for all tasks in this migration
84
+
85
+ per_document "Downcase emails" do
86
+ scope User.all
87
+
88
+ # Called once per batch; instance variables are shared with filter/up/after
89
+ setup do |docs|
90
+ @domain_map = ExternalService.load_domains
91
+ end
92
+
93
+ # Return false to skip a document
94
+ filter do |doc|
95
+ doc.email.present?
96
+ end
97
+
98
+ # Mutate the document in place — Hekenga handles persistence
99
+ up do |doc|
100
+ doc.email = doc.email.downcase
101
+ end
102
+
103
+ # Called once per batch with the successfully written documents
104
+ after do |docs|
105
+ AuditLog.record(docs.map(&:id))
106
+ end
107
+ end
108
+ end
109
+ ```
110
+
111
+ #### Document Task Options
112
+
113
+ ```ruby
114
+ per_document "Process records" do
115
+ scope MyModel.where(active: true)
116
+
117
+ parallel! # Process batches in parallel via ActiveJob
118
+ timeless! # Don't update Mongoid timestamps
119
+ always_write! # Write even if the document didn't change
120
+ skip_prepare! # Skip Mongoid callbacks on load
121
+ use_transaction! # Wrap each batch in a MongoDB transaction
122
+ batch_size 50 # Override migration-level batch size
123
+ write_strategy :update # :update (default) or :delete_then_insert
124
+ cursor_timeout 86_400 # Max cursor lifetime in seconds (default: 1 day)
125
+
126
+ up do |doc|
127
+ doc.status = "migrated"
128
+ end
129
+ end
130
+ ```
131
+
132
+ ### Test Mode
133
+
134
+ Run a migration without persisting changes:
135
+
136
+ ```ruby
137
+ migration = Hekenga.find_migration("2024-01-15-add-default-role-to-users")
138
+ migration.test_mode!
139
+ migration.perform!
140
+ ```
141
+
142
+ Or via the CLI:
143
+
144
+ $ hekenga run! <path_or_pkey> --test
145
+
146
+ ### Recovery
28
147
 
29
- $ hekenga help
148
+ When a migration fails (due to errors, invalid records, or write failures), Hekenga logs the failures and marks the migration as failed. You can re-process only the failed records:
30
149
 
31
- Migration DSL documentation TBD, for now please look at spec/
150
+ $ hekenga recover! <path_or_pkey>
32
151
 
33
152
  ## Development
34
153
 
data/docker-compose.yml CHANGED
@@ -8,7 +8,7 @@ networks:
8
8
 
9
9
  services:
10
10
  mongo:
11
- image: mongo:5
11
+ image: mongo:6
12
12
  command: ["--replSet", "rs0", "--bind_ip", "localhost,mongo"]
13
13
  volumes:
14
14
  - mongo:/data/db
@@ -18,7 +18,7 @@ services:
18
18
  - hekenga-net
19
19
 
20
20
  mongosetup:
21
- image: mongo:5
21
+ image: mongo:6
22
22
  depends_on:
23
23
  - mongo
24
24
  restart: "no"
@@ -0,0 +1,24 @@
1
+ module Hekenga
2
+ class BaseIterator
3
+ include Enumerable
4
+ DEFAULT_TIMEOUT = 86_400 # 1 day in seconds
5
+
6
+ attr_reader :cursor_timeout
7
+
8
+ def initialize(scope:, cursor_timeout: DEFAULT_TIMEOUT)
9
+ @scope = scope
10
+ @cursor_timeout = cursor_timeout
11
+ end
12
+
13
+ private
14
+
15
+ def iteration_scope
16
+ if @scope.selector.blank? && @scope.options.blank?
17
+ # Apply a default _id sort, it works the best
18
+ @scope.asc(:_id)
19
+ else
20
+ @scope
21
+ end.max_time_ms(cursor_timeout * 1000) # convert to ms
22
+ end
23
+ end
24
+ end
@@ -1,8 +1,9 @@
1
1
  require 'hekenga/irreversible'
2
+ require 'hekenga/base_iterator'
2
3
  module Hekenga
3
4
  class DocumentTask
4
5
  attr_reader :ups, :downs, :setups, :filters, :after_callbacks
5
- attr_accessor :parallel, :scope, :timeless, :batch_size
6
+ attr_accessor :parallel, :scope, :timeless, :batch_size, :cursor_timeout
6
7
  attr_accessor :description, :invalid_strategy, :skip_prepare, :write_strategy
7
8
  attr_accessor :always_write, :use_transaction
8
9
 
@@ -18,10 +19,14 @@ module Hekenga
18
19
  @batch_size = nil
19
20
  @always_write = false
20
21
  @use_transaction = false
22
+ @cursor_timeout = Hekenga::BaseIterator::DEFAULT_TIMEOUT
21
23
  end
22
24
 
23
25
  def validate!
24
26
  raise Hekenga::Invalid.new(self, :ups, "missing") unless ups.any?
27
+ if scope&.options&.key?(:fields)
28
+ raise Hekenga::Invalid.new(self, :scope, "uses .only() or .without() which would cause data loss with replace_one")
29
+ end
25
30
  end
26
31
 
27
32
  def up!(context, document)
@@ -59,7 +59,11 @@ module Hekenga
59
59
  end
60
60
 
61
61
  def record_scope
62
- task.scope.klass.unscoped.in(_id: task_record.ids)
62
+ scope = task.scope.klass.unscoped.in(_id: task_record.ids)
63
+ if task.scope.inclusions.any?
64
+ scope = scope.includes(*task.scope.inclusions.map(&:name))
65
+ end
66
+ scope
63
67
  end
64
68
 
65
69
  def records
@@ -28,6 +28,10 @@ module Hekenga
28
28
  @object.write_strategy = strategy
29
29
  end
30
30
 
31
+ def cursor_timeout(timeout)
32
+ @object.cursor_timeout = timeout.to_i
33
+ end
34
+
31
35
  def scope(scope)
32
36
  @object.scope = scope
33
37
  end
@@ -0,0 +1,34 @@
1
+ require "hekenga/base_iterator"
2
+ module Hekenga
3
+ class IdIterator < BaseIterator
4
+ DEFAULT_ID = "_id".freeze
5
+
6
+ attr_reader :id_property
7
+
8
+ def initialize(id_property: DEFAULT_ID, **kwargs)
9
+ super(**kwargs)
10
+ @id_property = id_property
11
+ end
12
+
13
+ def each
14
+ with_view do |view|
15
+ view.each do |doc|
16
+ yield doc[id_property]
17
+ end
18
+ end
19
+ end
20
+
21
+ private
22
+
23
+ def with_view
24
+ view = iteration_scope.view
25
+ yield view
26
+ ensure
27
+ view.close_query
28
+ end
29
+
30
+ def iteration_scope
31
+ super.only(id_property)
32
+ end
33
+ end
34
+ end
@@ -2,6 +2,7 @@ require 'hekenga/invalid'
2
2
  require 'hekenga/context'
3
3
  require 'hekenga/parallel_job'
4
4
  require 'hekenga/parallel_task'
5
+ require 'hekenga/mongoid_iterator'
5
6
  require 'hekenga/master_process'
6
7
  require 'hekenga/document_task_record'
7
8
  require 'hekenga/document_task_executor'
@@ -132,18 +133,18 @@ module Hekenga
132
133
  records = []
133
134
  task_records(task_idx).delete_all unless recover
134
135
  executor_key = BSON::ObjectId.new
135
- task.scope.asc(:_id).no_timeout.each do |record|
136
+ Hekenga::MongoidIterator.new(scope: task.scope, cursor_timeout: task.cursor_timeout).each do |record|
136
137
  records.push(record)
137
138
  next unless records.length == (task.batch_size || batch_size)
138
139
 
139
- records = filter_out_processed(task, task_idx, records) if recover
140
+ records = filter_out_processed(task, task_idx, records)
140
141
  next unless records.length == (task.batch_size || batch_size)
141
142
 
142
143
  execute_document_task(task_idx, executor_key, records)
143
144
  records = []
144
145
  return if log.cancel
145
146
  end
146
- records = filter_out_processed(task, task_idx, records) if recover
147
+ records = filter_out_processed(task, task_idx, records)
147
148
  execute_document_task(task_idx, executor_key, records) if records.any?
148
149
  return if log.cancel
149
150
  log_done!
@@ -0,0 +1,8 @@
1
+ require "hekenga/base_iterator"
2
+ module Hekenga
3
+ class MongoidIterator < BaseIterator
4
+ def each(&block)
5
+ iteration_scope.each(&block)
6
+ end
7
+ end
8
+ end
@@ -1,4 +1,4 @@
1
- require 'hekenga/iterator'
1
+ require 'hekenga/id_iterator'
2
2
  require 'hekenga/document_task_executor'
3
3
  require 'hekenga/task_splitter'
4
4
 
@@ -15,13 +15,13 @@ module Hekenga
15
15
 
16
16
  def start!
17
17
  clear_task_records!
18
- @executor_key = BSON::ObjectId.new
18
+ regenerate_executor_key
19
19
  generate_for_scope(task.scope)
20
20
  check_for_completion!
21
21
  end
22
22
 
23
23
  def resume!
24
- @executor_key = BSON::ObjectId.new
24
+ regenerate_executor_key
25
25
  task_records.set(executor_key: @executor_key)
26
26
  queue_jobs!(task_records.incomplete)
27
27
  generate_new_records!
@@ -41,16 +41,43 @@ module Hekenga
41
41
 
42
42
  private
43
43
 
44
+ def regenerate_executor_key
45
+ @executor_key = BSON::ObjectId.new
46
+ end
47
+
44
48
  def generate_for_scope(scope)
45
- Hekenga::Iterator.new(scope, size: 100_000).each do |id_block|
46
- task_records = id_block.each_slice(batch_size).map do |id_slice|
47
- generate_task_records!(id_slice)
48
- end
49
+ Hekenga::IdIterator.new(
50
+ scope: scope,
51
+ cursor_timeout: task.cursor_timeout
52
+ # Batch Batches of IDs
53
+ ).each_slice(batch_size).each_slice(enqueue_size) do |id_block|
54
+ sanitize_id_block!(id_block)
55
+ task_records = id_block.reject(&:empty?).map(&method(:generate_task_record!))
49
56
  write_task_records!(task_records)
50
57
  queue_jobs!(task_records)
51
58
  end
52
59
  end
53
60
 
61
+ def enqueue_size
62
+ 500 # task records written + enqueued at a time
63
+ end
64
+
65
+ def sanitize_id_block!(id_block)
66
+ return if task.scope.options.blank? && task.scope.selector.blank?
67
+
68
+ # Custom ordering on cursor with parallel updates may result in the same
69
+ # ID getting yielded into the migration multiple times. Detect this +
70
+ # remove
71
+ doubleups = task_records.in(ids: id_block.flatten).pluck(:ids).flatten.to_set
72
+ return if doubleups.empty?
73
+
74
+ id_block.each do |id_slice|
75
+ id_slice.reject! do |id|
76
+ doubleups.include?(id)
77
+ end
78
+ end
79
+ end
80
+
54
81
  def generate_new_records!
55
82
  last_record = task_records.desc(:_id).first
56
83
  last_id = last_record&.ids&.last
@@ -83,7 +110,7 @@ module Hekenga
83
110
  migration.task_records(task_idx)
84
111
  end
85
112
 
86
- def generate_task_records!(id_slice)
113
+ def generate_task_record!(id_slice)
87
114
  Hekenga::DocumentTaskRecord.new(
88
115
  migration_key: migration.to_key,
89
116
  task_idx: task_idx,
@@ -48,6 +48,7 @@ module Hekenga
48
48
  # #skip_prepare!
49
49
  # #batch_size 25
50
50
  # #write_strategy :update # :delete_then_insert
51
+ # #cursor_timeout 86_400 # max allowed time for the cursor to survive, in seconds
51
52
  #
52
53
  # # Called once per batch, instance variables will be accessible
53
54
  # # in the filter, up and after blocks
@@ -1,3 +1,3 @@
1
1
  module Hekenga
2
- VERSION = "1.1.0"
2
+ VERSION = "2.1.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: hekenga
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.0
4
+ version: 2.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Tapio Saarinen
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2024-02-20 00:00:00.000000000 Z
11
+ date: 2026-04-23 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: bundler
@@ -148,6 +148,7 @@ files:
148
148
  - ".rspec"
149
149
  - ".travis.yml"
150
150
  - CHANGELOG.md
151
+ - CLAUDE.md
151
152
  - Gemfile
152
153
  - README.md
153
154
  - Rakefile
@@ -159,6 +160,7 @@ files:
159
160
  - hekenga.gemspec
160
161
  - lib/hekenga.rb
161
162
  - lib/hekenga/base_error.rb
163
+ - lib/hekenga/base_iterator.rb
162
164
  - lib/hekenga/config.rb
163
165
  - lib/hekenga/context.rb
164
166
  - lib/hekenga/document_task.rb
@@ -173,12 +175,13 @@ files:
173
175
  - lib/hekenga/failure/error.rb
174
176
  - lib/hekenga/failure/validation.rb
175
177
  - lib/hekenga/failure/write.rb
178
+ - lib/hekenga/id_iterator.rb
176
179
  - lib/hekenga/invalid.rb
177
180
  - lib/hekenga/irreversible.rb
178
- - lib/hekenga/iterator.rb
179
181
  - lib/hekenga/log.rb
180
182
  - lib/hekenga/master_process.rb
181
183
  - lib/hekenga/migration.rb
184
+ - lib/hekenga/mongoid_iterator.rb
182
185
  - lib/hekenga/parallel_job.rb
183
186
  - lib/hekenga/parallel_task.rb
184
187
  - lib/hekenga/scaffold.rb
@@ -1,26 +0,0 @@
1
- module Hekenga
2
- class Iterator
3
- include Enumerable
4
-
5
- SMALLEST_ID = BSON::ObjectId.from_string('0'*24)
6
-
7
- attr_reader :scope, :size
8
-
9
- def initialize(scope, size:)
10
- @scope = scope
11
- @size = size
12
- end
13
-
14
- def each(&block)
15
- current_id = SMALLEST_ID
16
- base_scope = scope.asc(:_id).limit(size)
17
-
18
- loop do
19
- ids = base_scope.and(_id: {'$gt': current_id}).pluck(:_id)
20
- break if ids.empty?
21
- yield ids
22
- current_id = ids.sort.last
23
- end
24
- end
25
- end
26
- end