exwiw 0.8.4 → 0.8.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ee150e9fd830a7829be8c0eebbeca19f0e2d1b1dbcce50fe78e63d62e878997c
4
- data.tar.gz: 31377340d8737ece209e1c0755eeac49ac40930562503e2c4d04f6362938e9d9
3
+ metadata.gz: 3934deb015b3ede7c6a8caa25b857d1f7c8957b4efa3284cd26f21ff2620904d
4
+ data.tar.gz: e44fef95c3c273aa96b1dde37736212687da88d8d9b98abb5c44fb305d45b7a3
5
5
  SHA512:
6
- metadata.gz: 26151e0bf763b6f48e993d2025b4bac804a373fe2e3cfd71bb70d91b86b453826838ec683ebfd3f4f4336bb83b8a12d567d7dbda7f8a4ef48fc7f50bbe98145a
7
- data.tar.gz: b29a2da42750dde93cd53eb56a11a592478134b690ccba31b59cd312f3685dad2713e63897d09cb77246a8be37dfe55890f1a8505c1aa8b5fd020b7bc7760a03
6
+ metadata.gz: 1bf410fb503b270aca3f96191f34cb8ee67730ac49ea6a57d34550d2b603fed7404206d405c61f6dd3b67808f8865d73c80121d303c628dda513ebcf9c418b17
7
+ data.tar.gz: d1cbb8bd0ed7bd10f53d3d3169985844d4e5e509041c95a18d083dff458a06f26c09e325fea44bc5c1e14bf65b53fdc79e4dbe351f3e4e5f916e6c9135cfa71e
data/CHANGELOG.md CHANGED
@@ -2,6 +2,10 @@
2
2
 
3
3
  ## [Unreleased]
4
4
 
5
+ ## [0.8.5] - 2026-06-25
6
+
7
+ - **`--parallel-workers=N` parallelizes the MongoDB dump across forked processes (opt-in, byte-identical).** When set with `N≥2` on the mongodb adapter's `export`, exwiw runs an inter-collection fork schedule that decodes whole collections in parallel while preserving each collection's natural row order, so the output files are byte-identical to a serial run (same filenames, same content). Collections are classified into three dependency groups — reference data dumped in full (no `belongs_to`), the scoped DAG reachable to the dump target, and non-reachable reference data — which lets the heavy full-dump collections run concurrently with the scoped pass; only a handful of small `@state` hand-offs cross process boundaries. The win needs real cores: it clears ~2× from 4 workers and saturates there. Requires a dump target and a `fork`-capable runtime (CRuby on POSIX); it falls back to the serial path on JRuby/TruffleRuby/Windows or when no target is given. Also settable as `parallel_workers:` in the config file. The default remains the serial dump.
8
+
5
9
  ## [0.8.4] - 2026-06-24
6
10
 
7
11
  ### Fixed
data/README.md CHANGED
@@ -769,6 +769,7 @@ The MongoDB adapter is experimental. To use it:
769
769
  - `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
770
770
  - `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only** (the SQL adapters have no equivalent).
771
771
  - Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
772
+ - `--parallel-workers=N` (opt-in, `export` only) forks `N` worker processes that decode whole collections in parallel — the dominant cost on a large dump is the driver's BSON→Ruby decode, and each worker decodes its own collections in their natural order, so the output stays **byte-identical** to a serial run (same filenames and content). It needs a dump target (the schedule is built around the scoped DAG) and a `fork`-capable runtime (CRuby on POSIX), falling back to the serial path otherwise; it also accepts `parallel_workers:` in the config file. The speedup needs real cores to spend — it reaches ~2× from 4 workers and saturates there. The default is serial. See [`docs/mongodb-dump-parallelism-2x-notes.md`](docs/mongodb-dump-parallelism-2x-notes.md) for the schedule and measurements.
772
773
  - Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
773
774
  ```bash
774
775
  mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl
@@ -134,13 +134,24 @@ bounded by chunk size, and `N×` that stays well under the 7 GiB container limit
134
134
 
135
135
  ## Status
136
136
 
137
- This is a **measured, byte-identical proof** (bench prototype, not yet integrated
138
- into the Runner/CLI). Integrating it means re-introducing process orchestration
139
- and `@state` sidecar IPC the machinery `optimization-notes.md` deliberately
140
- removed as over-engineered for a flag. It is recorded here because, unlike that
141
- removed work, this schedule is (a) byte-identical by construction, (b) measured
142
- past the target on a real extraction, and (c) the lever the task explicitly
143
- invited (scale the task to go faster). A production version would gate it behind a
144
- worker-count option, fall back to serial where `fork` is unavailable
145
- (Windows/JRuby), and reuse the existing genuine-anchor classification to derive
146
- the three groups.
137
+ **Integrated and shipped behind an opt-in flag.** The schedule lives in
138
+ `Exwiw::MongodbParallelPlan` (the static, DB-free classification) and
139
+ `Exwiw::MongodbParallelDumper` (the fork orchestrator: per-group pools, LPT
140
+ bin-packing, the `@state` Marshal-sidecar IPC, and the Phase-2 cascade). The
141
+ `Runner` delegates the whole schema+inserts pass to the dumper when the mongodb
142
+ adapter is used with `--parallel-workers=N` (N≥2), a genuine-anchor dump target is
143
+ present, and the runtime can `fork`; otherwise it runs the serial loop unchanged.
144
+ The CLI exposes `--parallel-workers` / config-file `parallel_workers` (mongodb +
145
+ `export` only). The after-insert hook runs identically on both paths.
146
+
147
+ End-to-end verification through the real `exwiw export` CLI on the same staging
148
+ restore: **189/189 output files byte-identical** to the serial CLI run (same
149
+ filenames, 0 content mismatches), at **2.19× wall-clock** (serial 7.13 s → N=4
150
+ 3.25 s; N=2 3.99 s = 1.81×; N=6 saturates at 3.25 s). Per the curve above the win
151
+ materializes from ~4 real cores.
152
+
153
+ This was the machinery `optimization-notes.md` deliberately removed as
154
+ over-engineered for a flag — re-introduced here because, unlike that removed work,
155
+ this schedule is byte-identical by construction, measured past the 2× target on a
156
+ real extraction, and the lever the task explicitly invited (scale the task to go
157
+ faster). It is **strictly opt-in**: the default remains the serial path.
@@ -70,6 +70,25 @@ module Exwiw
70
70
  @state = {}
71
71
  end
72
72
 
73
+ # Propagation @state accessor, used ONLY by MongodbParallelDumper to seed a
74
+ # forked worker with the slice of parent ids its collections reference and to
75
+ # harvest the ids downstream collections will `$in`-match against (handed
76
+ # between processes as Marshal sidecars). The serial Runner never touches
77
+ # these — it relies on the in-process capture during #execute.
78
+ attr_accessor :state
79
+
80
+ # Cheap, metadata-only document-count estimate for `collection_name`, used by
81
+ # the parallel dumper to weight collections for LPT bin-packing. This only
82
+ # influences which worker processes a collection (never the output bytes), so
83
+ # an imprecise estimate is harmless. Reads collection metadata rather than
84
+ # running a COLLSCAN; returns 0 on any error (e.g. a collection absent from
85
+ # this database just sorts to the lowest weight).
86
+ def estimated_count(collection_name)
87
+ db[collection_name].estimated_document_count
88
+ rescue StandardError
89
+ 0
90
+ end
91
+
73
92
  def dumpable?(config)
74
93
  !config.embedded?
75
94
  end
data/lib/exwiw/cli.rb CHANGED
@@ -35,6 +35,7 @@ module Exwiw
35
35
  ids
36
36
  ids_field
37
37
  scope_column
38
+ parallel_workers
38
39
  ].freeze
39
40
 
40
41
  # Database connection settings are environment-specific (and sometimes
@@ -44,7 +45,7 @@ module Exwiw
44
45
 
45
46
  # Keys that only make sense for `export`. They are skipped when merging config
46
47
  # for `explain` so a shared config file does not trip validate_explain_only!.
47
- EXPORT_ONLY_CONFIG_KEYS = %w[output_dir output_format insert_only after_insert_hook].freeze
48
+ EXPORT_ONLY_CONFIG_KEYS = %w[output_dir output_format insert_only after_insert_hook parallel_workers].freeze
48
49
 
49
50
  def self.start(argv)
50
51
  new(argv).run
@@ -80,6 +81,7 @@ module Exwiw
80
81
  @output_format = nil
81
82
  @insert_only = nil
82
83
  @after_insert_hook_path = nil
84
+ @parallel_workers = nil
83
85
  # nil (not :info) so we can tell "user passed --log-level" from the default,
84
86
  # letting a config-file value fill in; the :info default is applied later.
85
87
  @log_level = nil
@@ -125,6 +127,7 @@ module Exwiw
125
127
  output_format: @output_format,
126
128
  insert_only: @insert_only,
127
129
  after_insert_hook_path: @after_insert_hook_path,
130
+ parallel_workers: @parallel_workers,
128
131
  cli_options: build_cli_options_hash,
129
132
  logger: logger,
130
133
  ).run
@@ -165,6 +168,7 @@ module Exwiw
165
168
  resolve_scope_column!
166
169
  resolve_ids_field!
167
170
  resolve_uri_option!
171
+ resolve_parallel_workers!
168
172
 
169
173
  if @subcommand == "explain"
170
174
  validate_explain_only!
@@ -316,6 +320,7 @@ module Exwiw
316
320
  end
317
321
  @ids_field ||= config["ids_field"]
318
322
  @scope_column ||= config["scope_column"]
323
+ @parallel_workers ||= parse_parallel_workers(config["parallel_workers"]) if config.key?("parallel_workers")
319
324
  end
320
325
 
321
326
  # Strip a trailing slash (like the CLI's dir options) and expand relative to
@@ -409,6 +414,38 @@ module Exwiw
409
414
  end
410
415
  end
411
416
 
417
+ # `--parallel-workers` opts into the MongoDB fork-parallel dump schedule
418
+ # (docs/mongodb-dump-parallelism-2x-notes.md). It is mongodb-only (the SQL
419
+ # adapters shell out to their own dumpers) and must be a positive integer;
420
+ # N<2 is accepted but runs serially. Runs after the adapter name is normalized
421
+ # so the family check is reliable. `explain` rejection is handled separately
422
+ # by validate_explain_only!.
423
+ private def resolve_parallel_workers!
424
+ return if @parallel_workers.nil?
425
+
426
+ if @database_adapter != "mongodb"
427
+ $stderr.puts "--parallel-workers is only supported by the mongodb adapter"
428
+ exit 1
429
+ end
430
+
431
+ if @parallel_workers < 1
432
+ $stderr.puts "--parallel-workers must be a positive integer (got #{@parallel_workers})"
433
+ exit 1
434
+ end
435
+ end
436
+
437
+ # Coerce a config-file `parallel_workers` (YAML scalar) to Integer, matching
438
+ # the CLI flag's Integer coercion. A non-integer value is a config typo, so
439
+ # fail fast rather than silently dropping it.
440
+ private def parse_parallel_workers(value)
441
+ return nil if value.nil?
442
+
443
+ Integer(value)
444
+ rescue ArgumentError, TypeError
445
+ $stderr.puts "config 'parallel_workers' must be an integer (got #{value.inspect})"
446
+ exit 1
447
+ end
448
+
412
449
  private def validate_explain_only!
413
450
  if @database_adapter == "mongodb"
414
451
  $stderr.puts "mongodb adapter is not yet supported by 'explain' subcommand"
@@ -420,6 +457,7 @@ module Exwiw
420
457
  rejected << "--output-format" unless @output_format.nil?
421
458
  rejected << "--insert-only" unless @insert_only.nil?
422
459
  rejected << "--after-insert-hook" unless @after_insert_hook_path.nil?
460
+ rejected << "--parallel-workers" unless @parallel_workers.nil?
423
461
 
424
462
  unless rejected.empty?
425
463
  $stderr.puts "The following options are not applicable in 'explain' subcommand: #{rejected.join(', ')}"
@@ -526,6 +564,7 @@ module Exwiw
526
564
  opts.on("--after-insert-hook=PATH", "Path to a .rb or .sh post-processing hook executed after all insert/delete files are written (export subcommand only)") do |v|
527
565
  @after_insert_hook_path = File.expand_path(v)
528
566
  end
567
+ opts.on("--parallel-workers=N", Integer, "Fork N workers for the MongoDB dump's parallel schedule (mongodb + export only; N>=2 enables it, default is serial). Output is byte-identical to serial; falls back to serial where fork is unavailable.") { |v| @parallel_workers = v }
529
568
  opts.on("--log-level=LEVEL", "Log level (debug, info). default is info") { |v| @log_level = v.to_sym }
530
569
 
531
570
  opts.on("--help", "Print this help") do
@@ -0,0 +1,290 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+ require "fileutils"
5
+ require "tmpdir"
6
+
7
+ module Exwiw
8
+ # Runs the inter-collection fork schedule from
9
+ # docs/mongodb-dump-parallelism-2x-notes.md, producing output **byte-identical**
10
+ # to the serial Runner while parallelizing the dominant cost (the Mongo driver's
11
+ # BSON->Ruby decode) across processes — each worker decodes its own collections
12
+ # in their natural order, so order is preserved and the result still matches a
13
+ # serial dump.
14
+ #
15
+ # It consumes the static, config-derived classification from MongodbParallelPlan
16
+ # (the three groups + cascade adjacency + ref_bt components) and adds the live
17
+ # orchestration the plan deliberately leaves out: a fork pool per group, LPT
18
+ # bin-packing on a per-collection cost weight, @state Marshal sidecar IPC for the
19
+ # handful of referenced leaves, and the Phase-2 cascade reprocess.
20
+ #
21
+ # The schedule (one parent process + a pool of `workers` forks):
22
+ #
23
+ # Phase 1 (concurrent): fork the leaf pool; the parent meanwhile dumps the
24
+ # schema and processes the WHOLE genuine DAG optimistically (no leaf @state
25
+ # yet), recording each genuine collection's row count.
26
+ # Barrier: wait for the leaf pool; load the Marshal sidecars the consumed
27
+ # leaves wrote into the parent's @state.
28
+ # Phase 2 (cascade): reprocess only the genuine collections whose output can
29
+ # change now that leaf @state is present (the direct-leaf referencers),
30
+ # cascading to genuine children of any whose row count actually changed.
31
+ # Phase 3: fork the ref_bt collections as dependency-closed components, each
32
+ # worker owning whole components (processed in topological order) seeded with
33
+ # the leaf @state its members reference.
34
+ #
35
+ # Output bytes are independent of the schedule: every collection writes its own
36
+ # insert-NNN-<name>.<ext> file (the index taken over the plan's full ordering,
37
+ # exactly as the serial Runner numbers them) and the per-collection write is the
38
+ # same build_query -> execute -> write_inserts pass the Runner performs. The
39
+ # bin-packing only decides which worker runs which collection, never the bytes.
40
+ #
41
+ # fork is required; callers must check {.available?} and fall back to the serial
42
+ # Runner on JRuby/TruffleRuby/Windows.
43
+ class MongodbParallelDumper
44
+ # True when the runtime can `fork` (CRuby on a POSIX OS). On JRuby/TruffleRuby
45
+ # and Windows it cannot — the caller must run the serial Runner instead.
46
+ def self.available?
47
+ Process.respond_to?(:fork)
48
+ end
49
+
50
+ # Longest-Processing-Time bin-packing: assign `items` to `bins` bins, heaviest
51
+ # first onto the currently least-loaded bin. Returns an Array of `bins` arrays
52
+ # (some may be empty when items < bins). `weight` is called exactly once per
53
+ # item (it may be DB-backed, so it must not be invoked repeatedly). Pure — no
54
+ # DB, no IO — so it is unit-tested directly.
55
+ def self.bin_pack(items, bins, &weight)
56
+ raise ArgumentError, "bins must be >= 1 (got #{bins})" if bins < 1
57
+
58
+ weighted = items.map { |item| [item, weight.call(item)] }.sort_by { |(_, w)| -w }
59
+ groups = Array.new(bins) { [] }
60
+ loads = Array.new(bins, 0)
61
+ weighted.each do |(item, w)|
62
+ i = (0...bins).min_by { |j| loads[j] }
63
+ groups[i] << item
64
+ loads[i] += w
65
+ end
66
+ groups
67
+ end
68
+
69
+ # @param connection_config [ConnectionConfig] used to build a FRESH adapter in
70
+ # the parent and in every fork (a Mongo client cannot be shared across fork)
71
+ # @param plan [MongodbParallelPlan] the static classification for this dump
72
+ # @param dump_target [DumpTarget]
73
+ # @param table_by_name [Hash{String=>config}] ALL configs (embedded included),
74
+ # exactly as Runner builds it
75
+ # @param output_dir [String]
76
+ # @param workers [Integer] fork pool size (>= 1)
77
+ # @param logger [Logger]
78
+ # @param weight_for [#call, nil] optional name -> numeric cost weight for LPT;
79
+ # defaults to the adapter's metadata-only estimated document count
80
+ def initialize(connection_config:, plan:, dump_target:, table_by_name:, output_dir:, workers:, logger:, weight_for: nil)
81
+ raise ArgumentError, "workers must be >= 1 (got #{workers})" if workers < 1
82
+
83
+ @connection_config = connection_config
84
+ @plan = plan
85
+ @dump_target = dump_target
86
+ @table_by_name = table_by_name
87
+ @output_dir = output_dir
88
+ @workers = workers
89
+ @logger = logger
90
+ @weight_for = weight_for
91
+ end
92
+
93
+ # Execute the full schedule. Assumes the caller has already cleaned the output
94
+ # directory (the Runner does this before handing off), mirroring the serial
95
+ # path which dumps the schema into a freshly-cleaned dir. Returns a small stats
96
+ # Hash. Raises if any worker pool reports a non-zero exit.
97
+ def run
98
+ raise "fork is unavailable on this runtime; run the serial Runner instead" unless self.class.available?
99
+
100
+ FileUtils.mkdir_p(@output_dir)
101
+ parent = build_adapter
102
+
103
+ Dir.mktmpdir("exwiw-mongo-parallel-") do |sidecar_dir|
104
+ phase1_leaf_and_genuine(parent, sidecar_dir)
105
+ phase2_cascade(parent, sidecar_dir)
106
+ phase3_ref_components(parent, sidecar_dir)
107
+ end
108
+
109
+ {
110
+ workers: @workers,
111
+ genuine: @plan.genuine.size,
112
+ leaves: @plan.leaves.size,
113
+ ref_bt: @plan.ref_bt.size,
114
+ components: @plan.reference_components.map(&:size).sort.reverse,
115
+ }
116
+ end
117
+
118
+ private
119
+
120
+ # Phase 1: fork the leaf pool to run concurrently while the parent dumps the
121
+ # schema (parent-only, needs no @state) and processes the whole genuine DAG
122
+ # optimistically. The genuine row counts captured here seed the Phase-2 cascade.
123
+ def phase1_leaf_and_genuine(parent, sidecar_dir)
124
+ leaf_master = fork do
125
+ ok = run_leaf_pool(sidecar_dir)
126
+ exit!(ok ? 0 : 1)
127
+ end
128
+
129
+ schema_path = File.join(@output_dir, "insert-000-schema.#{parent.schema_output_extension}")
130
+ ordered_tables = @plan.ordered_all.map { |name| @table_by_name.fetch(name) }
131
+ @logger.info("Writing schema to #{schema_path}...")
132
+ parent.dump_schema(ordered_tables, schema_path)
133
+
134
+ @logger.info("Processing #{@plan.genuine.size} genuine collection(s) (parent, optimistic pass)...")
135
+ @row_counts = {}
136
+ @plan.genuine.each { |name| @row_counts[name] = process_collection(parent, name) }
137
+
138
+ Process.wait(leaf_master)
139
+ raise "exwiw parallel leaf pool failed (exit #{$?.exitstatus})" unless $?.exitstatus&.zero?
140
+ end
141
+
142
+ # Barrier + Phase 2: load the consumed-leaf @state the leaf workers handed back,
143
+ # then reprocess only the genuine collections whose output can change now that
144
+ # leaf @state is present, cascading to genuine children of any that changed.
145
+ def phase2_cascade(parent, sidecar_dir)
146
+ load_sidecars(parent, @plan.consumed_leaves, sidecar_dir)
147
+
148
+ queue = @plan.direct_leaf_genuine.dup
149
+ seen = Set.new
150
+ until queue.empty?
151
+ name = queue.shift
152
+ next if seen.include?(name)
153
+
154
+ seen << name
155
+ new_count = process_collection(parent, name)
156
+ next if new_count == @row_counts[name]
157
+
158
+ @row_counts[name] = new_count
159
+ @plan.genuine_children[name].each { |child| queue << child }
160
+ end
161
+ @logger.info("Cascade reprocessed #{seen.size} genuine collection(s) with leaf @state.") unless seen.empty?
162
+ end
163
+
164
+ # Phase 3: fork the ref_bt collections as dependency-closed weakly-connected
165
+ # components in a single pool (no level barriers, no cross-worker IPC). Each
166
+ # worker owns whole components and processes their members in topological order,
167
+ # seeded only with the leaf @state those members reference.
168
+ def phase3_ref_components(parent, sidecar_dir)
169
+ components = @plan.reference_components
170
+ return if components.empty?
171
+
172
+ leaf_state = parent.state
173
+ groups = self.class.bin_pack(components, @workers) { |component| component.sum { |name| weight_of(parent, name) } }
174
+
175
+ pids = groups.reject(&:empty?).map do |group|
176
+ members = group.flatten
177
+ seed = leaf_state.slice(*parents_of(members))
178
+ fork { run_component_worker(group, seed) }
179
+ end
180
+ ok = pids.map { |pid| Process.wait(pid); $?.exitstatus&.zero? }.all?
181
+ raise "exwiw parallel ref_bt pool failed" unless ok
182
+
183
+ @logger.info("Processed #{@plan.ref_bt.size} ref_bt collection(s) in #{groups.reject(&:empty?).size} worker(s).")
184
+ end
185
+
186
+ # Fork `@workers` leaf workers (LPT-packed on cost weight so the single heaviest
187
+ # leaf sits alone) and wait for them. Each worker writes a Marshal sidecar for
188
+ # the consumed leaves it produced. Runs inside the leaf_master fork, so its own
189
+ # weight adapter and the worker connections never touch the parent's.
190
+ def run_leaf_pool(sidecar_dir)
191
+ return true if @plan.leaves.empty?
192
+
193
+ weight_adapter = build_adapter
194
+ groups = self.class.bin_pack(@plan.leaves, @workers) { |name| weight_of(weight_adapter, name) }
195
+
196
+ pids = groups.reject(&:empty?).map do |group|
197
+ fork { run_leaf_worker(group, sidecar_dir) }
198
+ end
199
+ pids.map { |pid| Process.wait(pid); $?.exitstatus&.zero? }.all?
200
+ rescue StandardError => e
201
+ @logger.error("exwiw parallel leaf master error: #{e.class}: #{e.message}")
202
+ false
203
+ end
204
+
205
+ def run_leaf_worker(group, sidecar_dir)
206
+ adapter = build_adapter
207
+ group.each { |name| process_collection(adapter, name) }
208
+ group.each do |name|
209
+ next unless @plan.consumed_leaves.include?(name)
210
+ next unless adapter.state.key?(name)
211
+
212
+ File.binwrite(File.join(sidecar_dir, "#{name}.marshal"), Marshal.dump(adapter.state[name]))
213
+ end
214
+ exit!(0)
215
+ rescue StandardError => e
216
+ @logger.error("exwiw parallel leaf worker error (#{group.first}..): #{e.class}: #{e.message}")
217
+ exit!(1)
218
+ end
219
+
220
+ def run_component_worker(group, seed)
221
+ adapter = build_adapter
222
+ adapter.state = seed unless seed.empty?
223
+ # Each component is already topologically ordered (parent before child) and
224
+ # dependency-closed over intra-ref_bt edges, so a plain serial walk suffices.
225
+ group.each { |component| component.each { |name| process_collection(adapter, name) } }
226
+ exit!(0)
227
+ rescue StandardError => e
228
+ @logger.error("exwiw parallel ref_bt worker error (#{group.first&.first}..): #{e.class}: #{e.message}")
229
+ exit!(1)
230
+ end
231
+
232
+ # Extract one collection to its insert-NNN-<name>.<ext> file. This mirrors the
233
+ # serial Runner's non-COPY insert path exactly — same filename (index taken over
234
+ # the plan's full ordering), same pre/post hooks (nil for MongoDB), same
235
+ # streaming write_inserts + trailing "\n", and the same empty-result handling
236
+ # (delete the just-opened file) — so the bytes are identical regardless of which
237
+ # process writes them. Returns the row count.
238
+ def process_collection(adapter, name)
239
+ table = @table_by_name.fetch(name)
240
+ query = adapter.build_query(table, @dump_target, @table_by_name)
241
+ results = adapter.execute(query)
242
+
243
+ insert_idx = (@plan.index_of.fetch(name) + 1).to_s.rjust(3, "0")
244
+ path = File.join(@output_dir, "insert-#{insert_idx}-#{name}.#{adapter.output_extension}")
245
+ chunk_size = table.bulk_insert_chunk_size || adapter.default_bulk_insert_chunk_size
246
+
247
+ record_num = 0
248
+ File.open(path, "w") do |file|
249
+ pre = adapter.pre_insert_sql(table)
250
+ file.puts(pre) if pre
251
+ _statement_count, record_num = adapter.write_inserts(file, results, table, chunk_size)
252
+ file.print("\n")
253
+ post = adapter.post_insert_sql(table)
254
+ file.puts(post) if post
255
+ end
256
+ File.delete(path) if record_num.zero?
257
+ record_num
258
+ end
259
+
260
+ # Merge the Marshal sidecars the leaf workers wrote (one per consumed leaf that
261
+ # actually produced rows) into `adapter`'s @state, so the cascade reprocess and
262
+ # the ref_bt workers can constrain on those leaf ids.
263
+ def load_sidecars(adapter, names, sidecar_dir)
264
+ state = adapter.state
265
+ names.each do |name|
266
+ path = File.join(sidecar_dir, "#{name}.marshal")
267
+ state[name] = Marshal.load(File.binread(path)) if File.exist?(path)
268
+ end
269
+ end
270
+
271
+ # The distinct belongs_to parent names of `names`, used to slice the leaf @state
272
+ # a worker is seeded with down to only the keys its collections reference.
273
+ def parents_of(names)
274
+ names.flat_map { |name| @table_by_name.fetch(name).belongs_tos.map(&:table_name) }.uniq
275
+ end
276
+
277
+ def weight_of(adapter, name)
278
+ return @weight_for.call(name) if @weight_for
279
+
280
+ adapter.estimated_count(name)
281
+ end
282
+
283
+ # A fresh adapter (and thus a fresh, lazily-opened Mongo connection). Built per
284
+ # process — the parent and every fork get their own; a Mongo client must never
285
+ # be shared across a fork boundary.
286
+ def build_adapter
287
+ Adapter.build(@connection_config, @logger)
288
+ end
289
+ end
290
+ end
@@ -0,0 +1,271 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+
5
+ module Exwiw
6
+ # Classifies a MongoDB dump's collections into the three dependency groups the
7
+ # inter-collection fork schedule needs, plus the derived adjacency that
8
+ # schedule consumes. See docs/mongodb-dump-parallelism-2x-notes.md for the why;
9
+ # this class is the static, config-derived half of that plan.
10
+ #
11
+ # It is a pure function of the loaded configs and the dump target — no DB
12
+ # access — so it can be computed once up front and unit-tested without a live
13
+ # MongoDB. The fork orchestration (worker pools, LPT bin-packing on output-size
14
+ # weights, @state Marshal sidecars, the Phase-2 cascade loop) lives elsewhere
15
+ # and consumes the structures produced here.
16
+ #
17
+ # Input contract: `configs` are MongodbCollectionConfig already passed through
18
+ # `#reject_ignored_members!` (exactly as Runner#load_table_config produces
19
+ # them), so every surviving belongs_to has a non-nil `table_name`. ignore:true
20
+ # *collections* are still present in `configs` — they contribute to the schema
21
+ # and to the file-index ordering, but their data extraction is skipped — and
22
+ # are therefore excluded from the three processing groups.
23
+ #
24
+ # The three groups partition the extractable collections exactly:
25
+ #
26
+ # - **genuine** — reachable to the dump target by following belongs_to edges
27
+ # (the scoped DAG). Includes the target itself.
28
+ # - **leaf** — no belongs_to at all: reference/master data dumped in full,
29
+ # with no input dependencies (embarrassingly parallel).
30
+ # - **ref_bt** — has belongs_to but is NOT reachable to the target: reference
31
+ # data scoped by the adapter's strict-AND fallback. Its
32
+ # internal edges form shallow components.
33
+ #
34
+ # `reachable` mirrors MongodbAdapter#genuine_scope_set exactly (fixpoint over
35
+ # all non-embedded configs, including ignore:true ones), so the genuine set
36
+ # here matches the adapter's runtime scoping classification.
37
+ class MongodbParallelPlan
38
+ EMPTY_NAMES = [].freeze
39
+ private_constant :EMPTY_NAMES
40
+
41
+ # @param configs [Array<MongodbCollectionConfig>] reject_ignored_members!'d
42
+ # @param target_table_name [String] the dump target collection
43
+ # @param logger [Logger, nil] forwarded to DetermineTableProcessingOrder
44
+ def initialize(configs:, target_table_name:, logger: nil)
45
+ @by = configs.each_with_object({}) { |c, h| h[c.name] = c }
46
+ @target_table_name = target_table_name
47
+
48
+ dumpable = configs.reject(&:embedded?)
49
+ # The file index (insert-NNN-) is taken over the FULL processing order,
50
+ # including ignore:true collections, so the orchestrated run's filenames
51
+ # are byte-identical to the serial Runner's (which numbers files the same
52
+ # way). Data extraction, however, skips ignore:true — see #extractable.
53
+ @ordered_all = DetermineTableProcessingOrder.run(dumpable, logger: logger).freeze
54
+ @index_of = @ordered_all.each_with_index.to_h.freeze
55
+ @extractable = @ordered_all.reject { |n| @by[n].ignore }.freeze
56
+
57
+ @reachable = compute_reachable
58
+ classify
59
+ derive_consumed_leaves
60
+ derive_cascade_adjacency
61
+ @reference_components = compute_reference_components.freeze
62
+ end
63
+
64
+ # Full processing order, INCLUDING ignore:true collections — the sequence the
65
+ # file index (insert-NNN-) is numbered over.
66
+ attr_reader :ordered_all
67
+
68
+ # name => 0-based position in #ordered_all (the file index is position + 1).
69
+ attr_reader :index_of
70
+
71
+ # #ordered_all minus ignore:true collections — the collections whose data is
72
+ # actually extracted. Union of the three groups below.
73
+ attr_reader :extractable
74
+
75
+ # The three groups (each a subset of #extractable, in #ordered_all order):
76
+
77
+ # genuine — reachable to the dump target (includes the target).
78
+ attr_reader :genuine
79
+
80
+ # leaf — no belongs_to; reference/master data with no input dependencies.
81
+ attr_reader :leaves
82
+
83
+ # ref_bt — has belongs_to but not reachable to the target.
84
+ attr_reader :ref_bt
85
+
86
+ # ref_bt collections as dependency-closed weakly-connected components over
87
+ # intra-ref_bt belongs_to edges, each returned in a valid topological order
88
+ # (a parent before its child). A whole component can be processed serially by
89
+ # one worker with no cross-worker @state IPC and no level barriers, seeded
90
+ # only with the leaf @state its members reference.
91
+ attr_reader :reference_components
92
+
93
+ # Leaf collections referenced (via belongs_to) by some non-leaf extractable
94
+ # collection (genuine OR ref_bt). These are the only leaves whose captured
95
+ # @state a downstream collection can need, so they are the ones a leaf worker
96
+ # must hand back (e.g. as a Marshal sidecar). Set<String>.
97
+ attr_reader :consumed_leaves
98
+
99
+ # genuine collections that directly reference a leaf — the only genuine
100
+ # collections whose output can change once leaf @state is present (and only
101
+ # at runtime, when their genuine anchor turns out empty and they fall back to
102
+ # the leaf clause). These seed the Phase-2 cascade reprocess.
103
+ attr_reader :direct_leaf_genuine
104
+
105
+ # name => genuine children (genuine collections that belongs_to it), keyed
106
+ # only by reachable parents. Drives the Phase-2 cascade: when a reprocessed
107
+ # collection's row count changes, its genuine children are re-enqueued.
108
+ attr_reader :genuine_children
109
+
110
+ # The set of collection names genuinely scoped by the target (the target plus
111
+ # everything that can reach it through belongs_to). Exposed for inspection.
112
+ attr_reader :reachable
113
+
114
+ def summary
115
+ {
116
+ extractable: @extractable.size,
117
+ genuine: @genuine.size,
118
+ leaves: @leaves.size,
119
+ ref_bt: @ref_bt.size,
120
+ consumed_leaves: @consumed_leaves.size,
121
+ direct_leaf_genuine: @direct_leaf_genuine.size,
122
+ reference_components: @reference_components.map(&:size).sort.reverse,
123
+ }
124
+ end
125
+
126
+ private
127
+
128
+ # Fixpoint over non-embedded configs: the target, plus every collection that
129
+ # can reach it by following belongs_to (child -> parent) transitively.
130
+ # Mirrors MongodbAdapter#genuine_scope_set (same traversal, same inclusion of
131
+ # ignore:true collections) so the genuine set matches the adapter's runtime
132
+ # scoping decision.
133
+ def compute_reachable
134
+ reachable = Set.new([@target_table_name])
135
+ loop do
136
+ added = false
137
+ @by.each_value do |cfg|
138
+ next if cfg.embedded? || reachable.include?(cfg.name)
139
+ next unless cfg.belongs_tos.any? { |rel| reachable.include?(rel.table_name) }
140
+
141
+ reachable << cfg.name
142
+ added = true
143
+ end
144
+ break unless added
145
+ end
146
+ reachable
147
+ end
148
+
149
+ def classify
150
+ # The three groups partition #extractable: reachable -> genuine; otherwise
151
+ # leaf (no belongs_to) -> leaves; otherwise -> ref_bt. The target is
152
+ # reachable (it seeds the set), so it lands in genuine and is never
153
+ # mis-grouped as a leaf even when it has no belongs_to of its own — which
154
+ # would otherwise double-process it (leaf pool AND parent).
155
+ @genuine = []
156
+ @leaves = []
157
+ @ref_bt = []
158
+ @extractable.each do |name|
159
+ if @reachable.include?(name)
160
+ @genuine << name
161
+ elsif leaf?(name)
162
+ @leaves << name
163
+ else
164
+ @ref_bt << name
165
+ end
166
+ end
167
+ @genuine.freeze
168
+ @leaves.freeze
169
+ @ref_bt.freeze
170
+ # Membership against the leaf *group* (which excludes the target), not the
171
+ # raw structural #leaf? predicate. The target has no belongs_to and is thus
172
+ # structurally leaf-like, but it is genuine — processed by the parent, not a
173
+ # leaf worker — so a belongs_to to the target must not count as referencing
174
+ # a leaf (it would wrongly demand a sidecar / seed the cascade).
175
+ @leaf_set = @leaves.to_set
176
+ end
177
+
178
+ def derive_consumed_leaves
179
+ consumed = Set.new
180
+ (@genuine + @ref_bt).each do |name|
181
+ @by[name].belongs_tos.each do |rel|
182
+ consumed << rel.table_name if @leaf_set.include?(rel.table_name)
183
+ end
184
+ end
185
+ @consumed_leaves = consumed.freeze
186
+ end
187
+
188
+ def derive_cascade_adjacency
189
+ @direct_leaf_genuine = @genuine.select do |name|
190
+ @by[name].belongs_tos.any? { |rel| @leaf_set.include?(rel.table_name) }
191
+ end.freeze
192
+
193
+ children = Hash.new { |h, k| h[k] = [] }
194
+ @genuine.each do |name|
195
+ @by[name].belongs_tos.each do |rel|
196
+ children[rel.table_name] << name if @reachable.include?(rel.table_name)
197
+ end
198
+ end
199
+ # Freeze with a non-mutating default so a lookup of a parent with no genuine
200
+ # children returns [] without trying to write into the frozen hash.
201
+ children.default_proc = nil
202
+ children.default = EMPTY_NAMES
203
+ @genuine_children = children.freeze
204
+ end
205
+
206
+ # ref_bt as dependency-closed weakly-connected components over intra-ref_bt
207
+ # belongs_to edges, each topo-ordered. Ported from the bench prototype: build
208
+ # the directed (child indegree) and undirected (component) views of the
209
+ # intra-ref_bt edges, find weakly-connected components, then Kahn-order each.
210
+ def compute_reference_components
211
+ ref_set = @ref_bt.to_set
212
+ children = Hash.new { |h, k| h[k] = [] }
213
+ adjacency = Hash.new { |h, k| h[k] = [] }
214
+ @ref_bt.each do |name|
215
+ @by[name].belongs_tos.each do |rel|
216
+ next unless ref_set.include?(rel.table_name)
217
+
218
+ children[rel.table_name] << name
219
+ adjacency[rel.table_name] << name
220
+ adjacency[name] << rel.table_name
221
+ end
222
+ end
223
+
224
+ seen = Set.new
225
+ components = []
226
+ @ref_bt.each do |start|
227
+ next if seen.include?(start)
228
+
229
+ stack = [start]
230
+ members = []
231
+ until stack.empty?
232
+ node = stack.pop
233
+ next if seen.include?(node)
234
+
235
+ seen << node
236
+ members << node
237
+ adjacency[node].each { |neighbor| stack << neighbor unless seen.include?(neighbor) }
238
+ end
239
+ components << members
240
+ end
241
+
242
+ components.map { |members| topo_order(members, children) }
243
+ end
244
+
245
+ # Kahn topological order of `members` over intra-component belongs_to edges
246
+ # (parent before child). `children` is the directed intra-ref_bt adjacency.
247
+ def topo_order(members, children)
248
+ member_set = members.to_set
249
+ indegree = members.to_h do |name|
250
+ [name, @by[name].belongs_tos.count { |rel| member_set.include?(rel.table_name) }]
251
+ end
252
+ queue = members.select { |name| indegree[name].zero? }
253
+ ordered = []
254
+ until queue.empty?
255
+ node = queue.shift
256
+ ordered << node
257
+ children[node].each do |child|
258
+ next unless member_set.include?(child)
259
+
260
+ indegree[child] -= 1
261
+ queue << child if indegree[child].zero?
262
+ end
263
+ end
264
+ ordered
265
+ end
266
+
267
+ def leaf?(name)
268
+ (cfg = @by[name]) && !cfg.embedded? && cfg.belongs_tos.empty?
269
+ end
270
+ end
271
+ end
data/lib/exwiw/runner.rb CHANGED
@@ -13,6 +13,7 @@ module Exwiw
13
13
  output_format: 'insert',
14
14
  insert_only: false,
15
15
  after_insert_hook_path: nil,
16
+ parallel_workers: nil,
16
17
  cli_options: {}
17
18
  )
18
19
  @connection_config = connection_config
@@ -22,6 +23,7 @@ module Exwiw
22
23
  @output_format = output_format
23
24
  @insert_only = insert_only
24
25
  @after_insert_hook_path = after_insert_hook_path
26
+ @parallel_workers = parallel_workers
25
27
  @cli_options = cli_options
26
28
  @logger = logger
27
29
  end
@@ -49,6 +51,19 @@ module Exwiw
49
51
 
50
52
  clean_output_dir!
51
53
 
54
+ # Opt-in MongoDB inter-collection fork parallelism (see
55
+ # docs/mongodb-dump-parallelism-2x-notes.md). It is byte-identical to the
56
+ # serial loop below — same filenames (the file index is taken over the same
57
+ # full processing order) and same per-collection bytes — so it is a drop-in
58
+ # replacement for the whole schema+inserts pass, after which the common
59
+ # after-insert hook still runs. Everything before this point (validation,
60
+ # scope check, ordering, output-dir clean) applies to both paths.
61
+ if use_mongodb_parallel?(adapter)
62
+ dump_mongodb_parallel(configs, table_by_name)
63
+ run_after_insert_hook(adapter, ordered_table_names.size)
64
+ return
65
+ end
66
+
52
67
  ordered_tables = ordered_table_names.map { |n| table_by_name.fetch(n) }
53
68
  schema_path = File.join(@output_dir, "insert-000-schema.#{adapter.schema_output_extension}")
54
69
  @logger.info("Writing schema to #{schema_path}...")
@@ -161,17 +176,71 @@ module Exwiw
161
176
  end
162
177
  end
163
178
 
164
- if @after_insert_hook_path
165
- @logger.info("Running after-insert hook: #{@after_insert_hook_path}")
166
- AfterInsertHook.run(
167
- path: @after_insert_hook_path,
168
- cli_options: @cli_options,
169
- output_dir: @output_dir,
170
- next_idx: total_size + 1,
171
- output_extension: adapter.output_extension,
172
- logger: @logger,
173
- )
179
+ run_after_insert_hook(adapter, total_size)
180
+ end
181
+
182
+ # Run the post-processing hook (no-op when none configured). `total_size` is
183
+ # the count of processed tables/collections; the hook's first output file is
184
+ # numbered just past them. Shared by the serial and parallel dump paths.
185
+ private def run_after_insert_hook(adapter, total_size)
186
+ return unless @after_insert_hook_path
187
+
188
+ @logger.info("Running after-insert hook: #{@after_insert_hook_path}")
189
+ AfterInsertHook.run(
190
+ path: @after_insert_hook_path,
191
+ cli_options: @cli_options,
192
+ output_dir: @output_dir,
193
+ next_idx: total_size + 1,
194
+ output_extension: adapter.output_extension,
195
+ logger: @logger,
196
+ )
197
+ end
198
+
199
+ # True when the opt-in MongoDB fork-parallel dump should run instead of the
200
+ # serial loop: the mongodb adapter, a worker count > 1, a genuine-anchor dump
201
+ # target (the schedule is built around the scoped DAG), and a runtime that can
202
+ # fork. Anything else falls back to the serial path (warning when the user
203
+ # explicitly asked for parallelism but it cannot apply).
204
+ private def use_mongodb_parallel?(adapter)
205
+ return false unless adapter.is_a?(Adapter::MongodbAdapter)
206
+ return false unless @parallel_workers && @parallel_workers > 1
207
+
208
+ if @dump_target.table_name.nil?
209
+ @logger.warn("--parallel-workers ignored: MongoDB parallelism needs a --target-collection; running serially.")
210
+ return false
174
211
  end
212
+
213
+ unless MongodbParallelDumper.available?
214
+ @logger.warn("--parallel-workers ignored: fork is unavailable on this runtime; running serially.")
215
+ return false
216
+ end
217
+
218
+ true
219
+ end
220
+
221
+ # Build the static plan and hand the whole schema+inserts pass to the fork
222
+ # orchestrator. `configs` are the reject_ignored_members!'d configs (the plan
223
+ # rejects embedded and orders them itself, identically to the serial path).
224
+ private def dump_mongodb_parallel(configs, table_by_name)
225
+ plan = MongodbParallelPlan.new(
226
+ configs: configs,
227
+ target_table_name: @dump_target.table_name,
228
+ logger: @logger,
229
+ )
230
+ @logger.info(
231
+ "MongoDB parallel dump with #{@parallel_workers} worker(s): " \
232
+ "genuine=#{plan.genuine.size}, leaves=#{plan.leaves.size}, ref_bt=#{plan.ref_bt.size}."
233
+ )
234
+ stats = MongodbParallelDumper.new(
235
+ connection_config: @connection_config,
236
+ plan: plan,
237
+ dump_target: @dump_target,
238
+ table_by_name: table_by_name,
239
+ output_dir: @output_dir,
240
+ workers: @parallel_workers,
241
+ logger: @logger,
242
+ ).run
243
+ @logger.info("MongoDB parallel dump complete: #{stats.inspect}")
175
244
  end
176
245
 
177
246
  # Empty the output dir before writing so each export starts from a clean
data/lib/exwiw/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Exwiw
4
- VERSION = "0.8.4"
4
+ VERSION = "0.8.5"
5
5
  end
data/lib/exwiw.rb CHANGED
@@ -23,6 +23,8 @@ require_relative "exwiw/adapter/mysql_adapter"
23
23
  require_relative "exwiw/adapter/postgresql_adapter"
24
24
  require_relative "exwiw/adapter/mongodb_adapter"
25
25
  require_relative "exwiw/determine_table_processing_order"
26
+ require_relative "exwiw/mongodb_parallel_plan"
27
+ require_relative "exwiw/mongodb_parallel_dumper"
26
28
  require_relative "exwiw/mongo_query"
27
29
  require_relative "exwiw/query_ast"
28
30
  require_relative "exwiw/query_ast_builder"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: exwiw
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.8.4
4
+ version: 0.8.5
5
5
  platform: ruby
6
6
  authors:
7
7
  - Shia
@@ -72,6 +72,8 @@ files:
72
72
  - lib/exwiw/mongo_query.rb
73
73
  - lib/exwiw/mongodb_collection_config.rb
74
74
  - lib/exwiw/mongodb_field.rb
75
+ - lib/exwiw/mongodb_parallel_dumper.rb
76
+ - lib/exwiw/mongodb_parallel_plan.rb
75
77
  - lib/exwiw/mongoid_schema_generator.rb
76
78
  - lib/exwiw/query_ast.rb
77
79
  - lib/exwiw/query_ast_builder.rb