exwiw 0.8.3 → 0.8.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +10 -0
- data/README.md +49 -24
- data/docs/mongodb-dump-parallelism-2x-notes.md +21 -10
- data/lib/exwiw/adapter/mongodb_adapter.rb +19 -0
- data/lib/exwiw/adapter/mysql_adapter.rb +39 -3
- data/lib/exwiw/adapter/postgresql_adapter.rb +41 -3
- data/lib/exwiw/adapter/sqlite_adapter.rb +32 -3
- data/lib/exwiw/adapter.rb +37 -0
- data/lib/exwiw/cli.rb +40 -1
- data/lib/exwiw/mongodb_parallel_dumper.rb +290 -0
- data/lib/exwiw/mongodb_parallel_plan.rb +271 -0
- data/lib/exwiw/runner.rb +79 -10
- data/lib/exwiw/version.rb +1 -1
- data/lib/exwiw.rb +2 -0
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 3934deb015b3ede7c6a8caa25b857d1f7c8957b4efa3284cd26f21ff2620904d
|
|
4
|
+
data.tar.gz: e44fef95c3c273aa96b1dde37736212687da88d8d9b98abb5c44fb305d45b7a3
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 1bf410fb503b270aca3f96191f34cb8ee67730ac49ea6a57d34550d2b603fed7404206d405c61f6dd3b67808f8865d73c80121d303c628dda513ebcf9c418b17
|
|
7
|
+
data.tar.gz: d1cbb8bd0ed7bd10f53d3d3169985844d4e5e509041c95a18d083dff458a06f26c09e325fea44bc5c1e14bf65b53fdc79e4dbe351f3e4e5f916e6c9135cfa71e
|
data/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,16 @@
|
|
|
2
2
|
|
|
3
3
|
## [Unreleased]
|
|
4
4
|
|
|
5
|
+
## [0.8.5] - 2026-06-25
|
|
6
|
+
|
|
7
|
+
- **`--parallel-workers=N` parallelizes the MongoDB dump across forked processes (opt-in, byte-identical).** When set with `N≥2` on the mongodb adapter's `export`, exwiw runs an inter-collection fork schedule that decodes whole collections in parallel while preserving each collection's natural row order, so the output files are byte-identical to a serial run (same filenames, same content). Collections are classified into three dependency groups — reference data dumped in full (no `belongs_to`), the scoped DAG reachable to the dump target, and non-reachable reference data — which lets the heavy full-dump collections run concurrently with the scoped pass; only a handful of small `@state` hand-offs cross process boundaries. The win needs real cores: it clears ~2× from 4 workers and saturates there. Requires a dump target and a `fork`-capable runtime (CRuby on POSIX); it falls back to the serial path on JRuby/TruffleRuby/Windows or when no target is given. Also settable as `parallel_workers:` in the config file. The default remains the serial dump.
|
|
8
|
+
|
|
9
|
+
## [0.8.4] - 2026-06-24
|
|
10
|
+
|
|
11
|
+
### Fixed
|
|
12
|
+
|
|
13
|
+
- **Scope id-sets are materialized and probed by JOIN instead of `<col> IN (subquery)`, removing a correlated full-scan on large tables.** The three id-set scope shapes — the multi-referencer `reverse_scope` `UNION`, the single-referencer reverse/`referenced_by` extraction, and the multi-hop forward (`via_scoped_parent`) cascade — were emitted as `<col> IN (<subquery>)`. On a large global-identity table (e.g. `users`) MySQL cannot turn a `UNION` subquery into a materialized semi-join and falls back to its IN-to-`EXISTS` rewrite: a correlated `DEPENDENT SUBQUERY`/`DEPENDENT UNION` re-evaluated **per outer row**, so the driving table is full-scanned and the union re-run for every row (the plan that ran for minutes and timed out on a production-scale identity table, even for an empty tenant). These clauses are now lifted into a `JOIN` against a materialized derived table — `JOIN (SELECT DISTINCT src.<id> AS exwiw_scope_id FROM (<id-set subquery>) AS src) AS ids ON <table>.<col> = ids.exwiw_scope_id` — so the engine evaluates the id set **once** (the `DISTINCT` makes the derived table non-mergeable, hence materialized) and probes the outer table by primary key. The `DISTINCT` also dedups, so the result set is identical to the old `IN` form; the cascade nests the same way (each level materialized once); NULL exclusion, the forward-path cycle guard, single-parent/polymorphic skips, and PostgreSQL's `uuid`/`varchar` `::text` reconciliation are all preserved. All three SQL adapters (mysql / postgresql / sqlite). See the [README](README.md#why-a-join-not-in-subquery).
|
|
14
|
+
|
|
5
15
|
## [0.8.3] - 2026-06-24
|
|
6
16
|
|
|
7
17
|
### Fixed
|
data/README.md
CHANGED
|
@@ -179,15 +179,18 @@ Each table is resolved as follows:
|
|
|
179
179
|
nearest such table and applies the scope filter there (the same join machinery
|
|
180
180
|
the single-target mode uses).
|
|
181
181
|
- **`belongs_to` a parent that is itself scoped but carries no scope column of its
|
|
182
|
-
own** → exwiw constrains this table to the parent's in-scope ids
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
182
|
+
own** → exwiw constrains this table to the parent's in-scope ids by joining it to
|
|
183
|
+
the parent's scoped query, materialized as a derived table
|
|
184
|
+
(`JOIN (SELECT DISTINCT parent.pk … FROM <parent's scoped query>) … ON fk = …`).
|
|
185
|
+
This covers a *hub* table that has no scope column and is scoped only because an
|
|
186
|
+
extractable child references it (see referenced-by below): the hub's other
|
|
187
|
+
`belongs_to` children ride along to just the in-scope rows instead of being dumped
|
|
188
|
+
in full. The parent itself may be scoped the same way, so this **cascades across
|
|
189
|
+
multiple hops** (each a single unambiguous scopable parent) and the derived-table
|
|
190
|
+
JOINs nest correspondingly; the recursion terminates on a genuine `belongs_to`
|
|
191
|
+
cycle (a table already on the path is left `:unscopable` rather than looped on).
|
|
192
|
+
(See [Why a JOIN, not `IN (subquery)`](#why-a-join-not-in-subquery) for the
|
|
193
|
+
materialization rationale.)
|
|
191
194
|
- **Cannot be scoped at all** (no scope column and no path to one) → exwiw
|
|
192
195
|
**aborts** and lists the offending tables, so an unscoped table is never silently
|
|
193
196
|
dumped in full. For each, either declare a `scope_column`, add a `belongs_to`
|
|
@@ -568,15 +571,19 @@ The same type filter is applied on the join path — and in the matching `delete
|
|
|
568
571
|
ActiveStorage is handled automatically — no ActiveStorage-specific configuration is required. The `has_one_attached` / `has_many_attached` macros don't add a column to the owning model; they generate ordinary associations that exwiw already understands:
|
|
569
572
|
|
|
570
573
|
- **`active_storage_attachments`** is the polymorphic join row (`belongs_to :record, polymorphic: true` + `belongs_to :blob`). `exwiw:schema:generate` expands the polymorphic `record` into one `belongs_to` per model that declared `has_*_attached` (found via the generated `has_* ..., as: :record` reflections), exactly like any other [polymorphic `belongs_to`](#polymorphic-belongs_to). So only the attachments whose owner is among the dumped rows are extracted.
|
|
571
|
-
- **`active_storage_blobs`** has no `belongs_to` of its own (attachments point *at* it), so it has no path to the dump target. exwiw narrows it via **reverse / "referenced_by" extraction**: a parent table referenced by exactly one constrained, non-polymorphic child is constrained to just the referenced ids instead of dumping every row:
|
|
574
|
+
- **`active_storage_blobs`** has no `belongs_to` of its own (attachments point *at* it), so it has no path to the dump target. exwiw narrows it via **reverse / "referenced_by" extraction**: a parent table referenced by exactly one constrained, non-polymorphic child is constrained to just the referenced ids instead of dumping every row. The id set is materialized once and joined back (see [Why a JOIN, not `IN (subquery)`](#why-a-join-not-in-subquery)):
|
|
572
575
|
|
|
573
576
|
```sql
|
|
574
577
|
SELECT active_storage_blobs.* FROM active_storage_blobs
|
|
575
|
-
|
|
576
|
-
SELECT
|
|
577
|
-
|
|
578
|
-
|
|
579
|
-
|
|
578
|
+
JOIN (
|
|
579
|
+
SELECT DISTINCT exwiw_scope_src_0.blob_id AS exwiw_scope_id
|
|
580
|
+
FROM (
|
|
581
|
+
SELECT active_storage_attachments.blob_id FROM active_storage_attachments
|
|
582
|
+
WHERE active_storage_attachments.record_id IN (/* owner subquery */)
|
|
583
|
+
AND active_storage_attachments.record_type = '...'
|
|
584
|
+
) AS exwiw_scope_src_0
|
|
585
|
+
) AS exwiw_scope_ids_0
|
|
586
|
+
ON active_storage_blobs.id = exwiw_scope_ids_0.exwiw_scope_id
|
|
580
587
|
```
|
|
581
588
|
|
|
582
589
|
`active_storage_variant_records` also references blobs, but since it has no path of its own to the dump target it doesn't constrain anything and is ignored as a referencer — blobs stays narrowed to the attachment-referenced ids. (A parent referenced by *multiple* constrained children currently falls back to dumping all of its rows.)
|
|
@@ -603,18 +610,22 @@ The automatic reverse extraction above narrows a table referenced by **exactly o
|
|
|
603
610
|
}
|
|
604
611
|
```
|
|
605
612
|
|
|
606
|
-
produces (each arm reuses that referencer's own scope, so a per-tenant run keeps only that tenant's ids):
|
|
613
|
+
produces (each arm reuses that referencer's own scope, so a per-tenant run keeps only that tenant's ids; the `UNION` id set is materialized once and joined back — see [Why a JOIN, not `IN (subquery)`](#why-a-join-not-in-subquery)):
|
|
607
614
|
|
|
608
615
|
```sql
|
|
609
616
|
SELECT users.* FROM users
|
|
610
|
-
|
|
611
|
-
SELECT
|
|
612
|
-
|
|
613
|
-
|
|
614
|
-
|
|
615
|
-
|
|
616
|
-
|
|
617
|
-
|
|
617
|
+
JOIN (
|
|
618
|
+
SELECT DISTINCT exwiw_scope_src_0.user_id AS exwiw_scope_id
|
|
619
|
+
FROM (
|
|
620
|
+
SELECT customers.user_id FROM customers WHERE <customers' scope> AND customers.user_id IS NOT NULL
|
|
621
|
+
UNION
|
|
622
|
+
SELECT staff.user_id FROM staff WHERE <staff' scope> AND staff.user_id IS NOT NULL
|
|
623
|
+
UNION
|
|
624
|
+
SELECT business_entity_customers.kantan_yoyaku_user_id FROM business_entity_customers
|
|
625
|
+
WHERE <…' scope> AND business_entity_customers.kantan_yoyaku_user_id IS NOT NULL
|
|
626
|
+
) AS exwiw_scope_src_0
|
|
627
|
+
) AS exwiw_scope_ids_0
|
|
628
|
+
ON users.id = exwiw_scope_ids_0.exwiw_scope_id
|
|
618
629
|
```
|
|
619
630
|
|
|
620
631
|
Notes:
|
|
@@ -625,6 +636,19 @@ Notes:
|
|
|
625
636
|
- **Satellites need no config.** A table that `belongs_to` the reverse-scoped table (e.g. `end_users.id → users.id`, or `identities.user_id → users.id`) tightens to the kept ids automatically through the normal cascade — only the reverse-scoped table itself declares `reverse_scope`. The cascade is **multi-hop**, so a table several `belongs_to` hops below the reverse-scoped table (e.g. `end_user_profiles → end_users → users`) also tightens automatically, with no config of its own.
|
|
626
637
|
- Works in both single-target and scope-column mode. Polymorphic foreign keys are not eligible as anchors (the named `column` is always a concrete column).
|
|
627
638
|
|
|
639
|
+
### Why a JOIN, not `IN (subquery)`
|
|
640
|
+
|
|
641
|
+
Every scope id-set above — the multi-referencer `reverse_scope` `UNION`, the single-referencer reverse extraction, and the multi-hop forward cascade — is emitted as a `JOIN` to a `SELECT DISTINCT` derived table rather than `<col> IN (<subquery>)`:
|
|
642
|
+
|
|
643
|
+
```sql
|
|
644
|
+
… JOIN (SELECT DISTINCT src.<id> AS exwiw_scope_id FROM (<id-set subquery>) AS src) AS ids
|
|
645
|
+
ON <table>.<col> = ids.exwiw_scope_id
|
|
646
|
+
```
|
|
647
|
+
|
|
648
|
+
Both forms select the **same rows** — the `DISTINCT` dedups, so the join never fans out — but the query plans differ sharply on a large table. As `<col> IN (… UNION …)`, MySQL cannot turn a `UNION` subquery into a materialized semi-join and falls back to its IN-to-`EXISTS` rewrite: a **correlated `DEPENDENT SUBQUERY`** re-evaluated for every outer row, i.e. a full scan of the (potentially huge) outer table multiplied by the cost of the union. The derived-table form forces the engine to evaluate the id set **once** (the `DISTINCT` makes the derived table non-mergeable, hence materialized) and then probe the outer table by its primary key. On a global-identity table such as `users` this is the difference between a full table scan and an index lookup; the cascade nests the same way, so each level is materialized once instead of being re-evaluated by the level above.
|
|
649
|
+
|
|
650
|
+
All three SQL adapters (mysql / postgresql / sqlite) emit this shape. PostgreSQL additionally reconciles a `uuid`/`varchar` type mismatch by casting the join key and the projected id to `text`, exactly as the old `IN` form did.
|
|
651
|
+
|
|
628
652
|
### Rails-managed tables (special `type` values)
|
|
629
653
|
|
|
630
654
|
Some tables are owned by Rails itself rather than the application — they have no ActiveRecord model and Rails reserves the right to evolve their column shape between versions (e.g. `schema_migrations`, `ar_internal_metadata`). exwiw treats them as a distinct category via the `type` field on a table config:
|
|
@@ -745,6 +769,7 @@ The MongoDB adapter is experimental. To use it:
|
|
|
745
769
|
- `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
|
|
746
770
|
- `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only** (the SQL adapters have no equivalent).
|
|
747
771
|
- Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
|
|
772
|
+
- `--parallel-workers=N` (opt-in, `export` only) forks `N` worker processes that decode whole collections in parallel — the dominant cost on a large dump is the driver's BSON→Ruby decode, and each worker decodes its own collections in their natural order, so the output stays **byte-identical** to a serial run (same filenames and content). It needs a dump target (the schedule is built around the scoped DAG) and a `fork`-capable runtime (CRuby on POSIX), falling back to the serial path otherwise; it also accepts `parallel_workers:` in the config file. The speedup needs real cores to spend — it reaches ~2× from 4 workers and saturates there. The default is serial. See [`docs/mongodb-dump-parallelism-2x-notes.md`](docs/mongodb-dump-parallelism-2x-notes.md) for the schedule and measurements.
|
|
748
773
|
- Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
|
|
749
774
|
```bash
|
|
750
775
|
mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl
|
|
@@ -134,13 +134,24 @@ bounded by chunk size, and `N×` that stays well under the 7 GiB container limit
|
|
|
134
134
|
|
|
135
135
|
## Status
|
|
136
136
|
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
137
|
+
**Integrated and shipped behind an opt-in flag.** The schedule lives in
|
|
138
|
+
`Exwiw::MongodbParallelPlan` (the static, DB-free classification) and
|
|
139
|
+
`Exwiw::MongodbParallelDumper` (the fork orchestrator: per-group pools, LPT
|
|
140
|
+
bin-packing, the `@state` Marshal-sidecar IPC, and the Phase-2 cascade). The
|
|
141
|
+
`Runner` delegates the whole schema+inserts pass to the dumper when the mongodb
|
|
142
|
+
adapter is used with `--parallel-workers=N` (N≥2), a genuine-anchor dump target is
|
|
143
|
+
present, and the runtime can `fork`; otherwise it runs the serial loop unchanged.
|
|
144
|
+
The CLI exposes `--parallel-workers` / config-file `parallel_workers` (mongodb +
|
|
145
|
+
`export` only). The after-insert hook runs identically on both paths.
|
|
146
|
+
|
|
147
|
+
End-to-end verification through the real `exwiw export` CLI on the same staging
|
|
148
|
+
restore: **189/189 output files byte-identical** to the serial CLI run (same
|
|
149
|
+
filenames, 0 content mismatches), at **2.19× wall-clock** (serial 7.13 s → N=4
|
|
150
|
+
3.25 s; N=2 3.99 s = 1.81×; N=6 saturates at 3.25 s). Per the curve above the win
|
|
151
|
+
materializes from ~4 real cores.
|
|
152
|
+
|
|
153
|
+
This was the machinery `optimization-notes.md` deliberately removed as
|
|
154
|
+
over-engineered for a flag — re-introduced here because, unlike that removed work,
|
|
155
|
+
this schedule is byte-identical by construction, measured past the 2× target on a
|
|
156
|
+
real extraction, and the lever the task explicitly invited (scale the task to go
|
|
157
|
+
faster). It is **strictly opt-in**: the default remains the serial path.
|
|
@@ -70,6 +70,25 @@ module Exwiw
|
|
|
70
70
|
@state = {}
|
|
71
71
|
end
|
|
72
72
|
|
|
73
|
+
# Propagation @state accessor, used ONLY by MongodbParallelDumper to seed a
|
|
74
|
+
# forked worker with the slice of parent ids its collections reference and to
|
|
75
|
+
# harvest the ids downstream collections will `$in`-match against (handed
|
|
76
|
+
# between processes as Marshal sidecars). The serial Runner never touches
|
|
77
|
+
# these — it relies on the in-process capture during #execute.
|
|
78
|
+
attr_accessor :state
|
|
79
|
+
|
|
80
|
+
# Cheap, metadata-only document-count estimate for `collection_name`, used by
|
|
81
|
+
# the parallel dumper to weight collections for LPT bin-packing. This only
|
|
82
|
+
# influences which worker processes a collection (never the output bytes), so
|
|
83
|
+
# an imprecise estimate is harmless. Reads collection metadata rather than
|
|
84
|
+
# running a COLLSCAN; returns 0 on any error (e.g. a collection absent from
|
|
85
|
+
# this database just sorts to the lowest weight).
|
|
86
|
+
def estimated_count(collection_name)
|
|
87
|
+
db[collection_name].estimated_document_count
|
|
88
|
+
rescue StandardError
|
|
89
|
+
0
|
|
90
|
+
end
|
|
91
|
+
|
|
73
92
|
def dumpable?(config)
|
|
74
93
|
!config.embedded?
|
|
75
94
|
end
|
|
@@ -229,11 +229,18 @@ module Exwiw
|
|
|
229
229
|
def compile_ast(query_ast, count_only: false)
|
|
230
230
|
raise NotImplementedError unless query_ast.is_a?(Exwiw::QueryAst::Select)
|
|
231
231
|
|
|
232
|
+
# Lift scope id-set clauses (reverse_scope UNION / forward cascade /
|
|
233
|
+
# single referenced_by) out of `WHERE <col> IN (subquery)` and into a
|
|
234
|
+
# JOIN against a materialized derived table. See #compile_scope_join.
|
|
235
|
+
scope_clauses, plain_where_clauses = partition_scope_clauses(query_ast.where_clauses)
|
|
236
|
+
|
|
232
237
|
sql = "SELECT "
|
|
233
238
|
sql += if count_only
|
|
234
239
|
"COUNT(*)"
|
|
235
240
|
elsif query_ast.select_all
|
|
236
|
-
|
|
241
|
+
# A lifted scope JOIN brings a derived table into FROM, so a bare
|
|
242
|
+
# `*` would also project its column. Qualify to this table's own.
|
|
243
|
+
scope_clauses.any? ? "#{query_ast.from_table_name}.*" : "*"
|
|
237
244
|
else
|
|
238
245
|
query_ast.columns.map { |col| compile_column_name(query_ast, col) }.join(', ')
|
|
239
246
|
end
|
|
@@ -256,14 +263,43 @@ module Exwiw
|
|
|
256
263
|
end
|
|
257
264
|
end
|
|
258
265
|
|
|
259
|
-
|
|
266
|
+
scope_clauses.each_with_index do |where_clause, idx|
|
|
267
|
+
sql += " #{compile_scope_join(query_ast.from_table_name, where_clause, idx)}"
|
|
268
|
+
end
|
|
269
|
+
|
|
270
|
+
if plain_where_clauses.any?
|
|
260
271
|
sql += " WHERE "
|
|
261
|
-
sql +=
|
|
272
|
+
sql += plain_where_clauses.map { |where| compile_where_condition(where, query_ast.from_table_name) }.join(' AND ')
|
|
262
273
|
end
|
|
263
274
|
|
|
264
275
|
sql
|
|
265
276
|
end
|
|
266
277
|
|
|
278
|
+
# Render a scope id-set clause as a JOIN to a materialized derived table:
|
|
279
|
+
#
|
|
280
|
+
# JOIN (SELECT DISTINCT src.<proj> AS exwiw_scope_id
|
|
281
|
+
# FROM (<subquery>) AS src) AS ids
|
|
282
|
+
# ON <table>.<col> = ids.exwiw_scope_id
|
|
283
|
+
#
|
|
284
|
+
# The DISTINCT makes the derived table non-mergeable, so MySQL materializes
|
|
285
|
+
# the id-set once and probes this table by it (PK/index lookup) — instead
|
|
286
|
+
# of full-scanning this table and re-evaluating a correlated
|
|
287
|
+
# `IN (… UNION …)` per row (the DEPENDENT SUBQUERY / IN-to-EXISTS fallback,
|
|
288
|
+
# which a UNION subquery cannot be turned into a materialized semi-join).
|
|
289
|
+
# DISTINCT also dedups, so the join never fans out: the row set is identical
|
|
290
|
+
# to `<col> IN (subquery)`.
|
|
291
|
+
private def compile_scope_join(from_table_name, where_clause, idx)
|
|
292
|
+
subquery = where_clause.value
|
|
293
|
+
projection = subquery_projection_name(subquery)
|
|
294
|
+
src_alias = "exwiw_scope_src_#{idx}"
|
|
295
|
+
ids_alias = "exwiw_scope_ids_#{idx}"
|
|
296
|
+
outer_key = "#{from_table_name}.#{where_clause.column_name}"
|
|
297
|
+
|
|
298
|
+
"JOIN (SELECT DISTINCT #{src_alias}.#{projection} AS exwiw_scope_id " \
|
|
299
|
+
"FROM (#{compile_subquery(subquery)}) AS #{src_alias}) AS #{ids_alias} " \
|
|
300
|
+
"ON #{outer_key} = #{ids_alias}.exwiw_scope_id"
|
|
301
|
+
end
|
|
302
|
+
|
|
267
303
|
private def compile_where_condition(where_clause, table_name)
|
|
268
304
|
# Use as it is if it's a raw query
|
|
269
305
|
return where_clause if where_clause.is_a?(String)
|
|
@@ -301,9 +301,16 @@ module Exwiw
|
|
|
301
301
|
def compile_ast(query_ast, select_cast_to: nil)
|
|
302
302
|
raise NotImplementedError unless query_ast.is_a?(Exwiw::QueryAst::Select)
|
|
303
303
|
|
|
304
|
+
# Lift scope id-set clauses (reverse_scope UNION / forward cascade /
|
|
305
|
+
# single referenced_by) out of `WHERE <col> IN (subquery)` and into a
|
|
306
|
+
# JOIN against a materialized derived table. See #compile_scope_join.
|
|
307
|
+
scope_clauses, plain_where_clauses = partition_scope_clauses(query_ast.where_clauses)
|
|
308
|
+
|
|
304
309
|
sql = "SELECT "
|
|
305
310
|
sql += if query_ast.select_all
|
|
306
|
-
|
|
311
|
+
# A lifted scope JOIN brings a derived table into FROM, so a bare
|
|
312
|
+
# `*` would also project its column. Qualify to this table's own.
|
|
313
|
+
scope_clauses.any? ? "#{query_ast.from_table_name}.*" : "*"
|
|
307
314
|
else
|
|
308
315
|
cols = query_ast.columns.map { |col| compile_column_name(query_ast, col) }
|
|
309
316
|
cols = cols.map { |c| "#{c}::#{select_cast_to}" } if select_cast_to
|
|
@@ -337,14 +344,45 @@ module Exwiw
|
|
|
337
344
|
end
|
|
338
345
|
end
|
|
339
346
|
|
|
340
|
-
|
|
347
|
+
scope_clauses.each_with_index do |where_clause, idx|
|
|
348
|
+
sql += " #{compile_scope_join(query_ast.from_table_name, where_clause, idx)}"
|
|
349
|
+
end
|
|
350
|
+
|
|
351
|
+
if plain_where_clauses.any?
|
|
341
352
|
sql += " WHERE "
|
|
342
|
-
sql +=
|
|
353
|
+
sql += plain_where_clauses.map { |where| compile_where_condition(where, query_ast.from_table_name) }.join(' AND ')
|
|
343
354
|
end
|
|
344
355
|
|
|
345
356
|
sql
|
|
346
357
|
end
|
|
347
358
|
|
|
359
|
+
# Render a scope id-set clause as a JOIN to a materialized derived table
|
|
360
|
+
# (see MysqlAdapter#compile_scope_join for the full rationale). The DISTINCT
|
|
361
|
+
# forces the engine to materialize the id-set once and probe this table by
|
|
362
|
+
# it, instead of full-scanning and re-evaluating a correlated subquery per
|
|
363
|
+
# row; it also dedups, so the join is row-for-row identical to
|
|
364
|
+
# `<col> IN (subquery)`.
|
|
365
|
+
#
|
|
366
|
+
# Type reconciliation mirrors the old IN form: when the outer column and
|
|
367
|
+
# the projected id clash (e.g. uuid vs varchar), #compile_subquery already
|
|
368
|
+
# casts every arm to text, so the derived `exwiw_scope_id` is text and the
|
|
369
|
+
# outer key is cast to match.
|
|
370
|
+
private def compile_scope_join(from_table_name, where_clause, idx)
|
|
371
|
+
subquery = where_clause.value
|
|
372
|
+
projection = subquery_projection_name(subquery)
|
|
373
|
+
src_alias = "exwiw_scope_src_#{idx}"
|
|
374
|
+
ids_alias = "exwiw_scope_ids_#{idx}"
|
|
375
|
+
|
|
376
|
+
inner_sql = compile_subquery(subquery, outer_table: from_table_name, outer_column: where_clause.column_name)
|
|
377
|
+
cast_to = subquery_cast_to(subquery, from_table_name, where_clause.column_name)
|
|
378
|
+
outer_key = "#{from_table_name}.#{where_clause.column_name}"
|
|
379
|
+
outer_key = "#{outer_key}::#{cast_to}" if cast_to
|
|
380
|
+
|
|
381
|
+
"JOIN (SELECT DISTINCT #{src_alias}.#{projection} AS exwiw_scope_id " \
|
|
382
|
+
"FROM (#{inner_sql}) AS #{src_alias}) AS #{ids_alias} " \
|
|
383
|
+
"ON #{outer_key} = #{ids_alias}.exwiw_scope_id"
|
|
384
|
+
end
|
|
385
|
+
|
|
348
386
|
private def compile_where_condition(where_clause, table_name)
|
|
349
387
|
# Use as it is if it's a raw query
|
|
350
388
|
return where_clause if where_clause.is_a?(String)
|
|
@@ -198,11 +198,18 @@ module Exwiw
|
|
|
198
198
|
def compile_ast(query_ast, count_only: false)
|
|
199
199
|
raise NotImplementedError unless query_ast.is_a?(Exwiw::QueryAst::Select)
|
|
200
200
|
|
|
201
|
+
# Lift scope id-set clauses (reverse_scope UNION / forward cascade /
|
|
202
|
+
# single referenced_by) out of `WHERE <col> IN (subquery)` and into a
|
|
203
|
+
# JOIN against a materialized derived table. See #compile_scope_join.
|
|
204
|
+
scope_clauses, plain_where_clauses = partition_scope_clauses(query_ast.where_clauses)
|
|
205
|
+
|
|
201
206
|
sql = "SELECT "
|
|
202
207
|
sql += if count_only
|
|
203
208
|
"COUNT(*)"
|
|
204
209
|
elsif query_ast.select_all
|
|
205
|
-
|
|
210
|
+
# A lifted scope JOIN brings a derived table into FROM, so a bare
|
|
211
|
+
# `*` would also project its column. Qualify to this table's own.
|
|
212
|
+
scope_clauses.any? ? "#{query_ast.from_table_name}.*" : "*"
|
|
206
213
|
else
|
|
207
214
|
query_ast.columns.map { |col| compile_column_name(query_ast, col) }.join(', ')
|
|
208
215
|
end
|
|
@@ -225,14 +232,36 @@ module Exwiw
|
|
|
225
232
|
end
|
|
226
233
|
end
|
|
227
234
|
|
|
228
|
-
|
|
235
|
+
scope_clauses.each_with_index do |where_clause, idx|
|
|
236
|
+
sql += " #{compile_scope_join(query_ast.from_table_name, where_clause, idx)}"
|
|
237
|
+
end
|
|
238
|
+
|
|
239
|
+
if plain_where_clauses.any?
|
|
229
240
|
sql += " WHERE "
|
|
230
|
-
sql +=
|
|
241
|
+
sql += plain_where_clauses.map { |where| compile_where_condition(where, query_ast.from_table_name) }.join(' AND ')
|
|
231
242
|
end
|
|
232
243
|
|
|
233
244
|
sql
|
|
234
245
|
end
|
|
235
246
|
|
|
247
|
+
# Render a scope id-set clause as a JOIN to a materialized derived table
|
|
248
|
+
# (see MysqlAdapter#compile_scope_join for the full rationale). The DISTINCT
|
|
249
|
+
# forces the engine to materialize the id-set once and probe this table by
|
|
250
|
+
# it, instead of full-scanning and re-evaluating a correlated subquery per
|
|
251
|
+
# row; it also dedups, so the join is row-for-row identical to
|
|
252
|
+
# `<col> IN (subquery)`.
|
|
253
|
+
private def compile_scope_join(from_table_name, where_clause, idx)
|
|
254
|
+
subquery = where_clause.value
|
|
255
|
+
projection = subquery_projection_name(subquery)
|
|
256
|
+
src_alias = "exwiw_scope_src_#{idx}"
|
|
257
|
+
ids_alias = "exwiw_scope_ids_#{idx}"
|
|
258
|
+
outer_key = "#{from_table_name}.#{where_clause.column_name}"
|
|
259
|
+
|
|
260
|
+
"JOIN (SELECT DISTINCT #{src_alias}.#{projection} AS exwiw_scope_id " \
|
|
261
|
+
"FROM (#{compile_subquery(subquery)}) AS #{src_alias}) AS #{ids_alias} " \
|
|
262
|
+
"ON #{outer_key} = #{ids_alias}.exwiw_scope_id"
|
|
263
|
+
end
|
|
264
|
+
|
|
236
265
|
private def compile_where_condition(where_clause, table_name)
|
|
237
266
|
# Use as it is if it's a raw query
|
|
238
267
|
return where_clause if where_clause.is_a?(String)
|
data/lib/exwiw/adapter.rb
CHANGED
|
@@ -242,6 +242,43 @@ module Exwiw
|
|
|
242
242
|
private def null_preserving(ast, column, masked_expr)
|
|
243
243
|
"CASE WHEN #{ast.from_table_name}.#{column.name} IS NOT NULL THEN #{masked_expr} ELSE NULL END"
|
|
244
244
|
end
|
|
245
|
+
|
|
246
|
+
# Split an outer query's WHERE clauses into the scope id-set clauses to
|
|
247
|
+
# lift into a materialized derived-table JOIN (see each adapter's
|
|
248
|
+
# #compile_scope_join) and the remaining plain clauses (kept in WHERE).
|
|
249
|
+
# Returns [scope_clauses, plain_clauses]; #partition keeps each clause in
|
|
250
|
+
# its original order *within* its own group. The two groups are emitted in
|
|
251
|
+
# different SQL positions (a JOIN vs the WHERE), so their interleaving is
|
|
252
|
+
# irrelevant — only the order within each group matters, and that is kept.
|
|
253
|
+
private def partition_scope_clauses(where_clauses)
|
|
254
|
+
where_clauses.partition { |where_clause| scope_subquery_clause?(where_clause) }
|
|
255
|
+
end
|
|
256
|
+
|
|
257
|
+
# Whether a WHERE clause is a scope id-set probe that should be lifted into
|
|
258
|
+
# a JOIN against a materialized derived table. Only the SelectSubquery /
|
|
259
|
+
# UnionSubquery shapes (reverse_scope UNION, forward cascade, single
|
|
260
|
+
# referenced_by) qualify: they project over potentially huge tables and, as
|
|
261
|
+
# `<col> IN (subquery)`, can degrade into a correlated DEPENDENT SUBQUERY
|
|
262
|
+
# re-evaluated per outer row. The flat ids_field `Subquery` is deliberately
|
|
263
|
+
# left as a plain IN — it is a small, bounded, uncorrelated probe.
|
|
264
|
+
private def scope_subquery_clause?(where_clause)
|
|
265
|
+
where_clause.is_a?(Exwiw::QueryAst::WhereClause) &&
|
|
266
|
+
where_clause.operator == :in_subquery &&
|
|
267
|
+
(where_clause.value.is_a?(Exwiw::QueryAst::SelectSubquery) ||
|
|
268
|
+
where_clause.value.is_a?(Exwiw::QueryAst::UnionSubquery))
|
|
269
|
+
end
|
|
270
|
+
|
|
271
|
+
# The bare name of the single column a scope subquery projects, used to
|
|
272
|
+
# reference it inside the materialized derived table. For a UNION the
|
|
273
|
+
# output column name comes from the first arm.
|
|
274
|
+
private def subquery_projection_name(subquery)
|
|
275
|
+
case subquery
|
|
276
|
+
when Exwiw::QueryAst::SelectSubquery
|
|
277
|
+
subquery.query.columns.first.name
|
|
278
|
+
when Exwiw::QueryAst::UnionSubquery
|
|
279
|
+
subquery.queries.first.columns.first.name
|
|
280
|
+
end
|
|
281
|
+
end
|
|
245
282
|
end
|
|
246
283
|
|
|
247
284
|
# @params [Exwiw::QueryAst] query_ast
|
data/lib/exwiw/cli.rb
CHANGED
|
@@ -35,6 +35,7 @@ module Exwiw
|
|
|
35
35
|
ids
|
|
36
36
|
ids_field
|
|
37
37
|
scope_column
|
|
38
|
+
parallel_workers
|
|
38
39
|
].freeze
|
|
39
40
|
|
|
40
41
|
# Database connection settings are environment-specific (and sometimes
|
|
@@ -44,7 +45,7 @@ module Exwiw
|
|
|
44
45
|
|
|
45
46
|
# Keys that only make sense for `export`. They are skipped when merging config
|
|
46
47
|
# for `explain` so a shared config file does not trip validate_explain_only!.
|
|
47
|
-
EXPORT_ONLY_CONFIG_KEYS = %w[output_dir output_format insert_only after_insert_hook].freeze
|
|
48
|
+
EXPORT_ONLY_CONFIG_KEYS = %w[output_dir output_format insert_only after_insert_hook parallel_workers].freeze
|
|
48
49
|
|
|
49
50
|
def self.start(argv)
|
|
50
51
|
new(argv).run
|
|
@@ -80,6 +81,7 @@ module Exwiw
|
|
|
80
81
|
@output_format = nil
|
|
81
82
|
@insert_only = nil
|
|
82
83
|
@after_insert_hook_path = nil
|
|
84
|
+
@parallel_workers = nil
|
|
83
85
|
# nil (not :info) so we can tell "user passed --log-level" from the default,
|
|
84
86
|
# letting a config-file value fill in; the :info default is applied later.
|
|
85
87
|
@log_level = nil
|
|
@@ -125,6 +127,7 @@ module Exwiw
|
|
|
125
127
|
output_format: @output_format,
|
|
126
128
|
insert_only: @insert_only,
|
|
127
129
|
after_insert_hook_path: @after_insert_hook_path,
|
|
130
|
+
parallel_workers: @parallel_workers,
|
|
128
131
|
cli_options: build_cli_options_hash,
|
|
129
132
|
logger: logger,
|
|
130
133
|
).run
|
|
@@ -165,6 +168,7 @@ module Exwiw
|
|
|
165
168
|
resolve_scope_column!
|
|
166
169
|
resolve_ids_field!
|
|
167
170
|
resolve_uri_option!
|
|
171
|
+
resolve_parallel_workers!
|
|
168
172
|
|
|
169
173
|
if @subcommand == "explain"
|
|
170
174
|
validate_explain_only!
|
|
@@ -316,6 +320,7 @@ module Exwiw
|
|
|
316
320
|
end
|
|
317
321
|
@ids_field ||= config["ids_field"]
|
|
318
322
|
@scope_column ||= config["scope_column"]
|
|
323
|
+
@parallel_workers ||= parse_parallel_workers(config["parallel_workers"]) if config.key?("parallel_workers")
|
|
319
324
|
end
|
|
320
325
|
|
|
321
326
|
# Strip a trailing slash (like the CLI's dir options) and expand relative to
|
|
@@ -409,6 +414,38 @@ module Exwiw
|
|
|
409
414
|
end
|
|
410
415
|
end
|
|
411
416
|
|
|
417
|
+
# `--parallel-workers` opts into the MongoDB fork-parallel dump schedule
|
|
418
|
+
# (docs/mongodb-dump-parallelism-2x-notes.md). It is mongodb-only (the SQL
|
|
419
|
+
# adapters shell out to their own dumpers) and must be a positive integer;
|
|
420
|
+
# N<2 is accepted but runs serially. Runs after the adapter name is normalized
|
|
421
|
+
# so the family check is reliable. `explain` rejection is handled separately
|
|
422
|
+
# by validate_explain_only!.
|
|
423
|
+
private def resolve_parallel_workers!
|
|
424
|
+
return if @parallel_workers.nil?
|
|
425
|
+
|
|
426
|
+
if @database_adapter != "mongodb"
|
|
427
|
+
$stderr.puts "--parallel-workers is only supported by the mongodb adapter"
|
|
428
|
+
exit 1
|
|
429
|
+
end
|
|
430
|
+
|
|
431
|
+
if @parallel_workers < 1
|
|
432
|
+
$stderr.puts "--parallel-workers must be a positive integer (got #{@parallel_workers})"
|
|
433
|
+
exit 1
|
|
434
|
+
end
|
|
435
|
+
end
|
|
436
|
+
|
|
437
|
+
# Coerce a config-file `parallel_workers` (YAML scalar) to Integer, matching
|
|
438
|
+
# the CLI flag's Integer coercion. A non-integer value is a config typo, so
|
|
439
|
+
# fail fast rather than silently dropping it.
|
|
440
|
+
private def parse_parallel_workers(value)
|
|
441
|
+
return nil if value.nil?
|
|
442
|
+
|
|
443
|
+
Integer(value)
|
|
444
|
+
rescue ArgumentError, TypeError
|
|
445
|
+
$stderr.puts "config 'parallel_workers' must be an integer (got #{value.inspect})"
|
|
446
|
+
exit 1
|
|
447
|
+
end
|
|
448
|
+
|
|
412
449
|
private def validate_explain_only!
|
|
413
450
|
if @database_adapter == "mongodb"
|
|
414
451
|
$stderr.puts "mongodb adapter is not yet supported by 'explain' subcommand"
|
|
@@ -420,6 +457,7 @@ module Exwiw
|
|
|
420
457
|
rejected << "--output-format" unless @output_format.nil?
|
|
421
458
|
rejected << "--insert-only" unless @insert_only.nil?
|
|
422
459
|
rejected << "--after-insert-hook" unless @after_insert_hook_path.nil?
|
|
460
|
+
rejected << "--parallel-workers" unless @parallel_workers.nil?
|
|
423
461
|
|
|
424
462
|
unless rejected.empty?
|
|
425
463
|
$stderr.puts "The following options are not applicable in 'explain' subcommand: #{rejected.join(', ')}"
|
|
@@ -526,6 +564,7 @@ module Exwiw
|
|
|
526
564
|
opts.on("--after-insert-hook=PATH", "Path to a .rb or .sh post-processing hook executed after all insert/delete files are written (export subcommand only)") do |v|
|
|
527
565
|
@after_insert_hook_path = File.expand_path(v)
|
|
528
566
|
end
|
|
567
|
+
opts.on("--parallel-workers=N", Integer, "Fork N workers for the MongoDB dump's parallel schedule (mongodb + export only; N>=2 enables it, default is serial). Output is byte-identical to serial; falls back to serial where fork is unavailable.") { |v| @parallel_workers = v }
|
|
529
568
|
opts.on("--log-level=LEVEL", "Log level (debug, info). default is info") { |v| @log_level = v.to_sym }
|
|
530
569
|
|
|
531
570
|
opts.on("--help", "Print this help") do
|
|
@@ -0,0 +1,290 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "set"
|
|
4
|
+
require "fileutils"
|
|
5
|
+
require "tmpdir"
|
|
6
|
+
|
|
7
|
+
module Exwiw
|
|
8
|
+
# Runs the inter-collection fork schedule from
|
|
9
|
+
# docs/mongodb-dump-parallelism-2x-notes.md, producing output **byte-identical**
|
|
10
|
+
# to the serial Runner while parallelizing the dominant cost (the Mongo driver's
|
|
11
|
+
# BSON->Ruby decode) across processes — each worker decodes its own collections
|
|
12
|
+
# in their natural order, so order is preserved and the result still matches a
|
|
13
|
+
# serial dump.
|
|
14
|
+
#
|
|
15
|
+
# It consumes the static, config-derived classification from MongodbParallelPlan
|
|
16
|
+
# (the three groups + cascade adjacency + ref_bt components) and adds the live
|
|
17
|
+
# orchestration the plan deliberately leaves out: a fork pool per group, LPT
|
|
18
|
+
# bin-packing on a per-collection cost weight, @state Marshal sidecar IPC for the
|
|
19
|
+
# handful of referenced leaves, and the Phase-2 cascade reprocess.
|
|
20
|
+
#
|
|
21
|
+
# The schedule (one parent process + a pool of `workers` forks):
|
|
22
|
+
#
|
|
23
|
+
# Phase 1 (concurrent): fork the leaf pool; the parent meanwhile dumps the
|
|
24
|
+
# schema and processes the WHOLE genuine DAG optimistically (no leaf @state
|
|
25
|
+
# yet), recording each genuine collection's row count.
|
|
26
|
+
# Barrier: wait for the leaf pool; load the Marshal sidecars the consumed
|
|
27
|
+
# leaves wrote into the parent's @state.
|
|
28
|
+
# Phase 2 (cascade): reprocess only the genuine collections whose output can
|
|
29
|
+
# change now that leaf @state is present (the direct-leaf referencers),
|
|
30
|
+
# cascading to genuine children of any whose row count actually changed.
|
|
31
|
+
# Phase 3: fork the ref_bt collections as dependency-closed components, each
|
|
32
|
+
# worker owning whole components (processed in topological order) seeded with
|
|
33
|
+
# the leaf @state its members reference.
|
|
34
|
+
#
|
|
35
|
+
# Output bytes are independent of the schedule: every collection writes its own
|
|
36
|
+
# insert-NNN-<name>.<ext> file (the index taken over the plan's full ordering,
|
|
37
|
+
# exactly as the serial Runner numbers them) and the per-collection write is the
|
|
38
|
+
# same build_query -> execute -> write_inserts pass the Runner performs. The
|
|
39
|
+
# bin-packing only decides which worker runs which collection, never the bytes.
|
|
40
|
+
#
|
|
41
|
+
# fork is required; callers must check {.available?} and fall back to the serial
|
|
42
|
+
# Runner on JRuby/TruffleRuby/Windows.
|
|
43
|
+
class MongodbParallelDumper
|
|
44
|
+
# True when the runtime can `fork` (CRuby on a POSIX OS). On JRuby/TruffleRuby
|
|
45
|
+
# and Windows it cannot — the caller must run the serial Runner instead.
|
|
46
|
+
def self.available?
|
|
47
|
+
Process.respond_to?(:fork)
|
|
48
|
+
end
|
|
49
|
+
|
|
50
|
+
# Longest-Processing-Time bin-packing: assign `items` to `bins` bins, heaviest
|
|
51
|
+
# first onto the currently least-loaded bin. Returns an Array of `bins` arrays
|
|
52
|
+
# (some may be empty when items < bins). `weight` is called exactly once per
|
|
53
|
+
# item (it may be DB-backed, so it must not be invoked repeatedly). Pure — no
|
|
54
|
+
# DB, no IO — so it is unit-tested directly.
|
|
55
|
+
def self.bin_pack(items, bins, &weight)
|
|
56
|
+
raise ArgumentError, "bins must be >= 1 (got #{bins})" if bins < 1
|
|
57
|
+
|
|
58
|
+
weighted = items.map { |item| [item, weight.call(item)] }.sort_by { |(_, w)| -w }
|
|
59
|
+
groups = Array.new(bins) { [] }
|
|
60
|
+
loads = Array.new(bins, 0)
|
|
61
|
+
weighted.each do |(item, w)|
|
|
62
|
+
i = (0...bins).min_by { |j| loads[j] }
|
|
63
|
+
groups[i] << item
|
|
64
|
+
loads[i] += w
|
|
65
|
+
end
|
|
66
|
+
groups
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
# @param connection_config [ConnectionConfig] used to build a FRESH adapter in
|
|
70
|
+
# the parent and in every fork (a Mongo client cannot be shared across fork)
|
|
71
|
+
# @param plan [MongodbParallelPlan] the static classification for this dump
|
|
72
|
+
# @param dump_target [DumpTarget]
|
|
73
|
+
# @param table_by_name [Hash{String=>config}] ALL configs (embedded included),
|
|
74
|
+
# exactly as Runner builds it
|
|
75
|
+
# @param output_dir [String]
|
|
76
|
+
# @param workers [Integer] fork pool size (>= 1)
|
|
77
|
+
# @param logger [Logger]
|
|
78
|
+
# @param weight_for [#call, nil] optional name -> numeric cost weight for LPT;
|
|
79
|
+
# defaults to the adapter's metadata-only estimated document count
|
|
80
|
+
def initialize(connection_config:, plan:, dump_target:, table_by_name:, output_dir:, workers:, logger:, weight_for: nil)
|
|
81
|
+
raise ArgumentError, "workers must be >= 1 (got #{workers})" if workers < 1
|
|
82
|
+
|
|
83
|
+
@connection_config = connection_config
|
|
84
|
+
@plan = plan
|
|
85
|
+
@dump_target = dump_target
|
|
86
|
+
@table_by_name = table_by_name
|
|
87
|
+
@output_dir = output_dir
|
|
88
|
+
@workers = workers
|
|
89
|
+
@logger = logger
|
|
90
|
+
@weight_for = weight_for
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
# Execute the full schedule. Assumes the caller has already cleaned the output
|
|
94
|
+
# directory (the Runner does this before handing off), mirroring the serial
|
|
95
|
+
# path which dumps the schema into a freshly-cleaned dir. Returns a small stats
|
|
96
|
+
# Hash. Raises if any worker pool reports a non-zero exit.
|
|
97
|
+
def run
|
|
98
|
+
raise "fork is unavailable on this runtime; run the serial Runner instead" unless self.class.available?
|
|
99
|
+
|
|
100
|
+
FileUtils.mkdir_p(@output_dir)
|
|
101
|
+
parent = build_adapter
|
|
102
|
+
|
|
103
|
+
Dir.mktmpdir("exwiw-mongo-parallel-") do |sidecar_dir|
|
|
104
|
+
phase1_leaf_and_genuine(parent, sidecar_dir)
|
|
105
|
+
phase2_cascade(parent, sidecar_dir)
|
|
106
|
+
phase3_ref_components(parent, sidecar_dir)
|
|
107
|
+
end
|
|
108
|
+
|
|
109
|
+
{
|
|
110
|
+
workers: @workers,
|
|
111
|
+
genuine: @plan.genuine.size,
|
|
112
|
+
leaves: @plan.leaves.size,
|
|
113
|
+
ref_bt: @plan.ref_bt.size,
|
|
114
|
+
components: @plan.reference_components.map(&:size).sort.reverse,
|
|
115
|
+
}
|
|
116
|
+
end
|
|
117
|
+
|
|
118
|
+
private
|
|
119
|
+
|
|
120
|
+
# Phase 1: fork the leaf pool to run concurrently while the parent dumps the
|
|
121
|
+
# schema (parent-only, needs no @state) and processes the whole genuine DAG
|
|
122
|
+
# optimistically. The genuine row counts captured here seed the Phase-2 cascade.
|
|
123
|
+
def phase1_leaf_and_genuine(parent, sidecar_dir)
|
|
124
|
+
leaf_master = fork do
|
|
125
|
+
ok = run_leaf_pool(sidecar_dir)
|
|
126
|
+
exit!(ok ? 0 : 1)
|
|
127
|
+
end
|
|
128
|
+
|
|
129
|
+
schema_path = File.join(@output_dir, "insert-000-schema.#{parent.schema_output_extension}")
|
|
130
|
+
ordered_tables = @plan.ordered_all.map { |name| @table_by_name.fetch(name) }
|
|
131
|
+
@logger.info("Writing schema to #{schema_path}...")
|
|
132
|
+
parent.dump_schema(ordered_tables, schema_path)
|
|
133
|
+
|
|
134
|
+
@logger.info("Processing #{@plan.genuine.size} genuine collection(s) (parent, optimistic pass)...")
|
|
135
|
+
@row_counts = {}
|
|
136
|
+
@plan.genuine.each { |name| @row_counts[name] = process_collection(parent, name) }
|
|
137
|
+
|
|
138
|
+
Process.wait(leaf_master)
|
|
139
|
+
raise "exwiw parallel leaf pool failed (exit #{$?.exitstatus})" unless $?.exitstatus&.zero?
|
|
140
|
+
end
|
|
141
|
+
|
|
142
|
+
# Barrier + Phase 2: load the consumed-leaf @state the leaf workers handed back,
|
|
143
|
+
# then reprocess only the genuine collections whose output can change now that
|
|
144
|
+
# leaf @state is present, cascading to genuine children of any that changed.
|
|
145
|
+
def phase2_cascade(parent, sidecar_dir)
|
|
146
|
+
load_sidecars(parent, @plan.consumed_leaves, sidecar_dir)
|
|
147
|
+
|
|
148
|
+
queue = @plan.direct_leaf_genuine.dup
|
|
149
|
+
seen = Set.new
|
|
150
|
+
until queue.empty?
|
|
151
|
+
name = queue.shift
|
|
152
|
+
next if seen.include?(name)
|
|
153
|
+
|
|
154
|
+
seen << name
|
|
155
|
+
new_count = process_collection(parent, name)
|
|
156
|
+
next if new_count == @row_counts[name]
|
|
157
|
+
|
|
158
|
+
@row_counts[name] = new_count
|
|
159
|
+
@plan.genuine_children[name].each { |child| queue << child }
|
|
160
|
+
end
|
|
161
|
+
@logger.info("Cascade reprocessed #{seen.size} genuine collection(s) with leaf @state.") unless seen.empty?
|
|
162
|
+
end
|
|
163
|
+
|
|
164
|
+
# Phase 3: fork the ref_bt collections as dependency-closed weakly-connected
|
|
165
|
+
# components in a single pool (no level barriers, no cross-worker IPC). Each
|
|
166
|
+
# worker owns whole components and processes their members in topological order,
|
|
167
|
+
# seeded only with the leaf @state those members reference.
|
|
168
|
+
def phase3_ref_components(parent, sidecar_dir)
|
|
169
|
+
components = @plan.reference_components
|
|
170
|
+
return if components.empty?
|
|
171
|
+
|
|
172
|
+
leaf_state = parent.state
|
|
173
|
+
groups = self.class.bin_pack(components, @workers) { |component| component.sum { |name| weight_of(parent, name) } }
|
|
174
|
+
|
|
175
|
+
pids = groups.reject(&:empty?).map do |group|
|
|
176
|
+
members = group.flatten
|
|
177
|
+
seed = leaf_state.slice(*parents_of(members))
|
|
178
|
+
fork { run_component_worker(group, seed) }
|
|
179
|
+
end
|
|
180
|
+
ok = pids.map { |pid| Process.wait(pid); $?.exitstatus&.zero? }.all?
|
|
181
|
+
raise "exwiw parallel ref_bt pool failed" unless ok
|
|
182
|
+
|
|
183
|
+
@logger.info("Processed #{@plan.ref_bt.size} ref_bt collection(s) in #{groups.reject(&:empty?).size} worker(s).")
|
|
184
|
+
end
|
|
185
|
+
|
|
186
|
+
# Fork `@workers` leaf workers (LPT-packed on cost weight so the single heaviest
|
|
187
|
+
# leaf sits alone) and wait for them. Each worker writes a Marshal sidecar for
|
|
188
|
+
# the consumed leaves it produced. Runs inside the leaf_master fork, so its own
|
|
189
|
+
# weight adapter and the worker connections never touch the parent's.
|
|
190
|
+
def run_leaf_pool(sidecar_dir)
|
|
191
|
+
return true if @plan.leaves.empty?
|
|
192
|
+
|
|
193
|
+
weight_adapter = build_adapter
|
|
194
|
+
groups = self.class.bin_pack(@plan.leaves, @workers) { |name| weight_of(weight_adapter, name) }
|
|
195
|
+
|
|
196
|
+
pids = groups.reject(&:empty?).map do |group|
|
|
197
|
+
fork { run_leaf_worker(group, sidecar_dir) }
|
|
198
|
+
end
|
|
199
|
+
pids.map { |pid| Process.wait(pid); $?.exitstatus&.zero? }.all?
|
|
200
|
+
rescue StandardError => e
|
|
201
|
+
@logger.error("exwiw parallel leaf master error: #{e.class}: #{e.message}")
|
|
202
|
+
false
|
|
203
|
+
end
|
|
204
|
+
|
|
205
|
+
def run_leaf_worker(group, sidecar_dir)
|
|
206
|
+
adapter = build_adapter
|
|
207
|
+
group.each { |name| process_collection(adapter, name) }
|
|
208
|
+
group.each do |name|
|
|
209
|
+
next unless @plan.consumed_leaves.include?(name)
|
|
210
|
+
next unless adapter.state.key?(name)
|
|
211
|
+
|
|
212
|
+
File.binwrite(File.join(sidecar_dir, "#{name}.marshal"), Marshal.dump(adapter.state[name]))
|
|
213
|
+
end
|
|
214
|
+
exit!(0)
|
|
215
|
+
rescue StandardError => e
|
|
216
|
+
@logger.error("exwiw parallel leaf worker error (#{group.first}..): #{e.class}: #{e.message}")
|
|
217
|
+
exit!(1)
|
|
218
|
+
end
|
|
219
|
+
|
|
220
|
+
def run_component_worker(group, seed)
|
|
221
|
+
adapter = build_adapter
|
|
222
|
+
adapter.state = seed unless seed.empty?
|
|
223
|
+
# Each component is already topologically ordered (parent before child) and
|
|
224
|
+
# dependency-closed over intra-ref_bt edges, so a plain serial walk suffices.
|
|
225
|
+
group.each { |component| component.each { |name| process_collection(adapter, name) } }
|
|
226
|
+
exit!(0)
|
|
227
|
+
rescue StandardError => e
|
|
228
|
+
@logger.error("exwiw parallel ref_bt worker error (#{group.first&.first}..): #{e.class}: #{e.message}")
|
|
229
|
+
exit!(1)
|
|
230
|
+
end
|
|
231
|
+
|
|
232
|
+
# Extract one collection to its insert-NNN-<name>.<ext> file. This mirrors the
|
|
233
|
+
# serial Runner's non-COPY insert path exactly — same filename (index taken over
|
|
234
|
+
# the plan's full ordering), same pre/post hooks (nil for MongoDB), same
|
|
235
|
+
# streaming write_inserts + trailing "\n", and the same empty-result handling
|
|
236
|
+
# (delete the just-opened file) — so the bytes are identical regardless of which
|
|
237
|
+
# process writes them. Returns the row count.
|
|
238
|
+
def process_collection(adapter, name)
|
|
239
|
+
table = @table_by_name.fetch(name)
|
|
240
|
+
query = adapter.build_query(table, @dump_target, @table_by_name)
|
|
241
|
+
results = adapter.execute(query)
|
|
242
|
+
|
|
243
|
+
insert_idx = (@plan.index_of.fetch(name) + 1).to_s.rjust(3, "0")
|
|
244
|
+
path = File.join(@output_dir, "insert-#{insert_idx}-#{name}.#{adapter.output_extension}")
|
|
245
|
+
chunk_size = table.bulk_insert_chunk_size || adapter.default_bulk_insert_chunk_size
|
|
246
|
+
|
|
247
|
+
record_num = 0
|
|
248
|
+
File.open(path, "w") do |file|
|
|
249
|
+
pre = adapter.pre_insert_sql(table)
|
|
250
|
+
file.puts(pre) if pre
|
|
251
|
+
_statement_count, record_num = adapter.write_inserts(file, results, table, chunk_size)
|
|
252
|
+
file.print("\n")
|
|
253
|
+
post = adapter.post_insert_sql(table)
|
|
254
|
+
file.puts(post) if post
|
|
255
|
+
end
|
|
256
|
+
File.delete(path) if record_num.zero?
|
|
257
|
+
record_num
|
|
258
|
+
end
|
|
259
|
+
|
|
260
|
+
# Merge the Marshal sidecars the leaf workers wrote (one per consumed leaf that
|
|
261
|
+
# actually produced rows) into `adapter`'s @state, so the cascade reprocess and
|
|
262
|
+
# the ref_bt workers can constrain on those leaf ids.
|
|
263
|
+
def load_sidecars(adapter, names, sidecar_dir)
|
|
264
|
+
state = adapter.state
|
|
265
|
+
names.each do |name|
|
|
266
|
+
path = File.join(sidecar_dir, "#{name}.marshal")
|
|
267
|
+
state[name] = Marshal.load(File.binread(path)) if File.exist?(path)
|
|
268
|
+
end
|
|
269
|
+
end
|
|
270
|
+
|
|
271
|
+
# The distinct belongs_to parent names of `names`, used to slice the leaf @state
|
|
272
|
+
# a worker is seeded with down to only the keys its collections reference.
|
|
273
|
+
def parents_of(names)
|
|
274
|
+
names.flat_map { |name| @table_by_name.fetch(name).belongs_tos.map(&:table_name) }.uniq
|
|
275
|
+
end
|
|
276
|
+
|
|
277
|
+
def weight_of(adapter, name)
|
|
278
|
+
return @weight_for.call(name) if @weight_for
|
|
279
|
+
|
|
280
|
+
adapter.estimated_count(name)
|
|
281
|
+
end
|
|
282
|
+
|
|
283
|
+
# A fresh adapter (and thus a fresh, lazily-opened Mongo connection). Built per
|
|
284
|
+
# process — the parent and every fork get their own; a Mongo client must never
|
|
285
|
+
# be shared across a fork boundary.
|
|
286
|
+
def build_adapter
|
|
287
|
+
Adapter.build(@connection_config, @logger)
|
|
288
|
+
end
|
|
289
|
+
end
|
|
290
|
+
end
|
|
@@ -0,0 +1,271 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "set"
|
|
4
|
+
|
|
5
|
+
module Exwiw
|
|
6
|
+
# Classifies a MongoDB dump's collections into the three dependency groups the
|
|
7
|
+
# inter-collection fork schedule needs, plus the derived adjacency that
|
|
8
|
+
# schedule consumes. See docs/mongodb-dump-parallelism-2x-notes.md for the why;
|
|
9
|
+
# this class is the static, config-derived half of that plan.
|
|
10
|
+
#
|
|
11
|
+
# It is a pure function of the loaded configs and the dump target — no DB
|
|
12
|
+
# access — so it can be computed once up front and unit-tested without a live
|
|
13
|
+
# MongoDB. The fork orchestration (worker pools, LPT bin-packing on output-size
|
|
14
|
+
# weights, @state Marshal sidecars, the Phase-2 cascade loop) lives elsewhere
|
|
15
|
+
# and consumes the structures produced here.
|
|
16
|
+
#
|
|
17
|
+
# Input contract: `configs` are MongodbCollectionConfig already passed through
|
|
18
|
+
# `#reject_ignored_members!` (exactly as Runner#load_table_config produces
|
|
19
|
+
# them), so every surviving belongs_to has a non-nil `table_name`. ignore:true
|
|
20
|
+
# *collections* are still present in `configs` — they contribute to the schema
|
|
21
|
+
# and to the file-index ordering, but their data extraction is skipped — and
|
|
22
|
+
# are therefore excluded from the three processing groups.
|
|
23
|
+
#
|
|
24
|
+
# The three groups partition the extractable collections exactly:
|
|
25
|
+
#
|
|
26
|
+
# - **genuine** — reachable to the dump target by following belongs_to edges
|
|
27
|
+
# (the scoped DAG). Includes the target itself.
|
|
28
|
+
# - **leaf** — no belongs_to at all: reference/master data dumped in full,
|
|
29
|
+
# with no input dependencies (embarrassingly parallel).
|
|
30
|
+
# - **ref_bt** — has belongs_to but is NOT reachable to the target: reference
|
|
31
|
+
# data scoped by the adapter's strict-AND fallback. Its
|
|
32
|
+
# internal edges form shallow components.
|
|
33
|
+
#
|
|
34
|
+
# `reachable` mirrors MongodbAdapter#genuine_scope_set exactly (fixpoint over
|
|
35
|
+
# all non-embedded configs, including ignore:true ones), so the genuine set
|
|
36
|
+
# here matches the adapter's runtime scoping classification.
|
|
37
|
+
class MongodbParallelPlan
|
|
38
|
+
EMPTY_NAMES = [].freeze
|
|
39
|
+
private_constant :EMPTY_NAMES
|
|
40
|
+
|
|
41
|
+
# @param configs [Array<MongodbCollectionConfig>] reject_ignored_members!'d
|
|
42
|
+
# @param target_table_name [String] the dump target collection
|
|
43
|
+
# @param logger [Logger, nil] forwarded to DetermineTableProcessingOrder
|
|
44
|
+
def initialize(configs:, target_table_name:, logger: nil)
|
|
45
|
+
@by = configs.each_with_object({}) { |c, h| h[c.name] = c }
|
|
46
|
+
@target_table_name = target_table_name
|
|
47
|
+
|
|
48
|
+
dumpable = configs.reject(&:embedded?)
|
|
49
|
+
# The file index (insert-NNN-) is taken over the FULL processing order,
|
|
50
|
+
# including ignore:true collections, so the orchestrated run's filenames
|
|
51
|
+
# are byte-identical to the serial Runner's (which numbers files the same
|
|
52
|
+
# way). Data extraction, however, skips ignore:true — see #extractable.
|
|
53
|
+
@ordered_all = DetermineTableProcessingOrder.run(dumpable, logger: logger).freeze
|
|
54
|
+
@index_of = @ordered_all.each_with_index.to_h.freeze
|
|
55
|
+
@extractable = @ordered_all.reject { |n| @by[n].ignore }.freeze
|
|
56
|
+
|
|
57
|
+
@reachable = compute_reachable
|
|
58
|
+
classify
|
|
59
|
+
derive_consumed_leaves
|
|
60
|
+
derive_cascade_adjacency
|
|
61
|
+
@reference_components = compute_reference_components.freeze
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
# Full processing order, INCLUDING ignore:true collections — the sequence the
|
|
65
|
+
# file index (insert-NNN-) is numbered over.
|
|
66
|
+
attr_reader :ordered_all
|
|
67
|
+
|
|
68
|
+
# name => 0-based position in #ordered_all (the file index is position + 1).
|
|
69
|
+
attr_reader :index_of
|
|
70
|
+
|
|
71
|
+
# #ordered_all minus ignore:true collections — the collections whose data is
|
|
72
|
+
# actually extracted. Union of the three groups below.
|
|
73
|
+
attr_reader :extractable
|
|
74
|
+
|
|
75
|
+
# The three groups (each a subset of #extractable, in #ordered_all order):
|
|
76
|
+
|
|
77
|
+
# genuine — reachable to the dump target (includes the target).
|
|
78
|
+
attr_reader :genuine
|
|
79
|
+
|
|
80
|
+
# leaf — no belongs_to; reference/master data with no input dependencies.
|
|
81
|
+
attr_reader :leaves
|
|
82
|
+
|
|
83
|
+
# ref_bt — has belongs_to but not reachable to the target.
|
|
84
|
+
attr_reader :ref_bt
|
|
85
|
+
|
|
86
|
+
# ref_bt collections as dependency-closed weakly-connected components over
|
|
87
|
+
# intra-ref_bt belongs_to edges, each returned in a valid topological order
|
|
88
|
+
# (a parent before its child). A whole component can be processed serially by
|
|
89
|
+
# one worker with no cross-worker @state IPC and no level barriers, seeded
|
|
90
|
+
# only with the leaf @state its members reference.
|
|
91
|
+
attr_reader :reference_components
|
|
92
|
+
|
|
93
|
+
# Leaf collections referenced (via belongs_to) by some non-leaf extractable
|
|
94
|
+
# collection (genuine OR ref_bt). These are the only leaves whose captured
|
|
95
|
+
# @state a downstream collection can need, so they are the ones a leaf worker
|
|
96
|
+
# must hand back (e.g. as a Marshal sidecar). Set<String>.
|
|
97
|
+
attr_reader :consumed_leaves
|
|
98
|
+
|
|
99
|
+
# genuine collections that directly reference a leaf — the only genuine
|
|
100
|
+
# collections whose output can change once leaf @state is present (and only
|
|
101
|
+
# at runtime, when their genuine anchor turns out empty and they fall back to
|
|
102
|
+
# the leaf clause). These seed the Phase-2 cascade reprocess.
|
|
103
|
+
attr_reader :direct_leaf_genuine
|
|
104
|
+
|
|
105
|
+
# name => genuine children (genuine collections that belongs_to it), keyed
|
|
106
|
+
# only by reachable parents. Drives the Phase-2 cascade: when a reprocessed
|
|
107
|
+
# collection's row count changes, its genuine children are re-enqueued.
|
|
108
|
+
attr_reader :genuine_children
|
|
109
|
+
|
|
110
|
+
# The set of collection names genuinely scoped by the target (the target plus
|
|
111
|
+
# everything that can reach it through belongs_to). Exposed for inspection.
|
|
112
|
+
attr_reader :reachable
|
|
113
|
+
|
|
114
|
+
def summary
|
|
115
|
+
{
|
|
116
|
+
extractable: @extractable.size,
|
|
117
|
+
genuine: @genuine.size,
|
|
118
|
+
leaves: @leaves.size,
|
|
119
|
+
ref_bt: @ref_bt.size,
|
|
120
|
+
consumed_leaves: @consumed_leaves.size,
|
|
121
|
+
direct_leaf_genuine: @direct_leaf_genuine.size,
|
|
122
|
+
reference_components: @reference_components.map(&:size).sort.reverse,
|
|
123
|
+
}
|
|
124
|
+
end
|
|
125
|
+
|
|
126
|
+
private
|
|
127
|
+
|
|
128
|
+
# Fixpoint over non-embedded configs: the target, plus every collection that
|
|
129
|
+
# can reach it by following belongs_to (child -> parent) transitively.
|
|
130
|
+
# Mirrors MongodbAdapter#genuine_scope_set (same traversal, same inclusion of
|
|
131
|
+
# ignore:true collections) so the genuine set matches the adapter's runtime
|
|
132
|
+
# scoping decision.
|
|
133
|
+
def compute_reachable
|
|
134
|
+
reachable = Set.new([@target_table_name])
|
|
135
|
+
loop do
|
|
136
|
+
added = false
|
|
137
|
+
@by.each_value do |cfg|
|
|
138
|
+
next if cfg.embedded? || reachable.include?(cfg.name)
|
|
139
|
+
next unless cfg.belongs_tos.any? { |rel| reachable.include?(rel.table_name) }
|
|
140
|
+
|
|
141
|
+
reachable << cfg.name
|
|
142
|
+
added = true
|
|
143
|
+
end
|
|
144
|
+
break unless added
|
|
145
|
+
end
|
|
146
|
+
reachable
|
|
147
|
+
end
|
|
148
|
+
|
|
149
|
+
def classify
|
|
150
|
+
# The three groups partition #extractable: reachable -> genuine; otherwise
|
|
151
|
+
# leaf (no belongs_to) -> leaves; otherwise -> ref_bt. The target is
|
|
152
|
+
# reachable (it seeds the set), so it lands in genuine and is never
|
|
153
|
+
# mis-grouped as a leaf even when it has no belongs_to of its own — which
|
|
154
|
+
# would otherwise double-process it (leaf pool AND parent).
|
|
155
|
+
@genuine = []
|
|
156
|
+
@leaves = []
|
|
157
|
+
@ref_bt = []
|
|
158
|
+
@extractable.each do |name|
|
|
159
|
+
if @reachable.include?(name)
|
|
160
|
+
@genuine << name
|
|
161
|
+
elsif leaf?(name)
|
|
162
|
+
@leaves << name
|
|
163
|
+
else
|
|
164
|
+
@ref_bt << name
|
|
165
|
+
end
|
|
166
|
+
end
|
|
167
|
+
@genuine.freeze
|
|
168
|
+
@leaves.freeze
|
|
169
|
+
@ref_bt.freeze
|
|
170
|
+
# Membership against the leaf *group* (which excludes the target), not the
|
|
171
|
+
# raw structural #leaf? predicate. The target has no belongs_to and is thus
|
|
172
|
+
# structurally leaf-like, but it is genuine — processed by the parent, not a
|
|
173
|
+
# leaf worker — so a belongs_to to the target must not count as referencing
|
|
174
|
+
# a leaf (it would wrongly demand a sidecar / seed the cascade).
|
|
175
|
+
@leaf_set = @leaves.to_set
|
|
176
|
+
end
|
|
177
|
+
|
|
178
|
+
def derive_consumed_leaves
|
|
179
|
+
consumed = Set.new
|
|
180
|
+
(@genuine + @ref_bt).each do |name|
|
|
181
|
+
@by[name].belongs_tos.each do |rel|
|
|
182
|
+
consumed << rel.table_name if @leaf_set.include?(rel.table_name)
|
|
183
|
+
end
|
|
184
|
+
end
|
|
185
|
+
@consumed_leaves = consumed.freeze
|
|
186
|
+
end
|
|
187
|
+
|
|
188
|
+
def derive_cascade_adjacency
|
|
189
|
+
@direct_leaf_genuine = @genuine.select do |name|
|
|
190
|
+
@by[name].belongs_tos.any? { |rel| @leaf_set.include?(rel.table_name) }
|
|
191
|
+
end.freeze
|
|
192
|
+
|
|
193
|
+
children = Hash.new { |h, k| h[k] = [] }
|
|
194
|
+
@genuine.each do |name|
|
|
195
|
+
@by[name].belongs_tos.each do |rel|
|
|
196
|
+
children[rel.table_name] << name if @reachable.include?(rel.table_name)
|
|
197
|
+
end
|
|
198
|
+
end
|
|
199
|
+
# Freeze with a non-mutating default so a lookup of a parent with no genuine
|
|
200
|
+
# children returns [] without trying to write into the frozen hash.
|
|
201
|
+
children.default_proc = nil
|
|
202
|
+
children.default = EMPTY_NAMES
|
|
203
|
+
@genuine_children = children.freeze
|
|
204
|
+
end
|
|
205
|
+
|
|
206
|
+
# ref_bt as dependency-closed weakly-connected components over intra-ref_bt
|
|
207
|
+
# belongs_to edges, each topo-ordered. Ported from the bench prototype: build
|
|
208
|
+
# the directed (child indegree) and undirected (component) views of the
|
|
209
|
+
# intra-ref_bt edges, find weakly-connected components, then Kahn-order each.
|
|
210
|
+
def compute_reference_components
|
|
211
|
+
ref_set = @ref_bt.to_set
|
|
212
|
+
children = Hash.new { |h, k| h[k] = [] }
|
|
213
|
+
adjacency = Hash.new { |h, k| h[k] = [] }
|
|
214
|
+
@ref_bt.each do |name|
|
|
215
|
+
@by[name].belongs_tos.each do |rel|
|
|
216
|
+
next unless ref_set.include?(rel.table_name)
|
|
217
|
+
|
|
218
|
+
children[rel.table_name] << name
|
|
219
|
+
adjacency[rel.table_name] << name
|
|
220
|
+
adjacency[name] << rel.table_name
|
|
221
|
+
end
|
|
222
|
+
end
|
|
223
|
+
|
|
224
|
+
seen = Set.new
|
|
225
|
+
components = []
|
|
226
|
+
@ref_bt.each do |start|
|
|
227
|
+
next if seen.include?(start)
|
|
228
|
+
|
|
229
|
+
stack = [start]
|
|
230
|
+
members = []
|
|
231
|
+
until stack.empty?
|
|
232
|
+
node = stack.pop
|
|
233
|
+
next if seen.include?(node)
|
|
234
|
+
|
|
235
|
+
seen << node
|
|
236
|
+
members << node
|
|
237
|
+
adjacency[node].each { |neighbor| stack << neighbor unless seen.include?(neighbor) }
|
|
238
|
+
end
|
|
239
|
+
components << members
|
|
240
|
+
end
|
|
241
|
+
|
|
242
|
+
components.map { |members| topo_order(members, children) }
|
|
243
|
+
end
|
|
244
|
+
|
|
245
|
+
# Kahn topological order of `members` over intra-component belongs_to edges
|
|
246
|
+
# (parent before child). `children` is the directed intra-ref_bt adjacency.
|
|
247
|
+
def topo_order(members, children)
|
|
248
|
+
member_set = members.to_set
|
|
249
|
+
indegree = members.to_h do |name|
|
|
250
|
+
[name, @by[name].belongs_tos.count { |rel| member_set.include?(rel.table_name) }]
|
|
251
|
+
end
|
|
252
|
+
queue = members.select { |name| indegree[name].zero? }
|
|
253
|
+
ordered = []
|
|
254
|
+
until queue.empty?
|
|
255
|
+
node = queue.shift
|
|
256
|
+
ordered << node
|
|
257
|
+
children[node].each do |child|
|
|
258
|
+
next unless member_set.include?(child)
|
|
259
|
+
|
|
260
|
+
indegree[child] -= 1
|
|
261
|
+
queue << child if indegree[child].zero?
|
|
262
|
+
end
|
|
263
|
+
end
|
|
264
|
+
ordered
|
|
265
|
+
end
|
|
266
|
+
|
|
267
|
+
def leaf?(name)
|
|
268
|
+
(cfg = @by[name]) && !cfg.embedded? && cfg.belongs_tos.empty?
|
|
269
|
+
end
|
|
270
|
+
end
|
|
271
|
+
end
|
data/lib/exwiw/runner.rb
CHANGED
|
@@ -13,6 +13,7 @@ module Exwiw
|
|
|
13
13
|
output_format: 'insert',
|
|
14
14
|
insert_only: false,
|
|
15
15
|
after_insert_hook_path: nil,
|
|
16
|
+
parallel_workers: nil,
|
|
16
17
|
cli_options: {}
|
|
17
18
|
)
|
|
18
19
|
@connection_config = connection_config
|
|
@@ -22,6 +23,7 @@ module Exwiw
|
|
|
22
23
|
@output_format = output_format
|
|
23
24
|
@insert_only = insert_only
|
|
24
25
|
@after_insert_hook_path = after_insert_hook_path
|
|
26
|
+
@parallel_workers = parallel_workers
|
|
25
27
|
@cli_options = cli_options
|
|
26
28
|
@logger = logger
|
|
27
29
|
end
|
|
@@ -49,6 +51,19 @@ module Exwiw
|
|
|
49
51
|
|
|
50
52
|
clean_output_dir!
|
|
51
53
|
|
|
54
|
+
# Opt-in MongoDB inter-collection fork parallelism (see
|
|
55
|
+
# docs/mongodb-dump-parallelism-2x-notes.md). It is byte-identical to the
|
|
56
|
+
# serial loop below — same filenames (the file index is taken over the same
|
|
57
|
+
# full processing order) and same per-collection bytes — so it is a drop-in
|
|
58
|
+
# replacement for the whole schema+inserts pass, after which the common
|
|
59
|
+
# after-insert hook still runs. Everything before this point (validation,
|
|
60
|
+
# scope check, ordering, output-dir clean) applies to both paths.
|
|
61
|
+
if use_mongodb_parallel?(adapter)
|
|
62
|
+
dump_mongodb_parallel(configs, table_by_name)
|
|
63
|
+
run_after_insert_hook(adapter, ordered_table_names.size)
|
|
64
|
+
return
|
|
65
|
+
end
|
|
66
|
+
|
|
52
67
|
ordered_tables = ordered_table_names.map { |n| table_by_name.fetch(n) }
|
|
53
68
|
schema_path = File.join(@output_dir, "insert-000-schema.#{adapter.schema_output_extension}")
|
|
54
69
|
@logger.info("Writing schema to #{schema_path}...")
|
|
@@ -161,17 +176,71 @@ module Exwiw
|
|
|
161
176
|
end
|
|
162
177
|
end
|
|
163
178
|
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
179
|
+
run_after_insert_hook(adapter, total_size)
|
|
180
|
+
end
|
|
181
|
+
|
|
182
|
+
# Run the post-processing hook (no-op when none configured). `total_size` is
|
|
183
|
+
# the count of processed tables/collections; the hook's first output file is
|
|
184
|
+
# numbered just past them. Shared by the serial and parallel dump paths.
|
|
185
|
+
private def run_after_insert_hook(adapter, total_size)
|
|
186
|
+
return unless @after_insert_hook_path
|
|
187
|
+
|
|
188
|
+
@logger.info("Running after-insert hook: #{@after_insert_hook_path}")
|
|
189
|
+
AfterInsertHook.run(
|
|
190
|
+
path: @after_insert_hook_path,
|
|
191
|
+
cli_options: @cli_options,
|
|
192
|
+
output_dir: @output_dir,
|
|
193
|
+
next_idx: total_size + 1,
|
|
194
|
+
output_extension: adapter.output_extension,
|
|
195
|
+
logger: @logger,
|
|
196
|
+
)
|
|
197
|
+
end
|
|
198
|
+
|
|
199
|
+
# True when the opt-in MongoDB fork-parallel dump should run instead of the
|
|
200
|
+
# serial loop: the mongodb adapter, a worker count > 1, a genuine-anchor dump
|
|
201
|
+
# target (the schedule is built around the scoped DAG), and a runtime that can
|
|
202
|
+
# fork. Anything else falls back to the serial path (warning when the user
|
|
203
|
+
# explicitly asked for parallelism but it cannot apply).
|
|
204
|
+
private def use_mongodb_parallel?(adapter)
|
|
205
|
+
return false unless adapter.is_a?(Adapter::MongodbAdapter)
|
|
206
|
+
return false unless @parallel_workers && @parallel_workers > 1
|
|
207
|
+
|
|
208
|
+
if @dump_target.table_name.nil?
|
|
209
|
+
@logger.warn("--parallel-workers ignored: MongoDB parallelism needs a --target-collection; running serially.")
|
|
210
|
+
return false
|
|
174
211
|
end
|
|
212
|
+
|
|
213
|
+
unless MongodbParallelDumper.available?
|
|
214
|
+
@logger.warn("--parallel-workers ignored: fork is unavailable on this runtime; running serially.")
|
|
215
|
+
return false
|
|
216
|
+
end
|
|
217
|
+
|
|
218
|
+
true
|
|
219
|
+
end
|
|
220
|
+
|
|
221
|
+
# Build the static plan and hand the whole schema+inserts pass to the fork
|
|
222
|
+
# orchestrator. `configs` are the reject_ignored_members!'d configs (the plan
|
|
223
|
+
# rejects embedded and orders them itself, identically to the serial path).
|
|
224
|
+
private def dump_mongodb_parallel(configs, table_by_name)
|
|
225
|
+
plan = MongodbParallelPlan.new(
|
|
226
|
+
configs: configs,
|
|
227
|
+
target_table_name: @dump_target.table_name,
|
|
228
|
+
logger: @logger,
|
|
229
|
+
)
|
|
230
|
+
@logger.info(
|
|
231
|
+
"MongoDB parallel dump with #{@parallel_workers} worker(s): " \
|
|
232
|
+
"genuine=#{plan.genuine.size}, leaves=#{plan.leaves.size}, ref_bt=#{plan.ref_bt.size}."
|
|
233
|
+
)
|
|
234
|
+
stats = MongodbParallelDumper.new(
|
|
235
|
+
connection_config: @connection_config,
|
|
236
|
+
plan: plan,
|
|
237
|
+
dump_target: @dump_target,
|
|
238
|
+
table_by_name: table_by_name,
|
|
239
|
+
output_dir: @output_dir,
|
|
240
|
+
workers: @parallel_workers,
|
|
241
|
+
logger: @logger,
|
|
242
|
+
).run
|
|
243
|
+
@logger.info("MongoDB parallel dump complete: #{stats.inspect}")
|
|
175
244
|
end
|
|
176
245
|
|
|
177
246
|
# Empty the output dir before writing so each export starts from a clean
|
data/lib/exwiw/version.rb
CHANGED
data/lib/exwiw.rb
CHANGED
|
@@ -23,6 +23,8 @@ require_relative "exwiw/adapter/mysql_adapter"
|
|
|
23
23
|
require_relative "exwiw/adapter/postgresql_adapter"
|
|
24
24
|
require_relative "exwiw/adapter/mongodb_adapter"
|
|
25
25
|
require_relative "exwiw/determine_table_processing_order"
|
|
26
|
+
require_relative "exwiw/mongodb_parallel_plan"
|
|
27
|
+
require_relative "exwiw/mongodb_parallel_dumper"
|
|
26
28
|
require_relative "exwiw/mongo_query"
|
|
27
29
|
require_relative "exwiw/query_ast"
|
|
28
30
|
require_relative "exwiw/query_ast_builder"
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: exwiw
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.8.
|
|
4
|
+
version: 0.8.5
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Shia
|
|
@@ -72,6 +72,8 @@ files:
|
|
|
72
72
|
- lib/exwiw/mongo_query.rb
|
|
73
73
|
- lib/exwiw/mongodb_collection_config.rb
|
|
74
74
|
- lib/exwiw/mongodb_field.rb
|
|
75
|
+
- lib/exwiw/mongodb_parallel_dumper.rb
|
|
76
|
+
- lib/exwiw/mongodb_parallel_plan.rb
|
|
75
77
|
- lib/exwiw/mongoid_schema_generator.rb
|
|
76
78
|
- lib/exwiw/query_ast.rb
|
|
77
79
|
- lib/exwiw/query_ast_builder.rb
|