exwiw 0.7.0 → 0.8.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +20 -1
- data/README.md +63 -24
- data/docs/mongodb-dump-parallelism-2x-notes.md +146 -0
- data/docs/optimization-notes.md +12 -0
- data/docs/scope-column-redesign.md +107 -0
- data/lib/exwiw/adapter/sql_bulk_insert.rb +25 -7
- data/lib/exwiw/adapter.rb +33 -4
- data/lib/exwiw/cli.rb +25 -43
- data/lib/exwiw/query_ast.rb +1 -1
- data/lib/exwiw/query_ast_builder.rb +34 -20
- data/lib/exwiw/runner.rb +24 -9
- data/lib/exwiw/schema_generator.rb +81 -3
- data/lib/exwiw/table_config.rb +5 -4
- data/lib/exwiw/version.rb +1 -1
- data/lib/tasks/exwiw.rake +17 -1
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 80610cc2d13a87793171563b2b8e4ff0568135eb987b550dd9f89bffe89b1d67
|
|
4
|
+
data.tar.gz: 5e9f4976043571647163e9f743c3c8fb709a29944b590c55424dce374cca3269
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 8508aaa2d9cba3310a9ee4d4b940682b894bbafcab9672e4ccca5fb8f1a4b43e6fe9c36cbfcd0efe8f12c5308fa9be33aeb16d53b6d98ef71692ec2e30076be0
|
|
7
|
+
data.tar.gz: 8a1a8187e7547f4ce0eeaee458d36b16e56c08c1f2275977127de56a49eb2a3b25e22d8c52d3678fa6ea91e553ebf6c69bf943e5bb108e3984fa8fc34f03a2f0
|
data/CHANGELOG.md
CHANGED
|
@@ -2,6 +2,25 @@
|
|
|
2
2
|
|
|
3
3
|
## [Unreleased]
|
|
4
4
|
|
|
5
|
+
## [0.8.1] - 2026-06-24
|
|
6
|
+
|
|
7
|
+
### Fixed
|
|
8
|
+
|
|
9
|
+
- **`schema:generate` skips a `belongs_to` whose target is not an ActiveRecord model instead of crashing.** A `belongs_to` can point at a non-ActiveRecord class — most commonly an ActiveHash/ActiveYaml master (`belongs_to :equipment, class_name: "SomeActiveYamlModel"`). active_hash registers these as ordinary `belongs_to` reflections, but the target class has no database table, so resolving its `table_name` raised and aborted generation. Such a relation is not a DB edge exwiw can join or extract across, so it is now dropped from the generated belongs_tos; the underlying foreign-key column is still emitted as a plain column. A bare `belongs_to` to a plain non-AR class — which makes ActiveRecord raise while resolving the target — is treated the same way. Polymorphic associations are unaffected.
|
|
10
|
+
|
|
11
|
+
## [0.8.0] - 2026-06-24
|
|
12
|
+
|
|
13
|
+
### Added
|
|
14
|
+
|
|
15
|
+
- **Scope-column mode is now declared per table.** A table that should be filtered by a shared scope/tenant column declares `scope_column: <column>` in its schema config. Naming any such table as `--target-table` then runs in scope-column mode: `--ids` are values of that shared column (not primary keys), and every table is filtered by its own `scope_column` — tables that lack one are reached via `belongs_to`, `scope_exempt: true` tables are dumped in full, and a table reachable by none aborts the run (so nothing is silently dumped unscoped). This is the primary way to extract across a foreign key that cannot be joined — most importantly a cross-database `belongs_to` (its join is impossible, but the FK column is still filterable): declare `scope_column: "<that foreign key>"` on the owning table and it is filtered by the FK value directly, with no join. SQL adapters only.
|
|
16
|
+
- **`schema:generate` detects a cross-database `belongs_to` and ignores just the relation — not the table, not the foreign-key column.** A `belongs_to` whose target model lives in a different database (Rails multi-database `connects_to`) cannot be joined — each database is exported separately — so the *belongs_to entry* is emitted with `ignore: true` and `ignore_type: "cross_database"` and a `comment` recording why and pointing at the per-table `scope_column` declaration for cross-boundary extraction. The owning table is still exported normally and its foreign-key column is still exported as a plain column; only the join/dependency edge is dropped at load (otherwise the dangling cross-database target would crash dependency resolution). Polymorphic associations are handled per target. The task prints a summary of every cross-database relation it ignored. (`ignore_type` is now also preserved across regeneration by `TableConfig#merge`.) Single-database applications are unaffected.
|
|
17
|
+
|
|
18
|
+
### Changed (breaking)
|
|
19
|
+
|
|
20
|
+
- **`--ids-column` is removed.** The SQL-adapter flag that matched `--ids` against a non primary-key column on the target table has no remaining use case; for a scoped table, the per-table `scope_column` is the column `--ids` filter against. (The mongodb-only `--ids-field` is unaffected.)
|
|
21
|
+
- **`--scope-column` is deprecated.** The global flag still selects scope-column mode (SQL-only, mutually exclusive with `--target-table`) but now emits a deprecation warning. Prefer declaring a per-table `scope_column:` in the schema config and running with `--target-table`. A per-table `scope_column` takes precedence over the flag for any table that sets both.
|
|
22
|
+
- **A scoped target's `--ids` now mean scope values, not primary keys.** When `--target-table` names a table that declares a `scope_column`, exwiw runs in scope-column mode and `--ids` are matched against that shared column; the target is scoped like any other table rather than anchored by primary key. A table that declares a `scope_column` can therefore no longer be single-extracted by primary key.
|
|
23
|
+
|
|
5
24
|
## [0.7.0] - 2026-06-23
|
|
6
25
|
|
|
7
26
|
### Changed
|
|
@@ -16,7 +35,7 @@
|
|
|
16
35
|
|
|
17
36
|
### Added
|
|
18
37
|
|
|
19
|
-
- Optimize memory usage https://github.com/heyinc/exwiw/pull/118
|
|
38
|
+
- Optimize memory usage https://github.com/heyinc/exwiw/pull/118
|
|
20
39
|
- **MongoDB: optional native (C) encoder for the Extended-JSON dump path** (no flag, byte-identical output, pure-Ruby fallback). Encoding each document to MongoDB Relaxed Extended JSON — previously `JSON.generate(doc.as_extended_json(mode: :relaxed))`, which rebuilds the whole document into an intermediate transformed Hash tree and then walks it again — was the dominant per-document CPU cost (~82% of serialization on embed-heavy data). A new C extension (`ext/exwiw/ext_json/`) emits the JSONL line in a single native tree-walk. It formats the structural bulk plus the leaves that dominate a dumped document — `Hash`, `Array`, `String`, fixnum `Integer`, `true`/`false`/`nil`, `BSON::ObjectId` (`_id`), and in-range `Time` (the Mongoid `created_at`/`updated_at` timestamps) — and delegates everything else (`Float`, out-of-int64 `Integer`, out-of-range `Time`, `Symbol`, `Decimal128`, …) back to the exact pure-Ruby path, so the output is provably byte-for-byte identical. On a 30-embedded-post timestamp-heavy document this serializes ~2.8× faster. With `gem install exwiw` the extension compiles automatically; hosts that cannot compile (JRuby/TruffleRuby, no toolchain) fall back to the pure-Ruby encoder, so exwiw stays installable as a pure-Ruby gem. See [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md).
|
|
21
40
|
|
|
22
41
|
## [0.5.3] - 2026-06-19
|
data/README.md
CHANGED
|
@@ -79,7 +79,7 @@ exwiw \
|
|
|
79
79
|
--log-level=info
|
|
80
80
|
```
|
|
81
81
|
|
|
82
|
-
By default `--ids` are matched against the target table's primary key.
|
|
82
|
+
By default `--ids` are matched against the target table's primary key. If the target table declares a per-table `scope_column`, exwiw runs in [scope-column mode](#scope-column-mode) instead — `--ids` are then values of that shared column, and the table is scoped like any other rather than anchored by primary key.
|
|
83
83
|
|
|
84
84
|
When `--target-table` and `--ids` are omitted, exwiw dumps all tables defined in `--schema-dir`:
|
|
85
85
|
|
|
@@ -129,18 +129,31 @@ exwiw explain \
|
|
|
129
129
|
|
|
130
130
|
The `--output-dir`, `--output-format`, `--insert-only`, and `--after-insert-hook` options are dump-specific and rejected when used with `explain`.
|
|
131
131
|
|
|
132
|
-
### Scope-column mode
|
|
132
|
+
### Scope-column mode
|
|
133
133
|
|
|
134
134
|
The default `--target-table` extraction assumes the schema converges on a single
|
|
135
135
|
root: every table is reached by walking `belongs_to` toward that one table. Some
|
|
136
136
|
schemas are not shaped that way — many independent top-level tables each carry the
|
|
137
|
-
*same* scope/tenant column (e.g. `tenant_id`, `
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
137
|
+
*same* scope/tenant column (e.g. `tenant_id`, `business_entity_id`), and a foreign
|
|
138
|
+
key that **cannot be joined** (most importantly a cross-database `belongs_to`,
|
|
139
|
+
whose join is impossible but whose FK column is still filterable) is not reached at
|
|
140
|
+
all. Choosing one table as `--target-table` would leave the others unrelated to it,
|
|
141
|
+
and an unrelated table is dumped in full — a problem if it holds personal data.
|
|
141
142
|
|
|
142
|
-
|
|
143
|
-
filtered by a shared column** whose values are `--ids`.
|
|
143
|
+
Scope-column mode handles this shape: instead of anchoring on one table's primary
|
|
144
|
+
key, **every table is filtered by a shared column** whose values are `--ids`.
|
|
145
|
+
Declare that column per table in the schema config with `scope_column:`:
|
|
146
|
+
|
|
147
|
+
```json
|
|
148
|
+
{
|
|
149
|
+
"name": "shops",
|
|
150
|
+
"primary_key": "id",
|
|
151
|
+
"scope_column": "business_entity_id",
|
|
152
|
+
"columns": [{ "name": "id" }, { "name": "name" }, { "name": "business_entity_id" }]
|
|
153
|
+
}
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Then name any scoped table as `--target-table` and pass the scope values as `--ids`:
|
|
144
157
|
|
|
145
158
|
```bash
|
|
146
159
|
exwiw \
|
|
@@ -148,15 +161,21 @@ exwiw \
|
|
|
148
161
|
--host=localhost --port=5432 --user=reader \
|
|
149
162
|
--database=app_production \
|
|
150
163
|
--schema-dir=exwiw/schema \
|
|
151
|
-
--
|
|
152
|
-
--ids=42,43 \
|
|
164
|
+
--target-table=shops --ids=42,43 \
|
|
153
165
|
--output-dir=dump
|
|
154
166
|
```
|
|
155
167
|
|
|
168
|
+
Because `shops` declares a `scope_column`, exwiw switches to scope-column mode: the
|
|
169
|
+
`--ids` (`42,43`) are **`business_entity_id` values, not shop primary keys**, and
|
|
170
|
+
`shops` itself is scoped by `business_entity_id IN (42,43)` like every other scoped
|
|
171
|
+
table — it is *not* used as a primary-key anchor. (A table that declares a
|
|
172
|
+
`scope_column` therefore can no longer be single-extracted by primary key.)
|
|
173
|
+
|
|
156
174
|
Each table is resolved as follows:
|
|
157
175
|
|
|
158
|
-
- **
|
|
159
|
-
|
|
176
|
+
- **Declares the scope column** (`scope_column:`, or carries the global column of
|
|
177
|
+
the deprecated `--scope-column` flag) → `WHERE scope_column IN (ids)`.
|
|
178
|
+
- **Does not, but `belongs_to` reaches a table that does** → exwiw joins up to the
|
|
160
179
|
nearest such table and applies the scope filter there (the same join machinery
|
|
161
180
|
the single-target mode uses).
|
|
162
181
|
- **`belongs_to` a parent that is itself scoped but carries no scope column of its
|
|
@@ -168,13 +187,22 @@ Each table is resolved as follows:
|
|
|
168
187
|
a single forward hop and a single unambiguous scopable parent.
|
|
169
188
|
- **Cannot be scoped at all** (no scope column and no path to one) → exwiw
|
|
170
189
|
**aborts** and lists the offending tables, so an unscoped table is never silently
|
|
171
|
-
dumped in full. For each, either
|
|
172
|
-
skip it, or mark it `scope_exempt: true` (below) to
|
|
190
|
+
dumped in full. For each, either declare a `scope_column`, add a `belongs_to`
|
|
191
|
+
path, set `ignore: true` to skip it, or mark it `scope_exempt: true` (below) to
|
|
192
|
+
export it in full.
|
|
193
|
+
|
|
194
|
+
Scope-column mode is SQL-only (mysql / postgresql / sqlite). It works with `exwiw
|
|
195
|
+
explain` too, which is the recommended way to preview the queries before exporting.
|
|
196
|
+
|
|
197
|
+
#### Cross-database foreign keys
|
|
173
198
|
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
199
|
+
The motivating case for declaring a `scope_column` is a foreign key that cannot be
|
|
200
|
+
joined: when a `belongs_to` target lives in a different database (see the
|
|
201
|
+
cross-database `belongs_to` note under the generator), that join is impossible, but
|
|
202
|
+
the foreign-key *column* is still present and can be filtered directly. Declaring
|
|
203
|
+
`scope_column: "<that foreign key>"` on the owning table scopes it by the column
|
|
204
|
+
value, with no join — `schema:generate` points this out in the ignored relation's
|
|
205
|
+
`comment`.
|
|
178
206
|
|
|
179
207
|
#### `scope_exempt` (intentional full dump)
|
|
180
208
|
|
|
@@ -193,11 +221,11 @@ opt out of the strict check and be exported in full:
|
|
|
193
221
|
Rails-managed tables (`schema_migrations`, `ar_internal_metadata`) are treated as
|
|
194
222
|
exempt automatically.
|
|
195
223
|
|
|
196
|
-
#### Per-table `scope_column`
|
|
224
|
+
#### Per-table `scope_column` and the value space
|
|
197
225
|
|
|
198
|
-
|
|
199
|
-
to every scoped table.
|
|
200
|
-
|
|
226
|
+
Scope-column mode assumes a single shared **value** space — the same `--ids` apply
|
|
227
|
+
to every scoped table. Each table names its own column, so a table that stores that
|
|
228
|
+
same value under a differently named column simply declares that name:
|
|
201
229
|
|
|
202
230
|
```json
|
|
203
231
|
{
|
|
@@ -211,6 +239,15 @@ column, override the column name for that table:
|
|
|
211
239
|
Both `scope_exempt` and `scope_column` are user-maintained and preserved across
|
|
212
240
|
`schema:generate` regeneration (the generators never emit them).
|
|
213
241
|
|
|
242
|
+
#### Deprecated: the `--scope-column` flag
|
|
243
|
+
|
|
244
|
+
Before per-table declarations, scope-column mode was selected with a global
|
|
245
|
+
`--scope-column=COLUMN` flag (every table filtered by that one column, `--ids` its
|
|
246
|
+
values, no `--target-table`). The flag still works — SQL-only and mutually
|
|
247
|
+
exclusive with `--target-table` — but is **deprecated** and emits a warning; prefer
|
|
248
|
+
declaring a per-table `scope_column` and running with `--target-table`. A per-table
|
|
249
|
+
`scope_column` takes precedence over the flag for any table that sets both.
|
|
250
|
+
|
|
214
251
|
### Config file (`exwiw.yml`)
|
|
215
252
|
|
|
216
253
|
Options you would otherwise repeat on every run can be kept in a YAML config file. Pass it with `--config=PATH`; when `--config` is omitted, exwiw automatically loads `exwiw.yml` (or `exwiw.yaml`) from the current directory if present.
|
|
@@ -226,7 +263,7 @@ output_format: insert # insert | copy
|
|
|
226
263
|
insert_only: false
|
|
227
264
|
after_insert_hook: hooks/seed.rb
|
|
228
265
|
log_level: info # debug | info
|
|
229
|
-
# target_table / ids / ids_field /
|
|
266
|
+
# target_table / ids / ids_field / scope_column may also be set here
|
|
230
267
|
```
|
|
231
268
|
|
|
232
269
|
With the file above, only the connection details need to be supplied on the CLI:
|
|
@@ -298,6 +335,8 @@ exwiw/schema/
|
|
|
298
335
|
|
|
299
336
|
Each database keeps its own Rails migration history, so a `schema_migrations` (and `ar_internal_metadata`) entry is emitted under every database that contains one — the example above shows `primary/schema_migrations.json` and would also produce `analytics/schema_migrations.json` when the analytics database has its own migration table. Single-database applications are unaffected and continue to write files flat into the output directory.
|
|
300
337
|
|
|
338
|
+
A `belongs_to` whose target model lives in a *different* database (e.g. a `primary` model referencing an `analytics` one) cannot be joined: each database is exported on its own connection and into its own subdirectory, so the target table is absent from the directory this config is loaded with. `schema:generate` detects such a relation (by comparing the owning and target models' database config names) and emits it with `ignore: true` and `ignore_type: "cross_database"`, recording why in the `comment`; the relation is then dropped from extraction at load time, while the foreign-key column itself is still exported as a plain column. Polymorphic associations are handled per target, so only the targets that cross a database boundary are ignored. The task also prints a summary of every cross-database `belongs_to` it ignored. **To extract across such a boundary, declare `scope_column: "<foreign_key>"` on the owning table (see [scope-column mode](#scope-column-mode)) so its rows are filtered by the foreign-key value directly** — there is no join, so the cross-database boundary is not a problem there.
|
|
339
|
+
|
|
301
340
|
**Limitations**
|
|
302
341
|
|
|
303
342
|
- The rails-managed table *names* are resolved from the global `ActiveRecord::Base.schema_migrations_table_name` / `internal_metadata_table_name` accessors, which are shared across all connections. A per-database override of these names is not detected, so such a table will be missing from that database's generated configs.
|
|
@@ -658,7 +697,7 @@ The MongoDB adapter is experimental. To use it:
|
|
|
658
697
|
- The MongoDB adapter consumes a separate config type, `MongodbCollectionConfig`, with MongoDB-native naming. Use `fields` (instead of the SQL adapters' `columns`), and set `"primary_key": "_id"`. Foreign keys (`shop_id`, `user_id`, ...) stay as ordinary fields.
|
|
659
698
|
- `--ids` values are coerced to the type actually stored in `_id` before filtering: integer-looking ids become `Integer`, 24-char hex ids become `BSON::ObjectId` (Mongoid's default `_id` type — a plain String would never match an ObjectId), and any other string is left as-is.
|
|
660
699
|
- `--target-collection=COLLECTION` is a mongodb-only alias of `--target-table` (use whichever reads better for MongoDB). Specifying both, or using `--target-collection` with a non-mongodb adapter, is an error.
|
|
661
|
-
- `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only
|
|
700
|
+
- `--ids-field=FIELD` matches `--ids` against `FIELD` on the target collection instead of its primary key (e.g. `--target-collection=users --ids=a@example.com --ids-field=email`). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are **not** type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is **mongodb-only** (the SQL adapters have no equivalent).
|
|
662
701
|
- Large or embedded-document-heavy dumps are streamed automatically: the adapter reads the collection through a lazy cursor (not `.to_a`) and writes JSONL in chunks, so peak memory is bounded by the chunk size rather than the collection size — no flag to set. Encoding each document to MongoDB Extended JSON is accelerated by an **optional native (C) extension** that compiles automatically on `gem install`; where it cannot compile, exwiw falls back to a byte-identical pure-Ruby encoder. See [`docs/optimization-notes.md`](docs/optimization-notes.md) for the performance investigation and [`docs/optimize-mongodb-export-with-native-ext.md`](docs/optimize-mongodb-export-with-native-ext.md) for the native encoder's design. Benchmark your own data with `script/bench_mongodb_dump.rb`.
|
|
663
702
|
- Output is JSON Lines (`insert-{idx}-{collection}.jsonl`) using MongoDB Extended JSON (relaxed mode). Import with `mongoimport`:
|
|
664
703
|
```bash
|
|
@@ -0,0 +1,146 @@
|
|
|
1
|
+
# MongoDB dump: reaching the 2× target with inter-collection fork parallelism
|
|
2
|
+
|
|
3
|
+
This continues [`optimization-notes.md`](./optimization-notes.md). That document
|
|
4
|
+
ends at "the remaining cost is the serial BSON→Ruby decode, and the lever left is
|
|
5
|
+
a native encoder bounded by the same serial-decode ceiling." This note records a
|
|
6
|
+
**different** lever that was measured to clear a **2× end-to-end** target on a
|
|
7
|
+
large real extraction, **byte-identically**, and explains why it succeeds where
|
|
8
|
+
the previously-removed `--parallel-workers` did not.
|
|
9
|
+
|
|
10
|
+
The measurements below are from a large staging extraction (~300 non-empty
|
|
11
|
+
collections, target → 1 store → ~hundreds of scoped rows, plus reference/master
|
|
12
|
+
data dumped in full). Reproduce with the bench harness against a restored backup
|
|
13
|
+
(see [`mongodb-scoping-fullscan-notes.md`](./mongodb-scoping-fullscan-notes.md)
|
|
14
|
+
for serving a WiredTiger backup as a standalone `mongod`).
|
|
15
|
+
|
|
16
|
+
## Where the time goes (warm cache, native ext compiled)
|
|
17
|
+
|
|
18
|
+
Serial baseline ≈ **6.8–7.0 s**. Per-collection instrumentation (wrapping
|
|
19
|
+
`build_query` / `execute` / `write_inserts`) attributes ~6.0 s to the write pass.
|
|
20
|
+
Splitting one collection's write pass into "drain the cursor only" vs
|
|
21
|
+
"drain + native-encode" shows **decode is 64–92 %** of per-collection cost — the
|
|
22
|
+
Mongo driver's BSON→Ruby-Hash decode dominates; the native Extended-JSON encode is
|
|
23
|
+
already cheap. The single heaviest collection (a 237k-doc reference table) is
|
|
24
|
+
**1.31 s by itself** and cannot be split without intra-collection partitioning.
|
|
25
|
+
|
|
26
|
+
Classifying every processed collection by how it is scoped (using the
|
|
27
|
+
genuine-anchor model from `mongodb-scoping-fullscan-notes.md`) gives three groups
|
|
28
|
+
with very different dependency shapes:
|
|
29
|
+
|
|
30
|
+
| group | meaning | count | time |
|
|
31
|
+
|---|---|---|---|
|
|
32
|
+
| **leaf** | no `belongs_to` → dumped in full, no input deps | 54 | ~3.5 s |
|
|
33
|
+
| **genuine** | reachable to the dump target via `belongs_to` (the scoped DAG) | 245 | ~0.8 s |
|
|
34
|
+
| **ref_bt** | has `belongs_to` but NOT reachable to target (reference data scoped by strict-AND fallback) | 23 | ~2.1 s |
|
|
35
|
+
|
|
36
|
+
The crucial structural facts, all derivable from the configs:
|
|
37
|
+
|
|
38
|
+
- **genuine never consumes leaf/ref_bt `@state`.** A genuine collection drops its
|
|
39
|
+
reference (leaf) parents when it has a genuine anchor, and a ref_bt parent is by
|
|
40
|
+
definition not genuine. So the whole scoped DAG is independent of the heavy
|
|
41
|
+
reference dumps — it can run **concurrently** with them.
|
|
42
|
+
- **leaf has no input deps at all** (no `belongs_to`) → embarrassingly parallel.
|
|
43
|
+
- **ref_bt consumes only leaf `@state` and other ref_bt `@state`** (never genuine).
|
|
44
|
+
Its internal edges form shallow weakly-connected components.
|
|
45
|
+
|
|
46
|
+
## Why process-level (not thread/serialization) parallelism is the right lever
|
|
47
|
+
|
|
48
|
+
`optimization-notes.md` removed two parallelism attempts. The distinction that
|
|
49
|
+
matters:
|
|
50
|
+
|
|
51
|
+
- `--parallel-workers` parallelized **serialization only** (the parent still did
|
|
52
|
+
the single serial cursor decode), so it was capped at ~1.1–1.4× — decode is the
|
|
53
|
+
bottleneck and it was untouched.
|
|
54
|
+
- `--cursor-parallel` parallelized **decode within a collection** (≈2.5–5.5×) but
|
|
55
|
+
forced `sort(_id)`, changing row order → not byte-identical, could not be the
|
|
56
|
+
default.
|
|
57
|
+
|
|
58
|
+
Forking **whole collections** across processes parallelizes the decode (each
|
|
59
|
+
worker decodes its own collections) **while preserving each collection's natural
|
|
60
|
+
order** — so it is both decode-parallel and byte-identical. That is the property
|
|
61
|
+
neither removed approach had.
|
|
62
|
+
|
|
63
|
+
## The schedule that was measured at ≥2×
|
|
64
|
+
|
|
65
|
+
One parent process plus a fork pool of `N` workers:
|
|
66
|
+
|
|
67
|
+
1. **Phase 1 (concurrent).** Fork the leaf pool (`N` workers, leaves assigned by
|
|
68
|
+
LPT bin-packing on output-size weight so the 1.31 s collection sits alone).
|
|
69
|
+
Concurrently the parent (a) dumps the schema (parent-only, needs no `@state`)
|
|
70
|
+
and (b) processes the **whole genuine DAG optimistically** — without leaf
|
|
71
|
+
`@state` yet — recording each collection's row count.
|
|
72
|
+
2. **Barrier + leaf `@state`.** Wait for the leaf pool; load the small Marshal
|
|
73
|
+
sidecars each consumed leaf wrote (only the ~11 leaves a downstream collection
|
|
74
|
+
references need one; the heavy 1.31 s table is not referenced, so no sidecar).
|
|
75
|
+
3. **Phase 2 (cascade reprocess).** Only **8** genuine collections reference a leaf
|
|
76
|
+
at all, and **0** are statically forced to use it — a genuine collection only
|
|
77
|
+
falls back to a leaf clause at runtime when its genuine anchor turns out empty
|
|
78
|
+
(e.g. an empty intermediate parent). Reprocess those 8 with leaf `@state` now
|
|
79
|
+
present; if a reprocessed collection's row count changes, enqueue its genuine
|
|
80
|
+
children and repeat. Non-fallback collections (e.g. the DAG root one level below
|
|
81
|
+
the dump target, which has a genuine anchor and drops its leaf parents)
|
|
82
|
+
reprocess to an identical result and stop the cascade,
|
|
83
|
+
so this terminates after re-touching only the genuinely affected handful.
|
|
84
|
+
4. **Phase 3 (ref_bt components).** Fork the 23 ref_bt collections as
|
|
85
|
+
**dependency-closed weakly-connected components** (over intra-ref_bt edges) in a
|
|
86
|
+
single pool — no level barriers, no cross-worker IPC. Each worker owns whole
|
|
87
|
+
components and processes their members in topological order, seeded with the
|
|
88
|
+
(sliced) leaf `@state` its members reference. Components are assigned by LPT.
|
|
89
|
+
|
|
90
|
+
`@state` IPC is just Marshal sidecars for the ~11 referenced leaves; everything
|
|
91
|
+
else is COW-inherited at fork or kept inside a component.
|
|
92
|
+
|
|
93
|
+
## Measured result (warm cache, same backup)
|
|
94
|
+
|
|
95
|
+
Internal compute timer (excludes the fixed ~0.5 s Ruby/bundler startup both paths
|
|
96
|
+
pay):
|
|
97
|
+
|
|
98
|
+
| workers | compute | speedup |
|
|
99
|
+
|---|---|---|
|
|
100
|
+
| serial | 7.01 s | 1.00× |
|
|
101
|
+
| N=2 | 3.87 s | 1.81× |
|
|
102
|
+
| N=4 | 2.80 s | **2.50×** |
|
|
103
|
+
| N=6 | 2.79 s | **2.51×** |
|
|
104
|
+
|
|
105
|
+
Full wall-clock (includes Ruby startup — the honest end-to-end number):
|
|
106
|
+
|
|
107
|
+
| workers | wall | speedup |
|
|
108
|
+
|---|---|---|
|
|
109
|
+
| serial | 6.79 s | 1.00× |
|
|
110
|
+
| N=2 | 4.06 s | 1.67× |
|
|
111
|
+
| N=4 | 3.24 s | **2.10×** |
|
|
112
|
+
| N=6 | 3.21 s | **2.12×** |
|
|
113
|
+
|
|
114
|
+
**Byte-identical:** all 189 output files (schema + inserts) match the serial run
|
|
115
|
+
exactly — same filenames (the insert index is taken over the full ordering,
|
|
116
|
+
including `ignore:true` collections, exactly as the Runner numbers them) and same
|
|
117
|
+
content (0/189 cmp mismatches).
|
|
118
|
+
|
|
119
|
+
The curve **saturates at N≈4**: past that the wall time is bounded by the single
|
|
120
|
+
1.31 s leaf decode plus the longest ref_bt chain (~0.6 s) plus startup. Going
|
|
121
|
+
beyond ~2.5× needs **intra-collection** decode parallelism (the `_id`-range cursor
|
|
122
|
+
split that was removed for changing row order) or a **native BSON→Extended-JSON
|
|
123
|
+
transcoder** that skips the Ruby-Hash decode for unmasked reference collections
|
|
124
|
+
(decode being 64–92 % of the cost, this is the largest single-process lever left
|
|
125
|
+
and composes with the fork approach).
|
|
126
|
+
|
|
127
|
+
## Operational note: this needs vCPUs to spend
|
|
128
|
+
|
|
129
|
+
The win is real cores doing real decode in parallel. On the current ECS task
|
|
130
|
+
(`cpu: 2048` = **2 vCPU**) the schedule reaches only ~1.67× (N=2). Clearing 2×
|
|
131
|
+
requires scaling to **≥4 vCPU** (`cpu: 4096`); 8 vCPU does not help further given
|
|
132
|
+
the N≈4 saturation. Memory is unchanged — the existing streaming keeps each worker
|
|
133
|
+
bounded by chunk size, and `N×` that stays well under the 7 GiB container limit.
|
|
134
|
+
|
|
135
|
+
## Status
|
|
136
|
+
|
|
137
|
+
This is a **measured, byte-identical proof** (bench prototype, not yet integrated
|
|
138
|
+
into the Runner/CLI). Integrating it means re-introducing process orchestration
|
|
139
|
+
and `@state` sidecar IPC — the machinery `optimization-notes.md` deliberately
|
|
140
|
+
removed as over-engineered for a flag. It is recorded here because, unlike that
|
|
141
|
+
removed work, this schedule is (a) byte-identical by construction, (b) measured
|
|
142
|
+
past the 2× target on a real extraction, and (c) the lever the task explicitly
|
|
143
|
+
invited (scale the task to go faster). A production version would gate it behind a
|
|
144
|
+
worker-count option, fall back to serial where `fork` is unavailable
|
|
145
|
+
(Windows/JRuby), and reuse the existing genuine-anchor classification to derive
|
|
146
|
+
the three groups.
|
data/docs/optimization-notes.md
CHANGED
|
@@ -109,6 +109,18 @@ The full design (byte-identity strategy, fast-path vs Ruby-delegate types,
|
|
|
109
109
|
optional-load + pure-Ruby fallback, packaging) is in
|
|
110
110
|
[`optimize-mongodb-export-with-native-ext.md`](./optimize-mongodb-export-with-native-ext.md).
|
|
111
111
|
|
|
112
|
+
The native ext shipped (0.6.1+) and the dump is now **decode-bound** (the driver's
|
|
113
|
+
BSON→Ruby decode is 64–92 % of per-collection cost). The single-process levers are
|
|
114
|
+
exhausted short of 2×. The lever that *was* measured past 2× — **byte-identical
|
|
115
|
+
inter-collection fork parallelism** (parallelize whole collections across
|
|
116
|
+
processes, so the decode itself runs in parallel while each collection keeps its
|
|
117
|
+
natural order) — is written up in
|
|
118
|
+
[`mongodb-dump-parallelism-2x-notes.md`](./mongodb-dump-parallelism-2x-notes.md).
|
|
119
|
+
Unlike the `--parallel-workers` attempt removed above (which parallelized
|
|
120
|
+
serialization only and so was capped at ~1.4×), this parallelizes the actual
|
|
121
|
+
bottleneck and reaches ~2.1× wall / ~2.5× compute on a real extraction, given
|
|
122
|
+
≥4 vCPU to spend.
|
|
123
|
+
|
|
112
124
|
## Methodology notes (for re-running)
|
|
113
125
|
|
|
114
126
|
- The CPU hotspot reproduces **with no database**: the Mongo driver hands back
|
|
@@ -0,0 +1,107 @@
|
|
|
1
|
+
# scope-column まわりの再設計メモ(2026-06-23)
|
|
2
|
+
|
|
3
|
+
> ステータス: **(a) で決定**。hybrid は畳む(PR #128 の hybrid 中核を revert し、per-table scope-column モードに作り直す)。確定仕様は末尾「決定: (a) の確定仕様」を参照。
|
|
4
|
+
|
|
5
|
+
## 背景 / 解きたい問題
|
|
6
|
+
|
|
7
|
+
Rails の multi-database(`connects_to`)で、別 DB のモデルを指す `belongs_to`(cross-database)は **join できない**。
|
|
8
|
+
|
|
9
|
+
```ruby
|
|
10
|
+
# main DB
|
|
11
|
+
class Customer < ApplicationRecord
|
|
12
|
+
belongs_to :tenant, class_name: "Org::Tenant" # tenants は別 DB
|
|
13
|
+
end
|
|
14
|
+
# org DB
|
|
15
|
+
class Org::Tenant < OrgApplicationRecord
|
|
16
|
+
end
|
|
17
|
+
```
|
|
18
|
+
|
|
19
|
+
`customers.tenant_id` は join では絞れないが、**列の値(tenant_id)で直接フィルタ**はできる。これを使ってテナント単位 / 対象単位でデータ抽出したい。
|
|
20
|
+
|
|
21
|
+
## いまある実装(PR #128, draft / branch `feat/hybrid-target-scope-column`)
|
|
22
|
+
|
|
23
|
+
- **hybrid モード**: `--target-table` と `--scope-column` を併用。各テーブルを「target 到達」OR「scope 列到達」の**和集合(OR)**で解決。scope 値は target から**導出**する
|
|
24
|
+
(`scope_column IN (SELECT target.scope_column FROM target WHERE <target ids>)`)。SQL only / insert-only / 単一アンカーは従来と byte 同一。
|
|
25
|
+
- **generator**: cross-database `belongs_to` を検出し、**relation だけ** `ignore: true` + `ignore_type: "cross_database"`(テーブル本体・FK 列は維持)。生成時にサマリ出力。
|
|
26
|
+
- **R2**: per-table `scope_column` があるのに `--scope-column` 未指定ならエラー。
|
|
27
|
+
|
|
28
|
+
## 再設計の方針(この議論での決定)
|
|
29
|
+
|
|
30
|
+
1. **per-table `scope_column` を唯一の宣言にする。** scoped なテーブルは config に `scope_column: <列>` を明示。
|
|
31
|
+
グローバル `--scope-column`(正準名)+ per-table オーバーライド、という二層と「オーバーライド」概念は**廃止**。
|
|
32
|
+
各テーブルが自分の列を宣言する(差異名も自然に表現できる。全 scope 列は同じ値空間を持つ前提)。
|
|
33
|
+
2. **`--scope-column` フラグは deprecated**(警告を出して per-table 宣言へ誘導)。
|
|
34
|
+
3. **`--ids-column` は削除**。「自然キー(email 等)で target を指定する」ユースケースが無いため。
|
|
35
|
+
scoped な target では `scope_column` が「`--ids` を当てる列」を兼ねる。
|
|
36
|
+
4. **generator の誘導文を更新**: 「越境は `--scope-column` を使え」→「**このテーブルに `scope_column:` を宣言しろ**」。
|
|
37
|
+
検出・relation ignore・FK 列維持・サマリ自体は据え置き。
|
|
38
|
+
5. **R2 は撤回**(グローバルフラグ概念が無くなるため、per-table 宣言が正道になる)。
|
|
39
|
+
|
|
40
|
+
ここまでは hybrid の有無に関わらず確定。
|
|
41
|
+
|
|
42
|
+
## 未解決: hybrid は要るのか?
|
|
43
|
+
|
|
44
|
+
抽出ニーズが2種類あり、片方は pure scope-column モードでは実現できない。
|
|
45
|
+
|
|
46
|
+
### (a) テナント丸ごと抽出 — pure scope で足りる
|
|
47
|
+
「tenant 42 のデータを全部」。各 scoped テーブルを自分の `scope_column IN (42)` で絞るだけ。未 scoped テーブルは belongs_to 経由(via_path / referenced_by)。→ **hybrid 不要**。
|
|
48
|
+
|
|
49
|
+
### (b) 対象を絞った最小抽出 + cross-DB 行 — hybrid が必要
|
|
50
|
+
「shop 1 のデータだけ最小で欲しい。ただし join できない cross-DB の `customers`(shop 1 のテナントに属する)も含めたい」。
|
|
51
|
+
|
|
52
|
+
- pure scope(テナント単位)だと **tenant 42 全体**を引いてしまい多すぎる(shop 1 以外の shop も customers も入る)。
|
|
53
|
+
- 素の `--target-table=shops`(join 抽出)だと、cross-DB の `customers` には**到達できない**(join 不可)。
|
|
54
|
+
- → **shop 1 を主キーで anchor しつつ、その tenant_id を導出して cross-DB テーブルを scope する hybrid でしか実現できない**。
|
|
55
|
+
|
|
56
|
+
テスト用データ抽出は (b)(最小・対象限定)になりがちなので、hybrid は捨てがたい。
|
|
57
|
+
|
|
58
|
+
### CLI 上の衝突(ここが論点)
|
|
59
|
+
|
|
60
|
+
`--ids-column` を消したので、**scoped な target の `--ids` を 2 通りに解釈できない**:
|
|
61
|
+
|
|
62
|
+
- 解釈A: `--target-table=shops --ids=42` の 42 = `scope_column` の値(テナント抽出)
|
|
63
|
+
- 解釈B: `--target-table=shops --ids=1` の 1 = 主キー(hybrid: shop 1 を anchor して導出)
|
|
64
|
+
|
|
65
|
+
同じ CLI 形では両立しない。→ どちらかを選ぶ:
|
|
66
|
+
|
|
67
|
+
- **解釈A を採る**なら hybrid は畳む(pure scope-column モードに作り直す)。(b) は失われる。
|
|
68
|
+
- **hybrid を残す**なら scoped target の `--ids` は**主キー**(解釈B)に戻し、テナント抽出(a)は **target 無しの pure scope**(または deprecated な `--scope-column`)で行う。
|
|
69
|
+
|
|
70
|
+
### 推奨(たたき台)
|
|
71
|
+
|
|
72
|
+
**hybrid を残す**案。理由: 元々の動機(cross-DB FK を含めて1回で対象を抽出)と (b) のニーズが一致し、すでに #128 で動いている。
|
|
73
|
+
per-table `scope_column` 化とも両立できる ―― hybrid の導出元 scope 列を「`--scope-column` フラグ」ではなく **target の config の `scope_column`** から取れば、フラグ無し・`--ids-column` 無しで成立する:
|
|
74
|
+
|
|
75
|
+
| やりたいこと | コマンド | ids の意味 | 挙動 |
|
|
76
|
+
|---|---|---|---|
|
|
77
|
+
| shop 1 を最小抽出(cross-DB 含む) | `--target-table=shops --ids=1` | 主キー | shop 1 + 到達分。cross-DB/scoped テーブルは shops.`scope_column` から導出した tenant に scope(hybrid) |
|
|
78
|
+
| tenant 42 を丸ごと | `--ids=42`(target 無し) | scope 値 | 各 scoped テーブルを自分の `scope_column IN (42)`(pure scope) |
|
|
79
|
+
| 通常の単体抽出(scope 無し) | `--target-table=orders --ids=1`(orders に scope_column 無し) | 主キー | 従来どおり |
|
|
80
|
+
|
|
81
|
+
この案だと「解釈A(scoped target で ids=scope 値)」は捨て、テナント抽出は no-target の pure scope に寄せることになる。
|
|
82
|
+
|
|
83
|
+
## 決定: (a) の確定仕様
|
|
84
|
+
|
|
85
|
+
(a) のみ採用。hybrid((b))は今回入れない。
|
|
86
|
+
|
|
87
|
+
### モデル
|
|
88
|
+
- scoped なテーブルは config に `scope_column: <列>` を宣言(per-table のみ。グローバル正準名もオーバーライドも無し。全 scope 列は同じ値空間)。
|
|
89
|
+
|
|
90
|
+
### 起動と挙動
|
|
91
|
+
| コマンド | 条件 | ids の意味 | 挙動 |
|
|
92
|
+
|---|---|---|---|
|
|
93
|
+
| `--target-table=shops --ids=42` | shops が `scope_column` を宣言 | **scope 値** | scope-column モード。`scope_column` を宣言する全テーブルが自分の `scope_column IN (42)`。未宣言テーブルは belongs_to 経由(via_path / referenced_by)。`scope_exempt` は full dump。解決不能は abort(validate_scope!)。target 名は PK フィルタには使わない(scoped テーブルの1つとして scope される) |
|
|
94
|
+
| `--target-table=orders --ids=1` | orders は `scope_column` 無し | 主キー | 従来の単体 target 抽出(変更なし) |
|
|
95
|
+
| `--scope-column=C --ids=42` | deprecated | scope 値 | 旧 pure scope(no target、global 列 C)。**警告**を出す。`--target-table` とは排他(pre-#128 に戻す) |
|
|
96
|
+
| `--ids=42` のみ | target も flag も無し | — | エラー(従来どおり) |
|
|
97
|
+
|
|
98
|
+
- `--ids-column` は**削除**。scoped target では `scope_column` が「`--ids` を当てる列」を兼ねる。
|
|
99
|
+
- scoped なテーブルを主キーで単体抽出することはできなくなる(解釈A の代償。受容済み)。
|
|
100
|
+
|
|
101
|
+
### 実装スコープ(PR #128 を作り直す)
|
|
102
|
+
1. builder: #128 の hybrid 一式(build_hybrid / compose_or / pk_in_subquery / mode 引数 / 導出版 scope_where_clause / validate_hybrid! / R2=validate_scope_column_usage!)と `:or`(query_ast + 3 adapter)を **revert**。`scope_where_clause` はリテラル `scope_column IN (ids)` に戻す。
|
|
103
|
+
2. builder: `scope_mode?` を「`--scope-column` フラグ or **target が `scope_column` を宣言**」で判定するよう拡張。`resolved_scope_column = table.scope_column || (deprecated flag)`。
|
|
104
|
+
3. CLI: `--ids-column` 削除、`--scope-column` を deprecated 警告化 + `--target-table` と排他(pre-#128 に戻す)、hybrid 用の `--insert-only` 必須を撤去。
|
|
105
|
+
4. runner/explain: `validate_hybrid!` / `validate_scope_column_usage!` 呼び出しを撤去。`validate_scope!` の起動条件を新 `scope_mode?` に合わせる。
|
|
106
|
+
5. generator: cross-DB 検出は維持。comment / サマリの誘導を「`--scope-column` を使え」→「このテーブルに `scope_column:` を宣言しろ」に更新。
|
|
107
|
+
6. docs / CHANGELOG: hybrid 記述を撤去し、本モデルで書き直す。breaking(`--ids-column` 削除・`--scope-column` deprecated・scoped target の `--ids` 意味変更)を明記。tests も差し替え。
|
|
@@ -35,32 +35,50 @@ module Exwiw
|
|
|
35
35
|
# #to_bulk_insert per chunk joined by "\n" (verified by
|
|
36
36
|
# insert_output_snapshot_spec), but only one ~STREAM_FLUSH_BYTES buffer is
|
|
37
37
|
# resident at a time rather than the entire table's INSERT string. Returns
|
|
38
|
-
#
|
|
38
|
+
# [statement_count, record_count]; record_count is tallied during the single
|
|
39
|
+
# streaming drain so the Runner needs no separate SELECT COUNT(*) pass.
|
|
39
40
|
def write_inserts(io, results, table, chunk_size)
|
|
40
41
|
chunks = chunk_size ? results.each_slice(chunk_size) : [results]
|
|
41
42
|
statement_count = 0
|
|
43
|
+
record_count = 0
|
|
42
44
|
chunks.each do |chunk_rows|
|
|
43
45
|
io.print("\n") if statement_count.positive?
|
|
44
|
-
stream_single_insert(io, chunk_rows, table)
|
|
46
|
+
record_count += stream_single_insert(io, chunk_rows, table)
|
|
45
47
|
statement_count += 1
|
|
46
48
|
end
|
|
47
|
-
statement_count
|
|
49
|
+
[statement_count, record_count]
|
|
48
50
|
end
|
|
49
51
|
|
|
50
52
|
# Emit one `INSERT INTO ... VALUES <tuples>;` statement to `io`, building
|
|
51
53
|
# and flushing the value tuples STREAM_FLUSH_ROWS at a time so the full
|
|
52
|
-
# statement text is never held in memory at once. Each
|
|
53
|
-
# map+join; the ",\n" between
|
|
54
|
+
# statement text is never held in memory at once. Each flush is one fast
|
|
55
|
+
# map+join; the ",\n" between flushes reproduces the same separator
|
|
54
56
|
# #to_bulk_insert puts between every tuple, so the bytes are identical.
|
|
57
|
+
# Returns the number of rows written, tallied as the stream is drained.
|
|
58
|
+
#
|
|
59
|
+
# Rows are buffered off `#each` rather than `rows.each_slice(...)`:
|
|
60
|
+
# each_slice queries the receiver's `#size`, which on a streaming result
|
|
61
|
+
# issues a redundant `SELECT COUNT(*)` (a second full pass over the same
|
|
62
|
+
# filter). Manual buffering walks the cursor exactly once.
|
|
55
63
|
private def stream_single_insert(io, rows, table)
|
|
56
64
|
io.print(insert_header(table))
|
|
57
65
|
first = true
|
|
58
|
-
|
|
66
|
+
record_count = 0
|
|
67
|
+
buffer = []
|
|
68
|
+
flush = lambda do
|
|
59
69
|
io.print(",\n") unless first
|
|
60
70
|
first = false
|
|
61
|
-
io.print(
|
|
71
|
+
io.print(buffer.map { |row| insert_tuple(row) }.join(",\n"))
|
|
72
|
+
record_count += buffer.size
|
|
73
|
+
buffer.clear
|
|
62
74
|
end
|
|
75
|
+
rows.each do |row|
|
|
76
|
+
buffer << row
|
|
77
|
+
flush.call if buffer.size >= STREAM_FLUSH_ROWS
|
|
78
|
+
end
|
|
79
|
+
flush.call unless buffer.empty?
|
|
63
80
|
io.print(";")
|
|
81
|
+
record_count
|
|
64
82
|
end
|
|
65
83
|
|
|
66
84
|
private def insert_tuple(row)
|
data/lib/exwiw/adapter.rb
CHANGED
|
@@ -140,15 +140,44 @@ module Exwiw
|
|
|
140
140
|
# @param results [Enumerable] rows/documents from #execute
|
|
141
141
|
# @param table the table/collection config
|
|
142
142
|
# @param chunk_size [Integer, nil] rows per statement (nil => one statement)
|
|
143
|
+
# @return [Array(Integer, Integer)] [statement_count, record_count]
|
|
144
|
+
#
|
|
145
|
+
# record_count is tallied from the rows actually streamed here so the
|
|
146
|
+
# Runner no longer needs a separate upfront count query (MongoDB's
|
|
147
|
+
# count_documents / the SQL adapters' SELECT COUNT(*)) just to log the row
|
|
148
|
+
# count and decide whether an empty table can be skipped. That count was a
|
|
149
|
+
# second full pass over the same filter — a wasted COLLSCAN when the scope
|
|
150
|
+
# is unindexed; counting during the single streaming pass removes it.
|
|
151
|
+
#
|
|
152
|
+
# Batches rows by accumulating into a buffer and flushing every chunk_size
|
|
153
|
+
# rows, rather than `results.each_slice(chunk_size)`. This is deliberate:
|
|
154
|
+
# `Enumerable#each_slice` calls `#size` on the receiver as an allocation
|
|
155
|
+
# hint, which for a streaming result (MongoDB's StreamingResult) issues a
|
|
156
|
+
# `count_documents` — the very redundant count this single-pass design
|
|
157
|
+
# removes. Driving the buffer off `#each` keeps the result's `#size`
|
|
158
|
+
# untouched, so the cursor is walked exactly once. The chunk boundaries and
|
|
159
|
+
# "\n" separators reproduce the each_slice output byte-for-byte.
|
|
160
|
+
#
|
|
161
|
+
# chunk_size is always positive for callers of this default (MongoDB); the
|
|
162
|
+
# SQL adapters pass nil and override #write_inserts, so the unbounded
|
|
163
|
+
# nil-branch buffer is never reached here in practice.
|
|
143
164
|
def write_inserts(io, results, table, chunk_size)
|
|
144
|
-
chunks = chunk_size ? results.each_slice(chunk_size) : [results]
|
|
145
165
|
statement_count = 0
|
|
146
|
-
|
|
166
|
+
record_count = 0
|
|
167
|
+
buffer = []
|
|
168
|
+
flush = lambda do
|
|
147
169
|
io.print("\n") if statement_count.positive?
|
|
148
|
-
io.print(to_bulk_insert(
|
|
170
|
+
io.print(to_bulk_insert(buffer, table))
|
|
149
171
|
statement_count += 1
|
|
172
|
+
record_count += buffer.size
|
|
173
|
+
buffer.clear
|
|
174
|
+
end
|
|
175
|
+
results.each do |row|
|
|
176
|
+
buffer << row
|
|
177
|
+
flush.call if chunk_size && buffer.size >= chunk_size
|
|
150
178
|
end
|
|
151
|
-
|
|
179
|
+
flush.call unless buffer.empty?
|
|
180
|
+
[statement_count, record_count]
|
|
152
181
|
end
|
|
153
182
|
|
|
154
183
|
# Run the database-specific EXPLAIN for the given query and return the
|
data/lib/exwiw/cli.rb
CHANGED
|
@@ -34,7 +34,6 @@ module Exwiw
|
|
|
34
34
|
target_collection
|
|
35
35
|
ids
|
|
36
36
|
ids_field
|
|
37
|
-
ids_column
|
|
38
37
|
scope_column
|
|
39
38
|
].freeze
|
|
40
39
|
|
|
@@ -77,7 +76,6 @@ module Exwiw
|
|
|
77
76
|
@target_collection_name = nil
|
|
78
77
|
@ids = []
|
|
79
78
|
@ids_field = nil
|
|
80
|
-
@ids_column = nil
|
|
81
79
|
@scope_column = nil
|
|
82
80
|
@output_format = nil
|
|
83
81
|
@insert_only = nil
|
|
@@ -165,7 +163,7 @@ module Exwiw
|
|
|
165
163
|
|
|
166
164
|
resolve_target_collection_alias!
|
|
167
165
|
resolve_scope_column!
|
|
168
|
-
|
|
166
|
+
resolve_ids_field!
|
|
169
167
|
resolve_uri_option!
|
|
170
168
|
|
|
171
169
|
if @subcommand == "explain"
|
|
@@ -317,7 +315,6 @@ module Exwiw
|
|
|
317
315
|
@ids = (raw.is_a?(String) ? raw.split(",") : Array(raw)).map(&:to_s)
|
|
318
316
|
end
|
|
319
317
|
@ids_field ||= config["ids_field"]
|
|
320
|
-
@ids_column ||= config["ids_column"]
|
|
321
318
|
@scope_column ||= config["scope_column"]
|
|
322
319
|
end
|
|
323
320
|
|
|
@@ -349,49 +346,33 @@ module Exwiw
|
|
|
349
346
|
@target_table_name = @target_collection_name
|
|
350
347
|
end
|
|
351
348
|
|
|
352
|
-
# `--ids-
|
|
353
|
-
#
|
|
354
|
-
#
|
|
355
|
-
#
|
|
356
|
-
#
|
|
357
|
-
# exclusive. Runs after resolve_target_collection_alias! so
|
|
349
|
+
# `--ids-field` overrides which field `--ids` is matched against on the target
|
|
350
|
+
# collection (defaulting to its primary key). It is mongodb-only and
|
|
351
|
+
# meaningless without a target collection to constrain. (The SQL adapters have
|
|
352
|
+
# no equivalent: a scoped target's `scope_column` is the column `--ids` filter
|
|
353
|
+
# against there.) Runs after resolve_target_collection_alias! so
|
|
358
354
|
# @target_table_name already reflects the collection alias.
|
|
359
|
-
private def
|
|
360
|
-
if @ids_field
|
|
361
|
-
$stderr.puts "Specify only one of --ids-field and --ids-column"
|
|
362
|
-
exit 1
|
|
363
|
-
end
|
|
355
|
+
private def resolve_ids_field!
|
|
356
|
+
return if @ids_field.nil?
|
|
364
357
|
|
|
365
|
-
if @
|
|
366
|
-
$stderr.puts "--ids-field is only supported by the mongodb adapter
|
|
358
|
+
if @database_adapter != "mongodb"
|
|
359
|
+
$stderr.puts "--ids-field is only supported by the mongodb adapter"
|
|
367
360
|
exit 1
|
|
368
361
|
end
|
|
369
362
|
|
|
370
|
-
|
|
371
|
-
|
|
372
|
-
unless sql_adapters.include?(@database_adapter)
|
|
373
|
-
$stderr.puts "--ids-column is only supported by the sql adapters (use --ids-field)"
|
|
374
|
-
exit 1
|
|
375
|
-
end
|
|
376
|
-
|
|
377
|
-
@ids_field = @ids_column
|
|
378
|
-
end
|
|
379
|
-
|
|
380
|
-
# --ids-field/--ids-column override the column --ids filters against on
|
|
381
|
-
# the target table; meaningless without a target table to constrain.
|
|
382
|
-
if @ids_field && !@target_table_name
|
|
383
|
-
flag = @ids_column ? "--ids-column" : "--ids-field"
|
|
384
|
-
$stderr.puts "--target-table is required when #{flag} is specified"
|
|
363
|
+
unless @target_table_name
|
|
364
|
+
$stderr.puts "--target-table is required when --ids-field is specified"
|
|
385
365
|
exit 1
|
|
386
366
|
end
|
|
387
367
|
end
|
|
388
368
|
|
|
389
|
-
# `--scope-column`
|
|
390
|
-
#
|
|
391
|
-
#
|
|
392
|
-
#
|
|
393
|
-
#
|
|
394
|
-
#
|
|
369
|
+
# `--scope-column` is **deprecated**: it selected scope-column mode with a
|
|
370
|
+
# single global column for every table. The preferred way is to declare a
|
|
371
|
+
# per-table `scope_column:` in the schema config and pass `--target-table`
|
|
372
|
+
# (the target is then scoped like any other table). The flag still works as
|
|
373
|
+
# before — SQL-only and mutually exclusive with `--target-table` — but emits a
|
|
374
|
+
# deprecation warning. Runs after resolve_target_collection_alias! so
|
|
375
|
+
# --target-collection is already folded into @target_table_name.
|
|
395
376
|
private def resolve_scope_column!
|
|
396
377
|
return if @scope_column.nil?
|
|
397
378
|
|
|
@@ -406,11 +387,13 @@ module Exwiw
|
|
|
406
387
|
exit 1
|
|
407
388
|
end
|
|
408
389
|
|
|
409
|
-
if @ids_field
|
|
410
|
-
|
|
411
|
-
$stderr.puts "--scope-column cannot be combined with #{flag}"
|
|
390
|
+
if @ids_field
|
|
391
|
+
$stderr.puts "--scope-column cannot be combined with --ids-field"
|
|
412
392
|
exit 1
|
|
413
393
|
end
|
|
394
|
+
|
|
395
|
+
$stderr.puts "warning: --scope-column is deprecated; declare a per-table `scope_column:` " \
|
|
396
|
+
"in the schema config and run with --target-table instead."
|
|
414
397
|
end
|
|
415
398
|
|
|
416
399
|
# `--uri` supplies a full connection string (e.g. `mongodb+srv://...`) and is
|
|
@@ -537,8 +520,7 @@ module Exwiw
|
|
|
537
520
|
opts.on("--target-collection=[COLLECTION]", "Alias of --target-table for the mongodb adapter.") { |v| @target_collection_name = v }
|
|
538
521
|
opts.on("--ids=[IDS]", "Comma-separated list of identifiers. Required when --target-table is given.") { |v| @ids = v.split(',') }
|
|
539
522
|
opts.on("--ids-field=[FIELD]", "Field on the target collection that --ids is matched against. Defaults to the primary key. (mongodb adapter only)") { |v| @ids_field = v }
|
|
540
|
-
opts.on("--
|
|
541
|
-
opts.on("--scope-column=[COLUMN]", "Filter every table by this shared column (--ids are its values) instead of a single --target-table. Tables lacking it are reached via belongs_to. SQL adapters only; mutually exclusive with --target-table.") { |v| @scope_column = v }
|
|
523
|
+
opts.on("--scope-column=[COLUMN]", "DEPRECATED. Filter every table by this shared global column (--ids are its values) instead of a single --target-table. SQL adapters only; mutually exclusive with --target-table. Prefer declaring a per-table `scope_column:` in the schema config and running with --target-table.") { |v| @scope_column = v }
|
|
542
524
|
opts.on("--output-format=[FORMAT]", "Output format: insert (default) or copy (PostgreSQL only, export subcommand only)") { |v| @output_format = v }
|
|
543
525
|
opts.on("--insert-only", "Do not generate DELETE SQL files (export subcommand only)") { @insert_only = true }
|
|
544
526
|
opts.on("--after-insert-hook=PATH", "Path to a .rb or .sh post-processing hook executed after all insert/delete files are written (export subcommand only)") do |v|
|
data/lib/exwiw/query_ast.rb
CHANGED
|
@@ -48,7 +48,7 @@ module Exwiw
|
|
|
48
48
|
|
|
49
49
|
# Resolves a set of values on `where_column` to the rows' `select_column`
|
|
50
50
|
# via a nested SELECT. Used as the `value` of a WhereClause whose operator
|
|
51
|
-
# is `:in_subquery`, so
|
|
51
|
+
# is `:in_subquery`, so a non primary-key `ids_field` can filter related
|
|
52
52
|
# tables through the target table's primary key:
|
|
53
53
|
#
|
|
54
54
|
# <table>.<fk> IN (SELECT <table_name>.<select_column>
|
|
@@ -12,12 +12,26 @@ module Exwiw
|
|
|
12
12
|
new(table_name, table_by_name, dump_target, logger).scope_category
|
|
13
13
|
end
|
|
14
14
|
|
|
15
|
+
# Scope-column mode is active when EITHER the named `--target-table` declares a
|
|
16
|
+
# per-table `scope_column` (the preferred trigger: the target is then scoped
|
|
17
|
+
# like any other table — its `--ids` are scope-column values, not primary
|
|
18
|
+
# keys), OR the deprecated `--scope-column` flag is set (a global column with no
|
|
19
|
+
# target). In both cases every table is filtered by a shared column instead of
|
|
20
|
+
# being anchored on one named target's primary key.
|
|
21
|
+
def self.scope_mode?(table_by_name, dump_target)
|
|
22
|
+
return true unless dump_target.scope_column.nil?
|
|
23
|
+
return false if dump_target.table_name.nil?
|
|
24
|
+
|
|
25
|
+
target = table_by_name[dump_target.table_name]
|
|
26
|
+
!!(target && target.respond_to?(:scope_column) && target.scope_column)
|
|
27
|
+
end
|
|
28
|
+
|
|
15
29
|
# Strict pre-flight for scope-column mode: abort if any extractable table
|
|
16
30
|
# cannot be scoped, so an unscoped (potentially sensitive) table is never
|
|
17
31
|
# silently dumped in full. No-op outside scope mode. `tables` is the set of
|
|
18
32
|
# dumpable configs (ignore:true tables are skipped — they are not extracted).
|
|
19
33
|
def self.validate_scope!(tables, table_by_name, dump_target, logger)
|
|
20
|
-
return
|
|
34
|
+
return unless scope_mode?(table_by_name, dump_target)
|
|
21
35
|
|
|
22
36
|
unscopable =
|
|
23
37
|
tables.reject(&:ignore).select do |table|
|
|
@@ -27,11 +41,10 @@ module Exwiw
|
|
|
27
41
|
|
|
28
42
|
names = unscopable.map(&:name).sort.join(", ")
|
|
29
43
|
raise ArgumentError,
|
|
30
|
-
"scope-column mode: #{unscopable.size} table(s) cannot be scoped
|
|
31
|
-
"
|
|
32
|
-
"
|
|
33
|
-
"
|
|
34
|
-
"column name differs on that table)."
|
|
44
|
+
"scope-column mode: #{unscopable.size} table(s) cannot be scoped: #{names}. " \
|
|
45
|
+
"For each, declare `scope_column: <column>` on the table to filter it directly, " \
|
|
46
|
+
"add a belongs_to path to a table that carries the scope column, mark it " \
|
|
47
|
+
"`scope_exempt: true` to export it in full, or set `ignore: true` to skip it."
|
|
35
48
|
end
|
|
36
49
|
|
|
37
50
|
attr_reader :table_name, :table_by_name, :dump_target
|
|
@@ -268,10 +281,10 @@ module Exwiw
|
|
|
268
281
|
clauses = []
|
|
269
282
|
|
|
270
283
|
if table.name == dump_target.table_name
|
|
271
|
-
#
|
|
272
|
-
#
|
|
273
|
-
#
|
|
274
|
-
#
|
|
284
|
+
# When `dump_target.ids_field` is set, `--ids` match a non primary-key
|
|
285
|
+
# column on the target table; otherwise fall back to the primary key.
|
|
286
|
+
# Only the target table's filter changes — downstream foreign-key
|
|
287
|
+
# propagation still keys off the primary key.
|
|
275
288
|
clauses.push Exwiw::QueryAst::WhereClause.new(
|
|
276
289
|
column_name: dump_target.ids_field || table.primary_key,
|
|
277
290
|
operator: :eq,
|
|
@@ -306,9 +319,9 @@ module Exwiw
|
|
|
306
319
|
|
|
307
320
|
# Builds the WHERE clause that constrains a `foreign_key` pointing at the
|
|
308
321
|
# dump target. Normally `--ids` are the target's primary keys, so a plain
|
|
309
|
-
# `foreign_key IN (ids)` suffices. When
|
|
310
|
-
#
|
|
311
|
-
#
|
|
322
|
+
# `foreign_key IN (ids)` suffices. When `dump_target.ids_field` is set, `--ids`
|
|
323
|
+
# match a non primary-key column instead, so the foreign key must be resolved
|
|
324
|
+
# through the target table:
|
|
312
325
|
# `foreign_key IN (SELECT pk FROM target WHERE ids_field IN (ids))`.
|
|
313
326
|
# This keeps related-table extraction correct regardless of whether the
|
|
314
327
|
# relation is direct, indirect, or polymorphic.
|
|
@@ -370,7 +383,7 @@ module Exwiw
|
|
|
370
383
|
# ------------------------------------------------------------------
|
|
371
384
|
|
|
372
385
|
private def scope_mode?
|
|
373
|
-
|
|
386
|
+
self.class.scope_mode?(table_by_name, dump_target)
|
|
374
387
|
end
|
|
375
388
|
|
|
376
389
|
# Classifier used by validate_scope! and mirrored by build_scoped below.
|
|
@@ -449,8 +462,8 @@ module Exwiw
|
|
|
449
462
|
ast
|
|
450
463
|
end
|
|
451
464
|
|
|
452
|
-
# The shared column this table is filtered on: a per-table `scope_column`
|
|
453
|
-
#
|
|
465
|
+
# The shared column this table is filtered on: a per-table `scope_column` when
|
|
466
|
+
# declared, otherwise the deprecated global `--scope-column` flag.
|
|
454
467
|
private def resolved_scope_column(table)
|
|
455
468
|
table.scope_column || dump_target.scope_column
|
|
456
469
|
end
|
|
@@ -557,10 +570,11 @@ module Exwiw
|
|
|
557
570
|
end
|
|
558
571
|
|
|
559
572
|
private def scope_unscopable_message(table)
|
|
560
|
-
"Table '#{table.name}' cannot be scoped in scope-column mode: it
|
|
561
|
-
"
|
|
562
|
-
"
|
|
563
|
-
"set `ignore: true` to skip it, or add
|
|
573
|
+
"Table '#{table.name}' cannot be scoped in scope-column mode: it carries no scope " \
|
|
574
|
+
"column (no per-table `scope_column` is declared on it) and has no belongs_to path " \
|
|
575
|
+
"to a table that does. Declare `scope_column: <column>` on it, mark it " \
|
|
576
|
+
"`scope_exempt: true` to export it in full, set `ignore: true` to skip it, or add " \
|
|
577
|
+
"the missing belongs_to."
|
|
564
578
|
end
|
|
565
579
|
end
|
|
566
580
|
end
|
data/lib/exwiw/runner.rb
CHANGED
|
@@ -74,15 +74,16 @@ module Exwiw
|
|
|
74
74
|
phase = "executing extraction query"
|
|
75
75
|
begin
|
|
76
76
|
results = adapter.execute(query_ast)
|
|
77
|
-
record_num = results.size
|
|
78
|
-
|
|
79
|
-
if record_num.zero?
|
|
80
|
-
@logger.info(" No records matched. skip this table.")
|
|
81
|
-
next
|
|
82
|
-
end
|
|
83
77
|
insert_idx = (idx + 1).to_s.rjust(3, '0')
|
|
84
78
|
|
|
85
79
|
if @output_format == 'copy'
|
|
80
|
+
# COPY mode (PostgreSQL only) builds the whole body up front rather
|
|
81
|
+
# than streaming, so it keeps the explicit count + early skip.
|
|
82
|
+
record_num = results.size
|
|
83
|
+
if record_num.zero?
|
|
84
|
+
@logger.info(" No records matched. skip this table.")
|
|
85
|
+
next
|
|
86
|
+
end
|
|
86
87
|
phase = "generating COPY statement"
|
|
87
88
|
@logger.debug(" Generate COPY statement...")
|
|
88
89
|
copy_sql = adapter.to_copy_from_stdin(results, table)
|
|
@@ -108,19 +109,33 @@ module Exwiw
|
|
|
108
109
|
# config does not set one (SQL adapters: nil -> one statement, but
|
|
109
110
|
# streamed in bounded buffers; MongoDB: a positive default so the
|
|
110
111
|
# JSONL is chunked). #write_inserts emits bytes identical to the
|
|
111
|
-
# previous inline chunk loop and returns
|
|
112
|
+
# previous inline chunk loop and returns [statement_count, record_num].
|
|
113
|
+
#
|
|
114
|
+
# The row count comes from this single streaming pass, so empty
|
|
115
|
+
# tables are detected here (and the just-opened file removed) rather
|
|
116
|
+
# than via a separate upfront count query — eliminating a redundant
|
|
117
|
+
# second scan of the same filter (e.g. MongoDB count_documents / SQL
|
|
118
|
+
# SELECT COUNT(*)), which is a full COLLSCAN for an unindexed scope.
|
|
112
119
|
chunk_size = table.bulk_insert_chunk_size || adapter.default_bulk_insert_chunk_size
|
|
113
120
|
|
|
121
|
+
insert_path = File.join(@output_dir, "insert-#{insert_idx}-#{table_name}.#{adapter.output_extension}")
|
|
114
122
|
statement_count = 0
|
|
115
|
-
|
|
123
|
+
record_num = 0
|
|
124
|
+
File.open(insert_path, 'w') do |file|
|
|
116
125
|
pre = adapter.pre_insert_sql(table)
|
|
117
126
|
file.puts(pre) if pre
|
|
118
|
-
statement_count = adapter.write_inserts(file, results, table, chunk_size)
|
|
127
|
+
statement_count, record_num = adapter.write_inserts(file, results, table, chunk_size)
|
|
119
128
|
file.print("\n")
|
|
120
129
|
post = adapter.post_insert_sql(table)
|
|
121
130
|
file.puts(post) if post
|
|
122
131
|
end
|
|
123
132
|
|
|
133
|
+
if record_num.zero?
|
|
134
|
+
File.delete(insert_path)
|
|
135
|
+
@logger.info(" No records matched. skip this table.")
|
|
136
|
+
next
|
|
137
|
+
end
|
|
138
|
+
|
|
124
139
|
@logger.info(" Generated INSERT statement for #{record_num} records (#{statement_count} statement(s)).")
|
|
125
140
|
end
|
|
126
141
|
|
|
@@ -59,6 +59,20 @@ module Exwiw
|
|
|
59
59
|
groups
|
|
60
60
|
end
|
|
61
61
|
|
|
62
|
+
# Flatten the generated `groups` (the Hash returned by generate! /
|
|
63
|
+
# build_table_groups) into the list of cross-database belongs_tos the
|
|
64
|
+
# generator auto-ignored, so a caller (the rake task) can surface them. Each
|
|
65
|
+
# entry is `{ table:, foreign_key:, target: }`. Empty for single-database apps.
|
|
66
|
+
def self.cross_database_belongs_tos(groups)
|
|
67
|
+
groups.values.flatten.flat_map do |table|
|
|
68
|
+
next [] unless table.respond_to?(:belongs_tos)
|
|
69
|
+
|
|
70
|
+
table.belongs_tos
|
|
71
|
+
.select { |bt| bt.ignore_type == CROSS_DATABASE_IGNORE_TYPE }
|
|
72
|
+
.map { |bt| { table: table.name, foreign_key: bt.foreign_key, target: bt.table_name } }
|
|
73
|
+
end
|
|
74
|
+
end
|
|
75
|
+
|
|
62
76
|
# Reconcile the config files already on disk against the live database,
|
|
63
77
|
# removing only what no longer exists there:
|
|
64
78
|
#
|
|
@@ -277,11 +291,17 @@ module Exwiw
|
|
|
277
291
|
end
|
|
278
292
|
|
|
279
293
|
private def aggregate_belongs_tos(models)
|
|
280
|
-
belongs_to_assocs = models
|
|
294
|
+
belongs_to_assocs = models
|
|
295
|
+
.flat_map { |m| belongs_to_associations_for(m) }
|
|
296
|
+
.select { |assoc| assoc.polymorphic? || active_record_target?(assoc) }
|
|
297
|
+
owner_db = database_name_for(models.first)
|
|
281
298
|
|
|
282
299
|
non_polymorphic = belongs_to_assocs
|
|
283
300
|
.reject(&:polymorphic?)
|
|
284
|
-
.map
|
|
301
|
+
.map do |assoc|
|
|
302
|
+
entry = { table_name: assoc.table_name, foreign_key: assoc.foreign_key }
|
|
303
|
+
annotate_cross_database(entry, owner_db, assoc.klass)
|
|
304
|
+
end
|
|
285
305
|
|
|
286
306
|
# A polymorphic belongs_to (`belongs_to :reviewable, polymorphic: true`)
|
|
287
307
|
# has no single target table. The candidate tables are found by looking up
|
|
@@ -292,18 +312,51 @@ module Exwiw
|
|
|
292
312
|
.select(&:polymorphic?)
|
|
293
313
|
.flat_map do |assoc|
|
|
294
314
|
polymorphic_target_models(assoc.name).map do |target_model|
|
|
295
|
-
{
|
|
315
|
+
entry = {
|
|
296
316
|
table_name: target_model.table_name,
|
|
297
317
|
foreign_key: assoc.foreign_key,
|
|
298
318
|
foreign_type: assoc.foreign_type,
|
|
299
319
|
type_value: target_model.polymorphic_name,
|
|
300
320
|
}
|
|
321
|
+
annotate_cross_database(entry, owner_db, target_model)
|
|
301
322
|
end
|
|
302
323
|
end
|
|
303
324
|
|
|
304
325
|
(non_polymorphic + polymorphic).uniq
|
|
305
326
|
end
|
|
306
327
|
|
|
328
|
+
CROSS_DATABASE_IGNORE_TYPE = "cross_database"
|
|
329
|
+
|
|
330
|
+
# A belongs_to whose target model lives in a *different* database than the
|
|
331
|
+
# owning table cannot be joined: in a Rails multi-database (`connects_to`)
|
|
332
|
+
# setup each database is exported on its own connection and into its own
|
|
333
|
+
# per-database config directory, so the target table is absent from the
|
|
334
|
+
# directory this config is loaded with, and there is no single connection to
|
|
335
|
+
# join the two on. Leaving the relation live would emit a dangling belongs_to
|
|
336
|
+
# whose target is never present at extraction time (a nil-target crash in
|
|
337
|
+
# dependency resolution). So emit it with `ignore: true` (dropped from
|
|
338
|
+
# extraction at load via TableConfig#reject_ignored_members!) tagged
|
|
339
|
+
# `ignore_type: "cross_database"`, with a `comment` recording why and pointing
|
|
340
|
+
# at the recovery path. The foreign-key column itself is still exported as a
|
|
341
|
+
# plain column; only the join/dependency edge is dropped. Polymorphic
|
|
342
|
+
# associations are annotated per target, so only the targets that cross a
|
|
343
|
+
# database boundary are ignored. Single-database apps are unaffected.
|
|
344
|
+
private def annotate_cross_database(entry, owner_db, target_model)
|
|
345
|
+
target_db = database_name_for(target_model)
|
|
346
|
+
return entry if target_db == owner_db
|
|
347
|
+
|
|
348
|
+
entry.merge(
|
|
349
|
+
ignore: true,
|
|
350
|
+
ignore_type: CROSS_DATABASE_IGNORE_TYPE,
|
|
351
|
+
comment: "Cross-database belongs_to: target '#{entry[:table_name]}' is in database " \
|
|
352
|
+
"'#{target_db}', not '#{owner_db}'. exwiw exports each database separately and " \
|
|
353
|
+
"cannot join across them, so this relation is ignored during extraction; its " \
|
|
354
|
+
"foreign-key column '#{entry[:foreign_key]}' is still exported. To extract across " \
|
|
355
|
+
"this boundary, declare `scope_column: \"#{entry[:foreign_key]}\"` on this table's " \
|
|
356
|
+
"config so its rows are filtered by that foreign-key value directly (scope-column mode).",
|
|
357
|
+
)
|
|
358
|
+
end
|
|
359
|
+
|
|
307
360
|
# `belongs_to` reflections for a model, with the synthetic HABTM left-side
|
|
308
361
|
# association removed.
|
|
309
362
|
#
|
|
@@ -325,6 +378,31 @@ module Exwiw
|
|
|
325
378
|
assocs.reject { |assoc| assoc.equal?(left) }
|
|
326
379
|
end
|
|
327
380
|
|
|
381
|
+
# Whether a (non-polymorphic) belongs_to points at an ActiveRecord model.
|
|
382
|
+
#
|
|
383
|
+
# A belongs_to can target a non-ActiveRecord class — most commonly an
|
|
384
|
+
# ActiveHash/ActiveYaml master (`belongs_to :equipment, class_name:
|
|
385
|
+
# "SomeActiveYamlModel"`). active_hash registers these as ordinary
|
|
386
|
+
# `belongs_to` reflections, yet the target class has no database table, so
|
|
387
|
+
# `assoc.table_name` (which delegates to `klass.table_name`) raises. Such a
|
|
388
|
+
# relation is not a DB edge exwiw can join or extract across, so it is
|
|
389
|
+
# dropped from the generated belongs_tos; the underlying foreign-key column
|
|
390
|
+
# is still emitted as a plain column. Polymorphic associations cannot be
|
|
391
|
+
# `klass`-resolved, so callers must screen those out before calling this.
|
|
392
|
+
#
|
|
393
|
+
# Resolving the target class behaves differently per non-AR shape: an
|
|
394
|
+
# ActiveHash reflection returns the class fine (the crash is later, at
|
|
395
|
+
# `table_name`), while a bare `belongs_to` to a plain class makes AR raise
|
|
396
|
+
# ArgumentError ("... is not an ActiveRecord::Base subclass") right here when
|
|
397
|
+
# the klass is computed. Both mean "not a DB relation", so rescue the lookup
|
|
398
|
+
# and treat either as a non-AR target to skip.
|
|
399
|
+
private def active_record_target?(assoc)
|
|
400
|
+
klass = assoc.klass
|
|
401
|
+
klass.is_a?(Class) && klass < ActiveRecord::Base ? true : false
|
|
402
|
+
rescue StandardError
|
|
403
|
+
false
|
|
404
|
+
end
|
|
405
|
+
|
|
328
406
|
# Enumerate the concrete models that can be targets of the polymorphic
|
|
329
407
|
# association `association_name`, by looking them up from every model's
|
|
330
408
|
# `has_many` / `has_one` `as:` option. The order of `concrete_models` depends
|
data/lib/exwiw/table_config.rb
CHANGED
|
@@ -154,10 +154,10 @@ module Exwiw
|
|
|
154
154
|
merged_table.scope_column = scope_column
|
|
155
155
|
|
|
156
156
|
# Structural facts of each belongs_to come from the freshly generated
|
|
157
|
-
# config, but the user-owned `comment`/`ignore`/`references`
|
|
158
|
-
# when the same relation still exists. (`references` is only
|
|
159
|
-
# the MongodbAdapter, but it lives on the shared BelongsTo, so
|
|
160
|
-
# it here too rather than silently dropping a hand-added value.)
|
|
157
|
+
# config, but the user-owned `comment`/`ignore`/`ignore_type`/`references`
|
|
158
|
+
# carry over when the same relation still exists. (`references` is only
|
|
159
|
+
# consumed by the MongodbAdapter, but it lives on the shared BelongsTo, so
|
|
160
|
+
# preserve it here too rather than silently dropping a hand-added value.)
|
|
161
161
|
receiver_belongs_to_by_identity = belongs_tos.each_with_object({}) { |bt, hash| hash[bt.identity] = bt }
|
|
162
162
|
merged_table.belongs_tos =
|
|
163
163
|
passed_table.belongs_tos.map do |passed_belongs_to|
|
|
@@ -165,6 +165,7 @@ module Exwiw
|
|
|
165
165
|
if receiver_belongs_to
|
|
166
166
|
passed_belongs_to.comment = receiver_belongs_to.comment if receiver_belongs_to.comment
|
|
167
167
|
passed_belongs_to.ignore = receiver_belongs_to.ignore unless receiver_belongs_to.ignore.nil?
|
|
168
|
+
passed_belongs_to.ignore_type = receiver_belongs_to.ignore_type if receiver_belongs_to.ignore_type
|
|
168
169
|
passed_belongs_to.references = receiver_belongs_to.references if receiver_belongs_to.references
|
|
169
170
|
end
|
|
170
171
|
passed_belongs_to
|
data/lib/exwiw/version.rb
CHANGED
data/lib/tasks/exwiw.rake
CHANGED
|
@@ -17,9 +17,25 @@ namespace :exwiw do
|
|
|
17
17
|
task generate: :environment do
|
|
18
18
|
require "exwiw"
|
|
19
19
|
|
|
20
|
-
Exwiw::SchemaGenerator.from_rails_application(
|
|
20
|
+
groups = Exwiw::SchemaGenerator.from_rails_application(
|
|
21
21
|
output_dir: resolve_schema_dir.call,
|
|
22
22
|
).generate!
|
|
23
|
+
|
|
24
|
+
# Surface cross-database belongs_tos the generator auto-ignored: these
|
|
25
|
+
# cannot be joined (each database is exported separately), so they were
|
|
26
|
+
# emitted with ignore:true and need a decision from the user.
|
|
27
|
+
cross = Exwiw::SchemaGenerator.cross_database_belongs_tos(groups)
|
|
28
|
+
unless cross.empty?
|
|
29
|
+
$stderr.puts "exwiw: detected #{cross.size} cross-database belongs_to(s), " \
|
|
30
|
+
"each emitted with ignore:true (exwiw cannot join across databases). " \
|
|
31
|
+
"The foreign-key column is still exported."
|
|
32
|
+
cross.each do |c|
|
|
33
|
+
$stderr.puts " - #{c[:table]}.#{c[:foreign_key]} -> #{c[:target]} (in another database)"
|
|
34
|
+
end
|
|
35
|
+
$stderr.puts " To extract across a boundary, declare `scope_column: <foreign_key>` on the " \
|
|
36
|
+
"owning table's config (scope-column mode). Otherwise the relation stays ignored " \
|
|
37
|
+
"and the foreign key is exported as a plain value."
|
|
38
|
+
end
|
|
23
39
|
end
|
|
24
40
|
|
|
25
41
|
desc "Remove tables/columns from the schema config that no longer exist in the application"
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: exwiw
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.8.1
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Shia
|
|
@@ -36,6 +36,7 @@ files:
|
|
|
36
36
|
- CHANGELOG.md
|
|
37
37
|
- LICENSE.txt
|
|
38
38
|
- README.md
|
|
39
|
+
- docs/mongodb-dump-parallelism-2x-notes.md
|
|
39
40
|
- docs/mongodb-scoping-fullscan-notes.md
|
|
40
41
|
- docs/optimization-notes.md
|
|
41
42
|
- docs/optimize-mongodb-export-with-native-ext.md
|
|
@@ -46,6 +47,7 @@ files:
|
|
|
46
47
|
- docs/plans/2026-05-29-rails-managed-tables.md
|
|
47
48
|
- docs/plans/2026-05-31-ids-column-for-sql-adapters.md
|
|
48
49
|
- docs/plans/2026-06-19-mongodb-export-remove-parallelism-native-ext.md
|
|
50
|
+
- docs/scope-column-redesign.md
|
|
49
51
|
- docs/sql-dump-optimization-notes.md
|
|
50
52
|
- exe/exwiw
|
|
51
53
|
- ext/exwiw/ext_json/ext_json.c
|